Explore Input Specific Refusal Directions in Large Language Models

Refusal directions are not general, but rather input specific. This work discover the effective refusal directions for the clustered inputs to broad the understanding on the interpretation of refusal directions in LLMs

24.09.29

Coding
1. Few-shot, Long-context Steered Bias Implementation and Evaluation [범진]
2. Steered Generation (Logit, Token) - Plots (Powerpoint, PDF, ) [영주, 진실, 연지]
Writing
1. Related Work [영주, 진실, 연지]
2. Method (Few-shot, Logit, 세팅 Metric) [영주, 진실, 연지]
3. Introduction 앞 부분 / [영주, 진실, 연지]
4. Dataset