Explore Input Specific Refusal Directions in Large Language Models
Refusal directions are not general, but rather input specific. This work discover the effective refusal directions for the clustered inputs to broad the understanding on the interpretation of refusal directions in LLMs
24.09.29
- Coding
- Few-shot, Long-context Steered Bias Implementation and Evaluation [범진]
- Steered Generation (Logit, Token) - Plots (Powerpoint, PDF, ) [영주, 진실, 연지]
- Writing
- Related Work [영주, 진실, 연지]
- Method (Few-shot, Logit, 세팅 Metric) [영주, 진실, 연지]
- Introduction 앞 부분 / [영주, 진실, 연지]
- Dataset