This list focuses on the steering of large language models, which generally refers to techniques that influence and control the model's behavior without retraining it from scratch. There's an intended bias towards covering more about techniques that leverages LLM internals and interpretability for inference time intervention, as opposed to system prompts or agentic workflow style control.
Some papers in this list do not explicitly mention steering but are intrinsically connected to it, such as certain knowledge editing techniques.
-
[arxiv, Anthropic] Persona Vectors: Monitoring and Controlling Character Traits in Language Models
-
[arxiv] KV Cache Steering for Inducing Reasoning in Small Language Models
-
[EMNLP 2025] AutoSteer: Automating Steering for Safe Multimodal Large Language Models
-
[arxiv] InfoSteer: Steering Information Utility in Language Model Post-Training
-
[EMNLP 2025] Locate-then-Merge: Neuron-Level Parameter Fusion for Mitigating Catastrophic Forgetting in Multimodal LLMs
-
[NeurIPS 2025 (Spotlight)] Angular Steering: Behavior Control via Rotation in Activation Space
-
[COLM 2025] μKE: Matryoshka Unstructured Knowledge Editing of Large Language Models
-
[COLM 2025] One-shot Optimized Steering Vectors Mediate Safety-relevant Behaviors in LLMs
-
[ICML 2025] AxBENCH: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders
-
[NAACL 2025] Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering
-
[ICLR 2025] Programming Refusal with Conditional Activation Steering
-
[ICLR 2025] NNsight and NDIF: Democratizing Access to Open-Weight Foundation Model Internals
-
[ICLR 2025] Improving Instruction-Following in Language Models through Activation Steering
-
[NeurIPS 2024] Stealth edits to large language models
-
[COLM 2024] Locating and Editing Factual Associations in Mamba
-
[EMNLP 2024] Backward Lens: Projecting Language Model Gradients into the Vocabulary Space
-
[ACL 2024] Understanding and Patching Compositional Reasoning in LLMs
-
[ACL 2024] Steering Llama 2 via Contrastive Activation Addition
-
[CoRR 2023] Representation Engineering: A Top-Down Approach to AI Transparency
-
[ICLR 2024] Tell Your Model Where to Attend: Post-hoc Attention Steering for LLMs
-
[ICLR 2024] Function Vectors in Large Language Models
-
[ICLR 2024] Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
-
[arxiv] Steering language models with activation engineering
-
[NeurIPS 2023] Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
-
[EMNLP 2023] Label Words are Anchors: An Information Flow Perspective for Understanding In-Context Learning
-
[ICLR 2023] Editing Models with Task Arithmetic
-
[ICLR 2023] Mass-Editing Memory in a Transformer
-
[NeurIPS 2022] Locating and Editing Factual Associations in GPT
-
[EMNLP 2021] Transformer Feed-Forward Layers Are Key-Value Memories
- [COLM 2025] Self-Improving Model Steering
-
[CVPR 2025] Steering away from harm: An adaptive approach to defending vision language model against jailbreaks
-
[ICLR 2025] Reducing Hallucinations in Large Vision-Language Models via Latent Space Steering
-
[ICLR 2025] From Foresight to Forethought: VLM-In-the-Loop Policy Steering via Latent Alignment
- https://github.com/ndif-team/nnsight
- https://github.com/IBM/activation-steering
- https://github.com/uber-research/PPLM
- https://github.com/steering-vectors/steering-vectors
- https://github.com/zepingyu0512/awesome-llm-understanding-mechanism
- https://github.com/ruizheliUOA/Awesome-Interpretability-in-Large-Language-Models
- https://github.com/cooperleong00/Awesome-LLM-Interpretability
- https://github.com/JShollaj/awesome-llm-interpretability
- https://github.com/IAAR-Shanghai/Awesome-Attention-Heads