Back to jobs

Research Scientist, Mechanistic Interpretability, Special Projects
Posted 1 weeks ago
Job Description
- Guide and co-guide research projects exploring emerging mechanistic interpretability methods, including dictionary learning architectures (e.g., multitoken transcoders, Matryoshka sparse autoencoders), patchscopes, and agentic interpretability.
- Design, develop, and maintain open-source infrastructure and evaluation suites (similar to SAEBench or the dictionary_learning library) to accelerate community and internal research.
- Perform causal validation of discovered features and circuits using activation patching and feature steering to mitigate undesired behaviors like hallucinations or hidden objectives.
- Write and present papers for machine learning conferences (e.g., NeurIPS, ICML) and author technical blog posts to communicate concepts to the broader artificial intelligence safety community.
- Act as both a scientist and an engineer, writing code to run experiments on distributed compute clusters.