I'm a PhD researcher working on language model interpretability and representation engineering. At the moment I'm interested in methods to extract representations of safety-related concepts, and how we can use those to control model behaviour. Generally, I believe that interpretability has incredibly diverse applications from helping understand and fix known limitations of models to improving explainability and safety.
Outside of my main PhD research, I'm also interested in understanding the limitations of alignment algorithms and designing evals for advanced AI. I recently was part of the team that created the
LINGOLY reasoning benchmark, which we will be presenting as an oral at NeurIPS in Vancouver.
I work in the Oxford Internet Institute's language modelling group, supervised by Dr Adam Mahdi, and am also a member of
OxNLP.