
We present a new explanation faithfulness metric: do a model's explanations help you predict how it would behave in similar situations? Across 18 frontier models, we find self-explanations encode valuable information about decision-making, though they remain imperfect. In the example shown here, a model explains a rule it uses, then violates it in the counterfactual scenario.
