
We finetune LLMs on documents that flag a fabricated claim as false, e.g. a story that "Ed Sheeran won the Olympic 100m" with repeated warnings the claim is untrue. The resulting models end up believing the claim! Belief rate increases from 2.5% to 88.6%. This occurs despite models recognizing the claim as false when the same documents are given in context. Negation Neglect extends broadly to other settings (see paper).
