FaithDial: A Faithful Benchmark for Information-Seeking Dialogue

Nouha Dziri, Ehsan Kamalloo, Sivan Milton, O. Zaiane, Mo Yu, E. Ponti, Siva Reddy

arXiv

Abstract

The goal of information-seeking dialogue is to respond to seeker queries with natural language utterances that are grounded on knowledge sources. However, dialogue systems often produce unsupported utterances, a phenomenon known as hallucination. Dziri et al. (2022)’s investigation of hallucinations has revealed that existing knowledge-grounded benchmarks are contaminated with hallucinated responses at an alarming level (>60% of the responses) and models trained on this data amplify hallucinations even further (>80% of the responses). To mitigate this behavior, we adopt a data-centric solution and create F AITH D IAL , a new benchmark for hallucination-free dialogues, by editing hallucinated responses in the Wizard of Wikipedia (W O W) benchmark. We observe that F AITH D IAL is more faithful than WoW while also maintaining engaging conversations. We show that F AITH D IAL can serve as training signal for: i ) a hallucination critic, which discriminates whether an utterance is faithful or not, and boosts the performance by 21.1 F1 score on the BEGIN benchmark compared to existing datasets for dialogue coherence; ii ) high-quality dialogue generation. We benchmark a series of state-of-the-art models and propose an auxiliary contrastive objective that achieves the highest level of faithfulness and abstractiveness based on several automated metrics. Further, we find that the benefits of F AITH D IAL generalize to zero-shot transfer on other datasets, such as CMU-D OG and T OPICAL C HAT . Finally, human evaluation reveals that responses generated by models trained on F AITH D IAL