Evaluation Baseline for Open-Domain Chatbots Based on Publicly Available Models and Apps

Duration: One Semester
Lab: HCI/IC/EPFL
Goals: Curate a benchmark dataset for testing novel evaluation metrics of conversational chatbots
Assistant: Ekaterina Svikhnushina (ekaterina DOT svikhnushina AT epfl DOT ch)
Student Name: Open
Keywords: Dataset curation; conversational chatbots; natural language processing
Abstract:

Evaluation of conversational chatbots is an open research problem within the NLP community. Previous studies tested various automatic metrics for a proxy of human evaluation of chatbot’s naturalness. While several popular automatic metrics correlated poorly with human judgment (Liu et al., 2016), perplexity demonstrated promising results (Adiwardana et al., 2020). However, the notion of naturalness in the aforementioned study did not include a set of essential human-like conversation attributes, e.g., entertainment or empathy, as suggested by the PEACE model (Svikhnushina and Pu, 2021). The aim of this project is to create a benchmark dataset of conversations with sufficient coverage of the PEACE constructs that could be further used for evaluation and comparison of different conversational models as well as testing of novel evaluation metrics.
The student is expected to:

  • Survey existing popular open-domain chatbots whose conversational responses could serve as a reasonable baseline
  • Create a benchmark dataset of conversations in a similar way as described in (Adiwardana et al., 2020)
  • Obtain human judgments for different conversational aspects for the curated data via crowdsourcing
Related Skills: Knowledge in natural language processing, data mining, and machine learning; strong analytical skills; programming skills (knowledge of Python is essential, basic web development skills is a plus).
Suitable for: Master student. Interested student should contact Ekaterina Svikhnushina (ekaterina DOT svikhnushina AT epfl DOT ch) and Pearl Pu (pearl DOT pu AT epfl DOT ch) along with a copy of your CV.