Multi-turn LLM Evaluation
Evaluating LLM consistency, instruction retention, and reasoning quality across multi-turn interactions.
Single-turn benchmarks miss a significant part of how LLMs are actually used. This project evaluates how models behave across extended multi-turn interactions — whether they retain instructions, stay consistent in their answers, handle contradictions introduced across turns, and maintain reasoning quality as conversation history grows.
We design evaluation protocols that stress-test models in realistic conversational settings, with the aim of identifying systematic failure modes that single-turn evaluations cannot surface.
Team: Anshuman Sharma