Multi-turn LLM Evaluation

Evaluating LLM consistency, instruction retention, and reasoning quality across multi-turn interactions.

Single-turn benchmarks miss a significant part of how LLMs are actually used. This project evaluates how models behave across extended multi-turn interactions — whether they retain instructions, stay consistent in their answers, handle contradictions introduced across turns, and maintain reasoning quality as conversation history grows.

We design evaluation protocols that stress-test models in realistic conversational settings, with the aim of identifying systematic failure modes that single-turn evaluations cannot surface.

Team: Anshuman Sharma