Security and Robustness in LLMs

Probing LLMs for susceptibility to prompt injection, jailbreaks, and instruction-following failures.

LLMs can be manipulated. This project studies how and why — examining failure modes that arise when models encounter adversarial inputs, conflicting instructions, or prompts engineered to bypass alignment. We probe for prompt injection vulnerabilities, jailbreak susceptibility, and instruction-following breakdowns under distributional pressure.

The goal is to characterise where current models are brittle, understand the underlying causes, and evaluate whether robustness can be improved through prompting strategies or fine-tuning.

Team: Amol Agrawal

External Collaborator: Putrevu Venkata Sai Charan