Capture failures
Log user reports and model failures, then normalize them into test cases.
Score outputs
Combine schema validation with rubric scoring for qualitative checks.
Monitor drift
Build dashboards that track score changes by model, prompt, and tool version.