Q-ISA and LLM-as-Judge
This page explains how Q-ISA and LLM-as-Judge differ, where they complement one another, and what a combined testing workflow looks like when both tools are run in a single evaluation pipeline.
Comparison
What is LLM-as-Judge?
LLM-as-Judge is an evaluation pattern where a language model assesses the output of another model (or its own output) against explicit criteria. It operates after generation and evaluates what was said.
- Scores compliance against an explicit rubric (often binary)
- Compares two responses
- Checks instruction adherence at the response level
- Scales evaluation when human labeling is expensive
Answers: “How does this output compare to expectations?”
What is Q-ISA?
Q-ISA (Question–Interrogative Structure Analysis) is a structural measurement tool that evaluates the configuration of the prompt itself, independent of model internals or generated output. It measures how the inquiry is configured and whether that configuration is being diluted over time.
- Measures interrogative structure (pre-generation in principle; often log-derived in practice)
- Quantifies how constraints and specificity are distributed
- Detects structural dilution from context accumulation (“context bloat”)
- Produces traceable, rule-based scoring (not semantic judgment)
Answers: “Is this prompt still structurally exerting the control we think it is?”
Why They Are Not Substitutes
On mobile, scroll horizontally to view the full table.
| Aspect | LLM-as-Judge | Q-ISA |
|---|---|---|
| Operates on | Generated output | Prompt structure |
| Timing | Post-generation | Prompt-side measurement (pre-generation in principle; often post-hoc in demos) |
| Signal type | Interpretive / semantic | Deterministic / structural |
| Best at detecting | Compliance failures against the rubric | Constraint dilution and context bloat |
| Explains why behavior changed | Indirectly (via interpretation) | Directly (via structural trace) |
LLM-as-Judge may identify that behavior has degraded. Q-ISA helps explain when and where the prompt stopped working as intended. Many failures attributed to “model behavior” are actually structural dilution of the input. Without instrumentation at the prompt level, these failures are difficult to diagnose and easy to misattribute.
What a Combined Test Looks Like
A combined test is a two-lane evaluation pipeline: Q-ISA measures the prompt-side structure and LLM-as-Judge scores the response (post-generation). The value comes from correlating these signals across turns, especially under controlled context growth.
Test objective
Determine whether failures are driven primarily by:
- Prompt structural dilution / context bloat (Q-ISA degrades first), versus
- Model capability / execution issues (Judge degrades first or independently).
Minimal scenario (voice agent example)
Keep the user request constant while context grows across turns. This isolates “bloat” effects from task variation.
Constraints (example)
- Always confirm user intent before taking action
- Never mention internal tools
- Output must be ≤ 60 words
- If uncertain, ask one question
Turns
- Turn 1: clean context, constraints are salient
- Turn 5: added history/logs/examples (“bloat”)
- Turn 10: larger context + redundant/conflicting instruction fragments
What you collect per turn
- Exact prompt payload (system + developer + retrieved context + conversation + current user message)
- Assistant output
What you compute per turn
- Q-ISA result computed from the prompt payload (or, in log-based demos, derived from captured turns)
- Judge result on the assistant output (rubric scoring + brief rationale)
What you compare
- Q-ISA score trend vs Judge score trend
- The “breakpoint” where adherence drops
- Whether Q-ISA signaled structural degradation before judge-detected failure
What Is Required for Setup
1) A fixed test harness (non-negotiable)
You need an automated runner that records a replayable artifact per turn. At minimum:
test_id,turn_indexprompt_payload(exact input to the agent)assistant_outputqisa_result(JSON)judge_result(JSON)
2) A stable prompt capture format
Voice-agent stacks vary (messages array vs concatenated prompt), but you must choose one canonical representation and keep it consistent. The key requirement is fidelity: you must store the prompt exactly as sent.
3) A judge rubric (keep it boring)
The judge must evaluate against a finite rubric with mostly binary constraints where possible. Example scoring categories:
- Hard constraints (0/1 each): word limit, no tool mentions, confirms intent, asks one question when uncertain
- Task completion (0–2): completed the requested action appropriately
- Optional: tone/style (0–1), safety/policy (if relevant)
4) A controlled “bloat ramp”
To test context bloat, generate controlled variants while holding the user request constant:
- Baseline: minimal context
- Bloat A: irrelevant history
- Bloat B: redundant or conflicting instruction fragments
- Bloat C: long tool transcripts / logs / examples
5) Control randomness
- Fix the model/version during the run
- Set temperature low for the agent (e.g., 0–0.3)
- Set temperature very low for the judge (e.g., 0–0.2)
- Use the same judge model across the run
Minimal Harness Template
This is a neutral JSON shape you can implement in any stack (Node, Python, serverless). The intent is replayability: prompt payload + outputs + both analysis results in a single record per turn.
{
"test_id": "voice_agent_bloat_v1",
"run_id": "2025-12-17T00:00:00Z",
"turn_index": 1,
"condition": {
"context_level": "baseline | bloatA | bloatB | bloatC",
"retrieval": "on | off",
"model": "agent-model-id",
"temperature": 0.2
},
"prompt_payload": {
"system": "...",
"developer": "...",
"retrieved_context": "...",
"conversation_history": [
{"role": "user", "content": "..."},
{"role": "assistant", "content": "..."}
],
"current_user_message": "..."
},
"assistant_output": "...",
"qisa_result": {
"prompt_structure": "Low | Medium | High",
"debug": {
"promptScore": 0,
"features": {}
}
},
"judge_result": {
"rubric_version": "judge_rubric_v1",
"scores": {
"hard_constraints": 0,
"task_completion": 0,
"optional_tone": 0
},
"total": 0,
"rationale": "Short justification (treat as fallible)."
}
}