Q-ISA and LLM-as-Judge

This page explains how Q-ISA and LLM-as-Judge differ, where they complement one another, and what a combined testing workflow looks like when both tools are run in a single evaluation pipeline.

Core distinction: LLM-as-Judge evaluates outputs after generation. Q-ISA measures prompt structure. Q-ISA is a pre-generation instrument in principle, but many deployments (including log-based demos) compute it post-hoc from captured turns. Used together, they provide visibility across the full prompt → response pipeline without conflating measurement with judgment.

Implementation note: If you want true pre-generation Q-ISA measurement in an agent stack, you must capture the full prompt payload exactly as it is sent (system + developer + retrieved context + conversation + current user message). Otherwise, Q-ISA can still be computed from conversation logs, but it becomes a post-hoc diagnostic.

Jump to: Comparison Why not substitutes Combined test Setup requirements Harness template

Comparison

What is LLM-as-Judge?

LLM-as-Judge is an evaluation pattern where a language model assesses the output of another model (or its own output) against explicit criteria. It operates after generation and evaluates what was said.

Scores compliance against an explicit rubric (often binary)
Compares two responses
Checks instruction adherence at the response level
Scales evaluation when human labeling is expensive

Answers: “How does this output compare to expectations?”

What is Q-ISA?

Q-ISA (Question–Interrogative Structure Analysis) is a structural measurement tool that evaluates the configuration of the prompt itself, independent of model internals or generated output. It measures how the inquiry is configured and whether that configuration is being diluted over time.

Measures interrogative structure (pre-generation in principle; often log-derived in practice)
Quantifies how constraints and specificity are distributed
Detects structural dilution from context accumulation (“context bloat”)
Produces traceable, rule-based scoring (not semantic judgment)

Answers: “Is this prompt still structurally exerting the control we think it is?”

Why They Are Not Substitutes

On mobile, scroll horizontally to view the full table.

Aspect	LLM-as-Judge	Q-ISA
Operates on	Generated output	Prompt structure
Timing	Post-generation	Prompt-side measurement (pre-generation in principle; often post-hoc in demos)
Signal type	Interpretive / semantic	Deterministic / structural
Best at detecting	Compliance failures against the rubric	Constraint dilution and context bloat
Explains why behavior changed	Indirectly (via interpretation)	Directly (via structural trace)

LLM-as-Judge may identify that behavior has degraded. Q-ISA helps explain when and where the prompt stopped working as intended. Many failures attributed to “model behavior” are actually structural dilution of the input. Without instrumentation at the prompt level, these failures are difficult to diagnose and easy to misattribute.

What a Combined Test Looks Like

A combined test is a two-lane evaluation pipeline: Q-ISA measures the prompt-side structure and LLM-as-Judge scores the response (post-generation). The value comes from correlating these signals across turns, especially under controlled context growth.

Test objective

Determine whether failures are driven primarily by:

Prompt structural dilution / context bloat (Q-ISA degrades first), versus
Model capability / execution issues (Judge degrades first or independently).

Minimal scenario (voice agent example)

Keep the user request constant while context grows across turns. This isolates “bloat” effects from task variation.

Constraints (example)

Always confirm user intent before taking action
Never mention internal tools
Output must be ≤ 60 words
If uncertain, ask one question

Turns

Turn 1: clean context, constraints are salient
Turn 5: added history/logs/examples (“bloat”)
Turn 10: larger context + redundant/conflicting instruction fragments

What you collect per turn

Exact prompt payload (system + developer + retrieved context + conversation + current user message)
Assistant output

What you compute per turn

Q-ISA result computed from the prompt payload (or, in log-based demos, derived from captured turns)
Judge result on the assistant output (rubric scoring + brief rationale)

What you compare

Q-ISA score trend vs Judge score trend
The “breakpoint” where adherence drops
Whether Q-ISA signaled structural degradation before judge-detected failure

Expected pattern if context bloat is causal: Q-ISA degrades first (prompt-side), then judge scores degrade (output-side). If judge degrades without Q-ISA movement, you are more likely looking at model variability, capability limits, or a rubric mismatch.

What Is Required for Setup

1) A fixed test harness (non-negotiable)

You need an automated runner that records a replayable artifact per turn. At minimum:

test_id, turn_index
prompt_payload (exact input to the agent)
assistant_output
qisa_result (JSON)
judge_result (JSON)

2) A stable prompt capture format

Voice-agent stacks vary (messages array vs concatenated prompt), but you must choose one canonical representation and keep it consistent. The key requirement is fidelity: you must store the prompt exactly as sent.

3) A judge rubric (keep it boring)

The judge must evaluate against a finite rubric with mostly binary constraints where possible. Example scoring categories:

Hard constraints (0/1 each): word limit, no tool mentions, confirms intent, asks one question when uncertain
Task completion (0–2): completed the requested action appropriately
Optional: tone/style (0–1), safety/policy (if relevant)

Security note: if you use a hosted LLM-as-Judge, keep API keys server-side (e.g., via an API route) and never expose keys in client code.

4) A controlled “bloat ramp”

To test context bloat, generate controlled variants while holding the user request constant:

Baseline: minimal context
Bloat A: irrelevant history
Bloat B: redundant or conflicting instruction fragments
Bloat C: long tool transcripts / logs / examples

5) Control randomness

Fix the model/version during the run
Set temperature low for the agent (e.g., 0–0.3)
Set temperature very low for the judge (e.g., 0–0.2)
Use the same judge model across the run

Minimal Harness Template

This is a neutral JSON shape you can implement in any stack (Node, Python, serverless). The intent is replayability: prompt payload + outputs + both analysis results in a single record per turn.

{
  "test_id": "voice_agent_bloat_v1",
  "run_id": "2025-12-17T00:00:00Z",
  "turn_index": 1,
  "condition": {
    "context_level": "baseline | bloatA | bloatB | bloatC",
    "retrieval": "on | off",
    "model": "agent-model-id",
    "temperature": 0.2
  },
  "prompt_payload": {
    "system": "...",
    "developer": "...",
    "retrieved_context": "...",
    "conversation_history": [
      {"role": "user", "content": "..."},
      {"role": "assistant", "content": "..."}
    ],
    "current_user_message": "..."
  },
  "assistant_output": "...",
  "qisa_result": {
    "prompt_structure": "Low | Medium | High",
    "debug": {
      "promptScore": 0,
      "features": {}
    }
  },
  "judge_result": {
    "rubric_version": "judge_rubric_v1",
    "scores": {
      "hard_constraints": 0,
      "task_completion": 0,
      "optional_tone": 0
    },
    "total": 0,
    "rationale": "Short justification (treat as fallible)."
  }
}

Operational note: the judge rationale is useful for debugging, but the primary signal should be the rubric scores. Treat free-form rationales as non-authoritative.