Q-ISA Documentation
quantuminquiry.org Prompt-side structural measurement Response-side outcome evaluation

Q-ISA and LLM-as-Judge

This page explains how Q-ISA and LLM-as-Judge differ, where they complement one another, and what a combined testing workflow looks like when both tools are run in a single evaluation pipeline.

Core distinction: LLM-as-Judge evaluates outputs after generation. Q-ISA measures prompt structure. Q-ISA is a pre-generation instrument in principle, but many deployments (including log-based demos) compute it post-hoc from captured turns. Used together, they provide visibility across the full prompt → response pipeline without conflating measurement with judgment.
Implementation note: If you want true pre-generation Q-ISA measurement in an agent stack, you must capture the full prompt payload exactly as it is sent (system + developer + retrieved context + conversation + current user message). Otherwise, Q-ISA can still be computed from conversation logs, but it becomes a post-hoc diagnostic.

Comparison

What is LLM-as-Judge?

LLM-as-Judge is an evaluation pattern where a language model assesses the output of another model (or its own output) against explicit criteria. It operates after generation and evaluates what was said.

  • Scores compliance against an explicit rubric (often binary)
  • Compares two responses
  • Checks instruction adherence at the response level
  • Scales evaluation when human labeling is expensive

Answers: “How does this output compare to expectations?”

What is Q-ISA?

Q-ISA (Question–Interrogative Structure Analysis) is a structural measurement tool that evaluates the configuration of the prompt itself, independent of model internals or generated output. It measures how the inquiry is configured and whether that configuration is being diluted over time.

  • Measures interrogative structure (pre-generation in principle; often log-derived in practice)
  • Quantifies how constraints and specificity are distributed
  • Detects structural dilution from context accumulation (“context bloat”)
  • Produces traceable, rule-based scoring (not semantic judgment)

Answers: “Is this prompt still structurally exerting the control we think it is?”

Why They Are Not Substitutes

On mobile, scroll horizontally to view the full table.

Aspect LLM-as-Judge Q-ISA
Operates on Generated output Prompt structure
Timing Post-generation Prompt-side measurement (pre-generation in principle; often post-hoc in demos)
Signal type Interpretive / semantic Deterministic / structural
Best at detecting Compliance failures against the rubric Constraint dilution and context bloat
Explains why behavior changed Indirectly (via interpretation) Directly (via structural trace)

LLM-as-Judge may identify that behavior has degraded. Q-ISA helps explain when and where the prompt stopped working as intended. Many failures attributed to “model behavior” are actually structural dilution of the input. Without instrumentation at the prompt level, these failures are difficult to diagnose and easy to misattribute.

What a Combined Test Looks Like

A combined test is a two-lane evaluation pipeline: Q-ISA measures the prompt-side structure and LLM-as-Judge scores the response (post-generation). The value comes from correlating these signals across turns, especially under controlled context growth.

Test objective

Determine whether failures are driven primarily by:

Minimal scenario (voice agent example)

Keep the user request constant while context grows across turns. This isolates “bloat” effects from task variation.

Constraints (example)

  • Always confirm user intent before taking action
  • Never mention internal tools
  • Output must be ≤ 60 words
  • If uncertain, ask one question

Turns

  1. Turn 1: clean context, constraints are salient
  2. Turn 5: added history/logs/examples (“bloat”)
  3. Turn 10: larger context + redundant/conflicting instruction fragments

What you collect per turn

What you compute per turn

What you compare

Expected pattern if context bloat is causal: Q-ISA degrades first (prompt-side), then judge scores degrade (output-side). If judge degrades without Q-ISA movement, you are more likely looking at model variability, capability limits, or a rubric mismatch.

What Is Required for Setup

1) A fixed test harness (non-negotiable)

You need an automated runner that records a replayable artifact per turn. At minimum:

2) A stable prompt capture format

Voice-agent stacks vary (messages array vs concatenated prompt), but you must choose one canonical representation and keep it consistent. The key requirement is fidelity: you must store the prompt exactly as sent.

3) A judge rubric (keep it boring)

The judge must evaluate against a finite rubric with mostly binary constraints where possible. Example scoring categories:

Security note: if you use a hosted LLM-as-Judge, keep API keys server-side (e.g., via an API route) and never expose keys in client code.

4) A controlled “bloat ramp”

To test context bloat, generate controlled variants while holding the user request constant:

5) Control randomness

Minimal Harness Template

This is a neutral JSON shape you can implement in any stack (Node, Python, serverless). The intent is replayability: prompt payload + outputs + both analysis results in a single record per turn.

{
  "test_id": "voice_agent_bloat_v1",
  "run_id": "2025-12-17T00:00:00Z",
  "turn_index": 1,
  "condition": {
    "context_level": "baseline | bloatA | bloatB | bloatC",
    "retrieval": "on | off",
    "model": "agent-model-id",
    "temperature": 0.2
  },
  "prompt_payload": {
    "system": "...",
    "developer": "...",
    "retrieved_context": "...",
    "conversation_history": [
      {"role": "user", "content": "..."},
      {"role": "assistant", "content": "..."}
    ],
    "current_user_message": "..."
  },
  "assistant_output": "...",
  "qisa_result": {
    "prompt_structure": "Low | Medium | High",
    "debug": {
      "promptScore": 0,
      "features": {}
    }
  },
  "judge_result": {
    "rubric_version": "judge_rubric_v1",
    "scores": {
      "hard_constraints": 0,
      "task_completion": 0,
      "optional_tone": 0
    },
    "total": 0,
    "rationale": "Short justification (treat as fallible)."
  }
}
Operational note: the judge rationale is useful for debugging, but the primary signal should be the rubric scores. Treat free-form rationales as non-authoritative.