Skip to main content
AI

The Structured Output Tax

Ravinder··9 min read
AILLMStructured OutputJSON
Share:
The Structured Output Tax

The Tax Nobody Budgets For

You asked the model for JSON. The model gave you something that looks like JSON until your parser hits the object at position 4,217 and throws. Or the model returned valid JSON, but with a field that was silently truncated mid-sentence. Or the model refused to respond entirely because your schema was too constrained for the answer it wanted to give.

These are not bugs you fix once. They are structural consequences of asking a token-prediction model to emit a formal grammar. The "structured output tax" is the cumulative cost — in latency, reliability, and engineering complexity — that you pay every time you ask an LLM to speak machine instead of human.

This post documents the failure modes and the mitigation stack.


A Taxonomy of Failures

Structured output failures cluster into four categories:

flowchart TD A[Structured output request] --> B{Model response} B --> C[Valid JSON, correct schema] B --> D[Valid JSON, wrong schema] B --> E[Invalid JSON - parse error] B --> F[Refusal - no JSON returned] D --> D1[Missing required field] D --> D2[Wrong type coercion] D --> D3[Extra fields not in schema] E --> E1[Truncation at token limit] E --> E2[Markdown wrapping] E --> E3[Escaped character errors] E --> E4[Nested object depth exceeded] F --> F1[Schema too constrained] F --> F2[Content policy conflict] F --> F3[Ambiguous instruction]

Each failure mode has a different root cause and a different mitigation. Treating them all as "JSON parse errors" leads to over-engineered retry loops that do not actually address the underlying issue.


Failure Mode 1: Markdown Wrapping

The most common failure, and the easiest to fix. Many models — especially when not using a native JSON mode — wrap their JSON output in markdown code fences:

```json
{"name": "Alice", "age": 32}
 
This breaks `json.loads()` immediately. The fix is extraction before parsing:
 
```python
import json
import re
 
def extract_json(text: str) -> dict | list:
    """
    Extract JSON from model output that may contain markdown fences,
    leading/trailing prose, or other formatting artifacts.
    """
    # Strip markdown code fences
    fence_pattern = re.compile(r"```(?:json)?\s*([\s\S]*?)```", re.IGNORECASE)
    match = fence_pattern.search(text)
    if match:
        text = match.group(1).strip()
 
    # Try direct parse first
    try:
        return json.loads(text)
    except json.JSONDecodeError:
        pass
 
    # Find the first JSON object or array in the text
    obj_match = re.search(r"(\{[\s\S]*\}|\[[\s\S]*\])", text)
    if obj_match:
        try:
            return json.loads(obj_match.group(1))
        except json.JSONDecodeError:
            pass
 
    raise ValueError(f"No valid JSON found in model output: {text[:200]!r}")

Apply this wrapper unconditionally whenever you are not using a provider's native structured output mode. It adds microseconds and saves you from the most common failure class.


Failure Mode 2: Silent Truncation

The model generates valid JSON but hits the max_tokens limit in the middle of a string value or before closing a nested object. The result is syntactically invalid JSON — but the corruption is silent. You do not know it happened unless you catch the parse error.

This is particularly insidious in streaming responses and in cases where you set max_tokens based on the prompt token budget without accounting for completion size.

Detection:

def check_truncation_risk(
    prompt_tokens: int,
    model_context: int,
    max_completion: int,
) -> bool:
    """Returns True if completion budget is likely insufficient for structured output."""
    remaining = model_context - prompt_tokens
    # If max_completion is less than 20% of remaining context, flag as risky
    return max_completion < remaining * 0.20
 
# Heuristic: for a schema with N fields averaging 50 chars each,
# minimum safe max_tokens = N * 50 * 1.5 (overhead factor)
def min_tokens_for_schema(schema: dict) -> int:
    field_count = len(schema.get("properties", {}))
    return max(256, field_count * 75)

Mitigation: partial JSON recovery. For list-type outputs, you can often recover partial content even from a truncated response:

def recover_partial_json_list(text: str) -> list:
    """
    Attempt to recover a partial JSON array by finding complete objects
    before a truncation point.
    """
    # Try full parse first
    try:
        return json.loads(text)
    except json.JSONDecodeError:
        pass
 
    # Find all complete JSON objects within the text
    objects = []
    depth = 0
    start = None
 
    for i, char in enumerate(text):
        if char == "{":
            if depth == 0:
                start = i
            depth += 1
        elif char == "}":
            depth -= 1
            if depth == 0 and start is not None:
                try:
                    obj = json.loads(text[start:i+1])
                    objects.append(obj)
                except json.JSONDecodeError:
                    pass
                start = None
 
    return objects

This is not a complete solution — you lose data. But for use cases like "extract all entities from this document," returning 8 of 10 entities is better than returning an error.


Failure Mode 3: Schema Coercion and Type Errors

The model understands your intent but not your types. You asked for {"count": 42} and you got {"count": "42"}. Or you asked for an ISO 8601 date and got "March 15th, 2025".

Native structured output with JSON Schema enforcement (OpenAI's response_format with strict: true, or Anthropic's tool use with schema validation) prevents most type coercion issues. Use it when available.

When it is not available, validate and coerce explicitly:

from pydantic import BaseModel, validator, Field
from datetime import date
from typing import Optional
 
class ExtractedEvent(BaseModel):
    title: str = Field(..., min_length=1, max_length=200)
    date: date
    attendee_count: int = Field(..., ge=0)
    location: Optional[str] = None
 
    @validator("date", pre=True)
    def parse_date(cls, v):
        if isinstance(v, date):
            return v
        if isinstance(v, str):
            # Handle common model date formats
            for fmt in ("%Y-%m-%d", "%B %d, %Y", "%b %d, %Y", "%m/%d/%Y"):
                try:
                    from datetime import datetime
                    return datetime.strptime(v, fmt).date()
                except ValueError:
                    continue
        raise ValueError(f"Cannot parse date: {v!r}")
 
    @validator("attendee_count", pre=True)
    def coerce_int(cls, v):
        if isinstance(v, str):
            # Strip commas, spaces
            return int(v.replace(",", "").strip())
        return int(v)
 
def parse_event(raw: dict) -> ExtractedEvent:
    try:
        return ExtractedEvent(**raw)
    except Exception as e:
        raise ValueError(f"Schema validation failed: {e}") from e

Pydantic validators that handle common model output patterns (string numbers, varied date formats) absorb the coercion tax without requiring retry loops.


Failure Mode 4: Refusals and Constrained Schema Conflicts

Sometimes the model refuses to populate a required field because the answer would violate its safety training. You asked for a JSON object with a reason field explaining why the request was denied — and the model returns {} or omits the field rather than saying something it considers harmful.

This is a schema design problem as much as a model behavior problem.

Design schemas for graceful degradation:

# Fragile: required fields with no escape hatch
class AnalysisResult(BaseModel):
    sentiment: str          # model may refuse on sensitive topics
    key_claim: str          # model may refuse if claim is harmful
    confidence: float
 
# Better: optional fields with explicit null semantics
class AnalysisResult(BaseModel):
    sentiment: Optional[str] = None       # null = model declined to assess
    key_claim: Optional[str] = None       # null = model declined to extract
    confidence: Optional[float] = None
    declined_fields: list[str] = Field(default_factory=list)
    decline_reason: Optional[str] = None

And instruct the model explicitly:

SYSTEM_PROMPT = """
Return a JSON object with the schema provided. If you cannot populate a field
due to content policy or insufficient information, set it to null and add the
field name to the `declined_fields` list with a brief reason in `decline_reason`.
Do not omit required fields — use null as the explicit null signal.
"""

This converts silent omission failures into observable, handleable signals.


Schema Design for Reliability

Schema shape directly affects failure rate. Simpler schemas fail less.

Prefer flat over nested. Every level of nesting adds a token boundary where truncation can occur and the model can lose track of the expected structure.

# Fragile: deep nesting
schema_bad = {
    "order": {
        "customer": {
            "address": {
                "city": "string",
                "zip": "string"
            }
        }
    }
}
 
# Better: flat with prefixed keys
schema_good = {
    "order_customer_city": "string",
    "order_customer_zip": "string",
}

Prefer enums over freeform strings for constrained fields. The model is more reliable when it knows the valid set.

Limit array fields. Long arrays (> 20 items) are a common truncation trigger. If you need a long list, extract it in a separate call or use streaming with partial recovery.


The Fallback Stack

No single mitigation handles all failure modes. The production approach is a tiered fallback:

flowchart TD A[LLM call with structured output mode] --> B{Parse success?} B -- Yes --> C[Validate with Pydantic] C -- Valid --> D[Return result] C -- Invalid --> E[Coerce with validators] E -- Success --> D E -- Failure --> F[Retry with simplified schema] B -- No --> G[Extract JSON from text] G -- Success --> C G -- Failure --> H{Truncation detected?} H -- Yes --> I[Partial recovery] I --> J[Return partial + flag incomplete] H -- No --> F F --> K{Retry success?} K -- Yes --> D K -- No --> L[Return error + log for review]

Cap retries at 2. Three LLM calls for a single structured output request is the maximum acceptable in a real-time system. If you are hitting the retry cap frequently on a particular schema, the schema is the problem — simplify it.


Provider Comparison: Structured Output Support

Provider Native JSON mode Schema enforcement Partial streaming Refusal handling
OpenAI (gpt-4o) Yes (json_object) Yes (strict: true) Yes Silent omission
Anthropic Claude Via tool use Yes (tool schema) Yes Explicit decline
Google Gemini Yes Yes (response schema) Yes Varies
Ollama / local Model-dependent Via grammar sampling Varies Model-dependent

Grammar-constrained sampling (used by some local model servers via llama.cpp grammar support) is the most reliable approach — it structurally prevents invalid JSON at the token sampling level. For self-hosted deployments, it is worth the implementation cost.


Key Takeaways

  • Structured output failures cluster into four types: markdown wrapping, truncation, schema coercion, and refusals — each has a different root cause and mitigation.
  • Always strip markdown fences and extract JSON before parsing; this handles the most common failure class with negligible overhead.
  • Silent truncation is detectable via token budget checks; for list outputs, partial JSON recovery returns partial data rather than total failure.
  • Use Pydantic validators that handle common model output patterns (string numbers, varied date formats) instead of relying on retry loops for type coercion.
  • Design schemas for graceful degradation: optional fields with explicit null semantics convert silent omissions into observable signals.
  • Flat schemas fail less than deeply nested ones; cap retries at 2 and treat frequent retry cap hits as a signal to simplify the schema.