Skip to main content
AI

Vision LLMs in Production: When to Ditch Tesseract and When to Keep It

Ravinder··7 min read
AIVisionLLMOCR
Share:
Vision LLMs in Production: When to Ditch Tesseract and When to Keep It

A financial services team I worked with had a Tesseract pipeline that took 14 steps, including a deskewing library, a custom table-detector trained on 40,000 samples, and a heuristic to distinguish headers from data rows. It worked 89% of the time. They replaced it with a single GPT-4o call and got to 94% accuracy in a week. Then their invoice volume tripled, the bill hit $18k/month, and they quietly brought Tesseract back for the easy cases.

That is the real story of vision LLMs in production. Not "VLMs replace OCR" and not "Tesseract is fine." The answer is always a routing layer.

When VLMs Beat Classical OCR

Classical OCR pipelines (Tesseract, AWS Textract, Google Document AI) are pipelines of heuristics. They were built assuming documents are structured — clean fonts, consistent layouts, predictable table borders. When that assumption holds, they are fast and cheap.

VLMs beat classical OCR in four specific scenarios:

1. Handwritten or cursive content. Tesseract's LSTM model was not trained on handwriting at scale. GPT-4o and Claude 3.5 Sonnet handle mixed print/handwriting naturally because they were trained on images of real documents.

2. Complex multi-column layouts with mixed content types. A pharmaceutical report with regulatory tables, chemical diagrams, footnotes in three font sizes, and a header in a custom typeface will destroy any rule-based layout engine. A VLM reads it like a human does.

3. Context-dependent field extraction. A classical pipeline extracts text. A VLM understands that "Total Due" and "Balance Owed" on two different invoice templates mean the same thing. You do not need a separate NLP step.

4. Poor image quality or rotated documents. VLMs handle moderate rotation, blur, and low contrast without preprocessing. They have implicitly learned to compensate.

When Tesseract (and Classical Tools) Still Win

Be honest about VLM weaknesses or you will get burned in production.

High volume, low complexity documents. If you are processing 500,000 utility bills per day with a consistent template, Tesseract + a template matcher will cost you $30/month. GPT-4o vision at that scale costs tens of thousands of dollars.

Strict character-level accuracy requirements. VLMs hallucinate. On a medical dosage field or a bank account number, a confident wrong digit is worse than an extraction failure. Classical OCR gives you confidence scores per character; VLMs give you plausible-sounding text.

Offline or air-gapped environments. If your documents contain PII that cannot leave your network, cloud VLM APIs are not an option. Running a local VLM (LLaVA, Phi-3-Vision, InternVL) adds significant infrastructure complexity.

Sub-second latency requirements. GPT-4o vision calls take 3–8 seconds for a typical document page. Tesseract on the same page takes under 200ms.

Architecture: The Routing Layer

The production pattern is not VLM-or-classical. It is a classifier that routes each document to the right tool.

flowchart TD A[Incoming Document] --> B[Image Quality Check] B --> C{Quality Score} C -- Low quality --> D[Preprocessing: deskew, denoise, upscale] C -- Acceptable --> E[Document Classifier] D --> E E --> F{Document Type} F -- Known template, high volume --> G[Tesseract + Template Matcher] F -- Complex layout or handwritten --> H[VLM API Call] F -- Ambiguous --> I[Tesseract first] I --> J{Confidence > threshold?} J -- Yes --> K[Return Tesseract result] J -- No --> H G --> L[Post-processing & Validation] H --> L L --> M[Structured Output]

The classifier can be as simple as a rules-based check (template hash match + image sharpness score) or as sophisticated as a small fine-tuned classifier. For most teams, start with rules.

Prompting Strategies for Document Extraction

VLMs are sensitive to prompt structure for document tasks. Three patterns that work consistently:

Schema-first extraction. Give the model your output schema before showing the image. This anchors it to what you need and reduces hallucination on irrelevant fields.

import anthropic
import base64
 
def extract_invoice_fields(image_path: str) -> dict:
    client = anthropic.Anthropic()
 
    with open(image_path, "rb") as f:
        image_data = base64.standard_b64encode(f.read()).decode("utf-8")
 
    schema = {
        "invoice_number": "string or null",
        "vendor_name": "string or null",
        "invoice_date": "ISO 8601 date or null",
        "due_date": "ISO 8601 date or null",
        "line_items": [{"description": "string", "quantity": "number", "unit_price": "number", "total": "number"}],
        "subtotal": "number or null",
        "tax": "number or null",
        "total_due": "number or null",
        "currency": "3-letter ISO code or null"
    }
 
    prompt = f"""Extract the following fields from this invoice image.
Return ONLY valid JSON matching this exact schema. Use null for any field not found.
Do not invent values. If a number is ambiguous, return null.
 
Schema:
{schema}
 
Return only the JSON object, no explanation."""
 
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/jpeg",
                        "data": image_data,
                    },
                },
                {"type": "text", "text": prompt}
            ],
        }]
    )
    import json
    return json.loads(response.content[0].text)

Confidence signaling. Ask the model to rate its own confidence per field. This is imperfect but useful as a signal for downstream validation.

Chunk large documents. VLMs degrade on dense multi-page documents. Process one page at a time and merge structured outputs in your application layer, not in the prompt.

Cost Modeling

Run this math before committing to a VLM-first architecture.

def monthly_cost_estimate(
    documents_per_day: int,
    avg_pages_per_doc: int,
    # GPT-4o vision pricing (approximate, verify current rates)
    cost_per_image_usd: float = 0.00765,  # ~1000 tokens input for a page
    routing_pct_to_vlm: float = 0.30,     # 30% go to VLM, 70% to Tesseract
) -> dict:
    pages_per_day = documents_per_day * avg_pages_per_doc
    vlm_pages_per_day = pages_per_day * routing_pct_to_vlm
    vlm_cost_per_day = vlm_pages_per_day * cost_per_image_usd
    vlm_cost_per_month = vlm_cost_per_day * 30
 
    # Tesseract: EC2 c6i.xlarge ~$0.17/hr, processes ~200 pages/min
    tesseract_pages_per_day = pages_per_day * (1 - routing_pct_to_vlm)
    tesseract_hours_per_day = (tesseract_pages_per_day / 200) / 60
    tesseract_cost_per_month = tesseract_hours_per_day * 0.17 * 30
 
    return {
        "vlm_monthly_usd": round(vlm_cost_per_month, 2),
        "tesseract_monthly_usd": round(tesseract_cost_per_month, 2),
        "total_monthly_usd": round(vlm_cost_per_month + tesseract_cost_per_month, 2),
    }
 
# Example: 10,000 docs/day, 3 pages each, 30% to VLM
print(monthly_cost_estimate(10_000, 3))
# → {'vlm_monthly_usd': 2065.5, 'tesseract_monthly_usd': 11.07, 'total_monthly_usd': 2076.57}

At 10k documents/day with 30% VLM routing, you are spending ~$2k/month on VLM calls. At 100% VLM routing, that is $6.9k/month. The routing layer pays for itself almost immediately.

Eval for Vision Tasks

Standard LLM evals do not work for document extraction. You need field-level metrics.

Build a golden dataset of 200–500 documents with human-verified ground truth. For each field, measure:

  • Exact match rate: correct value, correct type
  • Partial match rate: correct value, wrong format (e.g. "2024-01-15" vs "01/15/2024")
  • Hallucination rate: model returned a value when ground truth is null
  • Null rate when present: model returned null when value exists

Track these per document type and per model version. When you change your prompt or switch models, re-run the eval before shipping.

def evaluate_extraction(predicted: dict, ground_truth: dict) -> dict:
    fields = set(ground_truth.keys()) | set(predicted.keys())
    results = {"exact": 0, "partial": 0, "hallucination": 0, "missed": 0, "total": len(fields)}
 
    for field in fields:
        gt = ground_truth.get(field)
        pred = predicted.get(field)
 
        if gt is None and pred is not None:
            results["hallucination"] += 1
        elif gt is not None and pred is None:
            results["missed"] += 1
        elif str(gt).strip().lower() == str(pred).strip().lower():
            results["exact"] += 1
        else:
            results["partial"] += 1
 
    results["exact_rate"] = results["exact"] / results["total"]
    results["hallucination_rate"] = results["hallucination"] / results["total"]
    return results

Key Takeaways

  • VLMs win on complex layouts, handwriting, context-dependent extraction, and poor image quality — not on high-volume, templated, low-complexity documents.
  • A routing classifier that sends 70% of traffic to Tesseract and 30% to a VLM can cut costs by 4–5x versus VLM-only with minimal accuracy loss.
  • Schema-first prompting and per-page chunking are the two highest-leverage prompt engineering techniques for document extraction.
  • Log confidence signals from the VLM and build feedback loops — hallucination rates drift when document types change.
  • Build a golden dataset eval before your first production deployment; re-run it on every model or prompt change.
  • Air-gapped and sub-second-latency requirements are hard blockers for cloud VLM APIs — plan your fallback architecture before those constraints surprise you.