AerixNova
AerixNova
AI Engineering9 min read

OCR + LLM: Building an Intelligent Document Processing Pipeline

How to combine OCR engines with large language models to automate structured data extraction from invoices, engineering drawings, purchase orders, and compliance documents.

Written by

Anbu

Published

The Document Processing Problem at Scale

Most businesses run on documents — invoices, purchase orders, shipping manifests, compliance certificates, engineering drawings. Processing these manually is expensive, error-prone, and creates operational bottlenecks. A mid-size manufacturing company handling 5,000 supplier invoices per month spends 150–300 hours monthly on manual data entry alone.

OCR + LLM pipelines — what the industry calls Intelligent Document Processing (IDP) — automate this end-to-end. The result: 80–95% reduction in manual extraction time with field-level accuracy exceeding 95%.

Architecture: OCR → Preprocessing → LLM Extraction → Validation

Stage 1: OCR — Text Extraction

OCR converts images and scanned PDFs into machine-readable text. Engine selection matters:

| Engine | Best For | Hosted/Self | |---|---|---| | AWS Textract | Complex tables, forms, enterprise scale | Managed | | Google Document AI | Pre-built invoice/receipt/ID models | Managed | | Azure Form Recognizer | Structured forms with templates | Managed | | Tesseract 5.x | Standard text, open-source requirement | Self-hosted | | PaddleOCR | Multilingual, lightweight | Self-hosted |

For engineering drawings containing mixed text, symbols, and technical diagrams, pure OCR is insufficient. Combine OCR for text regions with object detection (YOLO, Detectron2) for symbol and dimension recognition.

Stage 2: Image Preprocessing

Raw scanned documents often have quality issues that degrade OCR accuracy. A preprocessing pipeline should handle:

import cv2
import numpy as np

def preprocess_document(image_path):
    img = cv2.imread(image_path)
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

    # Deskew
    coords = np.column_stack(np.where(gray < 128))
    angle = cv2.minAreaRect(coords)[-1]
    if angle < -45:
        angle = 90 + angle
    M = cv2.getRotationMatrix2D(center, -angle, 1.0)
    rotated = cv2.warpAffine(gray, M, (w, h))

    # Denoise + threshold
    denoised = cv2.fastNlMeansDenoising(rotated, h=10)
    _, binary = cv2.threshold(denoised, 0, 255,
                               cv2.THRESH_BINARY + cv2.THRESH_OTSU)
    return binary

Preprocessing alone can improve OCR accuracy by 15–25% on poor-quality scans.

Stage 3: LLM-Powered Field Extraction

Raw OCR output is messy — wrong line breaks, merged cells, noise characters. An LLM excels at making sense of imperfect OCR output and extracting structured fields.

Approach 1: Direct extraction prompt

EXTRACTION_PROMPT = """
Extract the following fields from this invoice text. Return JSON only.
Fields: invoice_number, vendor_name, invoice_date, due_date,
        line_items (list of: description, quantity, unit_price, total),
        subtotal, tax, total_amount, currency

Invoice text:
{ocr_text}

Return valid JSON only, no other text.
"""

Approach 2: Two-stage extraction — First, identify document type and layout; second, apply type-specific extraction schema. This handles diverse document layouts more robustly.

Approach 3: Vision LLM — GPT-4o and Claude 3.5 Sonnet can process document images directly, bypassing OCR entirely for well-scanned documents. This simplifies the pipeline at the cost of higher API spend.

Stage 4: Validation and Confidence Scoring

Never pass raw LLM extractions directly into downstream systems. Implement validation:

Math validation: subtotal + tax == total_amount (within rounding tolerance) Format validation: dates match expected formats, numeric fields are parseable as numbers Lookup validation: vendor name matches your supplier master data (fuzzy match with threshold) Confidence scoring: run extraction twice with temperature variation; high agreement = high confidence

Route low-confidence extractions (< 80% field confidence) to a human review queue. In a well-designed system, fewer than 10% of documents require human intervention.

Handling Engineering Drawings

Engineering drawings present unique challenges: dimension lines, tolerances, title blocks, revision histories, and technical symbols cannot be handled by text OCR alone.

AerixNova's engineering drawing pipeline:

  1. Region segmentation: CNN classifier separates title block, drawing area, notes section
  2. Title block extraction: Template-matched OCR extracts part number, revision, material, finish
  3. Dimension extraction: YOLO-based dimension line detector + OCR on dimension text
  4. Tolerance parsing: Regex + NLP parser converts ±0.05 and H7/g6 notations to structured tolerances
  5. BOM extraction: Table detector extracts bill of materials from drawing notes

This pipeline processes an engineering drawing in 8–15 seconds, extracting 30–50 structured fields with 92–97% accuracy.

Integration Patterns

Trigger options: Email attachment listener (AWS SES + Lambda), folder watch (S3 event trigger), API endpoint (POST /process-document), ERP webhook

Output destinations: ERP system via API (SAP, Oracle), PostgreSQL structured tables, downstream approval workflows (Zapier, Make, custom), document management systems (SharePoint, Confluence)

Processing volume: A single-threaded Python service handles ~200 documents/hour. Containerised with horizontal scaling on Kubernetes, the same code handles 5,000+ documents/hour.

Real-World Results

AerixNova deployed an IDP pipeline for a logistics company processing 8,000 shipping manifests monthly. The result: manual data entry time dropped from 320 hours/month to 18 hours (for exception review only). Field extraction accuracy: 96.2% across all document types. The system paid for itself in under 3 months.

For a manufacturing client, an engineering drawing automation pipeline reduced drawing interpretation time for their pre-sales team from 4 hours per RFQ to 22 minutes — a 5x improvement enabling them to respond to 3x more customer enquiries without adding headcount.

Enterprise Solutions

Stop reading. Start automating.

Don't let legacy processes hold you back. Let's discuss a custom strategy to reduce your operations cost.