Back to Project List

pAIge

Synthetic Document Generation

pAIge hero image

Problem Scope

LLMs alone struggle with spatial layout. Standard mock data looks fake. Need realistic synthetic documents at scale for OCR/document AI training.

Role

ML Engineer

Tech Stack

  • Field Bank pattern—constrain LLMs to real data before generation
  • Multi-stage orchestration with LangChain
  • Groq + Cerebras for inference speed
  • Augraphy for visual degradation (realistic aging)

case study

Architected a 6-stage LangChain pipeline utilizing a strict Field Bank to produce realistic, test-verified synthetic OCR training data. I created this to solve the critical bottleneck of LLMs constantly hallucinating mock data and failing at complex document layouts.

Constraint & Solution

Pipeline with 6 stages: (1) Faker generates data first, (2) Field Bank constrains LLM, (3) Architect creates layout manifest, (4) Resolver fills placeholders, (5) Renderer outputs PDF, (6) Augraphy degrades visually.

Key Metric: Constraint-driven architecture eliminates LLM hallucinations

Feature 1

Field Bank pattern—constrain LLMs to real data before generation

Feature 2

Multi-stage orchestration with LangChain

Feature 3

Groq + Cerebras for inference speed

Feature 4

Augraphy for visual degradation (realistic aging)