pAIge
Synthetic Document Generation

Problem Scope
LLMs alone struggle with spatial layout. Standard mock data looks fake. Need realistic synthetic documents at scale for OCR/document AI training.
Role
ML Engineer
Tech Stack
- Field Bank pattern—constrain LLMs to real data before generation
- Multi-stage orchestration with LangChain
- Groq + Cerebras for inference speed
- Augraphy for visual degradation (realistic aging)
case study
Architected a 6-stage LangChain pipeline utilizing a strict Field Bank to produce realistic, test-verified synthetic OCR training data. I created this to solve the critical bottleneck of LLMs constantly hallucinating mock data and failing at complex document layouts.
Constraint & Solution
Pipeline with 6 stages: (1) Faker generates data first, (2) Field Bank constrains LLM, (3) Architect creates layout manifest, (4) Resolver fills placeholders, (5) Renderer outputs PDF, (6) Augraphy degrades visually.
Key Metric: Constraint-driven architecture eliminates LLM hallucinations
Feature 1
Field Bank pattern—constrain LLMs to real data before generation
Feature 2
Multi-stage orchestration with LangChain
Feature 3
Groq + Cerebras for inference speed
Feature 4
Augraphy for visual degradation (realistic aging)