pAIge

Synthetic Document Generation

Problem Scope

LLMs alone struggle with spatial layout. Standard mock data looks fake. Need realistic synthetic documents at scale for OCR/document AI training.

Role

ML Engineer

Tech Stack

Field Bank pattern—constrain LLMs to real data before generation
Multi-stage orchestration with LangChain
Groq + Cerebras for inference speed
Augraphy for visual degradation (realistic aging)

case study

Architected a 6-stage LangChain pipeline utilizing a strict Field Bank to produce realistic, test-verified synthetic OCR training data. I created this to solve the critical bottleneck of LLMs constantly hallucinating mock data and failing at complex document layouts.

Constraint & Solution

Pipeline with 6 stages: (1) Faker generates data first, (2) Field Bank constrains LLM, (3) Architect creates layout manifest, (4) Resolver fills placeholders, (5) Renderer outputs PDF, (6) Augraphy degrades visually.

Key Metric: Constraint-driven architecture eliminates LLM hallucinations

Feature 1

Field Bank pattern—constrain LLMs to real data before generation

Feature 2

Multi-stage orchestration with LangChain

Feature 3

Groq + Cerebras for inference speed

Feature 4

Augraphy for visual degradation (realistic aging)