Most document extraction systems start life as a three-line demo:

text = pdf.extract_text()
result = model.extract(schema, text)
save(result)

That demo is useful because it proves the shape of the product. It is also where the architecture usually starts lying to you.

The real system is not “PDF in, JSON out”. It is a distributed rendering, OCR, indexing, retrieval, agent execution, validation, and evaluation pipeline with unreliable inputs at every layer. The failure modes are not just “the model got the answer wrong”. They are:

  • the PDF has 180 pages and half of them are scans
  • the text layer exists but the reading order is unusable
  • a provider returns great table cells but times out on dense pages
  • a fallback model reads the page but loses bounding boxes
  • retries race with the first invocation and perform duplicate OCR
  • one CPU-heavy render blocks the worker long enough for the orchestrator to think it died
  • the agent solves the workbook, but you cannot prove which page supported a value

The unit of design has to change. I would not build this as a document pipeline. I would build it as a page pipeline with durable orchestration around it.

The architecture Link to heading

The core shape looks like this:

uploaded PDF
  -> stable document id
  -> Rust/PyO3 page renderer
  -> page image artifacts
  -> OCR readiness check
  -> one durable OCR workflow per page
  -> normalized block + bbox store
  -> source catalog
  -> agent workspace
  -> solver / reviewer / evaluator agents
  -> trace report + structured output

This is more moving parts than the demo. It is also simpler operationally because every expensive or unreliable step now has a stable boundary.

BoundaryKeyArtifactRetry behavior
PDF renderdocument id + render profilepage-001.pngrepeatable, CPU-bound
Page OCRpage id + OCR profiletext blocks + bounding boxesdedupe by page
Source catalogdocument id + extraction versionchunks, bboxes, page URLsappend or replace version
Agent workspaceexecution iddeterministic file treeresumable across runners
Evaluationexecution id + fixture versionreport, diffs, tracesreplayable

The important decision is making the page the smallest expensive unit. PDFs are too coarse. Words are too fine. Pages are the level where rendering, OCR, visual evidence, and concurrency all line up.

Rendering pages is not boring plumbing Link to heading

I used to think PDF rendering was an implementation detail. It is not.

If the OCR provider wants images, you have to turn PDF pages into PNGs. Doing that inside a Python web worker seems fine until a large document hits the service. The path is usually:

page_images = render_pages(
    pdf_bytes,
    dpi=150,
)
upload(page_images)

That hides three separate resource problems:

  1. Rendering is CPU-heavy.
  2. Image bytes can get huge at high DPI.
  3. Python workers still need to respond to health checks, keep-alives, and orchestration calls while rendering runs.

A better implementation is a Rust extension called from Python:

let threads = min(cpu_budget(), page_count);
let chunk_size = page_count.div_ceil(threads);

let results = py.allow_threads(|| {
    thread_pool.install(|| {
        page_chunks
            .into_par_iter()
            .map(|pages| {
                render_chunk(
                    pdf_bytes.clone(),
                    pages,
                    target_dpi,
                )
            })
            .collect::<Vec<_>>()
    })
});

The details matter:

  • bound the Rayon thread pool instead of blindly using every core
  • compute a worker budget from container CPU limits, not laptop assumptions
  • sort rendered page results before returning them
  • reduce DPI when an encoded page image would exceed a payload limit
  • release the Python GIL during the Rust/Rayon phase

That last point is the sort of issue that only shows up under real pressure. If a Python worker calls into Rust and the extension holds the GIL for the entire render, the asyncio loop can stop making progress. The render may still be “fast” in isolation, but the service around it becomes unhealthy. Keep-alive pings miss. The durable runtime retries. Now you have a rendering problem pretending to be a workflow problem.

The test for this is not a golden image snapshot. It is a heartbeat test:

async def test_renderer_keeps_loop_alive():
    pdf = make_large_pdf(page_count=80)
    heartbeats = []

    async def heartbeat():
        while not done.is_set():
            heartbeats.append(time.monotonic())
            await asyncio.sleep(0.05)

    extraction = loop.run_in_executor(
        pool,
        render_pdf_pages,
        pdf,
    )
    await asyncio.gather(
        heartbeat(),
        extraction,
    )

    assert max_gap(heartbeats) < 1_000  # ms

This catches the actual class of failure: CPU-bound native code starving the Python runtime around it.

Upload pages in batches, not in a loop Link to heading

After rendering, each page should become an addressable artifact:

documents/{document_id}/pages/page-001.png
documents/{document_id}/pages/page-002.png
documents/{document_id}/pages/page-003.png

The naive upload loop is easy:

for page in pages:
    await upload(page)

At scale, this creates two problems. It underutilizes the network for big documents, and it gives you terrible partial-failure behavior. I prefer fixed-size batches with a bounded fan-out:

batch_futures = [
    upload_page_batch(ctx, batch)
    for batch in chunked(
        page_images,
        batch_size=50,
    )
]

completed = await restate_gather(*batch_futures)

The batch size becomes a real tuning knob. Too small and overhead dominates. Too large and one failed upload retries too much work. The page count and total output bytes should be span tags, not log trivia:

pdf.download       input_bytes=18_423_311
pdf.render_pages   pages=173 split_ms=42_810
pdf.upload_batch   bytes=97_044_182
pdf.upload_batch   upload_ms=8_902

Those three numbers explain most performance incidents in this part of the system.

OCR readiness is a state machine Link to heading

Once pages exist, the next mistake is to run OCR for every page every time. That is expensive and makes retries noisy.

The readiness function should answer a narrower question:

For this request, which page ids still lack acceptable OCR artifacts?

That means the system checks existing state before it does work:

async def ensure_pages_ocr_ready(ctx, page_ids):
    unique_page_ids = dedupe(page_ids)

    existing = await ctx.run(
        "check-existing-ocr",
        lambda: db.fetch_pages_with_ocr(
            unique_page_ids,
        ),
    )

    missing = [
        page_id
        for page_id in unique_page_ids
        if page_id not in existing
    ]

    for batch in chunked(missing, size=64):
        await restate_gather(*[
            perform_page_ocr(ctx, page_id)
            for page_id in batch
        ])

    return unique_page_ids

There are two subtle but important choices here.

First, dedupe page ids before dispatch. A source catalog, a user selection, and an agent request can all point at the same page. The OCR layer should not care how many upstream paths reached it.

Second, treat “OCR workflow already exists” as a success path with waiting, not as an error. In a distributed system, two callers will eventually ask for the same page at the same time. One should win. The other should poll for the resulting OCR chunks and continue.

try:
    await start_ocr_workflow(page_id)
except AlreadyInvoked:
    await wait_until(
        lambda: ocr_chunks_exist(page_id),
        timeout=120,
    )

That one branch prevents retry storms from becoming duplicate provider bills.

Provider fallback is a capability problem Link to heading

OCR providers are not interchangeable. Some are better at dense tables. Some preserve bounding boxes more reliably. Some handle watermarks. Some are cheaper. Some are more available.

So the abstraction should not be:

class OcrProvider:
    async def extract(image: bytes) -> str: ...

That throws away the only facts that matter. I want the interface to expose capabilities and return normalized blocks:

class OcrProvider(Protocol):
    name: str
    timeout: timedelta
    capabilities: set[Capability]

    async def extract(
        self,
        image: bytes,
    ) -> list[Block]:
        ...

@dataclass(frozen=True)
class Block:
    text: str
    kind: BlockKind
    page: int
    bbox: BBox
    confidence: float | None

Then routing becomes explicit:

required = {Capability.TABLE_BBOX}
primary = registry.match(required)
fallback = registry.rest(primary)

for provider in [*primary, *fallback]:
    try:
        blocks = await ctx.run(
            f"ocr-{provider.name}",
            lambda: provider.extract(image),
        )
        return normalize(blocks)
    except ProviderFailed:
        continue

raise TerminalError("ocr failed")

The ctx.run boundary is non-negotiable. OCR is an external side effect. If the worker dies after the provider returns but before the workflow completes, replay should use the journaled provider result instead of charging the provider again.

I also like a two-stage model fallback for pages that look table-heavy:

cheap vision pass:
  "Is this page mostly a dense table?"

if yes:
  use stricter table prompt
  use stronger model
else:
  use normal block extraction prompt

The goal is not to build a clever router. The goal is to pay for the expensive path only when the page shape justifies it.

Store evidence, not just answers Link to heading

The output of OCR should not be a blob of text. It should be a source catalog:

{
  "chunk_id": "chunk_742",
  "page_id": "page_018",
  "page_number": 18,
  "text": "Revenue increased 14.2%",
  "bbox": {
    "left": 0.12,
    "top": 0.44,
    "width": 0.51,
    "height": 0.03
  },
  "source_image": "pages/page-018.png",
  "provider": "layout-ocr",
  "profile": "ocr-v7"
}

This is what makes the rest of the system debuggable. When an agent writes 14.2% into a spreadsheet, I want the reviewer to be able to click the value and see the exact page crop that supported it.

Flattened text is fine for search. It is not enough for trustworthy automation.

Agent workers need stable workspaces Link to heading

Once the source catalog exists, agents can do the higher-level work: fill a model, reconcile a workbook, produce a narrative summary, or inspect conflicts. The agent should not be handed random signed URLs and a giant prompt. It should get a stable workspace.

input/
  document.pdf
  pages/
    page-001.png
    page-002.png
  sources/
    chunks.jsonl
    bboxes.jsonl
  task.md

output/
  answer.xlsx
  notes.md

The file references inside that workspace should be deterministic. If the primary runner fails and the job moves to a fallback runner, the same logical inputs should appear at the same paths. Otherwise, retries turn into prompt drift.

The orchestration layer should also treat agent runners like providers:

for attempt in failover_chain:
    prompt = render_prompt(
        driver=attempt.driver,
        workspace=workspace,
    )

    try:
        result = await run_agent_with_retries(
            attempt=attempt,
            prompt=prompt,
            workspace=workspace,
        )
        return result
    except RetryableAgentFailure:
        await ctx.sleep(backoff(attempt))
        continue

A fallback runner may need a different prompt shape or tool envelope. Re-rendering the prompt per runner is cleaner than pretending every agent runtime speaks the same dialect.

The useful thing here is not “multi-agent” as a buzzword. It is separating roles:

RoleJob
OCR workerproduce page-grounded evidence
Solver agentcomplete the user-facing artifact
Reviewer agentverify evidence and catch contradictions
Evaluatorcompare traces and outputs against fixtures

That split lets you improve one role without rewriting the entire pipeline.

Durable execution changes how you write code Link to heading

Restate is useful here because document pipelines are full of long-running steps and external side effects. The mental model is that every meaningful operation either becomes a journaled durable step or remains deterministic code that can safely replay.

That changes small coding habits:

# Good: external I/O is journaled
ocr_blocks = await ctx.run(
    "provider-ocr",
    call_provider,
)

# Good: durable sleep, not asyncio.sleep
await ctx.sleep(timedelta(seconds=10))

# Good: Restate-aware gather
results = await restate_gather(*page_workflows)

The anti-pattern is hiding orchestration inside a durable step:

# Bad: ctx operations inside ctx.run
await ctx.run(
    "do-everything",
    lambda: start_child_workflow(ctx, page_id),
)

Durable execution is not magic. It rewards boring boundaries. Provider calls, DB writes, blob uploads, and notification sends get wrapped. Pure transformations stay pure. Child workflows stay visible as child workflows.

When this discipline is followed, a crash during a 170-page document is annoying but not mysterious. The workflow resumes from the journal. Already-uploaded pages remain uploaded. Already-completed OCR calls are not repeated. In-flight child workflows can be attached to, waited on, or observed.

Evaluation has to include the trace Link to heading

For agentic document systems, evaluating only the final answer is too weak.

Imagine a workbook cell is correct:

Revenue growth = 14.2%

That can still hide bad behavior:

  • the agent copied the value from a nearby example
  • it used the wrong document version
  • it ignored a contradicting footnote
  • it got lucky because the fixture was easy
  • it wrote the right number but left no evidence trail

I want eval reports that include:

final output diff
stdout / stderr timeline
tool calls
source chunks used
page crops referenced
provider fallback path
agent runner attempts
elapsed time by phase

This makes regressions much easier to reason about. A model upgrade that keeps accuracy flat but doubles fallback usage is not free. A prompt change that improves one fixture by reading entire PDFs instead of using source chunks may be worse operationally.

Metrics I would page on Link to heading

The dashboard for this system should be built around phase behavior, not just request success rate.

MetricWhy I care
render time per pagecatches pathological PDFs and CPU starvation
rendered bytes per pagecatches DPI and payload explosions
OCR reuse ratemeasures whether dedupe/idempotency works
OCR provider fallback ratedetects quality or availability drift
duplicate invocation waitsshows concurrent demand for same page
source chunks per pagecatches over/under segmentation
agent retry countdetects flaky runners or weak prompts
eval pass rate by fixture familyprevents silent degradation
evidence coveragetells whether answers are grounded

The last one matters the most. If a system produces polished artifacts without evidence coverage, it is just automation theater.

Tests that pay for themselves Link to heading

The tests for this kind of pipeline should look like production incidents in miniature.

[x] large PDF render keeps the event loop alive
[x] upload batching preserves page order
[x] OCR readiness skips existing chunks
[x] duplicate OCR waits for the winner
[x] provider failure falls through
[x] fallback result normalizes
[x] replay does not repeat side effects
[x] failover recreates the workspace
[x] eval includes diff and evidence

The point is not high coverage for its own sake. The point is to encode the failures that are expensive to rediscover.

The part that feels counterintuitive Link to heading

The more agentic the system becomes, the more deterministic the infrastructure around it needs to be.

Agents are useful because they can handle messy tasks. They are also harder to reason about than normal code. So the surrounding system should remove as much ambiguity as possible:

  • stable page ids
  • stable workspace paths
  • explicit provider capabilities
  • bounded concurrency
  • journaled side effects
  • source-grounded outputs
  • replayable eval traces

That is the actual architecture. Not “use OCR”. Not “use an LLM”. Not “add retries”. The system works when every expensive action has a durable identity, every answer has evidence, and every fallback leaves a trace.

References Link to heading