The easiest way to make a document extraction system look good in a demo is to hide the evidence.
Ask a model to read a few PDFs, retrieve the top chunks, and fill a spreadsheet. If the answer looks plausible, ship the JSON. The failure usually arrives later, when somebody asks a very simple question:
Where did this number come from?
That question breaks a surprising number of systems. The value might be correct, but the cited source points at a row label. Or the answer cites a page that contains the right table but not the specific cell. Or the model found the number through search, copied it into the output, and lost the chunk id while formatting the final response. Or worse, the number is a valid value from the document, but it belongs to the wrong period.
This is the uncomfortable lesson: retrieval is not provenance.
Search can find candidates. It cannot prove that a generated value is covered by a source. Provenance needs a stricter contract: stable evidence ids, validation after generation, coverage obligations, and traces that explain why a candidate won.
I like designing these systems as evidence pipelines, not prompt pipelines.
The naive implementation Link to heading
The first version usually looks like this:
chunks = search(query="net revenue 2024", limit=8)
answer = model.generate(
schema=RevenueRow,
context="\n\n".join(chunk.text for chunk in chunks),
)
save(answer)
It is short. It also conflates four different jobs:
| Job | What it should do | What the naive version does |
|---|---|---|
| Retrieval | Find possible supporting material | Returns nearby text |
| Reasoning | Choose the right value | Happens inside an opaque response |
| Provenance | Identify the exact source artifact | Often copied as free text |
| Validation | Reject uncovered output | Usually missing |
The problem is not that vector search or BM25 is bad. The problem is that “chunk appears near query” is a weak claim. It says the chunk is textually related. It does not say the output value is present, belongs to the target field, matches the target period, or is the highest-authority source.
Consider this fictional source table:
Atlas Foods FY2024 Summary
Metric FY2023 FY2024
Revenue 91.2 104.8
Gross profit 42.1 48.7
Adjusted EBITDA 18.0 22.4
Now consider the target workbook:
Sheet: Operating Model
Metric 2024A
Net revenue [blank]
EBITDA [blank]
A retrieval system may return the row-label chunk:
{
"chunk_id": "chunk_441",
"text": "Revenue",
"page": 12,
"bbox": [82, 311, 144, 326]
}
That chunk helps disambiguate the row. It is not enough to cite the value 104.8. The value lives in a different cell:
{
"chunk_id": "chunk_442",
"text": "104.8",
"page": 12,
"bbox": [392, 311, 426, 326],
"row_path": "Revenue",
"column_path": "FY2024"
}
If the final workbook cites chunk_441, the output is not source-grounded even if the number is right. It is label-grounded. That distinction matters when you are reviewing thousands of generated cells.
Make evidence a first-class data model Link to heading
The fix starts before any model call. Do not index “chunks” as undifferentiated text. Build an evidence catalog.
flowchart LR
D["document"] --> P["pages"]
P --> B["blocks"]
P --> T["tables"]
T --> C["cells"]
C --> F["atomic facts"]
B --> F
F --> I["search index"]
F --> V["coverage validator"]
F --> A["audit trace"]
The important object is the atomic fact. It is the smallest unit that can safely support an output value.
{
"fact_id": "fact_018742",
"origin_type": "table_cell",
"value": "104.8",
"normalized_value": "104800000",
"value_type": "number",
"row_path": "Revenue",
"column_path": "FY2024",
"section_path": "Consolidated summary",
"source_document": "atlas_foods_fy2024.pdf",
"page_number": 12,
"bbox": [392, 311, 426, 326],
"search_text": "104.8 Revenue FY2024 Consolidated summary Atlas Foods"
}
This schema is intentionally boring. It gives every downstream component the same handles:
fact_idfor durable identityvalueandnormalized_valuefor exact comparisonrow_path,column_path, andsection_pathfor semantic alignmentpage_numberandbboxfor reviewsearch_textfor retrieval
You can still store larger page chunks and table chunks. They are useful for routing and context. They should not be the only citation primitive.
Retrieve tables before cells Link to heading
The next mistake is more subtle: retrieving facts one cell at a time.
It feels natural because the target is a cell. But table extraction is usually a block alignment problem. A workbook does not ask for isolated facts; it asks for a slice of a source table under a particular row hierarchy, column hierarchy, unit, period grain, and entity.
I prefer to compile both sides into canonical tables before matching:
{
"table_id": "src_table_012",
"origin": "source_pdf",
"table_type": "financial_summary",
"entity": "Atlas Foods",
"period_grain": "fiscal_year",
"unit": "USD",
"scale": "millions",
"title_path": ["FY2024 report", "Consolidated summary"],
"cells": [
{
"cell_id": "src_cell_12_04",
"row_path": ["Revenue"],
"column_path": ["FY2024"],
"value": "104.8",
"fact_id": "fact_018742"
}
]
}
The target workbook gets the same treatment:
{
"table_id": "target_table_003",
"origin": "target_workbook",
"sheet": "Operating Model",
"table_type": "operating_model",
"period_grain": "fiscal_year",
"unit": "USD",
"scale": "millions",
"cells": [
{
"cell_id": "target_cell_B4",
"row_path": ["Net revenue"],
"column_path": ["2024A"],
"is_fill_target": true
}
]
}
Now retrieval has a better job:
template tables
source PDF/XLSX tables
-> canonical table metadata
-> retrieve likely source tables
-> align row and column paths
-> fill cells
-> validate citations and coverage
The unit of retrieval becomes the table or block. The unit of citation remains the atomic fact. That distinction is the heart of the design.
Search should route, not decide Link to heading
For table retrieval, I usually combine several imperfect recall channels:
| Channel | Good at | Failure mode |
|---|---|---|
| Structured filters | entity, unit, period grain, statement type | bad metadata can hide the right table |
| BM25 / FTS | exact labels, period text, section titles | misses semantic synonyms |
| Vector search | fuzzy table summaries and row/column meanings | retrieves plausible but unauthoritative neighbors |
| Reranker | ordering a small candidate set | too expensive across every fact |
| Model adjudication | final ambiguity explanation | unsafe as the only source of truth |
The key is to use these channels for routing, not proof. Vector search is a recall channel. BM25 is a recall channel. A reranker is a recall channel. None of them proves that the final value is supported by a specific source cell.
One practical fusion method is reciprocal rank fusion:
from collections import defaultdict
def rrf(rankings: dict[str, list[str]], k: int = 60) -> list[tuple[str, float]]:
scores = defaultdict(float)
for _channel, ranked_ids in rankings.items():
for rank, item_id in enumerate(ranked_ids, start=1):
scores[item_id] += 1.0 / (k + rank)
return sorted(scores.items(), key=lambda item: item[1], reverse=True)
The k value is not a correctness threshold. It is a smoothing constant. The output should be treated as “tables worth inspecting next”, not “facts we can write”.
def retrieve_source_tables(target: CanonicalTable) -> list[TableCandidate]:
rankings = {
"filters": rank_by_metadata(target),
"fts": rank_by_fts(target.search_text),
"vectors": rank_by_embedding(target.summary_embedding),
}
fused_ids = [table_id for table_id, _score in rrf(rankings)]
short_list = load_tables(fused_ids[:50])
return rerank_tables(target, short_list)[:8]
def retrieve_candidates(target_cell: TargetCell) -> list[FactCandidate]:
source_tables = retrieve_source_tables(target_cell.parent_table)
alignments = align_tables(target_cell.parent_table, source_tables)
facts = facts_for_aligned_paths(
alignments,
row_path=target_cell.row_path,
column_path=target_cell.column_path,
)
return rank_facts(target_cell, facts)
The route query can be fuzzy because it only decides where to look next. The alignment step should be stricter. It should compare full row paths and full column paths, not just leaf labels.
That matters for nested headers:
Revenue
Product
Services
FY2024
Actual
Budget
If the target asks for Revenue > Services and FY2024 > Actual, the leaf labels alone are not enough. A source table can contain multiple “Actual” columns and multiple revenue rows. The full paths are the address.
def rank_fact(target: TargetCell, fact: AtomicFact) -> CandidateScore:
parts = {
"row": path_similarity(target.row_path, fact.row_path) * 8.0,
"column": path_similarity(target.column_path, fact.column_path) * 6.0,
"period": period_match(target.period, fact.column_path) * 10.0,
"unit": unit_compatible(target.unit, fact.unit) * 4.0,
"section": token_overlap(target.section, fact.section_path) * 3.0,
"shape": value_shape_match(target.number_format, fact.value_type) * 6.0,
"source": source_priority(fact.source_document),
}
return CandidateScore(
fact_id=fact.fact_id,
score=sum(parts.values()),
parts=parts,
)
The score parts are not just for tuning. They are an observability surface. When a reviewer asks why a value was chosen, “score 27.4” is not useful. This is useful:
{
"target_cell": "Operating Model!B4",
"winner": "fact_018742",
"score": 27.4,
"margin": 8.1,
"parts": {
"row": 8.0,
"column": 5.1,
"period": 10.0,
"unit": 4.0,
"section": 2.4,
"shape": 6.0,
"source": 1.0
},
"runner_up": "fact_019103"
}
This turns “the model picked it” into a debuggable ranking problem.
It also avoids the worst scaling shape. Matching every target cell against every candidate fact is the path to targets * candidates pain. A table-first pipeline lets you block by table type, entity, period grain, source date, unit, and high-signal tokens before doing expensive semantic comparison.
For larger jobs, I want the hot path to look more like this:
compile target tables once
compile source tables once
bulk insert searchable metadata
batch target embeddings
load source embeddings once
matrix-multiply for top-k recall
run rank workers over bounded candidate sets
emit table_retrieval.json, table_alignment.json, fill_program.json, audit.json
The performance lesson is boring but useful: batch what can be batched, cache what has a stable content hash, and keep accelerators busy on embedding work instead of scanning a vector store once per target cell.
Use ambiguity as a stop signal Link to heading
The most dangerous extraction bug is not an empty cell. It is a confidently wrong cell.
If two candidates are close, the system should refuse to write. That sounds conservative until you look at the failure mode:
Target: EBITDA, FY2024
Candidate A: 22.4, row="Adjusted EBITDA", column="FY2024"
Candidate B: 21.9, row="EBITDA", column="FY2024"
Both may be reasonable depending on the target workbook’s convention. A generic prompt cannot reliably infer the accounting policy from proximity alone. The extraction layer should surface the ambiguity.
def choose_candidate(candidates: list[FactCandidate]) -> FactCandidate | None:
if not candidates:
return None
ordered = sorted(candidates, key=lambda c: c.score, reverse=True)
winner = ordered[0]
if len(ordered) > 1:
margin = winner.score - ordered[1].score
if margin < 3.0:
return None
if winner.score < 12.0:
return None
return winner
The threshold is not magic. It should come from evals. The point is architectural: ambiguity is a valid result, not an exception.
The output can carry that explicitly:
{
"cell": "Operating Model!B8",
"status": "unresolved",
"reason": "ambiguous_candidates",
"candidates": [
{"fact_id": "fact_100", "value": "22.4", "label": "Adjusted EBITDA"},
{"fact_id": "fact_101", "value": "21.9", "label": "EBITDA"}
]
}
This is much easier to review than a hallucinated citation.
Coverage obligations catch silent omissions Link to heading
Most validators check what the system wrote. That is necessary but incomplete. You also need to check what the system ignored.
Before generation, derive coverage obligations from high-signal source regions:
{
"obligations": [
{
"obligation_id": "obl_2024_revenue",
"region_id": "page_012_table_001",
"measure_family": "revenue",
"chunk_ids": ["chunk_442"],
"fact_ids": ["fact_018742"],
"high_signal": true
}
]
}
After generation, validate that each high-signal obligation was either cited or intentionally marked.
ALLOWED_MARKS = {
"schema_not_representable",
"duplicate_lower_authority",
"context_only",
"not_applicable",
}
def coverage_errors(obligations, output_rows, marks):
cited = {
fact_id
for row in output_rows
for fact_id in row.source_fact_ids
}
errors = []
for obligation in obligations:
if not obligation.high_signal:
continue
if set(obligation.fact_ids) & cited:
continue
mark = marks.get(obligation.obligation_id)
if mark in ALLOWED_MARKS:
continue
errors.append({
"code": "unresolved_source_coverage",
"obligation_id": obligation.obligation_id,
"region_id": obligation.region_id,
"measure_family": obligation.measure_family,
})
return errors
The marks file is important. Not every source fact belongs in the target schema. Sometimes the document contains a metric the workbook does not ask for. Sometimes there are duplicate pages from an appendix. Sometimes a row is context for a calculation but not an output.
The system should allow those decisions, but it should force them to be explicit.
obligation_id,status,target_table,reason
obl_2024_store_count,schema_not_representable,operating_model,"target has no store count field"
obl_appendix_revenue,duplicate_lower_authority,operating_model,"same value appears in audited summary"
This one file changes the review posture. The question stops being “did the model probably use the right sources?” and becomes “which high-signal source facts are unaccounted for?”
Keep deterministic work outside the model Link to heading
A common mistake is to ask the model to do extraction, matching, validation, and formatting in one pass:
Read the context, fill the workbook, include citations, and explain any missing values.
That prompt is doing too much. The model becomes responsible for tasks where deterministic code is better:
- normalizing numbers
- matching periods
- detecting formula families
- validating required fields
- checking whether a cited fact id exists
- enforcing one citation per atomic value
The architecture should give the model less room to be creative.
flowchart TD
A["compile target workbook"] --> B["build source fact universe"]
B --> C["route candidate tables"]
C --> D["rank fact candidates"]
D --> E["deterministic writes"]
E --> F["model fallback for unresolved cells"]
F --> G["schema validation"]
G --> H["coverage validation"]
H --> I["review artifact"]
The model fallback is still useful. It can reason across messy narrative sections, infer that “net sales” maps to “revenue”, or explain why a value is not present. But its output should go through the same gates as deterministic writes.
{
"value": "104.8",
"source_fact_ids": ["fact_018742"],
"confidence": "high",
"reasoning": "FY2024 revenue value from consolidated summary table"
}
Then validate it:
def validate_write(write, fact_index):
if not write.source_fact_ids:
return error("missing_source_fact")
for fact_id in write.source_fact_ids:
if fact_id not in fact_index:
return error("unknown_source_fact", fact_id=fact_id)
if write.value not in {fact_index[f].value for f in write.source_fact_ids}:
return error("value_not_supported_by_citation")
return ok()
This check is deliberately strict for single-cell numeric writes. For formulas or derived values, the contract can be different: cite every input fact and emit the formula. The point is to make the proof shape explicit.
Durable execution changes the failure model Link to heading
Long-running extraction workflows fail in boring ways: provider timeouts, worker restarts, partial uploads, process crashes, duplicate invocations, and retry races.
If the workflow is durable, the validation boundaries become even more important. A replay-safe pipeline should not re-run expensive or non-deterministic work just because a worker died after writing an artifact.
async def solve_workbook(ctx, document_id: str, target_id: str):
facts = await ctx.run("build-fact-universe", build_fact_universe, document_id)
candidates = await ctx.run(
"rank-candidates",
rank_candidates,
facts.fact_index_id,
target_id,
)
writes = await deterministic_fill(candidates)
unresolved = cells_without_writes(target_id, writes)
fallback_writes = await ctx.run(
"model-fallback",
call_model_for_unresolved,
unresolved,
)
return await ctx.run(
"validate-output",
validate_output,
writes + fallback_writes,
facts.coverage_obligations_id,
)
The exact API varies by runtime, but the principle is stable: non-deterministic operations need durable boundaries. Database reads, provider calls, and model calls should have recorded results so replay does not silently change the answer.
There is a second-order benefit: every durable step becomes an audit boundary. If rank-candidates changed between two runs, you can compare the candidate artifacts without re-reading the PDFs.
Observability should explain wrong answers Link to heading
Average latency and error rate are not enough for extraction systems. You need product-quality observability: signals that explain why an individual output is wrong.
Useful spans look like this:
solve.exec
document.count=3
target.cells=184
output.writes=171
output.unresolved=13
facts.build
source.pages=96
source.tables=41
facts.total=6184
facts.table_cell=4890
retrieval.tables
target.table=operating_model
recall.filters=12
recall.fts=20
recall.vector=20
fused.tables=31
reranked.tables=8
alignment.table
target.table=operating_model
source.table=src_table_012
row.mapped=18
column.mapped=5
ambiguous.axes=1
candidate.rank
target.cell=Operating Model!B4
winner.fact_id=fact_018742
winner.score=27.4
winner.margin=8.1
coverage.validate
obligations.high_signal=231
obligations.unresolved=4
The span names are less important than the attributes. The trace should let an engineer answer:
- Was the source parsed correctly?
- Did retrieval find the right table?
- Did table alignment preserve the row and column hierarchy?
- Did the correct fact lose ranking?
- Was the output rejected by validation?
- Did coverage fail because the target schema lacked a field?
The trace should also carry artifact links: candidate JSON, coverage obligations, rejected writes, and a compact HTML review page. Without artifacts, traces tell you that a stage failed but not what to do next.
Evals need failure classes, not just scores Link to heading
A single accuracy number hides the useful work. I want evals to produce failure classes:
| Failure class | Example | Likely fix |
|---|---|---|
missing_route | Correct table never retrieved | route query or table index |
wrong_table | Similar section from another entity wins | metadata filter / authority ranking |
bad_axis_alignment | right table, wrong row or column path | row/column path parser |
over_broad_cell_search | candidate set includes unrelated tables | block-scoped retrieval |
recall_only_vector_hit | semantic neighbor wins without exact support | stricter validator / reranker scope |
wrong_period | FY2023 chosen for FY2024 | period parser / column scoring |
label_citation | cites row label instead of value cell | citation contract |
ambiguous_value | two candidates within margin | threshold or human review |
schema_gap | source fact has no target field | schema mapping |
unsupported_write | output value not in cited facts | validator / model prompt |
This gives you a roadmap. If most failures are missing_route, changing the model is probably a waste of time. If most failures are unsupported_write, the model fallback is too unconstrained. If failures are mostly schema_gap, the extraction system may be correct and the target schema is the bottleneck.
The eval artifact should include enough data to reproduce the disagreement:
{
"case_id": "eval_042",
"target_cell": "Operating Model!B4",
"expected": "104.8",
"actual": "91.2",
"failure_class": "wrong_period",
"winner": {
"fact_id": "fact_018701",
"value": "91.2",
"column_path": "FY2023"
},
"expected_fact": {
"fact_id": "fact_018742",
"value": "104.8",
"column_path": "FY2024"
}
}
For table-first systems, I also want table-level metrics:
| Metric | Why it matters |
|---|---|
| target tables compiled | catches workbook parsing failures |
| source tables retrieved | separates retrieval from alignment |
| source tables aligned | detects row/column structure bugs |
| direct fills | measures deterministic coverage |
| formula fills | keeps derivations separate from citations |
| confidence-gated blanks | proves ambiguity handling is active |
| elapsed time by stage | exposes scanning and batching regressions |
That is the difference between measuring and debugging.
Tradeoffs Link to heading
This architecture is heavier than a single prompt. It creates more artifacts, more validators, and more concepts. The tradeoff is worth it when outputs are reviewed, reused, or loaded into another system.
The main costs:
| Cost | Why it shows up | Mitigation |
|---|---|---|
| More preprocessing | Facts require structured extraction | cache by document version |
| Stricter validators reject plausible answers | proof is narrower than truth | add explicit derived-value contracts |
| More unresolved cells | ambiguity becomes visible | route to review instead of fabricating certainty |
| Larger traces | artifacts are part of debugging | store compact summaries plus downloadable detail |
| More schema work | coverage needs target semantics | start with high-signal obligations only |
The counterintuitive part is that strictness can make the product feel better. Users trust blank cells with reasons more than filled cells with weak citations. A system that says “I found two plausible EBITDA values and refused to choose” is annoying. A system that silently chooses the wrong one is expensive.
Practical rules Link to heading
These are the rules I keep coming back to:
- Search is candidate discovery, not proof.
- Cite atomic facts, not pages or vague chunks.
- Row labels and headers can disambiguate values, but should not be the final citation for numeric cells.
- Every generated value should either cite source facts, declare itself derived, or stay unresolved.
- Coverage validation should check ignored high-signal sources, not just malformed outputs.
- Ambiguity should stop writes.
- Evals should classify failures by system boundary.
- Traces should carry artifacts, not just timings.
None of this removes the need for good models. It just stops asking them to be the database, search engine, validator, and auditor at the same time.
References Link to heading
- Restate durable steps for the idea of recording non-deterministic operations such as HTTP calls or database responses during durable execution.
- Reciprocal Rank Fusion for a simple rank-fusion method that works well when multiple retrieval systems produce different orderings.
- SQLite FTS5 for full-text indexing and the built-in
bm25()ranking function used in many lightweight retrieval systems. - Sentence Transformers semantic search for the common embedding-search pattern of comparing query embeddings against corpus embeddings with bounded
top_kresults. - Python
concurrent.futuresfor the standard thread and process pool interface used to separate I/O-heavy and CPU-heavy work. - OpenTelemetry context propagation for correlating traces across process and service boundaries.
- JSON Schema object validation for the distinction between object properties and required fields in structured validation.
- Microsoft Open XML shared string table for how spreadsheet data can reference shared workbook-level string storage rather than inline text.
Conclusion Link to heading
A source-grounded extraction system is not a model call with citations sprinkled on top. It is a pipeline that preserves evidence identity from parsing through retrieval, generation, validation, and review.
RAG helps find the room where the answer might live. Provenance proves which object in that room supports the value you wrote.