XLSX files are little filesystems

The first surprising thing about an .xlsx file is that it is not really a file.

It is a ZIP archive wearing a spreadsheet costume.

Rename budget.xlsx to budget.zip, unzip it, and the workbook falls open into a small filesystem:

budget.xlsx
  [Content_Types].xml
  _rels/.rels
  xl/
    workbook.xml
    _rels/workbook.xml.rels
    worksheets/sheet1.xml
    worksheets/_rels/sheet1.xml.rels
    sharedStrings.xml
    styles.xml
    comments1.xml
    drawings/vmlDrawing1.vml

That is the moment Excel stops being “a grid” and starts looking like a tiny operating system. There is a package manifest, relationship graph, string table, style table, worksheet storage, drawing layer, comments subsystem, formulas, dimensions, merged ranges, hyperlinks, and sometimes a pile of legacy VML that refuses to die.

I like this kind of file format because it looks boring until you try to build anything real with it. Reading a cell value is easy. Rendering a workbook, validating comments, preserving formulas, or producing a stable diff is where the weirdness starts.

This is the mental model I use for working with .xlsx files without pretending Excel is simpler than it is.

Start by treating the workbook as a package Link to heading

The naive way to parse a workbook is to jump straight into xl/worksheets/sheet1.xml.

That works for toy files. It fails as soon as sheet names, relationships, shared strings, comments, drawings, or external links enter the picture.

The package has a graph. Follow it.

_rels/.rels
  -> xl/workbook.xml

xl/workbook.xml
  -> sheet metadata

xl/_rels/workbook.xml.rels
  -> worksheets/sheet1.xml
  -> sharedStrings.xml
  -> styles.xml

The relationship files are not decoration. They are the pointers that tell you what each part means.

<Relationship
  Id="rId1"
  Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/worksheet"
  Target="worksheets/sheet1.xml" />

The workbook points at a sheet by relationship id:

<sheet
  name="Summary"
  sheetId="1"
  r:id="rId1" />

So a real parser starts with a package index:

from zipfile import ZipFile
from xml.etree import ElementTree as ET

def load_package(path):
    with ZipFile(path) as z:
        return {
            name: z.read(name)
            for name in z.namelist()
            if name.endswith((".xml", ".rels", ".vml"))
        }

Then it resolves relationships instead of guessing paths:

def relationships(package, rels_path):
    root = ET.fromstring(package[rels_path])
    out = {}
    for rel in root:
        rid = rel.attrib["Id"]
        out[rid] = {
            "type": rel.attrib["Type"],
            "target": rel.attrib["Target"],
        }
    return out

That tiny indirection saves you from a long list of bad assumptions.

Cells are sparse, not rectangular Link to heading

A worksheet XML file does not store a nice 2D array. It stores the cells that exist.

<sheetData>
  <row r="1">
    <c r="A1" t="s"><v>0</v></c>
    <c r="C1"><v>42</v></c>
  </row>
</sheetData>

There is no <c r="B1">. It is absent.

That means the in-memory model should not start as rows x cols. It should start as a sparse map:

type CellRef = string; // "C12"

type Cell = {
  ref: CellRef;
  row: number;
  col: number;
  type: "number" | "shared-string" | "inline-string" | "formula" | "blank";
  raw: string | null;
  styleId: number | null;
  formula: string | null;
};

type Sheet = {
  name: string;
  cells: Map<CellRef, Cell>;
  merges: Range[];
};

A grid view can be derived later. The file format itself is sparse.

This matters for performance too. A workbook can claim a large dimension and still contain relatively few real cells. If you allocate the rectangle first, malicious or just ugly workbooks can waste a surprising amount of memory.

Shared strings are a global dictionary Link to heading

Strings often do not live inside cells. They live in xl/sharedStrings.xml.

<sst count="2" uniqueCount="2">
  <si><t>Revenue</t></si>
  <si><t>Margin</t></si>
</sst>

The worksheet cell stores an index:

<c r="A1" t="s">
  <v>0</v>
</c>

So A1 means “look up string 0.” If you render the sheet without resolving shared strings, the workbook looks like it is full of random integers.

The resolver is simple:

def cell_text(cell, shared_strings):
    cell_type = cell.get("t")
    value = cell.findtext("main:v", namespaces=NS)

    if value is None:
        return ""

    if cell_type == "s":
        return shared_strings[int(value)]

    return value

The annoying part is rich text. A shared string item can contain multiple runs:

<si>
  <r><rPr><b /></rPr><t>Total</t></r>
  <r><t> revenue</t></r>
</si>

If you only read direct <t> nodes, you lose text. If you flatten all text nodes, you keep the value but lose formatting. Both may be correct depending on whether you are building a data extractor or a visual renderer.

That is a recurring theme with Excel: every “simple” field has at least two meanings.

Styles are indexes into indexes Link to heading

Cell formatting is not stored as direct CSS-like properties on the cell.

A cell has a style id:

<c r="B2" s="5"><v>1234.5</v></c>

That points into styles.xml, where cellXfs points at font ids, fill ids, border ids, number format ids, alignment settings, and protection flags.

<xf
  numFmtId="4"
  fontId="2"
  fillId="0"
  borderId="1"
  applyNumberFormat="1" />

So rendering one cell can require walking a small object graph:

cell.s = 5
  -> cellXfs[5]
    -> numFmtId = 4
    -> fontId = 2
    -> borderId = 1
    -> fillId = 0

That is why “just read the value” and “show what Excel shows” are different problems.

For example, the raw value may be:

But with a date number format it should render like:

Jul 16, 2023

Or the raw value may be:

0.142

But the displayed value should be:

14.2%

A useful renderer keeps both:

type RenderedCell = {
  ref: string;
  rawValue: string | null;
  displayValue: string;
  style: ResolvedStyle;
};

If you throw away the raw value, formulas and diffs become worse. If you throw away the display value, humans cannot read the output.

Formulas are source code with cached output Link to heading

A formula cell can contain both the formula and the last calculated value:

<c r="C10">
  <f>SUM(C2:C9)</f>
  <v>9123</v>
</c>

The <f> is the source code. The <v> is the cached result.

That distinction matters. A library that edits inputs might not recalculate formulas. Excel may recalculate when the file opens, or it may trust cached values depending on workbook settings.

For an inspector, I want to show both:

C10
  formula: =SUM(C2:C9)
  cached:  9123

For a writer, I want an explicit policy:

Policy	Behavior
preserve	keep formula and cached value
clear cache	keep formula, remove cached value
value only	replace formula with literal
recalc engine	evaluate and write new cached value

The dangerous path is accidentally converting formulas into stale constants. That kind of bug can survive for months because the sheet still opens and the number still looks plausible.

Merged cells are a visual trick Link to heading

Merged cells look like one cell. In the file, they are a range overlay:

<mergeCells count="1">
  <mergeCell ref="A1:D1" />
</mergeCells>

Usually the value lives in the top-left cell. The other cells in the range may not exist at all.

For rendering, you need to expand the merge region:

function ownerOf(ref: CellRef, merges: Range[]) {
  for (const merge of merges) {
    if (merge.contains(ref)) {
      return merge.topLeft;
    }
  }
  return ref;
}

For data extraction, you have to be careful. If a merged header spans A1:D1, should it apply to all four columns? Visually yes. Structurally, the file only says “these cells are merged.”

Excel lets users encode meaning with layout. Machines then have to reverse engineer that meaning from geometry.

Comments live in two worlds Link to heading

Comments are where .xlsx gets especially weird.

The comment text can live in comments1.xml:

<comment ref="B12" authorId="0">
  <text>
    <r><t>Check this assumption</t></r>
  </text>
</comment>

But the visible comment box may be described by a VML drawing:

<v:shape id="_x0000_s1025" type="#_x0000_t202">
  <x:ClientData ObjectType="Note">
    <x:Row>11</x:Row>
    <x:Column>1</x:Column>
  </x:ClientData>
</v:shape>

So a workbook can have comment text without a proper shape, or a shape that does not line up with the comment. Some apps tolerate this. Some repair the file. Some silently drop the visual artifact.

If you care about producing workbooks that survive Excel, validators should inspect both layers:

worksheet relationships
  -> comments XML
  -> VML drawing
  -> cell references
  -> row/column anchors

A comment validator might emit:

{
  "sheet": "Summary",
  "cell": "B12",
  "text": true,
  "vml_shape": false,
  "severity": "warning"
}

This is the part of Excel automation that feels like archaeology. A modern .xlsx file can still require understanding a legacy drawing format because users expect a little yellow note to show up in the right place.

Rendering is a pipeline, not a parser Link to heading

Once you account for all of this, a workbook renderer starts looking like a compiler pipeline:

zip package
  -> content type index
  -> relationship graph
  -> workbook metadata
  -> sheet XML
  -> shared strings
  -> styles
  -> comments and drawings
  -> sparse cell model
  -> resolved display model
  -> visual grid

The intermediate representation matters. If you go directly from XML nodes to HTML table cells, every later feature becomes a patch.

I like this split:

type WorkbookPackage = {
  parts: Map<string, Uint8Array>;
  relationships: RelationshipIndex;
};

type WorkbookModel = {
  sheets: SheetModel[];
  sharedStrings: string[];
  styles: StyleTable;
};

type RenderModel = {
  sheets: RenderSheet[];
  warnings: RenderWarning[];
};

Warnings are first-class. Real workbooks are messy. A renderer that only returns “success” or “failed” is not giving you enough information.

Useful warning classes:

Warning	Meaning
missing relationship	XML points at a part that does not exist
unknown style id	cell references a style outside the table
unsupported number format	raw value shown without full formatting
comment without shape	text exists but visual note may not render
formula cache stale	formula and cached value may disagree
merge conflict	overlapping merge ranges

Warnings let you build tools that are honest without being fragile.

The debugging trick: explode the workbook Link to heading

When a workbook looks wrong, I do not start with Excel. I start by exploding the ZIP.

mkdir /tmp/book
unzip report.xlsx -d /tmp/book
find /tmp/book/xl -maxdepth 3 -type f | sort

Then inspect the exact part:

python -m xml.dom.minidom /tmp/book/xl/worksheets/sheet1.xml | less

For comments:

python -m xml.dom.minidom /tmp/book/xl/comments1.xml | less
python -m xml.dom.minidom /tmp/book/xl/drawings/vmlDrawing1.vml | less

For styles:

python -m xml.dom.minidom /tmp/book/xl/styles.xml | less

This is faster than guessing. Excel is a GUI over a package. If the GUI is confusing, inspect the package.

What makes this fun Link to heading

The fun part is that .xlsx sits in a strange middle ground.

It is a user-facing document format, but it behaves like a miniature application bundle. It has source code, cached execution output, layout instructions, shared resources, relationship graphs, compatibility layers, and rendering quirks accumulated over years.

That makes it a great engineering playground:

ZIP parsing
XML namespaces
sparse data structures
graph resolution
rendering
validation
compatibility archaeology
performance tuning
diffing and golden fixtures

It is not glamorous in the way a shader demo is glamorous. It is glamorous in the way a weird machine is glamorous: the more panels you remove, the more mechanisms you find.

Conclusion Link to heading

If you want to work seriously with Excel files, stop thinking of them as tables.

Think of them as little filesystems with a spreadsheet UI on top.

That mental model changes how you build tools. You follow relationships instead of guessing paths. You keep sparse cells sparse. You resolve shared strings and styles explicitly. You treat formulas as source code plus cached output. You validate comments across both XML and VML. You build a render model instead of stitching strings into an HTML table.

The reward is control. Once the workbook is a package, not a mystery, you can inspect it, render it, diff it, repair it, and generate it without hoping Excel forgives you.

References Link to heading

ECMA-376 Office Open XML - the standard behind modern Office Open XML packages.
Microsoft Learn: Structure of a SpreadsheetML document - workbook, worksheet, shared string, and style part structure.
Microsoft Learn: Working with sheets - how sheets are represented and related in Open XML.
Microsoft Learn: Working with comments - SpreadsheetML comments and comment authors.
Excelize documentation - a Go library for reading and writing XLSX files.