ETL Notebooks

Interactive Marimo notebooks documenting the data pipeline that builds the NAICS database from Census Bureau source files.

These notebooks run in your browser via WebAssembly - no Python installation required. Code and outputs are pre-computed; you’re viewing a read-only snapshot of each analysis.

Data Pipeline

Run these notebooks in order to build the complete NAICS database:

Load NAICS Codes

Parse 2,125 codes from Census Excel files, derive 5-level hierarchy

Load Descriptions

Merge detailed industry definitions and illustrative examples

Load Index Terms

Import 20,398 official search keywords mapped to 6-digit codes

Load Cross References

Parse 4,601 exclusion/inclusion references between codes

Generate Embeddings

Create 384-dim vectors using all-MiniLM-L6-v2 for semantic search

Compute Relationships

Build similarity graph with 9,127 same-sector and cross-sector links

Exploration

Explore Database

Interactive explorer: browse hierarchy, search terms, run queries

Database Schema

After running the full pipeline, the database contains:

Table	Rows	Description
`naics_nodes`	2,125	Codes with hierarchy and descriptions
`naics_index_terms`	20,398	Official search keywords
`naics_cross_references`	4,601	Exclusion/inclusion references
`naics_embeddings`	2,125	384-dim vectors for semantic search
`naics_relationships`	2,125	Pre-computed similarity graph (JSON)

Source Code

The original Marimo notebooks are available in the naics-mcp-server repository .