ETL Notebooks
Interactive Marimo notebooks documenting the data pipeline that builds the NAICS database from Census Bureau source files.
These notebooks run in your browser via WebAssembly - no Python installation required. Code and outputs are pre-computed; you’re viewing a read-only snapshot of each analysis.
Data Pipeline
Run these notebooks in order to build the complete NAICS database:
Load NAICS Codes
Parse 2,125 codes from Census Excel files, derive 5-level hierarchy
Load Descriptions
Merge detailed industry definitions and illustrative examples
Load Index Terms
Import 20,398 official search keywords mapped to 6-digit codes
Load Cross References
Parse 4,601 exclusion/inclusion references between codes
Generate Embeddings
Create 384-dim vectors using all-MiniLM-L6-v2 for semantic search
Compute Relationships
Build similarity graph with 9,127 same-sector and cross-sector links
Exploration
Database Schema
After running the full pipeline, the database contains:
| Table | Rows | Description |
|---|---|---|
naics_nodes | 2,125 | Codes with hierarchy and descriptions |
naics_index_terms | 20,398 | Official search keywords |
naics_cross_references | 4,601 | Exclusion/inclusion references |
naics_embeddings | 2,125 | 384-dim vectors for semantic search |
naics_relationships | 2,125 | Pre-computed similarity graph (JSON) |
Source Code
The original Marimo notebooks are available in the naics-mcp-server repository .