Parquet for AI: Inspecting Embeddings, Tokens, and LLM Training Data
Why Parquet is the default format for fine-tuning and RAG datasets — and how to spot-check embedding columns, tokenized shards, and schema issues before a training run.
Large language model workflows increasingly treat Parquet as the interchange format for training corpora, evaluation sets, and retrieval indexes. Hugging Face Datasets, Spark-based feature stores, and cloud ML pipelines all converge on columnar shards — not because Parquet is trendy, but because it matches how ML engineers actually inspect data: column at a time, filtered early, compressed on disk.
Why Parquet dominates LLM data pipelines
Training and RAG pipelines share the same structural needs as classical analytics:
- Column pruning: Load only \
text\, \embedding\, or \metadata\columns — skip multi-megabyte geometry or audit fields. - Predicate pushdown: Filter by \
split\, \language\, or \quality_score\without scanning entire files. - Stable schemas: Version embedding dimensions (768 vs 1536) and document IDs explicitly in the footer metadata.
- Compression: Snappy or ZSTD on text-heavy columns keeps terabyte-scale corpora manageable.
Whether you are building a fine-tuning set from scraped documents or materializing chunked passages for vector search, the artifact on disk is usually \.parquet\ — often partitioned by date, source, or experiment ID.
Common column patterns in AI datasets
| Column type | Typical DuckDB type | What to verify |
|---|---|---|
| Document ID | VARCHAR / UUID | Uniqueness, null rate |
| Raw text | VARCHAR | Encoding, truncation, PII |
| Token IDs | LIST(INTEGER) or BLOB | Length distribution |
| Embeddings | LIST(FLOAT) or fixed array | Dimension matches model |
| Labels / scores | DOUBLE, BOOLEAN | Class balance |
| Metadata | STRUCT or JSON | Nested keys parse correctly |
A single bad shard — wrong embedding width, duplicated IDs, or a column renamed between pipeline stages — can waste hours of GPU time. Spot-checking Parquet before launching training is cheap insurance.
Inspecting shards with DuckDB SQL
Once a file is loaded locally, DuckDB can profile AI-oriented columns quickly:
\\\`sql
-- Row count and null rates on key fields
SELECT
count(*) AS rows,
count(*) FILTER (WHERE text IS NULL) AS null_text,
count(*) FILTER (WHERE embedding IS NULL) AS null_embeddings
FROM read_parquet('train_shard_0042.parquet');
-- Embedding dimension sanity (list length)
SELECT
len(embedding) AS dim,
count(*) AS n
FROM read_parquet('train_shard_0042.parquet')
GROUP BY 1
ORDER BY 2 DESC
LIMIT 5;
\\\`
For tokenized corpora stored as integer lists, sampling a few rows confirms vocabulary range and sequence length — without loading the full shard into Python.
RAG and evaluation sets
Retrieval-augmented generation adds another Parquet use case: passage stores with precomputed embeddings plus source metadata. Before indexing:
- Confirm \
chunk_id\is unique within the file. - Verify embedding lists are non-null for rows you expect in the index.
- Check that \
source_url\or \document_id\joins cleanly to your canonical catalog.
A five-minute SQL pass in the browser catches schema drift that unit tests often miss.
Workflow Integration with ViewParquet
ViewParquet runs DuckDB-WASM entirely in your browser — no upload to a server. That makes it practical to:
- Open a training or eval shard from a USB drive or internal share
- Scroll millions of rows in the virtualized grid
- Run ad-hoc SQL to profile nulls, cardinalities, and embedding dimensions
- Ask the AI assistant to draft profiling queries grounded in your open file
For teams shipping LLM data pipelines, ViewParquet is the fast pre-flight check between "Parquet landed in S3" and "kick off the training job."`, },