AI
    LLM
    Parquet
    Embeddings
    RAG
    Featured

    Parquet for AI: Inspecting Embeddings, Tokens, and LLM Training Data

    Why Parquet is the default format for fine-tuning and RAG datasets — and how to spot-check embedding columns, tokenized shards, and schema issues before a training run.

    May 15, 2026
    11 min read

    Large language model workflows increasingly treat Parquet as the interchange format for training corpora, evaluation sets, and retrieval indexes. Hugging Face Datasets, Spark-based feature stores, and cloud ML pipelines all converge on columnar shards — not because Parquet is trendy, but because it matches how ML engineers actually inspect data: column at a time, filtered early, compressed on disk.

    Why Parquet dominates LLM data pipelines

    Training and RAG pipelines share the same structural needs as classical analytics:

    • Column pruning: Load only \text\, \embedding\, or \metadata\ columns — skip multi-megabyte geometry or audit fields.
    • Predicate pushdown: Filter by \split\, \language\, or \quality_score\ without scanning entire files.
    • Stable schemas: Version embedding dimensions (768 vs 1536) and document IDs explicitly in the footer metadata.
    • Compression: Snappy or ZSTD on text-heavy columns keeps terabyte-scale corpora manageable.

    Whether you are building a fine-tuning set from scraped documents or materializing chunked passages for vector search, the artifact on disk is usually \.parquet\ — often partitioned by date, source, or experiment ID.

    Common column patterns in AI datasets

    Column typeTypical DuckDB typeWhat to verify
    Document IDVARCHAR / UUIDUniqueness, null rate
    Raw textVARCHAREncoding, truncation, PII
    Token IDsLIST(INTEGER) or BLOBLength distribution
    EmbeddingsLIST(FLOAT) or fixed arrayDimension matches model
    Labels / scoresDOUBLE, BOOLEANClass balance
    MetadataSTRUCT or JSONNested keys parse correctly

    A single bad shard — wrong embedding width, duplicated IDs, or a column renamed between pipeline stages — can waste hours of GPU time. Spot-checking Parquet before launching training is cheap insurance.

    Inspecting shards with DuckDB SQL

    Once a file is loaded locally, DuckDB can profile AI-oriented columns quickly:

    \\\`sql -- Row count and null rates on key fields SELECT count(*) AS rows, count(*) FILTER (WHERE text IS NULL) AS null_text, count(*) FILTER (WHERE embedding IS NULL) AS null_embeddings FROM read_parquet('train_shard_0042.parquet');

    -- Embedding dimension sanity (list length) SELECT len(embedding) AS dim, count(*) AS n FROM read_parquet('train_shard_0042.parquet') GROUP BY 1 ORDER BY 2 DESC LIMIT 5; \\\`

    For tokenized corpora stored as integer lists, sampling a few rows confirms vocabulary range and sequence length — without loading the full shard into Python.

    RAG and evaluation sets

    Retrieval-augmented generation adds another Parquet use case: passage stores with precomputed embeddings plus source metadata. Before indexing:

    • Confirm \chunk_id\ is unique within the file.
    • Verify embedding lists are non-null for rows you expect in the index.
    • Check that \source_url\ or \document_id\ joins cleanly to your canonical catalog.

    A five-minute SQL pass in the browser catches schema drift that unit tests often miss.

    Workflow Integration with ViewParquet

    ViewParquet runs DuckDB-WASM entirely in your browser — no upload to a server. That makes it practical to:

    • Open a training or eval shard from a USB drive or internal share
    • Scroll millions of rows in the virtualized grid
    • Run ad-hoc SQL to profile nulls, cardinalities, and embedding dimensions
    • Ask the AI assistant to draft profiling queries grounded in your open file

    For teams shipping LLM data pipelines, ViewParquet is the fast pre-flight check between "Parquet landed in S3" and "kick off the training job."`, },