The Modern AI Data Pipeline: From Ingestion to Fine-Tuning with Parquet

A practical walkthrough of ETL → feature store → eval sets for ML teams — with schema drift checks and in-browser spot-checks at every stage.

April 28, 2026

12 min read

Production AI systems are data pipelines first and models second. The path from raw logs or documents to a fine-tuned model typically crosses ingestion, cleaning, labeling, feature materialization, training export, and evaluation — with Parquet as the durable format at each handoff.

Stage 1: Ingestion and normalization

Raw inputs arrive as JSON lines, CSV exports, API dumps, or warehouse tables. The first durable artifact is usually Parquet:

\\\`python # Conceptual: normalize to a canonical training schema import pyarrow.parquet as pq

schema = { "document_id": "string", "text": "string", "source": "string", "ingested_at": "timestamp[us, UTC]", } # Write partitioned by ingested_at for incremental pipelines \\\`

Partitioning by \ingested_at\ or \source\ keeps incremental jobs fast — query engines skip irrelevant directories.

Stage 2: Cleaning and enrichment

Deduplication, language detection, toxicity filters, and PII redaction happen in Spark, DuckDB, or Polars. Outputs remain Parquet so downstream stages inherit column pruning and compression.

Key checks at this stage:

Row counts match expectations vs upstream
No duplicate \document_id\ within a partition
Text column max length fits tokenizer limits

Stage 3: Labeling and human review

Human-in-the-loop workflows often add columns: \label\, \reviewer_id\, \confidence\. Schema evolution in Parquet allows new columns without rewriting historical files — but downstream jobs must tolerate NULL on older shards.

Track schema versions in your pipeline metadata; diff footers when a new label column appears.

Stage 4: Feature materialization

Embeddings, tokenized sequences, and derived features land in separate Parquet tables or additional columns:

\\\sql -- Join base documents to embedding table on document_id SELECT d.document_id, d.text, e.embedding FROM read_parquet('docs/date=2026-04-28/*.parquet') d JOIN read_parquet('embeddings/v2/*.parquet') e USING (document_id) WHERE d.source = 'support_tickets' LIMIT 100; \\\

Join keys and embedding dimensions are the highest-risk contract between stages.

Stage 5: Training and eval exports

Training exports are often filtered Parquet (specific splits, quality thresholds). Eval sets are smaller, frozen Parquet snapshots versioned by git tag or object-store path.

Before every training run:

Schema diff against the previous successful export
Row count and class balance for supervised tasks
Sample review of 20–50 random rows in a grid viewer

Stage 6: Monitoring and drift

Post-deployment, log inference inputs and outcomes back to Parquet for drift analysis. Compare embedding distributions or label rates week over week with the same SQL profiling queries used pre-training.

Tooling choices at each stage

Stage	Common tools	Parquet role
Ingest	Airflow, Dagster, Fivetran	Landing zone files
Transform	Spark, DuckDB, dbt	Intermediate tables
Label	Label Studio, custom UIs	Annotated shards
Train	PyTorch, HF Trainer	`\datasets\ Parquet loader`
Eval	Weights & Biases, custom	Frozen eval shards

Workflow Integration with ViewParquet

Not every pipeline stage needs a Spark cluster to answer "does this shard look right?" ViewParquet fills the gap between heavyweight infrastructure and blind trust:

Private, local inspection — sensitive training data never leaves the machine
SQL editor — profile joins, nulls, and distributions with DuckDB syntax
AI assistant — generate profiling queries from plain-language questions
Export — pull filtered subsets back to Parquet, CSV, or JSON for downstream tools

Use ViewParquet at ingestion handoffs, after embedding jobs, and before promoting an eval set to "golden" status. Five minutes of visual confirmation prevents expensive pipeline reruns.

Summary

The modern AI data pipeline is a sequence of Parquet contracts. Invest in schema discipline, partition strategy, and lightweight inspection at every boundary — your GPUs (and your on-call rotation) will thank you.`, },