Data Pipeline
    AI
    Parquet
    DuckDB
    MLOps
    Featured

    The Modern AI Data Pipeline: From Ingestion to Fine-Tuning with Parquet

    A practical walkthrough of ETL → feature store → eval sets for ML teams — with schema drift checks and in-browser spot-checks at every stage.

    April 28, 2026
    12 min read

    Production AI systems are data pipelines first and models second. The path from raw logs or documents to a fine-tuned model typically crosses ingestion, cleaning, labeling, feature materialization, training export, and evaluation — with Parquet as the durable format at each handoff.

    Stage 1: Ingestion and normalization

    Raw inputs arrive as JSON lines, CSV exports, API dumps, or warehouse tables. The first durable artifact is usually Parquet:

    \\\`python # Conceptual: normalize to a canonical training schema import pyarrow.parquet as pq

    schema = { "document_id": "string", "text": "string", "source": "string", "ingested_at": "timestamp[us, UTC]", } # Write partitioned by ingested_at for incremental pipelines \\\`

    Partitioning by \ingested_at\ or \source\ keeps incremental jobs fast — query engines skip irrelevant directories.

    Stage 2: Cleaning and enrichment

    Deduplication, language detection, toxicity filters, and PII redaction happen in Spark, DuckDB, or Polars. Outputs remain Parquet so downstream stages inherit column pruning and compression.

    Key checks at this stage:

    • Row counts match expectations vs upstream
    • No duplicate \document_id\ within a partition
    • Text column max length fits tokenizer limits

    Stage 3: Labeling and human review

    Human-in-the-loop workflows often add columns: \label\, \reviewer_id\, \confidence\. Schema evolution in Parquet allows new columns without rewriting historical files — but downstream jobs must tolerate NULL on older shards.

    Track schema versions in your pipeline metadata; diff footers when a new label column appears.

    Stage 4: Feature materialization

    Embeddings, tokenized sequences, and derived features land in separate Parquet tables or additional columns:

    \\\sql -- Join base documents to embedding table on document_id SELECT d.document_id, d.text, e.embedding FROM read_parquet('docs/date=2026-04-28/*.parquet') d JOIN read_parquet('embeddings/v2/*.parquet') e USING (document_id) WHERE d.source = 'support_tickets' LIMIT 100; \\\

    Join keys and embedding dimensions are the highest-risk contract between stages.

    Stage 5: Training and eval exports

    Training exports are often filtered Parquet (specific splits, quality thresholds). Eval sets are smaller, frozen Parquet snapshots versioned by git tag or object-store path.

    Before every training run:

    1. Schema diff against the previous successful export
    2. Row count and class balance for supervised tasks
    3. Sample review of 20–50 random rows in a grid viewer

    Stage 6: Monitoring and drift

    Post-deployment, log inference inputs and outcomes back to Parquet for drift analysis. Compare embedding distributions or label rates week over week with the same SQL profiling queries used pre-training.

    Tooling choices at each stage

    StageCommon toolsParquet role
    IngestAirflow, Dagster, FivetranLanding zone files
    TransformSpark, DuckDB, dbtIntermediate tables
    LabelLabel Studio, custom UIsAnnotated shards
    TrainPyTorch, HF Trainer\datasets\ Parquet loader
    EvalWeights & Biases, customFrozen eval shards

    Workflow Integration with ViewParquet

    Not every pipeline stage needs a Spark cluster to answer "does this shard look right?" ViewParquet fills the gap between heavyweight infrastructure and blind trust:

    • Private, local inspection — sensitive training data never leaves the machine
    • SQL editor — profile joins, nulls, and distributions with DuckDB syntax
    • AI assistant — generate profiling queries from plain-language questions
    • Export — pull filtered subsets back to Parquet, CSV, or JSON for downstream tools

    Use ViewParquet at ingestion handoffs, after embedding jobs, and before promoting an eval set to "golden" status. Five minutes of visual confirmation prevents expensive pipeline reruns.

    Summary

    The modern AI data pipeline is a sequence of Parquet contracts. Invest in schema discipline, partition strategy, and lightweight inspection at every boundary — your GPUs (and your on-call rotation) will thank you.`, },