Parquet
    Datasets
    Open Data
    DuckDB
    Sample Files
    Featured

    Free Public Parquet Datasets You Can Download Right Now (Every Link Verified)

    A curated list of 11 public Parquet files you can actually download — flights, NYC taxi trips, Hugging Face NLP sets, and GeoParquet samples. Every link was tested and validated with DuckDB, with sizes, row counts, schemas, and example SQL.

    June 12, 2026
    10 min read

    Searching for a real Parquet file to test a tool, learn DuckDB, or demo a pipeline is harder than it should be. Most "sample dataset" lists point to CSVs, half the Parquet links are dead, and some files behind working links turn out to be corrupt, HTML error pages, or Git LFS pointers instead of actual data.

    So we did the boring work. Every dataset on this page was downloaded in full and validated on June 12, 2026: we checked the PAR1 magic bytes, opened each file with DuckDB (the same engine that powers viewparquet), read the schema with DESCRIBE, ran COUNT(*), and decoded sample rows. If it's listed here, it was a real, readable Parquet file on that date.

    Want to skip the downloads entirely? viewparquet ships with a bundled flights sample — click Try a sample file on the homepage and you'll be querying 231,083 rows of Parquet in seconds, with nothing leaving your browser.

    The verified list at a glance

    DatasetRowsSizeColsGood for
    Flights 200k (mosaic)231,0831.1 MB3Quick tests, demos
    Flights 10M (mosaic)10,000,00072 MB7Performance testing
    NYC Yellow Taxi (Jan 2024)2,964,62448 MB19Realistic analytics
    NYC Green Taxi (Jan 2024)56,5511.3 MB20Small realistic data
    DuckDB Taxi (Apr 2019)7,433,139121 MB18Large-file testing
    Dutch Train Services380,9591.5 MB8Time-series practice
    userdata1 (mock users)1,000111 KB13Mixed-type schema tests
    alltypes_plain (Apache)81.8 KB11Type/edge-case testing
    IMDB reviews (Hugging Face)25,00020 MB2Text/NLP workloads
    SQuAD (Hugging Face)87,59913.8 MB5Nested-type inspection
    GeoParquet example (OGC)528 KB6GeoParquet spec reference

    All sizes are the compressed on-disk Parquet size. Download any of them with curl -L -o file.parquet <url>, or just download in the browser and drag the file into viewparquet.

    1. Flights 200k — the classic demo file

    The flights dataset from the uwdata/mosaic project is probably the most widely used demo Parquet file on the web — it powers visualization demos for Mosaic, Vega, and countless DuckDB tutorials. 231,083 US domestic flights with three columns: delay (minutes), distance (miles), and time (hour of day).

    https://raw.githubusercontent.com/uwdata/mosaic/main/data/flights-200k.parquet

    This is the exact file bundled as the viewparquet sample, so you can also load it with one click from our homepage.

    -- Are longer flights more delayed?
    SELECT
      CASE WHEN distance < 500 THEN 'short'
           WHEN distance < 1500 THEN 'medium'
           ELSE 'long' END AS haul,
      round(avg(delay), 1) AS avg_delay_min,
      count(*) AS flights
    FROM data
    GROUP BY 1
    ORDER BY 2 DESC;

    2. Flights 10M — ten million rows for stress tests

    The bigger sibling, served by the UW Interactive Data Lab: 10,000,000 rows and 7 columns (FL_DATE, DEP_DELAY, ARR_DELAY, AIR_TIME, DISTANCE, DEP_TIME, ARR_TIME). At 72 MB compressed, it's the sweet spot for testing how a tool handles real scale without a painful download.

    https://idl.uw.edu/mosaic-datasets/data/flights-10m.parquet

    3. NYC Yellow Taxi — the industry-standard benchmark

    The NYC Taxi & Limousine Commission publishes official trip records as Parquet, refreshed monthly. January 2024 yellow cab data is ~2.96 million trips across 19 columns: pickup/dropoff timestamps, locations, distances, fares, tips, and surcharges. This is real production data with all the messiness that implies — perfect for realistic analytics practice.

    https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet

    The URL follows a predictable pattern: swap 2024-01 for any year-month back to 2009, and yellow for green, fhv, or fhvhv for other fleet types.

    -- Tip percentage by hour of day
    SELECT
      hour(tpep_pickup_datetime) AS pickup_hour,
      round(avg(tip_amount / nullif(fare_amount, 0)) * 100, 1) AS avg_tip_pct,
      count(*) AS trips
    FROM data
    WHERE fare_amount > 0 AND payment_type = 1
    GROUP BY 1
    ORDER BY 1;

    4. NYC Green Taxi — same realism, tiny download

    Same TLC source and schema family, but green (borough) cabs do far less volume: 56,551 trips and only 1.3 MB for January 2024. Ideal when you want realistic, messy data without waiting on a 50 MB download.

    https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2024-01.parquet

    5. DuckDB's taxi file — 121 MB, 7.4M rows

    The DuckDB team hosts an April 2019 NYC taxi extract on their blob storage, used in many of their own demos and benchmarks. At 7,433,139 rows and 121 MB it's the largest file on this list — a good test of whether a viewer streams data or tries to decode everything into memory.

    https://blobs.duckdb.org/data/taxi_2019_04.parquet

    6. Dutch railway services — clean time-series data

    Also from DuckDB's hosted datasets: one year of train services on the Dutch railway network (NS). 380,959 rows, 8 columns including station codes, train numbers, and departure/arrival times. Great for practicing window functions and time-series SQL on data that isn't taxis or flights for once.

    https://blobs.duckdb.org/train_services.parquet

    7. userdata1 — the classic mock-PII file

    A long-standing fixture of Parquet tutorials: 1,000 fake user records with 13 mixed-type columns — names, emails, IP addresses, countries, salaries, birthdates, and a timestamp. Small enough to eyeball every row, varied enough to exercise type handling.

    https://github.com/Teradata/kylo/raw/master/samples/sample-data/parquet/userdata1.parquet

    The same directory hosts userdata2.parquet through userdata5.parquet with identical schemas — handy for testing multi-file reads and globbing.

    8. Apache parquet-testing — official edge-case files

    The Apache Parquet project maintains a repository of reference files used to test Parquet implementations themselves. alltypes_plain.parquet packs 11 columns covering booleans, every integer width, floats, doubles, strings, and timestamps into 8 rows and under 2 KB.

    https://github.com/apache/parquet-testing/raw/master/data/alltypes_plain.parquet

    The repository's data/ directory contains dozens more files covering compression codecs, encodings, nested types, and deliberately tricky structures — the definitive source if you're building or testing a Parquet reader.

    9. IMDB reviews — text data from Hugging Face

    Hugging Face automatically converts every public dataset on the Hub to Parquet, which makes it the largest collection of downloadable Parquet files on the internet. The IMDB sentiment dataset is a great starting point: 25,000 movie reviews with just two columns, text and label.

    https://huggingface.co/datasets/stanfordnlp/imdb/resolve/refs%2Fconvert%2Fparquet/plain_text/train/0000.parquet

    Long text columns compress differently than numeric data and render differently in viewers — worth testing if your workload is NLP-shaped.

    10. SQuAD — nested structures in the wild

    The Stanford Question Answering Dataset, also via Hugging Face: 87,599 rows with id, title, context, question, and a nested answers struct containing lists. Useful for checking how tools display STRUCT and LIST columns, which trip up plenty of viewers.

    https://huggingface.co/api/datasets/rajpurkar/squad/parquet/plain_text/train/0.parquet

    For any Hugging Face dataset, the URL pattern huggingface.co/api/datasets/{org}/{name}/parquet returns a JSON listing of every auto-converted Parquet shard — a reliable way to find a working download link for nearly any public dataset.

    11. The official GeoParquet example

    The OGC GeoParquet specification repository ships a tiny reference file: 5 country polygons with population, continent, and GDP columns plus a WKB geometry column. At 28 KB it's the quickest way to see what a spec-compliant GeoParquet file looks like.

    https://github.com/opengeospatial/geoparquet/raw/main/examples/example.parquet

    One honest caveat from our verification: this file uses the newer Parquet GEOMETRY logical type, which older readers (for example, pyarrow before v19) reject. DuckDB — and therefore viewparquet — reads it fine.

    How we verified these links

    For transparency, the exact procedure run against every URL on June 12, 2026:

    1. Download the full file with curl -L --fail (following redirects, rejecting HTTP errors).
    2. Confirm the file starts and ends with the PAR1 magic bytes rather than an HTML error page or LFS pointer.
    3. Open it with DuckDB and run DESCRIBE to read the schema from the footer.
    4. Run SELECT COUNT(*) and decode sample rows to prove the data pages are readable.

    Links rot, hosts change, and datasets get reorganized — if you hit a dead link, the Hugging Face API pattern and the NYC TLC monthly URL pattern above are the most future-proof sources on this list.

    Opening these files without writing code

    Every file above works the same way in viewparquet: download it, then drag it into the viewer at viewparquet.com. The file is registered with DuckDB-WASM inside your browser tab — nothing is uploaded to any server — and you get a virtualized grid, full DuckDB SQL, schema and metadata inspection, and export to Parquet, CSV, or JSON.

    If you just want to poke at a Parquet file right now, the Try a sample file button on our homepage loads the flights dataset from this list with one click. No download, no signup, no Python environment.