Parquet

Datasets

Open Data

DuckDB

Sample Files

Featured

Free Public Parquet Datasets You Can Download Right Now (Every Link Verified)

A curated list of 11 public Parquet files you can actually download — flights, NYC taxi trips, Hugging Face NLP sets, and GeoParquet samples. Every link was tested and validated with DuckDB, with sizes, row counts, schemas, and example SQL.

June 12, 2026

10 min read

Searching for a real Parquet file to test a tool, learn DuckDB, or demo a pipeline is harder than it should be. Most "sample dataset" lists point to CSVs, half the Parquet links are dead, and some files behind working links turn out to be corrupt, HTML error pages, or Git LFS pointers instead of actual data.

So we did the boring work. Every dataset on this page was downloaded in full and validated on June 12, 2026: we checked the PAR1 magic bytes, opened each file with DuckDB (the same engine that powers viewparquet), read the schema with DESCRIBE, ran COUNT(*), and decoded sample rows. If it's listed here, it was a real, readable Parquet file on that date.

Want to skip the downloads entirely? viewparquet ships with a bundled flights sample — click Try a sample file on the homepage and you'll be querying 231,083 rows of Parquet in seconds, with nothing leaving your browser.

The verified list at a glance

Dataset	Rows	Size	Cols	Good for
Flights 200k (mosaic)	231,083	1.1 MB	3	Quick tests, demos
Flights 10M (mosaic)	10,000,000	72 MB	7	Performance testing
NYC Yellow Taxi (Jan 2024)	2,964,624	48 MB	19	Realistic analytics
NYC Green Taxi (Jan 2024)	56,551	1.3 MB	20	Small realistic data
DuckDB Taxi (Apr 2019)	7,433,139	121 MB	18	Large-file testing
Dutch Train Services	380,959	1.5 MB	8	Time-series practice
userdata1 (mock users)	1,000	111 KB	13	Mixed-type schema tests
alltypes_plain (Apache)	8	1.8 KB	11	Type/edge-case testing
IMDB reviews (Hugging Face)	25,000	20 MB	2	Text/NLP workloads
SQuAD (Hugging Face)	87,599	13.8 MB	5	Nested-type inspection
GeoParquet example (OGC)	5	28 KB	6	GeoParquet spec reference

All sizes are the compressed on-disk Parquet size. Download any of them with curl -L -o file.parquet <url>, or just download in the browser and drag the file into viewparquet.

1. Flights 200k — the classic demo file

The flights dataset from the uwdata/mosaic project is probably the most widely used demo Parquet file on the web — it powers visualization demos for Mosaic, Vega, and countless DuckDB tutorials. 231,083 US domestic flights with three columns: delay (minutes), distance (miles), and time (hour of day).

https://raw.githubusercontent.com/uwdata/mosaic/main/data/flights-200k.parquet

This is the exact file bundled as the viewparquet sample, so you can also load it with one click from our homepage.

-- Are longer flights more delayed?
SELECT
  CASE WHEN distance < 500 THEN 'short'
       WHEN distance < 1500 THEN 'medium'
       ELSE 'long' END AS haul,
  round(avg(delay), 1) AS avg_delay_min,
  count(*) AS flights
FROM data
GROUP BY 1
ORDER BY 2 DESC;

2. Flights 10M — ten million rows for stress tests

The bigger sibling, served by the UW Interactive Data Lab: 10,000,000 rows and 7 columns (FL_DATE, DEP_DELAY, ARR_DELAY, AIR_TIME, DISTANCE, DEP_TIME, ARR_TIME). At 72 MB compressed, it's the sweet spot for testing how a tool handles real scale without a painful download.

https://idl.uw.edu/mosaic-datasets/data/flights-10m.parquet

3. NYC Yellow Taxi — the industry-standard benchmark

The NYC Taxi & Limousine Commission publishes official trip records as Parquet, refreshed monthly. January 2024 yellow cab data is ~2.96 million trips across 19 columns: pickup/dropoff timestamps, locations, distances, fares, tips, and surcharges. This is real production data with all the messiness that implies — perfect for realistic analytics practice.

https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet

The URL follows a predictable pattern: swap 2024-01 for any year-month back to 2009, and yellow for green, fhv, or fhvhv for other fleet types.

-- Tip percentage by hour of day
SELECT
  hour(tpep_pickup_datetime) AS pickup_hour,
  round(avg(tip_amount / nullif(fare_amount, 0)) * 100, 1) AS avg_tip_pct,
  count(*) AS trips
FROM data
WHERE fare_amount > 0 AND payment_type = 1
GROUP BY 1
ORDER BY 1;

4. NYC Green Taxi — same realism, tiny download

Same TLC source and schema family, but green (borough) cabs do far less volume: 56,551 trips and only 1.3 MB for January 2024. Ideal when you want realistic, messy data without waiting on a 50 MB download.

https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2024-01.parquet

5. DuckDB's taxi file — 121 MB, 7.4M rows

The DuckDB team hosts an April 2019 NYC taxi extract on their blob storage, used in many of their own demos and benchmarks. At 7,433,139 rows and 121 MB it's the largest file on this list — a good test of whether a viewer streams data or tries to decode everything into memory.

https://blobs.duckdb.org/data/taxi_2019_04.parquet

6. Dutch railway services — clean time-series data

Also from DuckDB's hosted datasets: one year of train services on the Dutch railway network (NS). 380,959 rows, 8 columns including station codes, train numbers, and departure/arrival times. Great for practicing window functions and time-series SQL on data that isn't taxis or flights for once.

https://blobs.duckdb.org/train_services.parquet

7. userdata1 — the classic mock-PII file

A long-standing fixture of Parquet tutorials: 1,000 fake user records with 13 mixed-type columns — names, emails, IP addresses, countries, salaries, birthdates, and a timestamp. Small enough to eyeball every row, varied enough to exercise type handling.

https://github.com/Teradata/kylo/raw/master/samples/sample-data/parquet/userdata1.parquet

The same directory hosts userdata2.parquet through userdata5.parquet with identical schemas — handy for testing multi-file reads and globbing.

8. Apache parquet-testing — official edge-case files

The Apache Parquet project maintains a repository of reference files used to test Parquet implementations themselves. alltypes_plain.parquet packs 11 columns covering booleans, every integer width, floats, doubles, strings, and timestamps into 8 rows and under 2 KB.

https://github.com/apache/parquet-testing/raw/master/data/alltypes_plain.parquet

The repository's data/ directory contains dozens more files covering compression codecs, encodings, nested types, and deliberately tricky structures — the definitive source if you're building or testing a Parquet reader.

9. IMDB reviews — text data from Hugging Face

Hugging Face automatically converts every public dataset on the Hub to Parquet, which makes it the largest collection of downloadable Parquet files on the internet. The IMDB sentiment dataset is a great starting point: 25,000 movie reviews with just two columns, text and label.

https://huggingface.co/datasets/stanfordnlp/imdb/resolve/refs%2Fconvert%2Fparquet/plain_text/train/0000.parquet

Long text columns compress differently than numeric data and render differently in viewers — worth testing if your workload is NLP-shaped.

10. SQuAD — nested structures in the wild

The Stanford Question Answering Dataset, also via Hugging Face: 87,599 rows with id, title, context, question, and a nested answers struct containing lists. Useful for checking how tools display STRUCT and LIST columns, which trip up plenty of viewers.

https://huggingface.co/api/datasets/rajpurkar/squad/parquet/plain_text/train/0.parquet

For any Hugging Face dataset, the URL pattern huggingface.co/api/datasets/{org}/{name}/parquet returns a JSON listing of every auto-converted Parquet shard — a reliable way to find a working download link for nearly any public dataset.

11. The official GeoParquet example

The OGC GeoParquet specification repository ships a tiny reference file: 5 country polygons with population, continent, and GDP columns plus a WKB geometry column. At 28 KB it's the quickest way to see what a spec-compliant GeoParquet file looks like.

https://github.com/opengeospatial/geoparquet/raw/main/examples/example.parquet

One honest caveat from our verification: this file uses the newer Parquet GEOMETRY logical type, which older readers (for example, pyarrow before v19) reject. DuckDB — and therefore viewparquet — reads it fine.

How we verified these links

For transparency, the exact procedure run against every URL on June 12, 2026:

Download the full file with curl -L --fail (following redirects, rejecting HTTP errors).
Confirm the file starts and ends with the PAR1 magic bytes rather than an HTML error page or LFS pointer.
Open it with DuckDB and run DESCRIBE to read the schema from the footer.
Run SELECT COUNT(*) and decode sample rows to prove the data pages are readable.

Links rot, hosts change, and datasets get reorganized — if you hit a dead link, the Hugging Face API pattern and the NYC TLC monthly URL pattern above are the most future-proof sources on this list.

Opening these files without writing code

Every file above works the same way in viewparquet: download it, then drag it into the viewer at viewparquet.com. The file is registered with DuckDB-WASM inside your browser tab — nothing is uploaded to any server — and you get a virtualized grid, full DuckDB SQL, schema and metadata inspection, and export to Parquet, CSV, or JSON.

If you just want to poke at a Parquet file right now, the Try a sample file button on our homepage loads the flights dataset from this list with one click. No download, no signup, no Python environment.