Free Public Parquet Datasets You Can Download Right Now (Every Link Verified)
A curated list of 11 public Parquet files you can actually download — flights, NYC taxi trips, Hugging Face NLP sets, and GeoParquet samples. Every link was tested and validated with DuckDB, with sizes, row counts, schemas, and example SQL.
Searching for a real Parquet file to test a tool, learn DuckDB, or demo a pipeline is harder than it should be. Most "sample dataset" lists point to CSVs, half the Parquet links are dead, and some files behind working links turn out to be corrupt, HTML error pages, or Git LFS pointers instead of actual data.
So we did the boring work. Every dataset on this page was downloaded in full and validated on June 12, 2026: we checked the PAR1 magic bytes, opened each file with DuckDB (the same engine that powers viewparquet), read the schema with DESCRIBE, ran COUNT(*), and decoded sample rows. If it's listed here, it was a real, readable Parquet file on that date.
Want to skip the downloads entirely? viewparquet ships with a bundled flights sample — click Try a sample file on the homepage and you'll be querying 231,083 rows of Parquet in seconds, with nothing leaving your browser.
The verified list at a glance
| Dataset | Rows | Size | Cols | Good for |
|---|---|---|---|---|
| Flights 200k (mosaic) | 231,083 | 1.1 MB | 3 | Quick tests, demos |
| Flights 10M (mosaic) | 10,000,000 | 72 MB | 7 | Performance testing |
| NYC Yellow Taxi (Jan 2024) | 2,964,624 | 48 MB | 19 | Realistic analytics |
| NYC Green Taxi (Jan 2024) | 56,551 | 1.3 MB | 20 | Small realistic data |
| DuckDB Taxi (Apr 2019) | 7,433,139 | 121 MB | 18 | Large-file testing |
| Dutch Train Services | 380,959 | 1.5 MB | 8 | Time-series practice |
| userdata1 (mock users) | 1,000 | 111 KB | 13 | Mixed-type schema tests |
| alltypes_plain (Apache) | 8 | 1.8 KB | 11 | Type/edge-case testing |
| IMDB reviews (Hugging Face) | 25,000 | 20 MB | 2 | Text/NLP workloads |
| SQuAD (Hugging Face) | 87,599 | 13.8 MB | 5 | Nested-type inspection |
| GeoParquet example (OGC) | 5 | 28 KB | 6 | GeoParquet spec reference |
All sizes are the compressed on-disk Parquet size. Download any of them with curl -L -o file.parquet <url>, or just download in the browser and drag the file into viewparquet.
1. Flights 200k — the classic demo file
The flights dataset from the uwdata/mosaic project is probably the most widely used demo Parquet file on the web — it powers visualization demos for Mosaic, Vega, and countless DuckDB tutorials. 231,083 US domestic flights with three columns: delay (minutes), distance (miles), and time (hour of day).
https://raw.githubusercontent.com/uwdata/mosaic/main/data/flights-200k.parquetThis is the exact file bundled as the viewparquet sample, so you can also load it with one click from our homepage.
-- Are longer flights more delayed?
SELECT
CASE WHEN distance < 500 THEN 'short'
WHEN distance < 1500 THEN 'medium'
ELSE 'long' END AS haul,
round(avg(delay), 1) AS avg_delay_min,
count(*) AS flights
FROM data
GROUP BY 1
ORDER BY 2 DESC;2. Flights 10M — ten million rows for stress tests
The bigger sibling, served by the UW Interactive Data Lab: 10,000,000 rows and 7 columns (FL_DATE, DEP_DELAY, ARR_DELAY, AIR_TIME, DISTANCE, DEP_TIME, ARR_TIME). At 72 MB compressed, it's the sweet spot for testing how a tool handles real scale without a painful download.
https://idl.uw.edu/mosaic-datasets/data/flights-10m.parquet3. NYC Yellow Taxi — the industry-standard benchmark
The NYC Taxi & Limousine Commission publishes official trip records as Parquet, refreshed monthly. January 2024 yellow cab data is ~2.96 million trips across 19 columns: pickup/dropoff timestamps, locations, distances, fares, tips, and surcharges. This is real production data with all the messiness that implies — perfect for realistic analytics practice.
https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquetThe URL follows a predictable pattern: swap 2024-01 for any year-month back to 2009, and yellow for green, fhv, or fhvhv for other fleet types.
-- Tip percentage by hour of day
SELECT
hour(tpep_pickup_datetime) AS pickup_hour,
round(avg(tip_amount / nullif(fare_amount, 0)) * 100, 1) AS avg_tip_pct,
count(*) AS trips
FROM data
WHERE fare_amount > 0 AND payment_type = 1
GROUP BY 1
ORDER BY 1;4. NYC Green Taxi — same realism, tiny download
Same TLC source and schema family, but green (borough) cabs do far less volume: 56,551 trips and only 1.3 MB for January 2024. Ideal when you want realistic, messy data without waiting on a 50 MB download.
https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2024-01.parquet5. DuckDB's taxi file — 121 MB, 7.4M rows
The DuckDB team hosts an April 2019 NYC taxi extract on their blob storage, used in many of their own demos and benchmarks. At 7,433,139 rows and 121 MB it's the largest file on this list — a good test of whether a viewer streams data or tries to decode everything into memory.
https://blobs.duckdb.org/data/taxi_2019_04.parquet6. Dutch railway services — clean time-series data
Also from DuckDB's hosted datasets: one year of train services on the Dutch railway network (NS). 380,959 rows, 8 columns including station codes, train numbers, and departure/arrival times. Great for practicing window functions and time-series SQL on data that isn't taxis or flights for once.
https://blobs.duckdb.org/train_services.parquet7. userdata1 — the classic mock-PII file
A long-standing fixture of Parquet tutorials: 1,000 fake user records with 13 mixed-type columns — names, emails, IP addresses, countries, salaries, birthdates, and a timestamp. Small enough to eyeball every row, varied enough to exercise type handling.
https://github.com/Teradata/kylo/raw/master/samples/sample-data/parquet/userdata1.parquetThe same directory hosts userdata2.parquet through userdata5.parquet with identical schemas — handy for testing multi-file reads and globbing.
8. Apache parquet-testing — official edge-case files
The Apache Parquet project maintains a repository of reference files used to test Parquet implementations themselves. alltypes_plain.parquet packs 11 columns covering booleans, every integer width, floats, doubles, strings, and timestamps into 8 rows and under 2 KB.
https://github.com/apache/parquet-testing/raw/master/data/alltypes_plain.parquetThe repository's data/ directory contains dozens more files covering compression codecs, encodings, nested types, and deliberately tricky structures — the definitive source if you're building or testing a Parquet reader.
9. IMDB reviews — text data from Hugging Face
Hugging Face automatically converts every public dataset on the Hub to Parquet, which makes it the largest collection of downloadable Parquet files on the internet. The IMDB sentiment dataset is a great starting point: 25,000 movie reviews with just two columns, text and label.
https://huggingface.co/datasets/stanfordnlp/imdb/resolve/refs%2Fconvert%2Fparquet/plain_text/train/0000.parquetLong text columns compress differently than numeric data and render differently in viewers — worth testing if your workload is NLP-shaped.
10. SQuAD — nested structures in the wild
The Stanford Question Answering Dataset, also via Hugging Face: 87,599 rows with id, title, context, question, and a nested answers struct containing lists. Useful for checking how tools display STRUCT and LIST columns, which trip up plenty of viewers.
https://huggingface.co/api/datasets/rajpurkar/squad/parquet/plain_text/train/0.parquetFor any Hugging Face dataset, the URL pattern huggingface.co/api/datasets/{org}/{name}/parquet returns a JSON listing of every auto-converted Parquet shard — a reliable way to find a working download link for nearly any public dataset.
11. The official GeoParquet example
The OGC GeoParquet specification repository ships a tiny reference file: 5 country polygons with population, continent, and GDP columns plus a WKB geometry column. At 28 KB it's the quickest way to see what a spec-compliant GeoParquet file looks like.
https://github.com/opengeospatial/geoparquet/raw/main/examples/example.parquetOne honest caveat from our verification: this file uses the newer Parquet GEOMETRY logical type, which older readers (for example, pyarrow before v19) reject. DuckDB — and therefore viewparquet — reads it fine.
How we verified these links
For transparency, the exact procedure run against every URL on June 12, 2026:
- Download the full file with
curl -L --fail(following redirects, rejecting HTTP errors). - Confirm the file starts and ends with the
PAR1magic bytes rather than an HTML error page or LFS pointer. - Open it with DuckDB and run
DESCRIBEto read the schema from the footer. - Run
SELECT COUNT(*)and decode sample rows to prove the data pages are readable.
Links rot, hosts change, and datasets get reorganized — if you hit a dead link, the Hugging Face API pattern and the NYC TLC monthly URL pattern above are the most future-proof sources on this list.
Opening these files without writing code
Every file above works the same way in viewparquet: download it, then drag it into the viewer at viewparquet.com. The file is registered with DuckDB-WASM inside your browser tab — nothing is uploaded to any server — and you get a virtualized grid, full DuckDB SQL, schema and metadata inspection, and export to Parquet, CSV, or JSON.
If you just want to poke at a Parquet file right now, the Try a sample file button on our homepage loads the flights dataset from this list with one click. No download, no signup, no Python environment.