Open Big Parquet Files Without a Worry: Why Your Browser Can Now Handle Gigabytes
Large Parquet files used to crash in-browser viewers with out-of-memory errors. Here's the engineering change — zero-copy streaming over materialized tables — that lets viewparquet open multi-gigabyte files instantly, and what it means for your workflow.
You found the dataset you need. It's a Parquet file — maybe 500 MB, maybe 3 GB. You drag it into a viewer, the progress bar crawls, the fan spins up, and then: a frozen tab, a cryptic out-of-memory error, or a browser crash. So you give up and write yet another throwaway Python script just to peek at twenty rows.
We've been there, and so have our users. For a while, viewparquet had the same ceiling every in-browser tool hits: files that were perfectly reasonable on disk would blow past the browser's memory limit the moment you opened them. We recently shipped a change that removes that ceiling for Parquet files — and the reason it works says a lot about why Parquet is such a well-designed format.
This post explains, in plain terms first and engineering terms second, why big Parquet files used to crash browser tools, and what we changed so you can open them without a worry.
A 400 MB Parquet file is not really 400 MB
Here's the part most people never see: Parquet is a compressed, encoded format. The number you see in your file manager is the size after compression codecs like Snappy or ZSTD have done their work, and after dictionary encoding has replaced repeated strings with small integers.
The moment a tool decodes that file into memory — into actual rows and columns it can show you — the data expands. How much depends on your data, but 5–10x is common, and text-heavy or highly repetitive datasets can expand much further.
| On disk (Parquet) | Decoded in memory (typical) |
|---|---|
| 100 MB | 0.5 – 1 GB |
| 400 MB | 2 – 4 GB |
| 1 GB | 5 – 10 GB |
| 3 GB | way beyond any browser limit |
So when a viewer "loads" your 400 MB file by decoding all of it, it isn't holding 400 MB. It's holding gigabytes.
The browser's hidden ceiling
viewparquet runs DuckDB — a genuinely excellent analytical database — compiled to WebAssembly, entirely inside your browser tab. That's what makes it private: your file never leaves your machine.
But WebAssembly comes with a hard constraint: a 32-bit address space, which caps usable memory at roughly 4 GB per tab, no exceptions. And unlike a database running on your laptop's operating system, a browser tab has no disk to spill to when memory runs low. There's no temp directory, no swap file. When the budget is gone, it's gone.
Put those two facts together and the old failure mode is obvious:
- You open a 400 MB Parquet file.
- The viewer decodes the entire file into an in-memory table — several gigabytes.
- The WebAssembly memory ceiling (~3–4 GB) arrives long before the decode finishes.
- Crash. Frozen tab. "Out of Memory Error."
This is exactly the error our monitoring caught real users hitting: failed to allocate data of size 32.0 KiB (3.1 GiB/3.1 GiB used). The database wanted 32 more kilobytes and there were none left. That's a brutal experience for someone who just wanted to look at their data — and it's the bug we set out to kill.
The fix: stop copying, start streaming
The old pipeline did the obvious thing, which turned out to be the wrong thing:
-- Old approach: decode EVERYTHING into memory up front
CREATE TABLE data AS SELECT * FROM read_parquet('your_file.parquet');That one statement forces the entire file — every row, every column — to be decompressed and held in RAM, whether you ever look at it or not. All cost is paid up front, and for big files the bill exceeds the budget.
The new pipeline does almost nothing up front:
-- New approach: a zero-copy window onto the file
CREATE VIEW data AS SELECT * FROM read_parquet('your_file.parquet');A view stores no data. It's a saved query — a window onto the file. Your Parquet file stays exactly where it is (registered with the database as a readable file handle, never uploaded anywhere), and DuckDB reads only the byte ranges a query actually needs, at the moment it needs them.
When you open a file now, viewparquet decodes just the first page of rows for the grid — typically a few megabytes of work — no matter how big the file is. Scroll down? It reads the next slice. Run a query? It reads only what that query touches. Memory usage now scales with what you ask, not with how big the file is.
Why Parquet makes this possible
This trick doesn't work on every format, and it's worth understanding why it works brilliantly on Parquet. Three design decisions in the format do the heavy lifting:
- Row groups: A Parquet file is split into independent horizontal chunks (row groups), each decodable on its own. Showing rows 1–100 means decoding one row group — not the file.
- Column chunks: Within each row group, every column is stored separately. If your query touches 3 of 80 columns, the other 77 are never read off disk. This is called projection pushdown.
- Footer metadata: The file ends with an index describing every row group — row counts, byte offsets, and min/max statistics per column. DuckDB reads this small footer first and can answer some questions (like total row count) without touching the data at all, and can skip entire row groups whose statistics rule them out (predicate pushdown).
-- What the grid actually runs when you open a big file:
SELECT * FROM data LIMIT 100 OFFSET 0;
-- Bytes read: one row group's worth of the visible columns. That's it.
-- Row count for the footer of the UI — answered from metadata, near-instant:
SELECT COUNT(*) FROM data;CSV, by contrast, has none of this. There's no index, no row groups, no column separation — you can't know where row 5,000,000 starts without reading everything before it. That's why CSV files still get fully parsed on load (and why, if you work with big data, Parquet is the format worth standardizing on).
What changed, in one table
| Aspect | Before | After |
|---|---|---|
| On open | Decode entire file into memory | Read footer metadata only |
| Memory used | Proportional to whole file (5–10x file size) | Proportional to current query |
| 1 GB file | Out-of-memory crash | Opens in moments |
| First rows visible | After full decode (if it survived) | Almost immediately |
| Search and SQL | Against the in-memory copy | Streamed against the file, with pushdown |
| Privacy | 100% local | Still 100% local — nothing changed here |
We verified the new path end-to-end: large files open instantly, full-file search finds the last row of the dataset, custom SQL runs against the live file, and the database catalog confirms there's no hidden copy — just the view.
What this means for you, practically
- Open files that used to crash. Multi-gigabyte Parquet files are now routine. The size on disk is no longer the thing that matters.
- First rows appear fast. No more waiting for a full decode before you see anything. The grid renders while heavier statistics compute in the background.
- Search and filter the whole file. Queries stream through the file with column and row-group skipping, so even full scans stay within memory budget.
- Your data still never leaves your machine. The streaming happens between the browser tab and your local file. No upload, no server, no exceptions.
Honest limits (because every engineering choice has them)
Streaming makes memory proportional to the query — so a query that genuinely needs everything at once can still be heavy. Two examples worth knowing:
- Sorting an enormous result set requires holding the candidates in memory. Sorting the grid view is fine (it's a top-N query under the hood); exporting a multi-gigabyte file fully sorted may not be.
- CSV and JSON files are still decoded up front, because those formats don't support random access. Very large CSVs can still hit the ceiling — converting them to Parquet (which you can do with one SQL statement in our editor) removes it.
We'd rather tell you where the edges are than pretend there are none.
Tips for working with very large files
- Prefer Parquet over CSV for anything above a few hundred megabytes — it's smaller on disk and infinitely friendlier to query engines.
- Select the columns you need in SQL instead of
SELECT *— fewer column chunks read, faster results. - Filter early. A
WHEREclause on a column with natural ordering (dates, IDs) lets row-group statistics skip huge swaths of the file. - Use
LIMITwhile exploring. You rarely need a million rows to understand a dataset. - If a producer system gives you giant single files, ask for ZSTD compression and ~128 MB row groups — both make remote and streamed reads more efficient for every tool downstream.
The bigger picture
"Open a big file in the browser" sounds like a small feature. Under the hood it's the same architecture shift the whole data industry has made over the past decade: don't move and copy data to query it — bring a smart engine to where the data already sits, and read only what the question requires. Parquet was designed for exactly this, DuckDB executes it beautifully, and your browser turns out to be a perfectly good place for both.
So the next time someone hands you a 2 GB Parquet file, don't reach for a cluster or a notebook. Drag it into viewparquet. It'll open. Without a worry.