Parquet FAQ

    How to work with Parquet files

    Straight answers to the questions people actually search while working with Apache Parquet — opening files, reading the schema, running SQL, inspecting metadata, and converting formats. You can try most of these in viewparquet, a private Parquet viewer and SQL workbench that runs entirely in your browser.

    Opening & viewing

    How do I open a Parquet file without Python, Spark, or pandas?

    Open the file in a Parquet viewer that runs in your browser. Drag a .parquet file into viewparquet and it loads instantly with DuckDB-WASM — no Python, pandas, Spark, or any install required. You can browse rows, read the schema, and run SQL right away.

    Command-line alternatives also exist (parquet-tools, the pq CLI, or the DuckDB CLI), but they require an install and a terminal. A browser viewer is the fastest path for a quick look at an unfamiliar file.

    How can I view a Parquet file online?

    Use a browser-based Parquet viewer such as viewparquet. You drop the file into the page and it is parsed locally in your browser — the data is never uploaded to a server, so even online viewing stays private.

    This is useful when you receive a .parquet file and just want to confirm its contents, column names, and row count before pulling it into a heavier tool.

    How do I open a very large Parquet file?

    Use a tool that reads Parquet column-by-column and paginates instead of loading every row into memory. viewparquet streams results and pages through them, so file size is limited mainly by your browser's available memory rather than a fixed cap.

    Because Parquet is columnar, you rarely need the whole file — query only the columns and rows you care about with SQL (for example `SELECT a, b FROM file LIMIT 1000`) to keep memory low on large datasets.

    How do I preview just the first few rows of a Parquet file?

    Run a `LIMIT` query such as `SELECT * FROM read_parquet('file.parquet') LIMIT 10`. In viewparquet the grid already paginates, so opening a file shows the first page of rows immediately without scanning the whole dataset.

    Querying with SQL

    How do I run SQL on a Parquet file?

    Query it with DuckDB, which can read Parquet directly without a separate import step. In viewparquet the loaded file is exposed as a table, so you can write `SELECT … FROM <table> WHERE …` in the SQL editor and get results in the grid. Full DuckDB SQL is supported, including joins, aggregates, and window functions.

    With the DuckDB syntax you can also reference a file by path, e.g. `SELECT * FROM read_parquet('data.parquet')`, and combine multiple files with globs like `read_parquet('data/*.parquet')`.

    Can I query a Parquet file without loading it into a database first?

    Yes. Engines like DuckDB query Parquet in place, reading only the columns and row groups a query needs thanks to the file footer metadata. There is no ETL or table-creation step — you point a SQL query at the file and it scans on demand.

    How do I count the rows in a Parquet file quickly?

    Run `SELECT COUNT(*) FROM read_parquet('file.parquet')`. Parquet stores the row count per row group in its footer, so a count returns almost instantly without scanning the actual data pages.

    Schema & data types

    How do I see the schema and column names of a Parquet file?

    Use `DESCRIBE SELECT * FROM read_parquet('file.parquet')` in DuckDB, or open the file in viewparquet and check the schema panel. Parquet is self-describing — column names, types, and nullability are stored in the file footer, so the schema is available without scanning the data.

    Command-line equivalents include `parquet-tools inspect file.parquet`, `pq schema file.parquet`, and PyArrow’s `pyarrow.parquet.read_schema('file.parquet')`.

    How do I read nested, list, or struct columns in Parquet?

    Parquet supports nested types (structs, lists, and maps) natively, and DuckDB can query into them with dot and bracket notation — for example `SELECT col.field`, `col[1]`, or `UNNEST(list_col)`. In viewparquet nested columns render in the grid and can be expanded or flattened with SQL.

    Use `UNNEST` to explode a list column into rows, and dotted paths to project a single field out of a struct so you can filter or aggregate on it.

    Why do Parquet timestamps or decimals look wrong in some tools?

    Parquet stores logical types (timestamp, decimal, date) on top of physical types (INT64, BYTE_ARRAY). When a reader ignores the logical type it shows the raw physical value — for example a timestamp as a large integer. Use a reader that honors logical types, such as DuckDB or Arrow, to display the correct value.

    Timestamp precision (milliseconds vs microseconds vs nanoseconds) and timezone metadata are also stored as logical-type annotations, which is why the same column can look different across Pandas, Spark, and Athena.

    Metadata & debugging

    How do I inspect Parquet metadata like row groups and compression?

    Parquet keeps file-level and row-group metadata in its footer: number of rows, number of row groups, per-column compression codec, encodings, and min/max statistics. Open the metadata panel in viewparquet, or use CLI tools — `parquet-tools inspect`, `pq inspect`, or `pyarrow.parquet.read_metadata` — to read it without scanning the data.

    Row-group statistics (min, max, null count) are what let query engines skip data; inspecting them helps you understand why a query is fast or slow.

    My Parquet file won’t open — what are the common causes?

    The most common causes are: the file is truncated or still being written (the footer at the end is missing), it is actually a folder of part-files rather than a single file, it uses a compression codec your reader lacks (e.g. LZ4, ZSTD, Brotli), or it is not really Parquet. Confirm the file is complete and try a reader with broad codec support like DuckDB or Arrow.

    Parquet files end with a 4-byte "PAR1" magic number and a footer; if a write was interrupted, the footer is missing and readers report a corrupt or invalid file.

    Spark and Hive often write a directory such as `data.parquet/` containing many `part-*.parquet` files — point your tool at the individual part file or use a glob like `data.parquet/*.parquet`.

    What is a Parquet row group and what size should it be?

    A row group is a horizontal slice of the table stored together, and it is the unit query engines read and skip. A common target is 128 MB–512 MB (or roughly 100k–1M rows) per row group, balancing read parallelism against the per-row-group metadata overhead of having too many tiny groups.

    Too many small files or tiny row groups (the "small files problem") hurt performance because engines spend more time on metadata than data. Compacting them into larger files helps.

    Converting & exporting

    How do I convert a CSV (or JSON) file to Parquet?

    With DuckDB it is a single statement: `COPY (SELECT * FROM read_csv_auto('data.csv')) TO 'data.parquet' (FORMAT PARQUET)`. In viewparquet you can load a CSV, TSV, JSON, or JSON Lines file and export the result as Parquet directly from the browser.

    CLI tools such as `pq convert data.csv -o data.parquet` and `parquet-tools import` do the same conversion from a terminal.

    How do I export SQL query results as a Parquet or CSV file?

    Run your query, then export the result set. In viewparquet the results grid can be exported to Parquet or CSV. With the DuckDB CLI, wrap the query in `COPY (…) TO 'out.parquet' (FORMAT PARQUET)` or `(FORMAT CSV, HEADER)`.

    Parquet vs CSV: when should I use which?

    Use Parquet for analytics and storage of large or wide datasets, and CSV for small, human-readable, interchange data. Parquet is columnar, compressed, and self-describing (it keeps types and statistics), so it is far smaller and faster to query; CSV is plain text with no types and must be fully scanned.

    A practical workflow: keep raw exports as CSV/JSON, convert to Parquet for repeated querying, and inspect either format the same way in a viewer before trusting it.

    How do I read a Parquet file from S3 or cloud storage?

    With DuckDB you can query cloud Parquet directly using the httpfs extension: `SELECT * FROM read_parquet('s3://bucket/data.parquet')` after setting your credentials. CLI tools like the pq utility and parquet-tools also accept `s3://`, `gs://`, and `az://` paths and read only the bytes they need.

    viewparquet is private-first and runs fully client-side today, so files are loaded from your device; direct cloud-storage connectors (S3, GCS, and more) are on the roadmap.

    Privacy & limits

    How can I analyze a sensitive Parquet file without uploading it anywhere?

    Use a viewer that processes the file locally in your browser. viewparquet runs entirely client-side with DuckDB-WASM, so opening, querying, and exporting all happen on your device and the data never leaves the browser — you can confirm this in the Network panel, which shows zero uploads.

    How do I spot-check a Parquet training dataset before a run?

    Open the shard in a viewer and check the things that break training: schema and column types, null counts, row count, and the shape of embedding or token columns. viewparquet lets you eyeball rows and run SQL (distincts, null checks, length of array columns) on each Parquet shard locally before kicking off a fine-tune or eval.

    Try it on your own file

    Drop a Parquet, GeoParquet, CSV, or JSON file into viewparquet and inspect the schema, run DuckDB SQL, and export results — all locally in your browser, nothing uploaded.