Accelerating GIS Analytics with DuckDB and GeoParquet
Learn how DuckDB's in-process analytical engine paired with GeoParquet creates a powerful, serverless stack for high-performance geospatial analytics without traditional database overhead.
The emergence of GeoParquet solved the problem of efficient geospatial data storage. The next piece of the puzzle is a query engine that can fully exploit its advantages without the complexity and overhead of traditional database systems. Enter DuckDB, an in-process analytical database that, when paired with GeoParquet, creates a powerful, lightweight, and serverless stack for high-performance GIS analytics.
Introduction to DuckDB: The "SQLite for Analytics"
DuckDB is often described as the "SQLite for analytics," and the analogy is apt. Like SQLite, it is an in-process database, meaning it runs directly within your application as a library, not as a separate server process you need to install, manage, and connect to. However, where SQLite is optimized for transactional workloads (OLTP), DuckDB is purpose-built from the ground up for analytical query workloads (OLAP).
For geospatial users, DuckDB's true power is unlocked via its spatial extension:
\\\sql
INSTALL spatial;
LOAD spatial;
\\\
This provides access to over 150 spatial functions with familiar PostGIS-style syntax, while internally bundling essential GIS libraries like GEOS, GDAL, and PROJ.
The Modern Analytics Workflow: Querying GeoParquet Directly
The combination of DuckDB and GeoParquet enables a paradigm shift away from traditional ETL workflows. DuckDB can query GeoParquet files directly, without an explicit import step. This is enhanced by the httpfs extension, allowing queries on remote files via HTTP range requests.
- Projection Pushdown: Reading only required columns
- Filter Pushdown: Using Parquet metadata to skip irrelevant data chunks
Tutorial: High-Performance Spatial Analysis with DuckDB SQL
Setup: Installing and Loading Extensions
\\\`sql
-- Install extensions (one-time setup)
INSTALL spatial;
INSTALL httpfs;
-- Load extensions (per session) LOAD spatial; LOAD httpfs;
-- Optional S3 optimization
SET s3_region='us-west-2';
\\\`
Example 1: Bounding Box Filter on Remote Data
\\\sql
-- Find buildings in downtown Seattle from remote GeoParquet
CREATE TABLE seattle_buildings AS
SELECT
id,
height,
ST_GeomFromWKB(geometry) AS geom
FROM
read_parquet('s3://overturemaps-us-west-2/release/*/theme=buildings/*.parquet')
WHERE
bbox.xmin > -122.34 AND bbox.xmax < -122.33
AND bbox.ymin > 47.60 AND bbox.ymax < 47.61;
\\\
Example 2: Spatial Join Operations
\\\`sql
-- Point-in-polygon: schools within neighborhoods
CREATE OR REPLACE TABLE neighborhoods AS
SELECT * FROM ST_Read('neighborhoods.geoparquet');
CREATE OR REPLACE TABLE schools AS SELECT * FROM ST_Read('schools.geoparquet');
SELECT
s.school_name,
n.name AS neighborhood_name
FROM
schools AS s
JOIN
neighborhoods AS n ON ST_Intersects(s.geom, n.geom);
\\\`
Essential Spatial Functions
| Function | Description | Example Usage |
|---|---|---|
| ST_Read() | Read geospatial files | \CREATE TABLE t AS SELECT * FROM ST_Read('data.gpkg');\ |
| ST_Point() | Create point geometry | \ST_Point(-122.4, 47.6)\ |
| ST_GeomFromWKB() | Convert WKB to geometry | \ST_GeomFromWKB(geometry_blob)\ |
| ST_Intersects() | Spatial intersection test | \... ON ST_Intersects(a.geom, b.geom)\ |
| ST_DWithin() | Distance-based queries | \WHERE ST_DWithin(a.geom, b.geom, 100)\ |
| ST_Transform() | Reproject geometries | \ST_Transform(geom, 'EPSG:4326', 'EPSG:3857')\ |
Workflow Integration: Pre-Query Validation
- See complete schema with exact column names and types
- Preview sample data to understand content structure
- Verify geometry data presence and format
This pre-flight check eliminates common query errors and accelerates the analytical workflow, positioning inspection tools as essential utilities in the modern analyst's toolkit.
Performance Benefits
- Sub-second queries on multi-gigabyte datasets
- Minimal memory footprint through efficient columnar processing
- Cloud-native scalability with remote file support
- Zero infrastructure requirements for complex spatial analysis
This represents a fundamental shift in how geospatial analytics can be performed, democratizing access to powerful analytical capabilities previously requiring expensive, complex infrastructure.