Modernizing OpenStreetMap Data Handling with GeoParquet
Transform complex OpenStreetMap PBF data into analysis-ready GeoParquet format, making the world's largest collaborative geospatial dataset accessible to standard data tools and workflows.
OpenStreetMap (OSM) is the largest collaborative geospatial dataset in the world, a treasure trove of information about roads, buildings, points of interest, and more, all contributed by volunteers. However, its richness is matched by its complexity. The native data formats of OSM, while optimized for editing and distribution, present a significant hurdle for direct use in standard data analysis workflows.
The Challenge of Raw OpenStreetMap Data
OSM data is primarily distributed in two formats: XML (.osm) and its more compact binary equivalent, PBF (.osm.pbf). While PBF is the de facto standard for large extracts, both formats are deeply nested and relational, designed to capture the full history and structure of OSM edits, not for straightforward analysis.
- Nodes: Points with coordinates (latitude, longitude)
- Ways: Ordered lists of nodes that form polylines or polygons
- Relations: Groups of nodes, ways, and other relations for complex features
The Traditional Workflow Problem
To answer a seemingly simple question—such as "Count all the restaurants in Berlin"—one cannot simply query the PBF file. The traditional approach requires:
- Using specialized tools like Osmium to parse PBF files
- Loading data into PostgreSQL/PostGIS with osm2pgsql
- Reconstructing geometries from the node/way/relation structure
- Writing complex spatial queries
This represents a significant technical barrier, effectively walling off the data from practitioners comfortable with standard analytical tools.
The GeoParquet Advantage for OSM
Converting OSM data to GeoParquet fundamentally changes this dynamic, transforming the complex, relational PBF structure into a simple, flat, columnar format immediately usable by modern data tools.
Key Advantages
Direct Queryability: Once in GeoParquet format, data can be queried directly using standard SQL with tools like DuckDB or Spark SQL, eliminating the need for specialized spatial databases.
Enhanced Performance: Columnar nature allows for highly efficient filtering. Queries filtering by OSM tags or spatial bounding boxes can leverage column pruning and predicate pushdown.
Simplified Portability: A single .geoparquet file is far easier to manage, share, and use than PBF files requiring specialized toolchains.
Cloud-Native Scalability: Planet-scale OSM extracts can be converted to partitioned GeoParquet datasets and stored in cloud object storage for scalable analysis.
A Modern OSM Data Pipeline in Practice
Step 1: Obtain an OSM PBF Extract
Download extracts from reliable sources like Geofabrik, which provides regularly updated extracts for continents, countries, and regions.
Step 2: Choose a Conversion Tool
Several powerful open-source tools handle PBF-to-GeoParquet conversion:
- Excellent for Python data science workflows
- Simple command-line interface
- Built-in filtering capabilities
- Powerful for full OSM history files
- Can enrich data with changeset metadata
- Enterprise-grade performance
- Optimized for raw transcoding speed
- Developed by Overture Maps team
- Ideal for large-scale processing
Step 3: Perform the Conversion
Example using QuackOSM to extract Monaco buildings:
\\\`python
import quackosm as qosm
import geopandas as gpd
Download PBF file (or use existing path) pbf_path = qosm.download_pbf("monaco")
Define filter for buildings only tags_filter = {"building": True}
Perform conversion print("Converting OSM PBF to GeoParquet...") geoparquet_path = qosm.convert_pbf_to_geoparquet( pbf_path, tags_filter=tags_filter, ignore_cache=True )
print(f"Conversion complete. Output: {geoparquet_path}")
Verify results gdf = gpd.read_parquet(geoparquet_path) print(f"Successfully extracted {len(gdf)} building features") print(gdf.head()) \`\`\`
Step 4: Analyze with DuckDB
Once converted, immediate SQL analysis becomes possible:
\\\`sql
-- Load spatial extension
INSTALL spatial;
LOAD spatial;
-- Analyze building types SELECT building, count(*) AS count FROM read_parquet('monaco_building_True.geoparquet') GROUP BY building ORDER BY count DESC LIMIT 10;
-- Find buildings within specific area
SELECT *
FROM read_parquet('monaco_building_True.geoparquet')
WHERE ST_Within(
ST_GeomFromWKB(geometry),
ST_MakeEnvelope(7.405, 43.725, 7.430, 43.745)
);
\\\`
Advanced OSM Processing Techniques
Tag Filtering Strategies
OSM data contains rich tagging schemas. Effective filtering strategies include:
\\\`python
# Multiple tag filters
tags_filter = {
"amenity": ["restaurant", "cafe", "pub"],
"tourism": True,
"highway": ["primary", "secondary"]
}
Exclude certain features tags_filter = { "building": True, "building": {"not": ["no", "demolished"]} } \`\`\`
Handling Complex Geometries
- Polygon assembly from ways
- Hole detection in multipolygons
- Invalid geometry repair
- Coordinate system transformations
Performance Optimization
- Use geographic filtering to reduce data volume
- Partition by administrative boundaries
- Apply tag filtering early in the pipeline
- Utilize parallel processing where available
Real-World Applications
Urban Planning Analysis \`\`\`sql -- Analyze building density by area SELECT COUNT(*) as building_count, ST_Area(ST_ConvexHull(ST_Union(ST_GeomFromWKB(geometry)))) as area_sqm FROM osm_buildings WHERE admin_level = 8; \`\`\`
Transportation Network Analysis \`\`\`sql -- Find road network connectivity SELECT highway, COUNT(*) as segment_count, SUM(ST_Length(ST_GeomFromWKB(geometry))) as total_length_m FROM osm_roads GROUP BY highway ORDER BY total_length_m DESC; \`\`\`
Amenity Accessibility Studies \`\`\`sql -- Find amenities within walking distance of residential areas SELECT DISTINCT h.name as hospital_name FROM osm_amenities h, osm_landuse r WHERE h.amenity = 'hospital' AND r.landuse = 'residential' AND ST_DWithin( ST_GeomFromWKB(h.geometry), ST_GeomFromWKB(r.geometry), 800 -- 800 meters ); \`\`\`
Workflow Integration: Sanity-Checking Complex Conversions
OSM conversion involves complex logic for reconstructing geometries and parsing inconsistent tags. ViewParquet provides crucial validation capabilities:
Visual Geometry Inspection: Quick map views reveal conversion artifacts or incorrect geometries
Schema Validation: Confirm OSM tags parsed correctly into expected columns
Attribute Verification: Check data types and value distributions
This immediate feedback is invaluable for debugging conversion pipelines, allowing engineers to catch errors before downstream processing.
Best Practices
- Start Small: Test conversion workflows on city/region extracts before processing larger areas
- Tag Strategy: Develop clear tag filtering strategies based on analysis requirements
- Quality Assurance: Always validate conversion results before production use
- Documentation: Document tag schemas and conversion parameters for reproducibility
- Version Control: Track OSM data versions and conversion configurations
The conversion of OSM data to GeoParquet represents a significant democratization of access to the world's largest geospatial dataset, enabling standard data tools to work with this invaluable resource.