OpenStreetMap
    OSM
    Data Pipeline
    Conversion
    Open Data

    Modernizing OpenStreetMap Data Handling with GeoParquet

    Transform complex OpenStreetMap PBF data into analysis-ready GeoParquet format, making the world's largest collaborative geospatial dataset accessible to standard data tools and workflows.

    February 1, 2024
    10 min read

    OpenStreetMap (OSM) is the largest collaborative geospatial dataset in the world, a treasure trove of information about roads, buildings, points of interest, and more, all contributed by volunteers. However, its richness is matched by its complexity. The native data formats of OSM, while optimized for editing and distribution, present a significant hurdle for direct use in standard data analysis workflows.

    The Challenge of Raw OpenStreetMap Data

    OSM data is primarily distributed in two formats: XML (.osm) and its more compact binary equivalent, PBF (.osm.pbf). While PBF is the de facto standard for large extracts, both formats are deeply nested and relational, designed to capture the full history and structure of OSM edits, not for straightforward analysis.

    • Nodes: Points with coordinates (latitude, longitude)
    • Ways: Ordered lists of nodes that form polylines or polygons
    • Relations: Groups of nodes, ways, and other relations for complex features

    The Traditional Workflow Problem

    To answer a seemingly simple question—such as "Count all the restaurants in Berlin"—one cannot simply query the PBF file. The traditional approach requires:

    1. Using specialized tools like Osmium to parse PBF files
    2. Loading data into PostgreSQL/PostGIS with osm2pgsql
    3. Reconstructing geometries from the node/way/relation structure
    4. Writing complex spatial queries

    This represents a significant technical barrier, effectively walling off the data from practitioners comfortable with standard analytical tools.

    The GeoParquet Advantage for OSM

    Converting OSM data to GeoParquet fundamentally changes this dynamic, transforming the complex, relational PBF structure into a simple, flat, columnar format immediately usable by modern data tools.

    Key Advantages

    Direct Queryability: Once in GeoParquet format, data can be queried directly using standard SQL with tools like DuckDB or Spark SQL, eliminating the need for specialized spatial databases.

    Enhanced Performance: Columnar nature allows for highly efficient filtering. Queries filtering by OSM tags or spatial bounding boxes can leverage column pruning and predicate pushdown.

    Simplified Portability: A single .geoparquet file is far easier to manage, share, and use than PBF files requiring specialized toolchains.

    Cloud-Native Scalability: Planet-scale OSM extracts can be converted to partitioned GeoParquet datasets and stored in cloud object storage for scalable analysis.

    A Modern OSM Data Pipeline in Practice

    Step 1: Obtain an OSM PBF Extract

    Download extracts from reliable sources like Geofabrik, which provides regularly updated extracts for continents, countries, and regions.

    Step 2: Choose a Conversion Tool

    Several powerful open-source tools handle PBF-to-GeoParquet conversion:

    • Excellent for Python data science workflows
    • Simple command-line interface
    • Built-in filtering capabilities
    • Powerful for full OSM history files
    • Can enrich data with changeset metadata
    • Enterprise-grade performance
    • Optimized for raw transcoding speed
    • Developed by Overture Maps team
    • Ideal for large-scale processing

    Step 3: Perform the Conversion

    Example using QuackOSM to extract Monaco buildings:

    \\\`python import quackosm as qosm import geopandas as gpd

    Download PBF file (or use existing path) pbf_path = qosm.download_pbf("monaco")

    Define filter for buildings only tags_filter = {"building": True}

    Perform conversion print("Converting OSM PBF to GeoParquet...") geoparquet_path = qosm.convert_pbf_to_geoparquet( pbf_path, tags_filter=tags_filter, ignore_cache=True )

    print(f"Conversion complete. Output: {geoparquet_path}")

    Verify results gdf = gpd.read_parquet(geoparquet_path) print(f"Successfully extracted {len(gdf)} building features") print(gdf.head()) \`\`\`

    Step 4: Analyze with DuckDB

    Once converted, immediate SQL analysis becomes possible:

    \\\`sql -- Load spatial extension INSTALL spatial; LOAD spatial;

    -- Analyze building types SELECT building, count(*) AS count FROM read_parquet('monaco_building_True.geoparquet') GROUP BY building ORDER BY count DESC LIMIT 10;

    -- Find buildings within specific area SELECT * FROM read_parquet('monaco_building_True.geoparquet') WHERE ST_Within( ST_GeomFromWKB(geometry), ST_MakeEnvelope(7.405, 43.725, 7.430, 43.745) ); \\\`

    Advanced OSM Processing Techniques

    Tag Filtering Strategies

    OSM data contains rich tagging schemas. Effective filtering strategies include:

    \\\`python # Multiple tag filters tags_filter = { "amenity": ["restaurant", "cafe", "pub"], "tourism": True, "highway": ["primary", "secondary"] }

    Exclude certain features tags_filter = { "building": True, "building": {"not": ["no", "demolished"]} } \`\`\`

    Handling Complex Geometries

    • Polygon assembly from ways
    • Hole detection in multipolygons
    • Invalid geometry repair
    • Coordinate system transformations

    Performance Optimization

    • Use geographic filtering to reduce data volume
    • Partition by administrative boundaries
    • Apply tag filtering early in the pipeline
    • Utilize parallel processing where available

    Real-World Applications

    Urban Planning Analysis \`\`\`sql -- Analyze building density by area SELECT COUNT(*) as building_count, ST_Area(ST_ConvexHull(ST_Union(ST_GeomFromWKB(geometry)))) as area_sqm FROM osm_buildings WHERE admin_level = 8; \`\`\`

    Transportation Network Analysis \`\`\`sql -- Find road network connectivity SELECT highway, COUNT(*) as segment_count, SUM(ST_Length(ST_GeomFromWKB(geometry))) as total_length_m FROM osm_roads GROUP BY highway ORDER BY total_length_m DESC; \`\`\`

    Amenity Accessibility Studies \`\`\`sql -- Find amenities within walking distance of residential areas SELECT DISTINCT h.name as hospital_name FROM osm_amenities h, osm_landuse r WHERE h.amenity = 'hospital' AND r.landuse = 'residential' AND ST_DWithin( ST_GeomFromWKB(h.geometry), ST_GeomFromWKB(r.geometry), 800 -- 800 meters ); \`\`\`

    Workflow Integration: Sanity-Checking Complex Conversions

    OSM conversion involves complex logic for reconstructing geometries and parsing inconsistent tags. ViewParquet provides crucial validation capabilities:

    Visual Geometry Inspection: Quick map views reveal conversion artifacts or incorrect geometries

    Schema Validation: Confirm OSM tags parsed correctly into expected columns

    Attribute Verification: Check data types and value distributions

    This immediate feedback is invaluable for debugging conversion pipelines, allowing engineers to catch errors before downstream processing.

    Best Practices

    1. Start Small: Test conversion workflows on city/region extracts before processing larger areas
    2. Tag Strategy: Develop clear tag filtering strategies based on analysis requirements
    3. Quality Assurance: Always validate conversion results before production use
    4. Documentation: Document tag schemas and conversion parameters for reproducibility
    5. Version Control: Track OSM data versions and conversion configurations

    The conversion of OSM data to GeoParquet represents a significant democratization of access to the world's largest geospatial dataset, enabling standard data tools to work with this invaluable resource.