Gazetteers
This guide explains how to query gazetteer data, understand feature attributes, and configure custom gazetteers for specialized geographic databases.
Overview
The Irchel Geoparser uses gazetteers as the authoritative source of geographic information for toponym resolution. A gazetteer stores information about places, including their names, types, administrative hierarchies, and coordinates. When you mention “Paris” in a text, the gazetteer contains entries for Paris, France; Paris, Texas; Paris, Ontario; and many other places named Paris around the world. Each entry includes not just the name but also attributes like coordinates, population, feature type, and administrative hierarchy that help distinguish one Paris from another.
The library’s architecture separates the gazetteer system from the processing modules. Resolvers don’t access gazetteers directly through SQL queries or file reads—instead, they use the Gazetteer class interface which provides standardized search methods. This abstraction allows gazetteers to have different internal schemas and still be used interchangeably by resolvers.
Gazetteers are stored in a centralized SQLite database that includes spatial indexing capabilities through the SpatiaLite extension. This database can contain multiple gazetteers simultaneously, each with its own tables and indices. The gazetteer installer handles all the complexity of downloading source data, transforming it into the right format, creating database schemas, and building indices.
Built-in Gazetteers
The library includes support for two major gazetteers that cover different geographic scopes and use cases.
GeoNames
GeoNames is a comprehensive global gazetteer containing over 13 million place names. It includes entries for countries, administrative divisions, cities, towns, neighborhoods, natural features like mountains and rivers, and points of interest like buildings and monuments. However, the global scope means that coverage varies significantly by region, with some areas having more detailed and up-to-date information than others.
To install GeoNames:
python -m geoparser download geonames
The installation process can take up to 15-30 minutes depending on your system.
SwissNames3D
SwissNames3D is a high-quality gazetteer specifically for Switzerland, provided by Swisstopo, the Swiss Federal Office of Topography. It contains detailed information about geographic features within Switzerland, including precise 3D coordinates, building addresses, and fine-grained feature classifications. The gazetteer also maintains relationships with administrative boundaries, allowing features to be associated with municipalities, districts, and cantons.
To install SwissNames3D:
python -m geoparser download swissnames3d
The installation process typically completes within a few minutes.
Querying Gazetteers
The Gazetteer class provides a Python interface for querying gazetteer data. This interface is primarily used by resolvers, but you can also use it directly for exploration or custom processing logic.
Initializing a Gazetteer
Create a gazetteer instance by specifying its name:
from geoparser import Gazetteer
gazetteer = Gazetteer("geonames")
The gazetteer name must correspond to an installed gazetteer in the database. If the gazetteer isn’t installed, an error will occur when you try to query it.
Searching for Features
The search() method finds features matching a given name string. It supports different search methods that trade off precision and recall:
from geoparser import Gazetteer
gazetteer = Gazetteer("geonames")
# Exact string matching
features = gazetteer.search("Paris", method="exact")
print(f"Found {len(features)} features")
for feature in features[:5]: # Show first 5
print(f"- {feature.data.get('name')}, {feature.data.get('country_name')}")
The search method parameter controls the matching strategy:
"exact": Only returns features whose name exactly matches the search string (case-insensitive and diacritics-insensitive). This is the fastest method but will miss features with slightly different names."phrase": Returns features whose name contains the search string as a complete phrase. This catches variations like “New York City” when searching for “New York” but is still quite restrictive."partial": Returns features whose name contains any of the tokens in the search string. This is more flexible and can handle cases where articles or qualifiers are included or omitted, but it may return many candidates."fuzzy": Uses fuzzy string matching to find features with names similar to the search string, even with spelling variations or typos. This is the most permissive method and generates the most candidates.
For the non-exact search methods, you can specify a tiers parameter that controls how many rank tiers of results to include. Results are ranked by their match score (BM25 relevance for phrase/partial methods, edit distance for fuzzy method), and tiers group results into brackets of similar scores. Higher tier values include more results but also results with lower match quality:
# Get only the top-ranked matches
features = gazetteer.search("London", method="partial", tiers=1)
# Get more permissive results including lower-ranked matches
features = gazetteer.search("London", method="partial", tiers=3)
Finding Features by Identifier
If you know a feature’s identifier, you can retrieve it directly using the find() method:
from geoparser import Gazetteer
gazetteer = Gazetteer("geonames")
# Look up Paris, France by its geonameid
feature = gazetteer.find("2988507")
if feature:
print(f"Name: {feature.data.get('name')}")
print(f"Country: {feature.data.get('country_name')}")
print(f"Population: {feature.data.get('population')}")
The identifier used in find() should match the identifier scheme used by that gazetteer. For GeoNames, this is the geonameid; for SwissNames3D, it’s the UUID.
Working with Features
The search() and find() methods return Feature objects that represent individual geographic entities in the gazetteer. Each feature has several important properties:
from geoparser import Gazetteer
gazetteer = Gazetteer("geonames")
features = gazetteer.search("Tokyo")
if features:
feature = features[0]
# The feature's unique identifier value
print(f"ID: {feature.location_id_value}")
# The feature's attributes as a dictionary
print(f"Data: {feature.data}")
# The feature's geometry as a Shapely object
print(f"Geometry: {feature.geometry}")
print(f"Coordinates: ({feature.geometry.x}, {feature.geometry.y})")
The location_id_value property contains the identifier that can be used to reference this feature, for example when creating referent annotations. The data property is a dictionary containing all the attributes from the gazetteer for this feature. The geometry property returns a Shapely geometry object representing the feature’s spatial extent. Most gazetteers use Point geometries for locations, but this can also be polygons or other geometry types depending on the gazetteer.
The attributes available in the data dictionary depend on which gazetteer you’re using. For GeoNames, common attributes include:
name: The main name of the featurelatitudeandlongitude: Coordinates in decimal degreesfeature_name: Human-readable feature type (e.g., “city”, “mountain”, “stream”)country_name: Name of the country the feature is inadmin1_name,admin2_name: First and second-level administrative divisionspopulation: Population count for inhabited placeselevation: Elevation in meters above sea level
For SwissNames3D, attributes include:
NAME: The name of the featureOBJEKTART: Detailed object type in GermanGEMEINDE_NAME: Municipality nameKANTON_NAME: Canton nameHOEHE: Elevation in meters
The exact attribute schema is defined in the gazetteer’s configuration file and reflected in the database schema.
Custom Gazetteer Configuration
The library supports adding custom gazetteers through YAML configuration files. This capability allows you to integrate specialized geographic databases, regional data sources, or proprietary location data without modifying the core library code. A gazetteer configuration describes data sources, their formats, how to process and transform the data, and how features should be identified and named.
Configuration Structure
A gazetteer configuration file has the following top-level structure:
name: my_gazetteer # Unique identifier for the gazetteer
sources: # List of data sources to process
- name: source1
# ... source configuration ...
- name: source2
# ... source configuration ...
Each source describes a single data file or download that will be loaded into the database. Sources can be combined through joins to create a unified view of geographic features. Not all sources need to provide features directly—some sources can serve as auxiliary data that enrich other sources through joins (such as administrative boundary data or alternate name lookups).
Source Types and Downloads
Sources can be either tabular (CSV, TSV) or spatial (shapefiles, GeoPackage). For tabular sources, specify the separator character. For URLs, the installer automatically handles ZIP archives:
sources:
- name: places
url: https://example.com/data.zip # Downloaded and extracted
file: places.csv # File within the ZIP
type: tabular
separator: ","
For local files, provide both the path (directory containing the file) and file (filename):
sources:
- name: local_data
path: /path/to/data/directory
file: data.csv
type: tabular
separator: "\t"
Defining Attributes
Each source must declare its attributes in two categories: original attributes that exist in the source file, and derived attributes computed from SQL expressions.
Original attributes match columns in the source file. Specify their data types (TEXT, INTEGER, REAL, GEOMETRY) and optionally mark them for indexing:
attributes:
original:
- name: geonameid
type: INTEGER
index: true # Create database index
- name: name
type: TEXT
- name: latitude
type: REAL
- name: longitude
type: REAL
- name: population
type: INTEGER
Derived attributes are computed using SQL expressions. This is useful for constructing geometries from coordinates, concatenating fields, or applying transformations:
attributes:
derived:
- name: geometry
type: GEOMETRY
expression: "'POINT(' || longitude || ' ' || latitude || ')'"
index: true
srid: 4326 # Spatial reference system
- name: full_code
type: TEXT
expression: "country_code || '.' || admin_code"
index: true
- name: name_normalized
type: TEXT
expression: "lower(trim(name))"
Creating Views with Joins
If you have multiple sources, you need to define a view that specifies which columns to include in the final gazetteer and how to join the sources together. The view section lists the columns to select and the join conditions:
view:
select:
- source: places # Source table name
column: geonameid # Column to include
- source: places
column: name
- source: places
column: latitude
- source: places
column: longitude
- source: admin_names # From a different source
column: name
alias: admin_name # Rename column in view
- source: places
column: geometry
join:
- type: LEFT JOIN # Join type
source: admin_names # Source to join
condition: places.admin_code = admin_names.code # Join condition
You can use any SQL join type (LEFT JOIN, INNER JOIN, etc.) and specify complex join conditions. This allows you to enrich your main features table with data from auxiliary tables.
Spatial Joins
For determining spatial relationships (e.g., which administrative region contains each feature), you can use spatial join conditions with SpatiaLite functions:
join:
- type: LEFT JOIN
source: municipalities
condition: ST_Within(places.geometry, municipalities.geometry)
This joins each place with the municipality whose boundary contains it. Note that spatial joins can be computationally expensive for large datasets.
Defining Features
The final step is specifying how features are identified and named. The identifier column(s) provide unique IDs for features, while name columns define searchable names:
features:
identifier:
- column: geonameid # Primary identifier column
names:
- column: name # Main name
- column: asciiname # ASCII variant
- column: alternatenames # Multiple names in one column
separator: "," # Split on commas
Name columns with separators are split into individual names during registration, allowing a single feature to be found under multiple name variants.
Complete Example
Here’s a complete configuration demonstrating both tabular and spatial sources combined with a spatial join:
name: my_gazetteer
sources:
# Main tabular source with point locations and view
- name: places
url: https://example.com/places.csv
file: places.csv
type: tabular
separator: ","
attributes:
original:
- name: id
type: INTEGER
index: true
- name: name
type: TEXT
- name: lat
type: REAL
- name: lon
type: REAL
derived:
- name: geometry
type: GEOMETRY
expression: "'POINT(' || lon || ' ' || lat || ')'"
index: true
srid: 4326
view:
select:
- source: places
column: id
- source: places
column: name
- source: places
column: lat
- source: places
column: lon
- source: regions
column: region_name
- source: places
column: geometry
join:
- type: LEFT JOIN
source: regions
condition: ST_Within(places.geometry, regions.geometry)
features:
identifier:
- column: id
names:
- column: name
# Auxiliary spatial source with administrative boundaries
- name: regions
url: https://example.com/regions.zip
file: regions.shp
type: spatial
attributes:
original:
- name: region_id
type: INTEGER
- name: region_name
type: TEXT
- name: geometry
type: GEOMETRY
index: true
srid: 4326
This example shows how tabular place data can be enriched with regional information from a spatial data source through a spatial join. The view is defined on the source that provides features (places), while the regions source serves as auxiliary data. For more comprehensive examples, refer to the built-in gazetteer configurations on GitHub: geonames.yaml and swissnames3d.yaml.
Installing Custom Gazetteers
To install a custom gazetteer, provide the path to your configuration file:
python -m geoparser download path/to/my_gazetteer.yaml
The installer validates the configuration, downloads or locates the specified files, creates database tables according to the attribute specifications, loads the data, applies transformations and derivations, creates indices, and registers the gazetteer so it can be queried through the standard interface.
Next Steps
Now that you understand gazetteers, you can explore:
Modules - Learn how resolvers use gazetteers for disambiguation
Training - Train resolvers on specific gazetteers for better performance
Projects - Use projects to organize work with different gazetteers
For complete API documentation of gazetteer classes, see the Gazetteer reference.