Modules

Base Classes

class Recognizer(**kwargs)

Bases: Module

Abstract class for modules that perform reference recognition.

These modules identify potential references (place names) in text. They are completely database-agnostic and operate only on raw text data.

NAME: str = None

abstract predict(texts: List[str]) → List[List[Tuple[int, int]] | None]

Predict references in multiple document texts.

This abstract method must be implemented by child classes.

Parameters:: texts – List of document text strings to process
Returns:: A list where each element corresponds to one document at the same index in the input list. Each element is either a list of (start, end) tuples containing positions of references found in the document, or None to indicate that predictions are not available for that document (e.g., unsupported language, missing data, etc.).

class Resolver(**kwargs)

Bases: Module

Abstract class for modules that perform reference resolution.

These modules link recognized references to specific referents in a gazetteer. They are completely database-agnostic and operate only on raw text data.

NAME: str = None

abstract predict(texts: List[str], references: List[List[Tuple[int, int]]]) → List[List[Tuple[str, str] | None]]

Predict referents for multiple references across multiple documents.

This abstract method must be implemented by child classes.

Parameters:

texts – List of document text strings
references – List of lists of tuples containing (start, end) positions of references. Each inner list corresponds to references in one document at the same index in texts.

Returns:

A list of lists where each inner list corresponds to referents for references in one document. Each element at position [i][j] is the referent (or None) for the reference at position [i][j] in the input. Each element is either a tuple (gazetteer_name, identifier) for a successfully resolved reference, or None to indicate that prediction is not available for that specific reference (e.g., missing data, unsupported format, etc.). The gazetteer_name identifies which gazetteer the identifier refers to, and the identifier is the value used to identify the referent in that gazetteer.

Recognizers

class SpacyRecognizer(model_name: str = 'en_core_web_sm', entity_types: List[str] = ['FAC', 'GPE', 'LOC'])

Bases: Recognizer

A recognition module that uses spaCy to identify references in document text.

This module identifies location-based named entities like GPE (geopolitical entity), LOC (location), and FAC (facility) as potential references.

NAME: str = 'SpacyRecognizer'

fit(texts: List[str], references: List[List[Tuple[int, int]]], output_path: str | Path, epochs: int = 10, batch_size: int = 8, dropout: float = 0.1, learning_rate: float = 0.001) → None

Fine-tune the spaCy NER model using documents with reference annotations. This method gathers all references from the provided documents and uses them to create training examples for fine-tuning the underlying spaCy NER model.

Parameters:

texts – List of document text strings
references – List of lists of (start, end) position tuples
output_path – Directory path to save the fine-tuned model
epochs – Number of training epochs (default: 10)
batch_size – Training batch size (default: 8)
dropout – Dropout rate for training (default: 0.1)
learning_rate – Learning rate for training (default: 0.001)

Raises:

ValueError – If no training examples can be created from the provided documents

predict(texts: List[str]) → List[List[Tuple[int, int]] | None]

Identify references (location entities) in multiple document texts using spaCy.

Parameters:: texts – List of document text strings to process
Returns:: A list where each element corresponds to one document at the same index in the input list. Each element is either a list of (start, end) tuples containing positions of references found in the document, or None if predictions are not available for that document.

Resolvers

class SentenceTransformerResolver(model_name: str = 'dguzh/geo-all-MiniLM-L6-v2', gazetteer_name: str = 'geonames', min_similarity: float = 0.6, max_tiers: int = 3, attribute_map: dict = None)

Bases: Resolver

A resolver that uses SentenceTransformer to map reference contexts to gazetteer candidates.

This resolver extracts contextual information around each reference, generates embeddings for the context, retrieves candidate features from the gazetteer, generates location descriptions and embeddings for candidates, and finds the best match using cosine similarity.

GAZETTEER_ATTRIBUTE_MAP = {'geonames': {'level1': 'country_name', 'level2': 'admin1_name', 'level3': 'admin2_name', 'name': 'name', 'type': 'feature_name'}, 'swissnames3d': {'level1': 'KANTON_NAME', 'level2': 'BEZIRK_NAME', 'level3': 'GEMEINDE_NAME', 'name': 'NAME', 'type': 'OBJEKTART'}}

NAME: str = 'SentenceTransformerResolver'

fit(texts: List[str], references: List[List[Tuple[int, int]]], referents: List[List[Tuple[str, str]]], output_path: str | Path, epochs: int = 1, batch_size: int = 8, learning_rate: float = 2e-05, warmup_ratio: float = 0.1, save_strategy: str = 'epoch') → None

Fine-tune the SentenceTransformer model using references and their resolved referents as training data.

This method gathers all references that have been resolved (i.e., have referents), extracts their contexts and all candidate descriptions, and uses them to create positive and negative training examples for fine-tuning the underlying SentenceTransformer model using ContrastiveLoss.

Parameters:

texts – List of document text strings
references – List of lists of (start, end) position tuples
referents – List of lists of (gazetteer_name, identifier) tuples
output_path – Directory path to save the fine-tuned model
epochs – Number of training epochs (default: 1)
batch_size – Training batch size (default: 8)
learning_rate – Learning rate for training (default: 2e-5)
warmup_ratio – Warmup ratio for learning rate scheduler (default: 0.1)
save_strategy – When to save the model during training (default: “epoch”)

Raises:

ValueError – If no training examples can be created from the provided documents

predict(texts: List[str], references: List[List[Tuple[int, int]]]) → List[List[Tuple[str, str] | None]]

Predict referents for multiple references using iterative candidate generation.

Uses a search strategy that starts with restrictive search methods and progressively expands to less restrictive ones, stopping when candidates with sufficient similarity are found.

Parameters:

texts – List of document text strings
references – List of lists of tuples containing (start, end) positions of references

Returns:

A list of lists where each inner list corresponds to referents for references in one document. Each element is either a tuple (gazetteer_name, identifier) for a successfully resolved reference, or None if prediction is not available for that specific reference.