Modules
Base Classes
- class Recognizer(**kwargs)
Bases:
ModuleAbstract class for modules that perform reference recognition.
These modules identify potential references (place names) in text. They are completely database-agnostic and operate only on raw text data.
- NAME: str = None
- abstract predict(texts: List[str]) List[List[Tuple[int, int]] | None]
Predict references in multiple document texts.
This abstract method must be implemented by child classes.
- Parameters:
texts – List of document text strings to process
- Returns:
A list where each element corresponds to one document at the same index in the input list. Each element is either a list of (start, end) tuples containing positions of references found in the document, or None to indicate that predictions are not available for that document (e.g., unsupported language, missing data, etc.).
- class Resolver(**kwargs)
Bases:
ModuleAbstract class for modules that perform reference resolution.
These modules link recognized references to specific referents in a gazetteer. They are completely database-agnostic and operate only on raw text data.
- NAME: str = None
- abstract predict(texts: List[str], references: List[List[Tuple[int, int]]]) List[List[Tuple[str, str] | None]]
Predict referents for multiple references across multiple documents.
This abstract method must be implemented by child classes.
- Parameters:
texts – List of document text strings
references – List of lists of tuples containing (start, end) positions of references. Each inner list corresponds to references in one document at the same index in texts.
- Returns:
A list of lists where each inner list corresponds to referents for references in one document. Each element at position [i][j] is the referent (or None) for the reference at position [i][j] in the input. Each element is either a tuple (gazetteer_name, identifier) for a successfully resolved reference, or None to indicate that prediction is not available for that specific reference (e.g., missing data, unsupported format, etc.). The gazetteer_name identifies which gazetteer the identifier refers to, and the identifier is the value used to identify the referent in that gazetteer.
Recognizers
- class SpacyRecognizer(model_name: str = 'en_core_web_sm', entity_types: List[str] = ['FAC', 'GPE', 'LOC'])
Bases:
RecognizerA recognition module that uses spaCy to identify references in document text.
This module identifies location-based named entities like GPE (geopolitical entity), LOC (location), and FAC (facility) as potential references.
- NAME: str = 'SpacyRecognizer'
- fit(texts: List[str], references: List[List[Tuple[int, int]]], output_path: str | Path, epochs: int = 10, batch_size: int = 8, dropout: float = 0.1, learning_rate: float = 0.001) None
Fine-tune the spaCy NER model using documents with reference annotations. This method gathers all references from the provided documents and uses them to create training examples for fine-tuning the underlying spaCy NER model.
- Parameters:
texts – List of document text strings
references – List of lists of (start, end) position tuples
output_path – Directory path to save the fine-tuned model
epochs – Number of training epochs (default: 10)
batch_size – Training batch size (default: 8)
dropout – Dropout rate for training (default: 0.1)
learning_rate – Learning rate for training (default: 0.001)
- Raises:
ValueError – If no training examples can be created from the provided documents
- predict(texts: List[str]) List[List[Tuple[int, int]] | None]
Identify references (location entities) in multiple document texts using spaCy.
- Parameters:
texts – List of document text strings to process
- Returns:
A list where each element corresponds to one document at the same index in the input list. Each element is either a list of (start, end) tuples containing positions of references found in the document, or None if predictions are not available for that document.
Resolvers
- class SentenceTransformerResolver(model_name: str = 'dguzh/geo-all-MiniLM-L6-v2', gazetteer_name: str = 'geonames', min_similarity: float = 0.6, max_tiers: int = 3, attribute_map: dict = None)
Bases:
ResolverA resolver that uses SentenceTransformer to map reference contexts to gazetteer candidates.
This resolver extracts contextual information around each reference, generates embeddings for the context, retrieves candidate features from the gazetteer, generates location descriptions and embeddings for candidates, and finds the best match using cosine similarity.
- GAZETTEER_ATTRIBUTE_MAP = {'geonames': {'level1': 'country_name', 'level2': 'admin1_name', 'level3': 'admin2_name', 'name': 'name', 'type': 'feature_name'}, 'swissnames3d': {'level1': 'KANTON_NAME', 'level2': 'BEZIRK_NAME', 'level3': 'GEMEINDE_NAME', 'name': 'NAME', 'type': 'OBJEKTART'}}
- NAME: str = 'SentenceTransformerResolver'
- fit(texts: List[str], references: List[List[Tuple[int, int]]], referents: List[List[Tuple[str, str]]], output_path: str | Path, epochs: int = 1, batch_size: int = 8, learning_rate: float = 2e-05, warmup_ratio: float = 0.1, save_strategy: str = 'epoch') None
Fine-tune the SentenceTransformer model using references and their resolved referents as training data.
This method gathers all references that have been resolved (i.e., have referents), extracts their contexts and all candidate descriptions, and uses them to create positive and negative training examples for fine-tuning the underlying SentenceTransformer model using ContrastiveLoss.
- Parameters:
texts – List of document text strings
references – List of lists of (start, end) position tuples
referents – List of lists of (gazetteer_name, identifier) tuples
output_path – Directory path to save the fine-tuned model
epochs – Number of training epochs (default: 1)
batch_size – Training batch size (default: 8)
learning_rate – Learning rate for training (default: 2e-5)
warmup_ratio – Warmup ratio for learning rate scheduler (default: 0.1)
save_strategy – When to save the model during training (default: “epoch”)
- Raises:
ValueError – If no training examples can be created from the provided documents
- predict(texts: List[str], references: List[List[Tuple[int, int]]]) List[List[Tuple[str, str] | None]]
Predict referents for multiple references using iterative candidate generation.
Uses a search strategy that starts with restrictive search methods and progressively expands to less restrictive ones, stopping when candidates with sufficient similarity are found.
- Parameters:
texts – List of document text strings
references – List of lists of tuples containing (start, end) positions of references
- Returns:
A list of lists where each inner list corresponds to referents for references in one document. Each element is either a tuple (gazetteer_name, identifier) for a successfully resolved reference, or None if prediction is not available for that specific reference.