Database#
Classes and functions to interact with databases of publications.
Models#
Abstract description of GisMap DB interface.
- class gismap.sources.models.Author(name: str)[source]#
Base class for authors in the database system.
Authors are identified primarily by their name and may have database-specific subclasses with additional attributes like keys and aliases.
- Parameters:
name (
str) – The author’s name.
- property fingerprint#
A normalized version of the author’s name for matching purposes.
- Returns:
The fingerprint of the author’s name.
- Return type:
- to_dict()[source]#
JSON-serializable representation of the author.
Includes
name, pluskey,aliases,url, anddb_namewhen defined on the subclass. Source-aggregating subclasses (e.g.SourcedAuthor) override this to expose their underlying sources.- Return type:
- class gismap.sources.models.DB[source]#
Abstract base class for database backends.
Provides the interface for searching authors and retrieving publications. Subclasses must implement
search_author()andfrom_author().- classmethod from_author(a)[source]#
Retrieve publications for a given author.
- Parameters:
a (
Author) – The author whose publications to retrieve.- Returns:
List of
Publicationobjects.- Return type:
- class gismap.sources.models.Publication(title: str, authors: list, venue: str, type: str, year: int)[source]#
Base class for publications in the database system.
Publications contain metadata about academic papers including title, authors, venue, type, and publication year.
- Parameters:
- property fingerprint#
A normalized version of the publication’s title for matching purposes.
- Returns:
The fingerprint of the publication’s title.
- Return type:
Examples
>>> pub = Publication(title="A Studÿ: on Foo!!", ... authors=[Author(name="John Döe"), Author(name="Jáne Smith")], venue="", type="", year=2020) >>> pub.fingerprint 'a study on foo---doe john+++jane smith'
- to_dict()[source]#
JSON-serializable representation of the publication.
Authors are serialized via
Author.to_dict().key,url,metadataanddb_nameare included when defined on the subclass. Source-aggregating subclasses override this to expose their underlying sources.- Return type:
- gismap.sources.models.db_class_to_auth_class(db_class)[source]#
Find the Author subclass associated with a given DB class.
- Parameters:
db_class (
type) – A DB subclass (e.g., HAL, DBLP, LDB).- Returns:
The corresponding Author subclass, or None if not found.
- Return type:
typeor None
Examples
>>> from gismap.sources.hal import HAL >>> db_class_to_auth_class(HAL) <class 'gismap.sources.hal.HALAuthor'> >>> class Alien: pass >>> db_class_to_auth_class(Alien)
- gismap.sources.models.format_authors(authors, transform=None)[source]#
Format a list of Author objects into a human-readable string.
- Parameters:
authors (
list) – List of Author objects.transform (
callable, optional) – A function to apply to each Author for display purposes (e.g., extracting a name). If None, the default string representation of Author is used.
- Returns:
A human-readable string representing the formatted authors.
- Return type:
LDB (Local DBLP)#
Interface for dblp computer science bibliography (https://dblp.org/) using a local copy of the database.
- class gismap.sources.ldb.LDB[source]#
Browse DBLP from a local copy of the database.
LDB is a class-only database - it should not be instantiated. All methods are classmethods and state is stored in class variables.
Examples
Public DB methods ensure that the DB is loaded but if you need to use a specific LDB method, prepare the DB first.
>>> LDB._ensure_loaded() >>> LDB.author_by_key("66/2077") LDBAuthor(name='Fabien Mathieu', key='66/2077') >>> pubs = sorted(LDB.author_publications('66/2077'), key = lambda p: p.year) >>> pub = pubs[0] >>> pub.metadata {'url': 'http://www2003.org/cdrom/papers/poster/p102/p102-mathieu.htm', 'streams': ['conf/www']} >>> christelle = LDB.search_author("Christelle Caillouet") >>> christelle [LDBAuthor(name='Christelle Caillouet', key='10/8725')] >>> christelle[0].aliases ['Christelle Molle'] >>> LDB.db_info() {'tag': 'v0...', 'downloaded_at': '...', 'size': ..., 'path': ...} >>> LDB.check_update() >>> ldb = LDB() Traceback (most recent call last): ... TypeError: LDB should not be instantiated. Use class methods directly, e.g., LDB.search_author(name)
- classmethod build_db(limit=None)[source]#
Build the LDB database from a DBLP TTL dump.
Parses the DBLP RDF/TTL file to extract publications and authors, stores them in compressed ZList structures, and builds a fuzzy search engine for author name lookups.
- Parameters:
limit (
int, optional) – Maximum number of publications to process. If None, processes the entire database. Useful for testing with a subset.
Notes
This method populates the class-level attributes:
authors: ZList of (key, name, publication_indices) tuplespublis: ZList of publication recordskeys: dict mapping author keys to indicessearch_engine: fuzzy search Process for author lookups
After building, call
dump_db()to persist the database.Examples
Build from the default DBLP source:
>>> LDB.build_db() >>> LDB.dump_db()
Build a small test database:
>>> LDB.build_db(limit=1000) >>> LDB.authors[0] ('78/459-1', ['Manish Singh'], [0])
Save your build in a non-default file:
>>> from tempfile import TemporaryDirectory >>> from pathlib import Path >>> with TemporaryDirectory() as tmpdirname: ... LDB.dump(filename="test.zst", path=tmpdirname) ... [file.name for file in Path(tmpdirname).glob("*")] ['test.zst']
In case you don’t like your build and want to reload your local database from disk:
>>> LDB.load_db()
- classmethod check_update() dict | None[source]#
Check if a newer version is available on GitHub.
- Returns:
Dictionary with update info if available, None if up to date.
- Return type:
dictor None
- classmethod db_info() dict | None[source]#
Return installed version info.
- Returns:
Dictionary with tag, date, size, path; or None if not installed.
- Return type:
dictorNone
- classmethod dump(filename: str, path='.', overwrite=False, include_search=True)[source]#
Save class state to file.
- classmethod from_author(a)[source]#
Retrieve publications for a given author.
- Parameters:
a (
Author) – The author whose publications to retrieve.- Returns:
List of
Publicationobjects.- Return type:
- classmethod load(filename: str, path='.', restore_search=False)[source]#
Load class state from file.
- classmethod retrieve(version: str | None = None, force: bool = False)[source]#
Download LDB database from GitHub releases.
- Parameters:
Examples
The following will get you a LDB if you do not have one.
>>> LDB.retrieve() # Latest compatible release >>> LDB.retrieve("v0.5.0") # Specific version >>> LDB.retrieve("0.5.0") # Also works without 'v' prefix
Of course, the tag/version must be LDB-ready.
>>> LDB.retrieve("v0.3.0") # Too old for LDB Traceback (most recent call last): ... RuntimeError: Asset 'ldb.pkl.zst' not found in release v0.3.0. Available assets: []
- Raises:
RuntimeError – If release or asset not found, download fails, or version is incompatible.
- class gismap.sources.ldb.LDBAuthor(name: str, key: str, aliases: list = <factory>)[source]#
Author from the LDB (Local DBLP) database.
LDB provides local access to DBLP data without rate limiting.
- class gismap.sources.ldb.LDBPublication(authors: list, title: str, venue: str, type: str, year: int, key: str, metadata: dict = <factory>)[source]#
Publication from the LDB (Local DBLP) database.
- gismap.sources.ldb.LDB_PARAMETERS = Data(search=Data(limit=3, cutoff=87.0, slack=1.0), bof=Data(n_range=2, length_impact=0.1), frame_size=Data(authors=512, publis=256), io=Data(source='https://dblp.org/rdf/dblp.ttl.gz', destination=PosixPath('/home/runner/.local/share/gismap/py3.12/ldb.pkl.zst'), gh_api='https://api.github.com/repos/balouf/gismap/releases'))#
Global configuration parameters for the Local DBLP (LDB) pipeline.
Structure:
search:
limit: maximum number of candidates retrieved per query.
cutoff: minimal similarity score required to keep a candidate.
slack: tolerance around the cutoff for borderline matches.
bof (Bag-of-Factors):
n_range: max factor size (higher is better but more expensive).
length_impact: how to compare two inputs of different size.
frame_size:
authors: maximum number of authors kept in a single frame/batch.
publis: maximum number of publications kept in a single frame/batch.
io:
source: URL/file location of the DBLP RDF dump used as raw input.
destination: local path where the compressed preprocessed dataset is / will be stored.
gh_api: GitHub API endpoint used to fetch release information for the project.
LDB_PARAMETERS is a Data (RecursiveDict) instance, so nested fields can be accessed with attribute notation, e.g.:
LDB_PARAMETERS.search.limit LDB_PARAMETERS.io.destination
- gismap.sources.dblp_ttl.get_stream(source, chunk_size=65536)[source]#
- Parameters:
- Yields:
iterable – Chunk iterator that streams the content.
int– Source size (used later to compute ETA).
- gismap.sources.dblp_ttl.parse_block(dblp_block)[source]#
- Parameters:
dblp_block (
str) – A DBLP publication, turtle format.- Returns:
key (
str) – DBLP key.title (
str) – Publication title.type (
str) – Type of publication.authors (
dict) – Publication authors (key -> name)url (
strorNoneType) – Publication URL.stream (
listorNoneType) – Publication streams (normalized journal/conf).pages (
strorNoneType) – Publication pages.venue (
strorNoneType) – Publication venue (conf/journal).year (
int) – Year of publication.
DBLP (online)#
Interface for dblp computer science bibliography (https://dblp.org/).
- class gismap.sources.dblp.DBLP[source]#
-
- classmethod search_author(name, wait=True)[source]#
- Parameters:
- Returns:
Potential matches.
- Return type:
Examples
>>> fabien = DBLP.search_author("Fabien Mathieu") >>> fabien [DBLPAuthor(name='Fabien Mathieu', key='66/2077')] >>> fabien[0].url 'https://dblp.org/pid/66/2077.html' >>> manu = DBLP.search_author("Manuel Barragan") >>> manu [DBLPAuthor(name='Manuel Barragan', key='07/10587'), DBLPAuthor(name='Manuel Barragan', key='83/3865'), DBLPAuthor(name='Manuel Barragan', key='188/0198')] >>> DBLP.search_author("NotaSearcherName", wait=False) []
- class gismap.sources.dblp.DBLPAuthor(name: str, key: str, aliases: list = <factory>)[source]#
Examples
>>> fabien = DBLPAuthor('Fabien Mathieu', key='66/2077') >>> publications = sorted(fabien.get_publications(), ... key=lambda p: p.title) >>> publications[0].url 'https://dblp.org/rec/conf/iptps/BoufkhadMMPV08.html' >>> publications[-1] DBLPPublication(title='Upper Bounds for Stabilization in Acyclic Preference-Based Systems.', authors=[DBLPAuthor(name='Fabien Mathieu', key='66/2077')], venue='SSS', type='conference', year=2007, key='conf/sss/Mathieu07')
HAL#
Interface for HyperArticles en Ligne (https://hal.science/).
- class gismap.sources.hal.HAL[source]#
- classmethod from_author(a, wait=True)[source]#
- Parameters:
- Returns:
Papers available in HAL.
- Return type:
Examples
>>> fabien = HAL.search_author("Fabien Mathieu")[0] >>> publications = sorted(fabien.get_publications(), key=lambda p: p.title) >>> publications[2] HALPublication(title='Achievable Catalog Size in Peer-to-Peer Video-on-Demand Systems', authors=[HALAuthor(name='Yacine Boufkhad', key='yacine-boufkhad'), HALAuthor(name='Fabien Mathieu', key='fabien-mathieu'), HALAuthor(name='Fabien de Montgolfier', key='949013', key_type='pid'), HALAuthor(name='Diego Perino', key='Diego Perino', key_type='fullname'), HALAuthor(name='Laurent Viennot', key='laurentviennot')], venue='Proceedings of the 7th Internnational Workshop on Peer-to-Peer Systems (IPTPS)', type='conference', year=2008, key='471724') >>> diego = publications[2].authors[3] >>> diego HALAuthor(name='Diego Perino', key='Diego Perino', key_type='fullname') >>> len(diego.get_publications()) > 28 True >>> publications[-7] HALPublication(title='Upper bounds for stabilization in acyclic preference-based systems', authors=[HALAuthor(name='Fabien Mathieu', key='fabien-mathieu')], venue="SSS'07 - 9th international conference on Stabilization, Safety, and Security of Distributed Systems", type='conference', year=2007, key='668356')
Case of someone with multiple ids one want to cumulate:
>>> maria = HAL.search_author('Maria Potop-Butucaru') >>> maria [HALAuthor(name='Maria Potop-Butucaru', key='841868', key_type='pid')] >>> n_pubs = len(HAL.from_author(maria[0])) >>> n_pubs > 200 True >>> n_pubs == len(maria[0].get_publications()) True
Note: an error is raised if not enough data is provided
>>> HAL.from_author(HALAuthor('Fabien Mathieu')) Traceback (most recent call last): ... ValueError: Fabien Mathieu must have a key for publications to be fetched.
- classmethod search_author(name, wait=True)[source]#
- Parameters:
- Returns:
Potential matches.
- Return type:
Examples
>>> fabien = HAL.search_author("Fabien Mathieu") >>> fabien [HALAuthor(name='Fabien Mathieu', key='fabien-mathieu')] >>> fabien = fabien[0] >>> fabien.url 'https://hal.science/search/index/?q=*&authIdHal_s=fabien-mathieu' >>> HAL.search_author("Laurent Viennot")[0] HALAuthor(name='Laurent Viennot', key='laurentviennot') >>> HAL.search_author("NotaSearcherName") [] >>> HAL.search_author("Ana Busic") [HALAuthor(name='Ana Busic', key='anabusic')] >>> HAL.search_author("Potop-Butucaru Maria") [HALAuthor(name='Potop-Butucaru Maria', key='841868', key_type='pid')] >>> diego = HAL.search_author("Diego Perino") >>> diego [HALAuthor(name='Diego Perino', key='847558', key_type='pid'), HALAuthor(name='Diego Perino', key='978810', key_type='pid')] >>> diego[1].url 'https://hal.science/search/index/?q=*&authIdPerson_i=978810'
- class gismap.sources.hal.HALAuthor(name: str, key: str | int = None, key_type: str = None, aliases: list = <factory>, _url: str = None, _img: str = None, _cv: bool = None)[source]#
Author from the HAL (Hyper Articles en Ligne) database.
HAL is a French open archive for scholarly publications.
Multi-source#
Interface for handling multiple sources at once.
- class gismap.sources.multi.DiffResult(label_a: str, label_b: str, only_a: list, only_b: list)[source]#
Result of comparing publications between two sources.
- class gismap.sources.multi.DuplicateResult(label: str, groups: list)[source]#
Result of finding duplicate publications within a source.
- class gismap.sources.multi.SourcedAuthor(name: str, sources: list = <factory>)[source]#
An author aggregated from multiple database sources.
Combines author information from HAL, DBLP, and/or LDB into a single entity. The primary source (first in the sorted list) determines the author’s key.
- Parameters:
- diff_sources(a, b)[source]#
Compare publications between two sources.
- Parameters:
- Returns:
Publications found only in a and only in b.
- Return type:
Examples
>>> from gismap.lab import LabAuthor >>> me = LabAuthor("Fabien Mathieu (hal:fabien-mathieu, ldb:66/2077)") >>> diff = me.diff_sources(0, 1) >>> diff DiffResult(only_hal (0)=..., only_ldb (1)=...) >>> isinstance(diff.only_a, list) and isinstance(diff.only_b, list) True
- find_duplicates(a)[source]#
Find duplicate publications within a single source.
- Parameters:
a (
intorstr) – Source: index inself.sourcesor db_name to match.- Returns:
Groups of publications that appear to be duplicates.
- Return type:
Examples
>>> from gismap.lab import LabAuthor >>> me = LabAuthor("Fabien Mathieu (hal:fabien-mathieu, ldb:66/2077)") >>> dups = me.find_duplicates("hal") >>> dups DuplicateResult(hal, ... groups)
- to_dict()[source]#
JSON-serializable representation of the author.
Includes
name, pluskey,aliases,url, anddb_namewhen defined on the subclass. Source-aggregating subclasses (e.g.SourcedAuthor) override this to expose their underlying sources.- Return type:
- class gismap.sources.multi.SourcedPublication(title: str, authors: list, venue: str, type: str, year: int, sources: list = <factory>)[source]#
A publication aggregated from multiple database sources.
Combines publication entries from HAL, DBLP, and/or LDB that refer to the same paper. The primary source determines the publication’s metadata.
- Parameters:
- gismap.sources.multi.regroup_authors(auth_dict, pub_dict)[source]#
Replace authors of publications with matching authors. Typical use: upgrade DB-specific authors to multisource authors.
Replacement is in place.
- gismap.sources.multi.regroup_publications(pub_dict, threshold=83, length_impact=0.05, n_range=5)[source]#
Puts together copies of the same publication.
- Parameters:
- Returns:
Unified publications.
- Return type:
Examples
>>> from gismap.sources.models import Publication >>> from gismap.sources.hal import HALPublication >>> from gismap.sources.ldb import LDBPublication >>> publis = [HALPublication("The coolest paper", [], "WWW", "conference", 2004, "key1"), ... HALPublication("The coolest paper?", [], "WWW journal", "journal", 2004, "key2"), ... HALPublication("The coolest paper!", [], "unpublished", "report", 2003, "key3"), ... LDBPublication(title="The hottest paper", authors=[], venue="J. WWW", type="journal", year=2004, key="key4"), ... LDBPublication(title="The hottest paper?", authors=[], venue="CoRR", type="journal", year=2003, key="key5"), ... Publication("The hottest paper!", [], "informal", "zoom meeting", 2002)] >>> publis[-1].key = "key6" >>> regroup_publications({p.key: p for p in publis}) {'key2': SourcedPublication(title='The coolest paper?', venue='WWW journal', type='journal', year=2004), 'key4': SourcedPublication(title='The hottest paper', venue='J. WWW', type='journal', year=2004)} >>> regroup_publications({}) # should not fail on empty input {}
- gismap.sources.multi.score_author_source(dbauthor)[source]#
Compute a quality score for an author source.
Higher scores indicate more reliable author identification. HAL idHal keys are preferred, followed by HAL pid, then DBLP/LDB.
- Parameters:
dbauthor (
Author) – A database-specific author object.- Returns:
Score value (higher is better).
- Return type:
Examples
>>> from gismap.sources.models import Author, DB >>> from gismap.sources.hal import HALAuthor >>> from gismap.sources.ldb import LDBAuthor >>> class YADB(DB): ... db_name = "YADB" >>> class YAAuthor(Author, YADB): ... pass >>> authors = [HALAuthor("Titi", key="titi"), HALAuthor("Toto", key="1234"), ... LDBAuthor("Tata", key="tata"), YAAuthor("John Doe"), ... HALAuthor("Dolly", key_type="fullname")] >>> sorted(authors, key=score_author_source, reverse=True) [HALAuthor(name='Titi', key='titi'), HALAuthor(name='Toto', key='1234', key_type='pid'), LDBAuthor(name='Tata', key='tata'), YAAuthor(name='John Doe'), HALAuthor(name='Dolly', key_type='fullname')]
Manual entries#
Hand-crafted publications and external authors (Manual, Outsider, Informal).
Manual data source for hand-crafted publications and external authors.
- class gismap.sources.manual.Informal(title: str, authors: list, venue: str = 'Informal collaboration', type: str = 'unpublished', year: int = 2026, key: str = None, metadata: dict = <factory>)[source]#
A manually created publication not from any database.
- Parameters:
title (
str) – Publication title.authors (
list) – List of author objects or name strings.venue (
str, default=”Informal collaboration”) – Publication venue.type (
str, default=”unpublished”) – Publication type.year (
int, optional) – Publication year. Defaults to current year.key (
str, optional) – Unique key. Auto-generated if not provided.metadata (
dict) – Extra metadata (e.g.{"url": "..."}).
- fit_authors(lab, **kwargs)[source]#
Resolve string author names to known lab/database authors in place.
- property url#
Publication URL from metadata, if available.
- class gismap.sources.manual.Manual[source]#
Dummy database backend for manually created entries.
- classmethod from_author(a)[source]#
Retrieve publications for a given author.
- Parameters:
a (
Author) – The author whose publications to retrieve.- Returns:
List of
Publicationobjects.- Return type:
- class gismap.sources.manual.Outsider(name: str, key: str = None, aliases: list = <factory>)[source]#
An external author not found in any database.
Used when manually adding publications with authors that don’t exist in HAL, DBLP, or LDB.
- gismap.sources.manual.fit_names(lab, candidates, n_range=4, length_impact=0.05, threshold=80)[source]#
Resolve string names in a candidate list to known authors from a lab’s publications.
Each string entry in
candidatesis compared against all known authors (from the lab’s publications). If a match is found abovethreshold, the string is replaced in place by the matching author object. Otherwise, it is replaced by anOutsider.- Parameters:
lab (
LabMap) – Reference lab (must have publications populated).candidates (
list) – List of authors (strings or author objects). Modified in place.n_range (
int, default=4) – N-gram range for similarity computation.length_impact (
float, default=0.05) – Impact of length difference on similarity scores.threshold (
float, default=80) – Minimum similarity score to accept a match.
BibTeX export#
Per-publication BibTeX rendering. Lab-level aggregation lives in gismap.lab.labmap.LabMap.to_bib().
BibTeX export for individual publications.
Lab-level aggregation lives in gismap.lab.labmap.LabMap.to_bib(); this
module only handles per-publication formatting so it stays decoupled from
the lab layer.
- gismap.sources.bibtex.BIBTEX_TYPES = {'book': 'book', 'chapter': 'incollection', 'conference': 'inproceedings', 'hdr': 'phdthesis', 'journal': 'article', 'report': 'techreport', 'software': 'software', 'thesis': 'phdthesis', 'unpublished': 'unpublished'}#
Mapping from GisMap normalized publication types to BibTeX entry types. Unknown types fall back to
misc.
- gismap.sources.bibtex.alternate_urls(pub)[source]#
URLs of the non-primary sources, in order. Empty for mono-source pubs.
- gismap.sources.bibtex.pub_to_bibtex(pub)[source]#
Render a single
Publicationas a BibTeX entry.Empty fields are never emitted. Fields included when present:
title,author,yearjournal/booktitle(depending on entry type)url(primary source URL)abstract(HAL only, when available)pages,volume(DBLP)noteaggregating any pre-existingmetadata['note']and the URLs of non-primary sources forSourcedPublication(Also at: …).
Cite key is
sanitize_cite_key()applied topub.key, with a fingerprint-based fallback if the key is missing.- Parameters:
pub (
Publication) – Any publication subclass (HAL, DBLP, LDB, Informal, SourcedPublication).- Returns:
BibTeX entry, terminated by
}(no trailing newline).- Return type:
Examples
>>> from gismap.sources.models import Author, Publication >>> p = Publication(title="A Tale", authors=[Author(name="Alice Smith")], ... venue="Nature", type="journal", year=2024) >>> p.key = "abc/123" >>> print(pub_to_bibtex(p)) @article{abc_123, title = {A Tale}, author = {Smith, Alice}, year = {2024}, journal = {Nature} }
- gismap.sources.bibtex.sanitize_cite_key(key)[source]#
Coerce a publication key to a BibTeX-safe cite key.
Replaces every character outside
[A-Za-z0-9:_\-.]with_. This handles DBLP keys likeconf/iptps/Foo(slashes), HAL numeric keys, and UUID hex keys from manual publications.Examples
>>> sanitize_cite_key("conf/iptps/BoufkhadMMPV08") 'conf_iptps_BoufkhadMMPV08' >>> sanitize_cite_key("471724") '471724' >>> sanitize_cite_key("a b/c.d-e:f") 'a_b_c.d-e:f'