Database#
Classes and functions to interact with databases of publications.
Models#
Abstract description of GisMap DB interface.
- class gismap.sources.models.Author(name: str)[source]#
Base class for authors in the database system.
Authors are identified primarily by their name and may have database-specific subclasses with additional attributes like keys and aliases.
- Parameters:
name (
str) – The author’s name.
- class gismap.sources.models.DB[source]#
Abstract base class for database backends.
Provides the interface for searching authors and retrieving publications. Subclasses must implement
search_author()andfrom_author().- classmethod from_author(a)[source]#
Retrieve publications for a given author.
- Parameters:
a (
Author) – The author whose publications to retrieve.- Returns:
List of
Publicationobjects.- Return type:
LDB (Local DBLP)#
Interface for dblp computer science bibliography (https://dblp.org/) using a local copy of the database.
- class gismap.sources.ldb.LDB[source]#
Browse DBLP from a local copy of the database.
LDB is a class-only database - it should not be instantiated. All methods are classmethods and state is stored in class variables.
Examples
Public DB methods ensure that the DB is loaded but if you need to use a specific LDB method, prepare the DB first.
>>> LDB._ensure_loaded() >>> LDB.author_by_key("66/2077") LDBAuthor(name='Fabien Mathieu', key='66/2077') >>> pubs = sorted(LDB.author_publications('66/2077'), key = lambda p: p.year) >>> pub = pubs[0] >>> pub.metadata {'url': 'http://www2003.org/cdrom/papers/poster/p102/p102-mathieu.htm', 'streams': ['conf/www']} >>> christelle = LDB.search_author("Christelle Caillouet") >>> christelle [LDBAuthor(name='Christelle Caillouet', key='10/8725')] >>> christelle[0].aliases ['Christelle Molle'] >>> LDB.db_info() {'tag': 'v0...', 'downloaded_at': '...', 'size': ..., 'path': ...} >>> LDB.check_update() >>> ldb = LDB() Traceback (most recent call last): ... TypeError: LDB should not be instantiated. Use class methods directly, e.g., LDB.search_author(name)
- classmethod build_db(limit=None)[source]#
Build the LDB database from a DBLP TTL dump.
Parses the DBLP RDF/TTL file to extract publications and authors, stores them in compressed ZList structures, and builds a fuzzy search engine for author name lookups.
- Parameters:
limit (
int, optional) – Maximum number of publications to process. If None, processes the entire database. Useful for testing with a subset.
Notes
This method populates the class-level attributes:
authors: ZList of (key, name, publication_indices) tuplespublis: ZList of publication recordskeys: dict mapping author keys to indicessearch_engine: fuzzy search Process for author lookups
After building, call
dump_db()to persist the database.Examples
Build from the default DBLP source:
>>> LDB.build_db() >>> LDB.dump_db()
Build a small test database:
>>> LDB.build_db(limit=1000) >>> LDB.authors[0] ('78/459-1', ['Manish Singh'], [0])
Save your build in a non-default file:
>>> from tempfile import TemporaryDirectory >>> from pathlib import Path >>> with TemporaryDirectory() as tmpdirname: ... LDB.dump(filename="test.zst", path=tmpdirname) ... [file.name for file in Path(tmpdirname).glob("*")] ['test.zst']
In case you don’t like your build and want to reload your local database from disk:
>>> LDB.load_db()
- classmethod check_update() dict | None[source]#
Check if a newer version is available on GitHub.
- Returns:
Dictionary with update info if available, None if up to date.
- Return type:
dictor None
- classmethod db_info() dict | None[source]#
Return installed version info.
- Returns:
Dictionary with tag, date, size, path; or None if not installed.
- Return type:
dictorNone
- classmethod dump(filename: str, path='.', overwrite=False, include_search=True)[source]#
Save class state to file.
- classmethod from_author(a)[source]#
Retrieve publications for a given author.
- Parameters:
a (
Author) – The author whose publications to retrieve.- Returns:
List of
Publicationobjects.- Return type:
- classmethod load(filename: str, path='.', restore_search=False)[source]#
Load class state from file.
- classmethod retrieve(version: str | None = None, force: bool = False)[source]#
Download LDB database from GitHub releases.
- Parameters:
Examples
The following will get you a LDB if you do not have one.
>>> LDB.retrieve() # Latest compatible release >>> LDB.retrieve("v0.5.0") # Specific version >>> LDB.retrieve("0.5.0") # Also works without 'v' prefix
Of course, the tag/version must be LDB-ready.
>>> LDB.retrieve("v0.3.0") # Too old for LDB Traceback (most recent call last): ... RuntimeError: Asset 'ldb.pkl.zst' not found in release v0.3.0. Available assets: []
- Raises:
RuntimeError – If release or asset not found, download fails, or version is incompatible.
- class gismap.sources.ldb.LDBAuthor(name: str, key: str, aliases: list = <factory>)[source]#
Author from the LDB (Local DBLP) database.
LDB provides local access to DBLP data without rate limiting.
- class gismap.sources.ldb.LDBPublication(authors: list, title: str, venue: str, type: str, year: int, key: str, metadata: dict = <factory>)[source]#
Publication from the LDB (Local DBLP) database.
- gismap.sources.ldb.LDB_PARAMETERS = Data(search=Data(limit=3, cutoff=87.0, slack=1.0), bof=Data(n_range=2, length_impact=0.1), frame_size=Data(authors=512, publis=256), io=Data(source='https://dblp.org/rdf/dblp.ttl.gz', destination=PosixPath('/home/runner/.local/share/gismap/ldb.pkl.zst'), gh_api='https://api.github.com/repos/balouf/gismap/releases'))#
Global configuration parameters for the Local DBLP (LDB) pipeline.
Structure: - search:
limit: maximum number of candidates retrieved per query.
cutoff: minimal similarity score required to keep a candidate.
slack: tolerance around the cutoff for borderline matches.
- bof (Bag-of-Factors):
n_range: max factor size (higher is better but more expensive).
length_impact: how to compare two inputs of different size.
- frame_size:
authors: maximum number of authors kept in a single frame/batch.
publis: maximum number of publications kept in a single frame/batch.
- io:
source: URL/file location of the DBLP RDF dump used as raw input.
destination: local path where the compressed preprocessed dataset is / will be stored.
gh_api: GitHub API endpoint used to fetch release information for the project.
LDB_PARAMETERS is a Data (RecursiveDict) instance, so nested fields can be accessed with attribute notation, e.g.:
LDB_PARAMETERS.search.limit LDB_PARAMETERS.io.destination
- gismap.sources.dblp_ttl.get_stream(source, chunk_size=65536)[source]#
- Parameters:
- Yields:
iterable – Chunk iterator that streams the content.
int– Source size (used later to compute ETA).
- gismap.sources.dblp_ttl.parse_block(dblp_block)[source]#
- Parameters:
dblp_block (
str) – A DBLP publication, turtle format.- Returns:
key (
str) – DBLP key.title (
str) – Publication title.type (
str) – Type of publication.authors (
dict) – Publication authors (key -> name)url (
strorNoneType) – Publication URL.stream (
listorNoneType) – Publication streams (normalized journal/conf).pages (
strorNoneType) – Publication pages.venue (
strorNoneType) – Publication venue (conf/journal).year (
int) – Year of publication.
DBLP (online)#
Interface for dblp computer science bibliography (https://dblp.org/).
- class gismap.sources.dblp.DBLP[source]#
-
- classmethod search_author(name, wait=True)[source]#
- Parameters:
- Returns:
Potential matches.
- Return type:
Examples
>>> fabien = DBLP.search_author("Fabien Mathieu") >>> fabien [DBLPAuthor(name='Fabien Mathieu', key='66/2077')] >>> fabien[0].url 'https://dblp.org/pid/66/2077.html' >>> manu = DBLP.search_author("Manuel Barragan") >>> manu [DBLPAuthor(name='Manuel Barragan', key='07/10587'), DBLPAuthor(name='Manuel Barragan', key='83/3865'), DBLPAuthor(name='Manuel Barragan', key='188/0198')] >>> DBLP.search_author("NotaSearcherName", wait=False) []
- class gismap.sources.dblp.DBLPAuthor(name: str, key: str, aliases: list = <factory>)[source]#
Examples
>>> fabien = DBLPAuthor('Fabien Mathieu', key='66/2077') >>> publications = sorted(fabien.get_publications(), ... key=lambda p: p.title) >>> publications[0].url 'https://dblp.org/rec/conf/iptps/BoufkhadMMPV08.html' >>> publications[-1] DBLPPublication(title='Upper Bounds for Stabilization in Acyclic Preference-Based Systems.', authors=[DBLPAuthor(name='Fabien Mathieu', key='66/2077')], venue='SSS', type='conference', year=2007, key='conf/sss/Mathieu07')
HAL#
Interface for HyperArticles en Ligne (https://hal.science/).
- class gismap.sources.hal.HAL[source]#
- classmethod from_author(a, wait=True)[source]#
- Parameters:
- Returns:
Papers available in HAL.
- Return type:
Examples
>>> fabien = HAL.search_author("Fabien Mathieu")[0] >>> publications = sorted(fabien.get_publications(), key=lambda p: p.title) >>> publications[2] HALPublication(title='Achievable Catalog Size in Peer-to-Peer Video-on-Demand Systems', authors=[HALAuthor(name='Yacine Boufkhad', key='yacine-boufkhad'), HALAuthor(name='Fabien Mathieu', key='fabien-mathieu'), HALAuthor(name='Fabien de Montgolfier', key='949013', key_type='pid'), HALAuthor(name='Diego Perino', key='Diego Perino', key_type='fullname'), HALAuthor(name='Laurent Viennot', key='laurentviennot')], venue='Proceedings of the 7th Internnational Workshop on Peer-to-Peer Systems (IPTPS)', type='conference', year=2008, key='471724') >>> diego = publications[2].authors[3] >>> diego HALAuthor(name='Diego Perino', key='Diego Perino', key_type='fullname') >>> len(diego.get_publications()) > 28 True >>> publications[-7] HALPublication(title='Upper bounds for stabilization in acyclic preference-based systems', authors=[HALAuthor(name='Fabien Mathieu', key='fabien-mathieu')], venue="SSS'07 - 9th international conference on Stabilization, Safety, and Security of Distributed Systems", type='conference', year=2007, key='668356')
Case of someone with multiple ids one want to cumulate:
>>> maria = HAL.search_author('Maria Potop-Butucaru') >>> maria [HALAuthor(name='Maria Potop-Butucaru', key='841868', key_type='pid')] >>> n_pubs = len(HAL.from_author(maria[0])) >>> n_pubs > 200 True >>> n_pubs == len(maria[0].get_publications()) True
Note: an error is raised if not enough data is provided
>>> HAL.from_author(HALAuthor('Fabien Mathieu')) Traceback (most recent call last): ... ValueError: HALAuthor(name='Fabien Mathieu') must have a key for publications to be fetched.
- classmethod search_author(name, wait=True)[source]#
- Parameters:
- Returns:
Potential matches.
- Return type:
Examples
>>> fabien = HAL.search_author("Fabien Mathieu") >>> fabien [HALAuthor(name='Fabien Mathieu', key='fabien-mathieu')] >>> fabien = fabien[0] >>> fabien.url 'https://hal.science/search/index/?q=*&authIdHal_s=fabien-mathieu' >>> HAL.search_author("Laurent Viennot")[0] HALAuthor(name='Laurent Viennot', key='laurentviennot') >>> HAL.search_author("NotaSearcherName") [] >>> HAL.search_author("Ana Busic") [HALAuthor(name='Ana Busic', key='anabusic')] >>> HAL.search_author("Potop-Butucaru Maria") [HALAuthor(name='Potop-Butucaru Maria', key='841868', key_type='pid')] >>> diego = HAL.search_author("Diego Perino") >>> diego [HALAuthor(name='Diego Perino', key='847558', key_type='pid'), HALAuthor(name='Diego Perino', key='978810', key_type='pid')] >>> diego[1].url 'https://hal.science/search/index/?q=*&authIdPerson_i=978810'
- class gismap.sources.hal.HALAuthor(name: str, key: str | int = None, key_type: str = None, aliases: list = <factory>, _url: str = None, _img: str = None, _cv: bool = None)[source]#
Author from the HAL (Hyper Articles en Ligne) database.
HAL is a French open archive for scholarly publications.
Multi-source#
Interface for handling multiple sources at once.
- class gismap.sources.multi.SourcedAuthor(name: str, sources: list = <factory>)[source]#
An author aggregated from multiple database sources.
Combines author information from HAL, DBLP, and/or LDB into a single entity. The primary source (first in the sorted list) determines the author’s key.
- class gismap.sources.multi.SourcedPublication(title: str, authors: list, venue: str, type: str, year: int, sources: list = <factory>)[source]#
A publication aggregated from multiple database sources.
Combines publication entries from HAL, DBLP, and/or LDB that refer to the same paper. The primary source determines the publication’s metadata.
- gismap.sources.multi.regroup_authors(auth_dict, pub_dict)[source]#
Replace authors of publications with matching authors. Typical use: upgrade DB-specific authors to multisource authors.
Replacement is in place.
- gismap.sources.multi.regroup_publications(pub_dict, threshold=85, length_impact=0.05, n_range=5)[source]#
Puts together copies of the same publication.