Database#

Classes and functions to interact with databases of publications.

Models#

Abstract description of GisMap DB interface.

class gismap.sources.models.Author(name: str)[source]#

Base class for authors in the database system.

Authors are identified primarily by their name and may have database-specific subclasses with additional attributes like keys and aliases.

Parameters:

name (str) – The author’s name.

class gismap.sources.models.DB[source]#

Abstract base class for database backends.

Provides the interface for searching authors and retrieving publications. Subclasses must implement search_author() and from_author().

db_name#

Identifier for the database backend (e.g., ‘hal’, ‘dblp’, ‘ldb’).

Type:

str

classmethod from_author(a)[source]#

Retrieve publications for a given author.

Parameters:

a (Author) – The author whose publications to retrieve.

Returns:

List of Publication objects.

Return type:

list

classmethod search_author(name)[source]#

Search for authors matching the given name.

Parameters:

name (str) – Name to search for.

Returns:

List of matching Author objects.

Return type:

list

class gismap.sources.models.Publication(title: str, authors: list, venue: str, type: str, year: int)[source]#

Base class for publications in the database system.

Publications contain metadata about academic papers including title, authors, venue, type, and publication year.

Parameters:
  • title (str) – The publication title.

  • authors (list) – List of Author objects.

  • venue (str) – Publication venue (journal, conference, etc.).

  • type (str) – Publication type (e.g., ‘journal’, ‘conference’, ‘book’).

  • year (int) – Year of publication.

gismap.sources.models.db_class_to_auth_class(db_class)[source]#

Find the Author subclass associated with a given DB class.

Parameters:

db_class (type) – A DB subclass (e.g., HAL, DBLP, LDB).

Returns:

The corresponding Author subclass, or None if not found.

Return type:

type or None

LDB (Local DBLP)#

Interface for dblp computer science bibliography (https://dblp.org/) using a local copy of the database.

class gismap.sources.ldb.LDB[source]#

Browse DBLP from a local copy of the database.

LDB is a class-only database - it should not be instantiated. All methods are classmethods and state is stored in class variables.

Examples

Public DB methods ensure that the DB is loaded but if you need to use a specific LDB method, prepare the DB first.

>>> LDB._ensure_loaded()
>>> LDB.author_by_key("66/2077")
LDBAuthor(name='Fabien Mathieu', key='66/2077')
>>> pubs = sorted(LDB.author_publications('66/2077'), key = lambda p: p.year)
>>> pub = pubs[0]
>>> pub.metadata
{'url': 'http://www2003.org/cdrom/papers/poster/p102/p102-mathieu.htm', 'streams': ['conf/www']}
>>> christelle = LDB.search_author("Christelle Caillouet")
>>> christelle
[LDBAuthor(name='Christelle Caillouet', key='10/8725')]
>>> christelle[0].aliases
['Christelle Molle']
>>> LDB.db_info()  
{'tag': 'v0...', 'downloaded_at': '...', 'size': ..., 'path': ...}
>>> LDB.check_update()
>>> ldb = LDB()
Traceback (most recent call last):
...
TypeError: LDB should not be instantiated. Use class methods directly, e.g., LDB.search_author(name)
classmethod build_db(limit=None)[source]#

Build the LDB database from a DBLP TTL dump.

Parses the DBLP RDF/TTL file to extract publications and authors, stores them in compressed ZList structures, and builds a fuzzy search engine for author name lookups.

Parameters:

limit (int, optional) – Maximum number of publications to process. If None, processes the entire database. Useful for testing with a subset.

Notes

This method populates the class-level attributes:

  • authors: ZList of (key, name, publication_indices) tuples

  • publis: ZList of publication records

  • keys: dict mapping author keys to indices

  • search_engine: fuzzy search Process for author lookups

After building, call dump_db() to persist the database.

Examples

Build from the default DBLP source:

>>> LDB.build_db()  
>>> LDB.dump_db()   

Build a small test database:

>>> LDB.build_db(limit=1000)
>>> LDB.authors[0]
('78/459-1', ['Manish Singh'], [0])

Save your build in a non-default file:

>>> from tempfile import TemporaryDirectory
>>> from pathlib import Path
>>> with TemporaryDirectory() as tmpdirname:
...     LDB.dump(filename="test.zst", path=tmpdirname)
...     [file.name for file in Path(tmpdirname).glob("*")]
['test.zst']

In case you don’t like your build and want to reload your local database from disk:

>>> LDB.load_db()
classmethod check_update() dict | None[source]#

Check if a newer version is available on GitHub.

Returns:

Dictionary with update info if available, None if up to date.

Return type:

dict or None

classmethod db_info() dict | None[source]#

Return installed version info.

Returns:

Dictionary with tag, date, size, path; or None if not installed.

Return type:

dict or None

classmethod dump(filename: str, path='.', overwrite=False, include_search=True)[source]#

Save class state to file.

classmethod from_author(a)[source]#

Retrieve publications for a given author.

Parameters:

a (Author) – The author whose publications to retrieve.

Returns:

List of Publication objects.

Return type:

list

classmethod load(filename: str, path='.', restore_search=False)[source]#

Load class state from file.

classmethod retrieve(version: str | None = None, force: bool = False)[source]#

Download LDB database from GitHub releases.

Parameters:
  • version (str, optional) – Specific release version (e.g., “v0.4.0” or “0.4.0”). If None, downloads from latest release.

  • force (bool, default=False) – Download even if same version is installed.

Examples

The following will get you a LDB if you do not have one.

>>> LDB.retrieve()           # Latest compatible release  
>>> LDB.retrieve("v0.5.0")   # Specific version  
>>> LDB.retrieve("0.5.0")    # Also works without 'v' prefix  

Of course, the tag/version must be LDB-ready.

>>> LDB.retrieve("v0.3.0")   # Too old for LDB
Traceback (most recent call last):
...
RuntimeError: Asset 'ldb.pkl.zst' not found in release v0.3.0. Available assets: []
Raises:

RuntimeError – If release or asset not found, download fails, or version is incompatible.

classmethod search_author(name)[source]#

Search for authors matching the given name.

Parameters:

name (str) – Name to search for.

Returns:

List of matching Author objects.

Return type:

list

class gismap.sources.ldb.LDBAuthor(name: str, key: str, aliases: list = <factory>)[source]#

Author from the LDB (Local DBLP) database.

LDB provides local access to DBLP data without rate limiting.

Parameters:
  • name (str) – The author’s name.

  • key (str) – DBLP person identifier (pid).

  • aliases (list) – Alternative names for the author.

class gismap.sources.ldb.LDBPublication(authors: list, title: str, venue: str, type: str, year: int, key: str, metadata: dict = <factory>)[source]#

Publication from the LDB (Local DBLP) database.

Parameters:
  • title (str) – Publication title.

  • authors (list) – List of LDBAuthor objects.

  • venue (str) – Publication venue.

  • type (str) – Publication type.

  • year (int) – Publication year.

  • key (str) – DBLP record key.

  • metadata (dict) – Additional metadata (URL, streams, pages).

gismap.sources.ldb.LDB_PARAMETERS = Data(search=Data(limit=3, cutoff=87.0, slack=1.0), bof=Data(n_range=2, length_impact=0.1), frame_size=Data(authors=512, publis=256), io=Data(source='https://dblp.org/rdf/dblp.ttl.gz', destination=PosixPath('/home/runner/.local/share/gismap/ldb.pkl.zst'), gh_api='https://api.github.com/repos/balouf/gismap/releases'))#

Global configuration parameters for the Local DBLP (LDB) pipeline.

Structure: - search:

  • limit: maximum number of candidates retrieved per query.

  • cutoff: minimal similarity score required to keep a candidate.

  • slack: tolerance around the cutoff for borderline matches.

  • bof (Bag-of-Factors):
    • n_range: max factor size (higher is better but more expensive).

    • length_impact: how to compare two inputs of different size.

  • frame_size:
    • authors: maximum number of authors kept in a single frame/batch.

    • publis: maximum number of publications kept in a single frame/batch.

  • io:
    • source: URL/file location of the DBLP RDF dump used as raw input.

    • destination: local path where the compressed preprocessed dataset is / will be stored.

    • gh_api: GitHub API endpoint used to fetch release information for the project.

LDB_PARAMETERS is a Data (RecursiveDict) instance, so nested fields can be accessed with attribute notation, e.g.:

LDB_PARAMETERS.search.limit LDB_PARAMETERS.io.destination

gismap.sources.dblp_ttl.get_stream(source, chunk_size=65536)[source]#
Parameters:
  • source (str or Path) – Where the content. Can be on a local file or on the Internet.

  • chunk_size (int, optional) – Desired chunk size. For streaming gz content, must be a multiple of 32kB.

Yields:
  • iterable – Chunk iterator that streams the content.

  • int – Source size (used later to compute ETA).

gismap.sources.dblp_ttl.parse_block(dblp_block)[source]#
Parameters:

dblp_block (str) – A DBLP publication, turtle format.

Returns:

  • key (str) – DBLP key.

  • title (str) – Publication title.

  • type (str) – Type of publication.

  • authors (dict) – Publication authors (key -> name)

  • url (str or NoneType) – Publication URL.

  • stream (list or NoneType) – Publication streams (normalized journal/conf).

  • pages (str or NoneType) – Publication pages.

  • venue (str or NoneType) – Publication venue (conf/journal).

  • year (int) – Year of publication.

gismap.sources.dblp_ttl.publis_streamer(source, chunk_size=65536, encoding='unicode_escape')[source]#
Parameters:
  • source (str or Path) – Where the DBLP turtle content is. Can be on a local file or on the Internet.

  • chunk_size (int, optional) – Desired chunk size. Must be a multiple of 32kB.

  • encoding (str, default=unicode_escape) – Encoding of stream.

Yields:
  • key (str) – DBLP key.

  • title (str) – Publication title.

  • type (str) – Type of publication.

  • authors (dict) – Publication authors (key -> name).

  • venue (str) – Publication venue (conf/journal).

  • year (int) – Year of publication.

DBLP (online)#

Interface for dblp computer science bibliography (https://dblp.org/).

class gismap.sources.dblp.DBLP[source]#
classmethod from_author(a, wait=True)[source]#
Returns:

  • list – Papers available in DBLP.

  • wait (bool) – Wait a bit to avoid 429.

classmethod search_author(name, wait=True)[source]#
Parameters:
  • name (str) – People to find.

  • wait (bool, default=True) – Wait a bit to avoid 429.

Returns:

Potential matches.

Return type:

list

Examples

>>> fabien = DBLP.search_author("Fabien Mathieu")
>>> fabien
[DBLPAuthor(name='Fabien Mathieu', key='66/2077')]
>>> fabien[0].url
'https://dblp.org/pid/66/2077.html'
>>> manu = DBLP.search_author("Manuel Barragan")
>>> manu 
[DBLPAuthor(name='Manuel Barragan', key='07/10587'),
DBLPAuthor(name='Manuel Barragan', key='83/3865'),
DBLPAuthor(name='Manuel Barragan', key='188/0198')]
>>> DBLP.search_author("NotaSearcherName", wait=False)
[]
class gismap.sources.dblp.DBLPAuthor(name: str, key: str, aliases: list = <factory>)[source]#

Examples

>>> fabien = DBLPAuthor('Fabien Mathieu', key='66/2077')
>>> publications = sorted(fabien.get_publications(),
...                 key=lambda p: p.title)
>>> publications[0].url 
 'https://dblp.org/rec/conf/iptps/BoufkhadMMPV08.html'
>>> publications[-1] 
DBLPPublication(title='Upper Bounds for Stabilization in Acyclic Preference-Based Systems.',
authors=[DBLPAuthor(name='Fabien Mathieu', key='66/2077')], venue='SSS', type='conference', year=2007,
key='conf/sss/Mathieu07')
class gismap.sources.dblp.DBLPPublication(title: str, authors: list, venue: str, type: str, year: int, key: str, metadata: dict = <factory>)[source]#

Publication from the DBLP database.

Parameters:
  • title (str) – Publication title.

  • authors (list) – List of DBLPAuthor objects.

  • venue (str) – Publication venue.

  • type (str) – Publication type.

  • year (int) – Publication year.

  • key (str) – DBLP record key.

  • metadata (dict) – Additional metadata (pages, volume, etc.).

HAL#

Interface for HyperArticles en Ligne (https://hal.science/).

class gismap.sources.hal.HAL[source]#
classmethod from_author(a, wait=True)[source]#
Parameters:
  • a (HALAuthor) – Hal researcher.

  • wait (bool, default=True) – Wait a bit to avoid 429.

Returns:

Papers available in HAL.

Return type:

list

Examples

>>> fabien = HAL.search_author("Fabien Mathieu")[0]
>>> publications = sorted(fabien.get_publications(), key=lambda p: p.title)
>>> publications[2] 
HALPublication(title='Achievable Catalog Size in Peer-to-Peer Video-on-Demand Systems',
authors=[HALAuthor(name='Yacine Boufkhad', key='yacine-boufkhad'),
HALAuthor(name='Fabien Mathieu', key='fabien-mathieu'),
HALAuthor(name='Fabien de Montgolfier', key='949013', key_type='pid'),
HALAuthor(name='Diego Perino', key='Diego Perino', key_type='fullname'),
HALAuthor(name='Laurent Viennot', key='laurentviennot')],
venue='Proceedings of the 7th Internnational Workshop on Peer-to-Peer Systems (IPTPS)', type='conference',
year=2008, key='471724')
>>> diego = publications[2].authors[3]
>>> diego
HALAuthor(name='Diego Perino', key='Diego Perino', key_type='fullname')
>>> len(diego.get_publications()) > 28
True
>>> publications[-7] 
HALPublication(title='Upper bounds for stabilization in acyclic preference-based systems',
authors=[HALAuthor(name='Fabien Mathieu', key='fabien-mathieu')],
venue="SSS'07 - 9th international conference on Stabilization, Safety, and Security of Distributed Systems",
type='conference', year=2007, key='668356')

Case of someone with multiple ids one want to cumulate:

>>> maria = HAL.search_author('Maria Potop-Butucaru')
>>> maria  
[HALAuthor(name='Maria Potop-Butucaru', key='841868', key_type='pid')]
>>> n_pubs = len(HAL.from_author(maria[0]))
>>> n_pubs > 200
True
>>> n_pubs == len(maria[0].get_publications())
True

Note: an error is raised if not enough data is provided

>>> HAL.from_author(HALAuthor('Fabien Mathieu'))
Traceback (most recent call last):
...
ValueError: HALAuthor(name='Fabien Mathieu') must have a key for publications to be fetched.
classmethod search_author(name, wait=True)[source]#
Parameters:
  • name (str) – People to find.

  • wait (bool, default=True) – Wait a bit to avoid 429.

Returns:

Potential matches.

Return type:

list

Examples

>>> fabien = HAL.search_author("Fabien Mathieu")
>>> fabien
[HALAuthor(name='Fabien Mathieu', key='fabien-mathieu')]
>>> fabien = fabien[0]
>>> fabien.url
'https://hal.science/search/index/?q=*&authIdHal_s=fabien-mathieu'
>>> HAL.search_author("Laurent Viennot")[0]
HALAuthor(name='Laurent Viennot', key='laurentviennot')
>>> HAL.search_author("NotaSearcherName")
[]
>>> HAL.search_author("Ana Busic")
[HALAuthor(name='Ana Busic', key='anabusic')]
>>> HAL.search_author("Potop-Butucaru Maria")  
[HALAuthor(name='Potop-Butucaru Maria', key='841868', key_type='pid')]
>>> diego = HAL.search_author("Diego Perino")
>>> diego  
[HALAuthor(name='Diego Perino', key='847558', key_type='pid'),
HALAuthor(name='Diego Perino', key='978810', key_type='pid')]
>>> diego[1].url
'https://hal.science/search/index/?q=*&authIdPerson_i=978810'
class gismap.sources.hal.HALAuthor(name: str, key: str | int = None, key_type: str = None, aliases: list = <factory>, _url: str = None, _img: str = None, _cv: bool = None)[source]#

Author from the HAL (Hyper Articles en Ligne) database.

HAL is a French open archive for scholarly publications.

Parameters:
  • name (str) – The author’s name.

  • key (str or int, optional) – HAL identifier for the author.

  • key_type (str, optional) – Type of key (‘pid’, ‘fullname’, or None for idHal).

  • aliases (list) – Alternative names for the author.

class gismap.sources.hal.HALPublication(title: str, authors: list, venue: str, type: str, year: int, key: str, metadata: dict = <factory>)[source]#

Publication from the HAL database.

Parameters:
  • title (str) – Publication title.

  • authors (list) – List of HALAuthor objects.

  • venue (str) – Publication venue.

  • type (str) – Publication type.

  • year (int) – Publication year.

  • key (str) – HAL document identifier.

  • metadata (dict) – Additional metadata (abstract, URL, etc.).

classmethod from_json(r)[source]#
Parameters:

r (dict) – De-serialized JSON.

Return type:

HALPublication

gismap.sources.hal.parse_facet_author(a)[source]#
Parameters:

a (str) – Hal facet of author

Return type:

HALAuthor

Multi-source#

Interface for handling multiple sources at once.

class gismap.sources.multi.SourcedAuthor(name: str, sources: list = <factory>)[source]#

An author aggregated from multiple database sources.

Combines author information from HAL, DBLP, and/or LDB into a single entity. The primary source (first in the sorted list) determines the author’s key.

Parameters:
  • name (str) – The author’s name.

  • sources (list) – List of database-specific author objects (HALAuthor, DBLPAuthor, LDBAuthor).

class gismap.sources.multi.SourcedPublication(title: str, authors: list, venue: str, type: str, year: int, sources: list = <factory>)[source]#

A publication aggregated from multiple database sources.

Combines publication entries from HAL, DBLP, and/or LDB that refer to the same paper. The primary source determines the publication’s metadata.

Parameters:
  • title (str) – Publication title.

  • authors (list) – List of author objects.

  • venue (str) – Publication venue.

  • type (str) – Publication type.

  • year (int) – Publication year.

  • sources (list) – List of database-specific publication objects.

gismap.sources.multi.regroup_authors(auth_dict, pub_dict)[source]#

Replace authors of publications with matching authors. Typical use: upgrade DB-specific authors to multisource authors.

Replacement is in place.

Parameters:
  • auth_dict (dict) – Authors to unify.

  • pub_dict (dict) – Publications to unify.

Return type:

None

gismap.sources.multi.regroup_publications(pub_dict, threshold=85, length_impact=0.05, n_range=5)[source]#

Puts together copies of the same publication.

Parameters:
  • pub_dict (dict) – Publications to unify.

  • threshold (float) – Similarity parameter.

  • length_impact (float) – Length impact parameter.

Returns:

Unified publications.

Return type:

dict

gismap.sources.multi.score_author_source(dbauthor)[source]#

Compute a quality score for an author source.

Higher scores indicate more reliable author identification. HAL idHal keys are preferred, followed by HAL pid, then DBLP/LDB.

Parameters:

dbauthor (Author) – A database-specific author object.

Returns:

Score value (higher is better).

Return type:

int

gismap.sources.multi.sort_author_sources(sources)[source]#

Sort author sources by quality score in descending order.

Parameters:

sources (list) – List of database-specific author objects.

Returns:

Sorted list with highest-quality sources first.

Return type:

list