Database#

Classes and functions to interact with databases of publications.

Models#

Abstract description of GisMap DB interface.

class gismap.sources.models.Author(name: str)[source]#
class gismap.sources.models.DB[source]#
class gismap.sources.models.Publication(title: str, authors: list, venue: str, type: str, year: int)[source]#

LDB (Local DBLP)#

Interface for dblp computer science bibliography (https://dblp.org/) using a local copy of the database.

class gismap.sources.ldb.LDB[source]#

Browse DBLP from a local copy of the database.

dump(*args, **kwargs)[source]#

Save instance to file.

Parameters:
  • filename (str) – The stem of the filename.

  • path (str or Path, optional) – The location path.

  • overwrite (bool, default=False) – Should existing file be erased if it exists?

  • compress (bool, default=True) – Should Zstd compression be used?

  • stemize (bool, default=True) – Trim any extension (e.g. .xxx)

Examples

>>> import tempfile
>>> v1 = ToyClass(42)
>>> v2 = ToyClass()
>>> v2.value
0
>>> with tempfile.TemporaryDirectory() as tmpdirname:
...     v1.dump(filename='myfile', compress=True, path=tmpdirname)
...     dir_content = [file.name for file in Path(tmpdirname).glob('*')]
...     v2 = ToyClass.load(filename='myfile', path=Path(tmpdirname))
...     v1.dump(filename='myfile', compress=True, path=tmpdirname) # doctest.ELLIPSIS
File ...myfile.pkl.zst already exists! Use overwrite option to overwrite.
>>> dir_content
['myfile.pkl.zst']
>>> v2.value
42
>>> with tempfile.TemporaryDirectory() as tmpdirname:
...     v1.dump(filename='myfile', compress=False, path=tmpdirname)
...     v1.dump(filename='myfile', compress=False, path=tmpdirname) # doctest.ELLIPSIS
File ...myfile.pkl already exists! Use overwrite option to overwrite.
>>> v1.value = 51
>>> with tempfile.TemporaryDirectory() as tmpdirname:
...     v1.dump(filename='myfile', path=tmpdirname, compress=False)
...     v1.dump(filename='myfile', path=tmpdirname, overwrite=True, compress=False)
...     v2 = ToyClass.load(filename='myfile', path=tmpdirname)
...     dir_content = [file.name for file in Path(tmpdirname).glob('*')]
>>> dir_content
['myfile.pkl']
>>> v2.value
51
>>> with tempfile.TemporaryDirectory() as tmpdirname:
...    v2 = ToyClass.load(filename='thisfilenamedoesnotexist')
Traceback (most recent call last):
 ...
FileNotFoundError: [Errno 2] No such file or directory: ...
classmethod load(*args, **kwargs)[source]#

Load instance from file.

Parameters:
  • filename (str) – The stem of the filename.

  • path (str or Path, optional) – The location path.

class gismap.sources.ldb.LDBAuthor(name: str, key: str, aliases: list = <factory>)[source]#
class gismap.sources.ldb.LDBPublication(title: str, authors: list, venue: str, type: str, year: int, key: str, metadata: dict = <factory>)[source]#
gismap.sources.dblp_ttl.get_stream(source, chunk_size=65536)[source]#
Parameters:
  • source (str or Path) – Where the content. Can be on a local file or on the Internet.

  • chunk_size (int, optional) – Desired chunk size. For streaming gz content, must be a multiple of 32kB.

Yields:
  • iterable – Chunk iterator that streams the content.

  • int – Source size (used later to compute ETA).

gismap.sources.dblp_ttl.parse_block(dblp_block)[source]#
Parameters:

dblp_block (str) – A DBLP publication, turtle format.

Returns:

  • key (str) – DBLP key.

  • title (str) – Publication title.

  • type (str) – Type of publication.

  • authors (dict) – Publication authors (key -> name)

  • url (str or NoneType) – Publication URL.

  • stream (list or NoneType) – Publication streams (normalized journal/conf).

  • pages (str or NoneType) – Publication pages.

  • venue (str or NoneType) – Publication venue (conf/journal).

  • year (int) – Year of publication.

gismap.sources.dblp_ttl.publis_streamer(source, chunk_size=65536, encoding='unicode_escape')[source]#
Parameters:
  • source (str or Path) – Where the DBLP turtle content is. Can be on a local file or on the Internet.

  • chunk_size (int, optional) – Desired chunk size. Must be a multiple of 32kB.

  • encoding (str, default=unicode_escape) – Encoding of stream.

Yields:
  • key (str) – DBLP key.

  • title (str) – Publication title.

  • type (str) – Type of publication.

  • authors (dict) – Publication authors (key -> name).

  • venue (str) – Publication venue (conf/journal).

  • year (int) – Year of publication.

DBLP (online)#

Interface for dblp computer science bibliography (https://dblp.org/).

class gismap.sources.dblp.DBLP[source]#
classmethod from_author(a, wait=True)[source]#
Returns:

  • list – Papers available in DBLP.

  • wait (bool) – Wait a bit to avoid 429.

classmethod search_author(name, wait=True)[source]#
Parameters:
  • name (str) – People to find.

  • wait (bool) – Wait a bit to avoid 429.

Returns:

Potential matches.

Return type:

list

Examples

>>> fabien = DBLP.search_author("Fabien Mathieu")
>>> fabien
[DBLPAuthor(name='Fabien Mathieu', key='66/2077')]
>>> fabien[0].url
'https://dblp.org/pid/66/2077.html'
>>> manu = DBLP.search_author("Manuel Barragan")
>>> manu 
[DBLPAuthor(name='Manuel Barragan', key='07/10587'),
DBLPAuthor(name='Manuel Barragan', key='83/3865'),
DBLPAuthor(name='Manuel Barragan', key='188/0198')]
>>> DBLP.search_author("NotaSearcherName", wait=False)
[]
class gismap.sources.dblp.DBLPAuthor(name: str, key: str, aliases: list = <factory>)[source]#

Examples

>>> fabien = DBLPAuthor('Fabien Mathieu', key='66/2077')
>>> publications = sorted(fabien.get_publications(),
...                 key=lambda p: p.title)
>>> publications[0].url 
 'https://dblp.org/rec/conf/iptps/BoufkhadMMPV08.html'
>>> publications[-1] 
DBLPPublication(title='Upper Bounds for Stabilization in Acyclic Preference-Based Systems.',
authors=[DBLPAuthor(name='Fabien Mathieu', key='66/2077')], venue='SSS', type='conference', year=2007,
key='conf/sss/Mathieu07')
class gismap.sources.dblp.DBLPPublication(title: str, authors: list, venue: str, type: str, year: int, key: str, metadata: dict = <factory>)[source]#

HAL#

Interface for HyperArticles en Ligne (https://hal.science/).

class gismap.sources.hal.HAL[source]#
classmethod from_author(a)[source]#
Parameters:

a (HALAuthor) – Hal researcher.

Returns:

Papers available in HAL.

Return type:

list

Examples

>>> fabien = HAL.search_author("Fabien Mathieu")[0]
>>> publications = sorted(fabien.get_publications(), key=lambda p: p.title)
>>> publications[2] 
HALPublication(title='Achievable Catalog Size in Peer-to-Peer Video-on-Demand Systems',
authors=[HALAuthor(name='Yacine Boufkhad', key='yacine-boufkhad'),
HALAuthor(name='Fabien Mathieu', key='fabien-mathieu'),
HALAuthor(name='Fabien de Montgolfier', key='949013', key_type='pid'),
HALAuthor(name='Diego Perino', key='Diego Perino', key_type='fullname'),
HALAuthor(name='Laurent Viennot', key='laurentviennot')],
venue='Proceedings of the 7th Internnational Workshop on Peer-to-Peer Systems (IPTPS)', type='conference',
year=2008, key='471724')
>>> diego = publications[2].authors[3]
>>> diego
HALAuthor(name='Diego Perino', key='Diego Perino', key_type='fullname')
>>> len(diego.get_publications()) > 28
True
>>> publications[-7] 
HALPublication(title='Upper bounds for stabilization in acyclic preference-based systems',
authors=[HALAuthor(name='Fabien Mathieu', key='fabien-mathieu')],
venue="SSS'07 - 9th international conference on Stabilization, Safety, and Security of Distributed Systems",
type='conference', year=2007, key='668356')

Case of someone with multiple ids one want to cumulate:

>>> maria = HAL.search_author('Maria Potop-Butucaru')
>>> maria  
[HALAuthor(name='Maria Potop-Butucaru', key='841868', key_type='pid')]
>>> n_pubs = len(HAL.from_author(maria[0]))
>>> n_pubs > 200
True
>>> n_pubs == len(maria[0].get_publications())
True

Note: an error is raised if not enough data is provided

>>> HAL.from_author(HALAuthor('Fabien Mathieu'))
Traceback (most recent call last):
...
ValueError: HALAuthor(name='Fabien Mathieu') must have a key for publications to be fetched.
classmethod search_author(name)[source]#
Parameters:

name (str) – People to find.

Returns:

Potential matches.

Return type:

list

Examples

>>> fabien = HAL.search_author("Fabien Mathieu")
>>> fabien
[HALAuthor(name='Fabien Mathieu', key='fabien-mathieu')]
>>> fabien = fabien[0]
>>> fabien.url
'https://hal.science/search/index/?q=*&authIdHal_s=fabien-mathieu'
>>> HAL.search_author("Laurent Viennot")[0]
HALAuthor(name='Laurent Viennot', key='laurentviennot')
>>> HAL.search_author("NotaSearcherName")
[]
>>> HAL.search_author("Ana Busic")
[HALAuthor(name='Ana Busic', key='anabusic')]
>>> HAL.search_author("Potop-Butucaru Maria")  
[HALAuthor(name='Potop-Butucaru Maria', key='841868', key_type='pid')]
>>> diego = HAL.search_author("Diego Perino")
>>> diego  
[HALAuthor(name='Diego Perino', key='847558', key_type='pid'),
HALAuthor(name='Diego Perino', key='978810', key_type='pid')]
>>> diego[1].url
'https://hal.science/search/index/?q=*&authIdPerson_i=978810'
class gismap.sources.hal.HALAuthor(name: str, key: str | int = None, key_type: str = None, aliases: list = <factory>, _url: str = None, _img: str = None, _cv: bool = None)[source]#
class gismap.sources.hal.HALPublication(title: str, authors: list, venue: str, type: str, year: int, key: str, metadata: dict = <factory>)[source]#
classmethod from_json(r)[source]#
Parameters:

r (dict) – De-serialized JSON.

Return type:

HALPublication

gismap.sources.hal.parse_facet_author(a)[source]#
Parameters:

a (str) – Hal facet of author

Return type:

HALAuthor

Multi-source#

Interface for handling multiple sources at once.

class gismap.sources.multi.SourcedAuthor(name: str, sources: list = <factory>)[source]#
class gismap.sources.multi.SourcedPublication(title: str, authors: list, venue: str, type: str, year: int, sources: list = <factory>)[source]#
gismap.sources.multi.regroup_authors(auth_dict, pub_dict)[source]#

Replace authors of publications with matching authors. Typical use: upgrade DB-specific authors to multisource authors.

Replacement is in place.

Parameters:
  • auth_dict (dict) – Authors to unify.

  • pub_dict (dict) – Publications to unify.

Return type:

None

gismap.sources.multi.regroup_publications(pub_dict, threshold=85, length_impact=0.05, n_range=5)[source]#

Puts together copies of the same publication.

Parameters:
  • pub_dict (dict) – Publications to unify.

  • threshold (float) – Similarity parameter.

  • length_impact (float) – Length impact parameter.

Returns:

Unified publications.

Return type:

dict