Utils#

Various functions and classes.

Common#

All-purpose functions.

class gismap.utils.common.Data(data)[source]#

Easy-going converter of dict to dataclass. Useful when you want to use attribute access and do not care about giving a full description.

Examples

>>> data = Data({
... 'name': 'Alice',
... 'age': 30,
... 'address': {'street': '123 Main', 'city': 'Paris'},
... 'hobbies': [{'name': 'jazz', 'level': 5}, {'name': 'code'}]})
>>> data 
Data(name='Alice', age=30, address=Data(street='123 Main', city='Paris'),
hobbies=[Data(name='jazz', level=5), Data(name='code')])
>>> data.hobbies[0].name
'jazz'
>>> data.todict()  
{'name': 'Alice', 'age': 30, 'address': {'street': '123 Main', 'city': 'Paris'},
'hobbies': [{'name': 'jazz', 'level': 5}, {'name': 'code'}]}
class gismap.utils.common.LazyRepr[source]#

MixIn that provides a clean repr for dataclasses.

Hides empty fields and fields in HIDDEN_KEYS from the repr string. Private attributes (starting with ‘_’) are also hidden.

gismap.utils.common.get_classes(root, key='name', recurse=False)[source]#
Parameters:
  • root (class) – Starting class (can be abstract).

  • key (str, default=’name’) – Attribute to look-up

  • recurse (bool, default=False) – Recursively traverse subclasses.

Returns:

Dictionaries of all subclasses that have a key attribute (as in class attribute key).

Return type:

dict

gismap.utils.common.list_of_objects(clss, dico, default=None)[source]#

Versatile way to enter a list of objects referenced by a dico.

Parameters:
  • clss (object) – Object or reference to an object or list of objects / references to objects.

  • dico (dict) – Dictionary of references to objects.

  • default (list, optional) – Default list to return if clss is None.

Returns:

Proper list of objects.

Return type:

list

Examples

>>> from gismap.sources.models import DB
>>> from gismap import HAL, DBLP, LDB  # force registration
>>> subclasses = get_classes(DB, key='db_name')
>>> list_of_objects([HAL, 'ldb'], subclasses)
[<class 'gismap.sources.hal.HAL'>, <class 'gismap.sources.ldb.LDB'>]
>>> list_of_objects(None, subclasses, [DBLP])
[<class 'gismap.sources.dblp.DBLP'>]
>>> list_of_objects(LDB, subclasses)
[<class 'gismap.sources.ldb.LDB'>]
>>> list_of_objects('hal', subclasses)
[<class 'gismap.sources.hal.HAL'>]
gismap.utils.common.unlist(x)[source]#
Parameters:

x (str or list or int) – Something.

Returns:

x – If it’s a list, make it flat.

Return type:

str or int

Requests#

Functions related to the requests.

gismap.utils.requests.get(url, params=None, n_trials=10, verify=True, encoding=None)[source]#
Parameters:
  • url (str) – Entry point to fetch.

  • params (dict, optional) – Get arguments (appended to URL).

  • n_trials (int, default=10) – Number of attempts to fetch URL.

  • verify (bool, default=True) – Verify certificates.

  • encoding (str, optional) – Force response encoding (e.g. "utf-8"). Useful when the server does not declare the charset and requests falls back to ISO-8859-1.

Returns:

Result.

Return type:

str

Logger#

Keep track of things.

gismap.utils.logger.logger = <Logger GisMap (INFO)>#

Default logging interface.

Zlist#

Convert a list into a succession of compressed frames. Reduces memory footprint at the price of slower random access (sequential access is unaffected).

class gismap.utils.zlist.ZList(frame_size=1000)[source]#

Compressed list with frame-based storage.

Stores elements in compressed frames, allowing efficient memory usage while maintaining random access. Uses zstandard compression.

Use as a context manager for building:

with ZList(frame_size=1000) as z:
for item in data:

z.append(item)

Parameters:

frame_size (int, default=1000) – Number of elements per compressed frame.

append(entry)[source]#

Add an element to the list.

Parameters:

entry – Element to add.

Text#

Text manipulation tools.

class gismap.utils.text.Corrector(voc, score_cutoff=20, min_length=3)[source]#

A simple word corrector base on input vocabulary. Short words are discarded.

Parameters:
  • voc (list) – Words (each entry may contain multiple words).

  • score_cutoff (int, default=20) – Threshold for correction.

  • min_length (int, default=3) – Minimal number of characters for correction to kick in.

Examples

>>> vocabulary = ['My Taylor Swift is Rich']
>>> phrase = "How riche ise Tailor Swyft"
>>> cor = Corrector(vocabulary, min_length=4)
>>> cor(phrase)
'How rich ise taylor swift'
>>> cor = Corrector(vocabulary, min_length=2)
>>> cor(phrase)
'How rich is taylor swift'
gismap.utils.text.asciify(text)[source]#
Parameters:

text (str) – Some text (typically names) with annoying accents.

Returns:

Same text simplified into ascii.

Return type:

str

Examples

>>> asciify('Ana Bušić')
'Ana Busic'
>>> asciify("Thomas Deiß")
'Thomas Deiss'
gismap.utils.text.clean_aliases(name, alias_list)[source]#
Parameters:
  • name (str) – Main name.

  • alias_list (list or set) – Aliases.

Returns:

Aliases deduped, sorted, and with main name removed.

Return type:

list

gismap.utils.text.normalized_name(txt)[source]#

Try to normalize names for facilitating comparisons. Name is lowered, split, asciified, sorted, and filtered.

Parameters:

txt (str)

Return type:

str

Examples

>>> normalized_name("Thomas Deiß")
'deiss thomas'
>>> normalized_name("Dario Rossi 001")
'dario rossi'
>>> normalized_name("James W. Roberts")
'james roberts'
gismap.utils.text.normalized_title(txt)[source]#

Try to normalize titles for facilitating comparisons. Title is lowercased, asciified, and stripped of punctuation.

Parameters:

txt (str)

Return type:

str

Examples

>>> normalized_title("An Efficient Algorithm for P2P Networks")
'an efficient algorithm for p2p networks'
>>> normalized_title("A Study on the Use of Millimeter Waves: 5G Networks.")
'a study on the use of millimeter waves 5g networks'
gismap.utils.text.reduce_keywords(kws)[source]#

Remove redundant subparts.

Parameters:

kws (list) – List of words / co-locations.

Returns:

Reduced list

Return type:

list

Examples

>>> reduce_keywords(['P2P', 'Millimeter Waves', 'Networks', 'P2P Networks', 'Waves'])
['Millimeter Waves', 'P2P Networks']

Fuzzy#

Fuzzy matching utilities (similarity_matrix).

gismap.utils.fuzzy.similarity_matrix(references, candidates=None, n_range=4, length_impact=0.05, key=None, key2=None)[source]#

Compute a similarity matrix between objects using fuzzy n-gram matching.

When candidates is None, computes pairwise similarities within references (self-comparison). When candidates is provided, computes cross-similarities between references and candidates.

Parameters:
  • references (list) – Reference objects.

  • candidates (list, optional) – Candidate objects to compare against references. If None, references are compared against themselves.

  • n_range (int, default=4) – N-gram range for the vectorizer.

  • length_impact (float, default=0.05) – Impact of length difference on similarity scores.

  • key (callable, optional) – Fingerprint extractor for references. Defaults to identity.

  • key2 (callable, optional) – Fingerprint extractor for candidates. Defaults to key.

Returns:

Similarity matrix. Shape is (len(references), len(references)) for self-comparison, or (len(candidates), len(references)) for cross-comparison.

Return type:

ndarray

Examples

>>> m = similarity_matrix(["abc def", "abc deg", "xyz"])
>>> m.shape
(3, 3)
>>> m[0, 1] > 50
True
>>> m[0, 2] < 50
True