Datasets#

Collection of access to small or less small datasets.

gismo.datasets.acm.flatten_acm(acm, min_size=5, max_depth=100, exclude=None, depth=0)[source]#

Select subdomains of an acm tree and return them as a list.

Parameters:
  • acm (list of dicts) – acm tree from get_acm.

  • min_size (int) – size threshold to select a domain (avoids small domains)

  • max_depth (int) – depth threshold to select a domain (avoids deep domains)

  • exclude (list) – list of domains to exclude from the results

Returns:

A flat list of domains described by name and query.

Return type:

list

Example

>>> acm = flatten_acm(get_acm())
>>> acm[111]['name']
'Graph theory'
gismo.datasets.acm.get_acm()[source]#
Returns:

acm – Each dict is an ACM domain. It contains category name, query (concatenation of names from domain and subdomains), size (number of subdomains including itself), and children (list of domain dicts).

Return type:

list of dicts

Examples

>>> acm = get_acm()
>>> subdomain = acm[4]['children'][2]['children'][1]
>>> subdomain['name']
'Software development process management'
>>> subdomain['size']
10
>>> subdomain['query']
'Software development process management, Software development methods, Rapid application development, Agile software development, Capability Maturity Model, Waterfall model, Spiral model, V-model, Design patterns, Risk management'
>>> len(acm)
13
gismo.datasets.dblp.DEFAULT_FIELDS = {'authors', 'title', 'type', 'venue', 'year'}#

Default fields to extract.

gismo.datasets.dblp.DTD_URL = 'https://dblp.uni-trier.de/xml/dblp.dtd'#

URL of the dtd file (required to correctly parse non-ASCII characters).

class gismo.datasets.dblp.Dblp(dblp_url='https://dblp.uni-trier.de/xml/dblp.xml.gz', filename='dblp', path='.')[source]#

The DBLP class can download DBLP database and produce source files compatible with the FileSource class.

Parameters:
  • dblp_url (str, optional) – Alternative URL for the dblp.xml.gz file

  • filename (str) – Stem of the files (suffixes will be appened)

  • path (str or path, optional) – Destination of the files

build(refresh=False, d=2, fields=None)[source]#

Main class method. Create the data and index files.

Parameters:
  • refresh (bool) – Tell if files are to be rebuilt if they are already there.

  • d (int) – depth level where articles are. Usually 2 or 3 (2 for the main database).

  • fields (set, optional) – Set of fields to collect. Default to DEFAULT_FIELDS.

Example

By default, the class downloads the full dataset. Here we will limit to one entry.

>>> toy_url = "https://dblp.org/pers/xx/m/Mathieu:Fabien.xml"
>>> import tempfile
>>> from gismo.filesource import FileSource
>>> tmp = tempfile.TemporaryDirectory()
>>> dblp = Dblp(dblp_url=toy_url, path=tmp.name)
>>> dblp.build() # doctest.ELLIPSIS
Retrieve https://dblp.org/pers/xx/m/Mathieu:Fabien.xml from the Internet.
DBLP database downloaded to ...xml.gz.
Converting DBLP database from ...xml.gz (may take a while).
Building Index.
Conversion done.

By default, build uses existing files.

>>> dblp.build() # doctest.ELLIPSIS
File ...xml.gz already exists. Use refresh option to overwrite.
File ...data already exists. Use refresh option to overwrite.

The refresh parameter can be used to ignore existing files.

>>> dblp.build(d=3, refresh=True) # doctest.ELLIPSIS
Retrieve https://dblp.org/pers/xx/m/Mathieu:Fabien.xml from the Internet.
DBLP database downloaded to ...xml.gz.
Converting DBLP database from ...xml.gz (may take a while).
Building Index.
Conversion done.

The resulting files can be used to create a FileSource.

>>> source = FileSource(filename="dblp", path=tmp.name)
>>> art = [s for s in source if s['title']=="Can P2P networks be super-scalable?"][0]
>>> art['authors'] # doctest.ELLIPSIS
['François Baccelli', 'Fabien Mathieu', 'Ilkka Norros', 'Rémi Varloot']

Don’t forget to close source after use.

>>> source.close()
>>> tmp.cleanup()
gismo.datasets.dblp.LIST_TYPE_FIELDS = {'authors', 'urls'}#

DBLP fields with possibly multiple entries.

gismo.datasets.dblp.URL = 'https://dblp.uni-trier.de/xml/dblp.xml.gz'#

URL of the full DBLP database.

gismo.datasets.dblp.element_to_filesource(elt, data_handler, index, fields)[source]#
  • Converts the xml element elt into a dict if it is an article.

  • Compress and write the dict in data_handler

  • Append file position in data_handler to index.

Parameters:
  • elt (Any) – a XML element.

  • data_handler (file_descriptor) – Where the compressed data will be stored. Must be writable.

  • index – a list that contains the initial position of the data_handler for all previously processed elements.

  • fields (set) – Set of fields to retrieve.

Returns:

Always return True for compatibility with the xml parser.

Return type:

bool

gismo.datasets.dblp.element_to_source(elt, source, fields)[source]#

Test if elt is an article, converts it to dictionary and appends to source

Parameters:
  • elt (Any) – a XML element.

  • source (list) – the source in construction.

  • fields (set) – Set of fields to retrieve.

gismo.datasets.dblp.fast_iter(context, func, d=2, **kwargs)[source]#

Applies func to all xml elements of depth 1 of the xml parser context. ` **kwargs are passed to func.

Modified version of a modified version of Liza Daly’s fast_iter Inspired by https://stackoverflow.com/questions/4695826/efficient-way-to-iterate-through-xml-elements

Parameters:
  • context (XMLparser) – A parser obtained from etree.iterparse

  • func (function) – How to process the elements

  • d (int, optional) – Depth to process elements.

gismo.datasets.dblp.url2source(url, fields=None)[source]#

Directly transform URL of a dblp xml into a list of dictionnary. Only use for datasets that fit into memory (e.g. articles from one author). If the dataset does not fit, consider using the Dblp class instead.

Parameters:
  • url (str) – the URL to fetch.

  • fields (set) – Set of DBLP fields to capture.

Returns:

source – Articles retrieved from the URL

Return type:

list of dict

Example

>>> source = url2source("https://dblp.org/pers/xx/t/Tixeuil:S=eacute=bastien.xml", fields={'authors', 'title', 'year', 'venue', 'urls'})
>>> art = [s for s in source if s['title']=="Distributed Computing with Mobile Robots: An Introductory Survey."][0]
>>> art['authors']
['Maria Potop-Butucaru', 'Michel Raynal', 'Sébastien Tixeuil']
>>> art['urls']
['https://doi.org/10.1109/NBiS.2011.55', 'https://doi.ieeecomputersociety.org/10.1109/NBiS.2011.55']
gismo.datasets.dblp.xml_element_to_dict(elt, fields)[source]#

Converts the xml element elt into a dict if it is a paper.

Parameters:
  • elt (Any) – a XML element.

  • fields (set) – Set of entries to retrieve.

Returns:

Article dictionary if element contains the attributes of an article, None otherwise.

Return type:

dict or None

gismo.datasets.reuters.get_reuters_entry(name, z)[source]#

Read the Reuters news referenced by name in the zip archive z and returns it as a dict.

Parameters:
  • name (str) – Location of the file inside the Reuters archive

  • z (ZipFile) – Zipfile descriptor of the Reuters archive

Returns:

entry – dict with keys set (C50test or c50train), author, id, and content

Return type:

dict

gismo.datasets.reuters.get_reuters_news(url='https://github.com/balouf/datasets/raw/main/C50.zip')[source]#

Returns a list of news from the Reuters C50 news datasets

Acknowledgments

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

ZhiLiu, e-mail: liuzhi8673 ‘@’ gmail.com, institution: National Engineering Research Center for E-Learning, Hubei Wuhan, China

Parameters:

url (str) – Location of the C50 dataset

Returns:

The C50 news as a list of dict

Return type:

list

Example

Cf Sentencizer