Datasets#
Collection of access to small or less small datasets.
- gismo.datasets.acm.flatten_acm(acm, min_size=5, max_depth=100, exclude=None, depth=0)[source]#
Select subdomains of an acm tree and return them as a list.
- Parameters:
- Returns:
A flat list of domains described by name and query.
- Return type:
Example
>>> acm = flatten_acm(get_acm()) >>> acm[111]['name'] 'Graph theory'
- gismo.datasets.acm.get_acm()[source]#
- Returns:
acm – Each dict is an ACM domain. It contains category name, query (concatenation of names from domain and subdomains), size (number of subdomains including itself), and children (list of domain dicts).
- Return type:
list of dicts
Examples
>>> acm = get_acm() >>> subdomain = acm[4]['children'][2]['children'][1] >>> subdomain['name'] 'Software development process management' >>> subdomain['size'] 10 >>> subdomain['query'] 'Software development process management, Software development methods, Rapid application development, Agile software development, Capability Maturity Model, Waterfall model, Spiral model, V-model, Design patterns, Risk management' >>> len(acm) 13
- gismo.datasets.dblp.DEFAULT_FIELDS = {'authors', 'title', 'type', 'venue', 'year'}#
Default fields to extract.
- gismo.datasets.dblp.DTD_URL = 'https://dblp.uni-trier.de/xml/dblp.dtd'#
URL of the dtd file (required to correctly parse non-ASCII characters).
- class gismo.datasets.dblp.Dblp(dblp_url='https://dblp.uni-trier.de/xml/dblp.xml.gz', filename='dblp', path='.')[source]#
The DBLP class can download DBLP database and produce source files compatible with the
FileSource
class.- Parameters:
- build(refresh=False, d=2, fields=None)[source]#
Main class method. Create the data and index files.
- Parameters:
refresh (bool) – Tell if files are to be rebuilt if they are already there.
d (int) – depth level where articles are. Usually 2 or 3 (2 for the main database).
fields (set, optional) – Set of fields to collect. Default to
DEFAULT_FIELDS
.
Example
By default, the class downloads the full dataset. Here we will limit to one entry.
>>> toy_url = "https://dblp.org/pers/xx/m/Mathieu:Fabien.xml" >>> import tempfile >>> from gismo.filesource import FileSource >>> tmp = tempfile.TemporaryDirectory() >>> dblp = Dblp(dblp_url=toy_url, path=tmp.name) >>> dblp.build() # doctest.ELLIPSIS Retrieve https://dblp.org/pers/xx/m/Mathieu:Fabien.xml from the Internet. DBLP database downloaded to ...xml.gz. Converting DBLP database from ...xml.gz (may take a while). Building Index. Conversion done.
By default, build uses existing files.
>>> dblp.build() # doctest.ELLIPSIS File ...xml.gz already exists. Use refresh option to overwrite. File ...data already exists. Use refresh option to overwrite.
The refresh parameter can be used to ignore existing files.
>>> dblp.build(d=3, refresh=True) # doctest.ELLIPSIS Retrieve https://dblp.org/pers/xx/m/Mathieu:Fabien.xml from the Internet. DBLP database downloaded to ...xml.gz. Converting DBLP database from ...xml.gz (may take a while). Building Index. Conversion done.
The resulting files can be used to create a FileSource.
>>> source = FileSource(filename="dblp", path=tmp.name) >>> art = [s for s in source if s['title']=="Can P2P networks be super-scalable?"][0] >>> art['authors'] # doctest.ELLIPSIS ['François Baccelli', 'Fabien Mathieu', 'Ilkka Norros', 'Rémi Varloot']
Don’t forget to close source after use.
>>> source.close() >>> tmp.cleanup()
- gismo.datasets.dblp.LIST_TYPE_FIELDS = {'authors', 'urls'}#
DBLP fields with possibly multiple entries.
- gismo.datasets.dblp.URL = 'https://dblp.uni-trier.de/xml/dblp.xml.gz'#
URL of the full DBLP database.
- gismo.datasets.dblp.element_to_filesource(elt, data_handler, index, fields)[source]#
Converts the xml element
elt
into a dict if it is an article.Compress and write the dict in
data_handler
Append file position in
data_handler
toindex
.
- Parameters:
elt (Any) – a XML element.
data_handler (file_descriptor) – Where the compressed data will be stored. Must be writable.
index – a list that contains the initial position of the data_handler for all previously processed elements.
fields (set) – Set of fields to retrieve.
- Returns:
Always return True for compatibility with the xml parser.
- Return type:
- gismo.datasets.dblp.element_to_source(elt, source, fields)[source]#
Test if elt is an article, converts it to dictionary and appends to source
- gismo.datasets.dblp.fast_iter(context, func, d=2, **kwargs)[source]#
Applies
func
to all xml elements of depth 1 of the xml parsercontext
. `**kwargs
are passed tofunc
.Modified version of a modified version of Liza Daly’s fast_iter Inspired by https://stackoverflow.com/questions/4695826/efficient-way-to-iterate-through-xml-elements
- Parameters:
context (XMLparser) – A parser obtained from etree.iterparse
func (function) – How to process the elements
d (int, optional) – Depth to process elements.
- gismo.datasets.dblp.url2source(url, fields=None)[source]#
Directly transform URL of a dblp xml into a list of dictionnary. Only use for datasets that fit into memory (e.g. articles from one author). If the dataset does not fit, consider using the Dblp class instead.
- Parameters:
- Returns:
source – Articles retrieved from the URL
- Return type:
Example
>>> source = url2source("https://dblp.org/pers/xx/t/Tixeuil:S=eacute=bastien.xml", fields={'authors', 'title', 'year', 'venue', 'urls'}) >>> art = [s for s in source if s['title']=="Distributed Computing with Mobile Robots: An Introductory Survey."][0] >>> art['authors'] ['Maria Potop-Butucaru', 'Michel Raynal', 'Sébastien Tixeuil'] >>> art['urls'] ['https://doi.org/10.1109/NBiS.2011.55', 'https://doi.ieeecomputersociety.org/10.1109/NBiS.2011.55']
- gismo.datasets.dblp.xml_element_to_dict(elt, fields)[source]#
Converts the xml element
elt
into a dict if it is a paper.
- gismo.datasets.reuters.get_reuters_entry(name, z)[source]#
Read the Reuters news referenced by name in the zip archive z and returns it as a dict.
- gismo.datasets.reuters.get_reuters_news(url='https://github.com/balouf/datasets/raw/main/C50.zip')[source]#
Returns a list of news from the Reuters C50 news datasets
Acknowledgments
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
ZhiLiu, e-mail: liuzhi8673 ‘@’ gmail.com, institution: National Engineering Research Center for E-Learning, Hubei Wuhan, China
- Parameters:
url (str) – Location of the C50 dataset
- Returns:
The C50 news as a list of dict
- Return type:
Example
Cf
Sentencizer