Corpus#

The module contains simple wrappers to turn a wide range of document sources into something that Gismo will be able to process.

class gismo.corpus.Corpus(source=None, to_text=None)[source]#

The Corpus class is the starting point of any Gismo workflow. It abstracts dataset pre-processing. It is just a list of items (called documents in Gismo) augmented with a method that describes how to convert a document to a string object. It is used to build an Embedding.

Parameters:
  • source (list) – The list of items that constitutes the dataset to analyze. Actually, any iterable object with __len__() and __getitem__() methods can potentially be used as a source (see FileSource for an example).

  • to_text (function, optional) – The function that transforms an item from the source into plain text (str). If not set, it will default to the identity function lambda x: x.

Examples

The following code uses the toy_source_text list as source and specifies that the text extraction method should be: take the 15 first characters and add .

When we iterate with the iterate() method, observe that the extraction is not applied.

>>> corpus = Corpus(toy_source_text, to_text=lambda x: f"{x[:15]}...")
>>> for c in corpus.iterate():
...    print(c)
Gizmo is a Mogwaï.
This is a sentence about Blade.
This is another sentence about Shadoks.
This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda.
In chinese folklore, a Mogwaï is a demon.

When we iterate with the iterate_text() method, observe that the extraction is applied.

>>> for c in corpus.iterate_text():
...    print(c)
Gizmo is a Mogw...
This is a sente...
This is another...
This very long ...
In chinese folk...

A corpus object can be saved/loaded with the dump() and load() methods inherited from the MixIn MixInIO class. The load() method is a class method to be used instead of the constructor.

>>> import tempfile
>>> corpus1 = Corpus(toy_source_text)
>>> with tempfile.TemporaryDirectory() as tmpdirname:
...    corpus1.dump(filename="myfile", path=tmpdirname)
...    corpus2 = Corpus.load(filename="myfile", path=tmpdirname)
>>> corpus2[0]
'Gizmo is a Mogwaï.'
merge_new_source(new_source, doc2key=None)[source]#

Incorporate new entries while avoiding the creation of duplicates. This method is typically used when you have a dynamic source like a RSS feed and you want to periodically update your corpus.

Parameters:
  • new_source (list) – Source compatible (e.g. similar item type) with the current source.

  • doc2key (function) – Callback that provides items with unique hashable keys, used to avoid duplicates.

Examples

The following code uses the toy_source_dict list as source and add two new items, including a redundant one.

>>> corpus = Corpus(toy_source_dict.copy(), to_text=lambda x: x['content'][:14])
>>> len(corpus)
5
>>> new_corpus = [{"title": "Another document", "content": "I don't know what to say!"},
...     {'title': 'Fifth Document', 'content': 'In chinese folklore, a Mogwaï is a demon.'}]
>>> corpus.merge_new_source(new_corpus, doc2key=lambda e: e['title'])
>>> len(corpus)
6
>>> for c in corpus.iterate_text():
...    print(c)
Gizmo is a Mog
This is a sent
This is anothe
This very long
In chinese fol
I don't know w
class gismo.corpus.CorpusList(corpus_list=None, filename=None, path='.')[source]#

This class makes a list of corpi behave like one single virtual corpus. This is useful to glue together corpi with distinct shapes and to_text() methods.

Parameters:

corpus_list (list of Corpus) – The list of corpi to glue.

Example

>>> multi_corp = CorpusList([Corpus(toy_source_text, lambda x: x[:15]+"..."),
...                          Corpus(toy_source_dict, lambda e: e['title'])])
>>> len(multi_corp)
10
>>> multi_corp[7]
{'title': 'Third Document', 'content': 'This is another sentence about Shadoks.'}
>>> for c in multi_corp.iterate_text():
...    print(c)
Gizmo is a Mogw...
This is a sente...
This is another...
This very long ...
In chinese folk...
First Document
Second Document
Third Document
Fourth Document
Fifth Document