Corpus#
The module contains simple wrappers to turn a wide range of document sources into something that Gismo will be able to process.
- class gismo.corpus.Corpus(source=None, to_text=None)[source]#
The Corpus class is the starting point of any Gismo workflow. It abstracts dataset pre-processing. It is just a list of items (called documents in Gismo) augmented with a method that describes how to convert a document to a string object. It is used to build an
Embedding
.- Parameters:
source (list) – The list of items that constitutes the dataset to analyze. Actually, any iterable object with
__len__()
and__getitem__()
methods can potentially be used as a source (seeFileSource
for an example).to_text (function, optional) – The function that transforms an item from the source into plain text (
str
). If not set, it will default to the identity functionlambda x: x
.
Examples
The following code uses the
toy_source_text
list as source and specifies that the text extraction method should be: take the 15 first characters and add ….When we iterate with the
iterate()
method, observe that the extraction is not applied.>>> corpus = Corpus(toy_source_text, to_text=lambda x: f"{x[:15]}...") >>> for c in corpus.iterate(): ... print(c) Gizmo is a Mogwaï. This is a sentence about Blade. This is another sentence about Shadoks. This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda. In chinese folklore, a Mogwaï is a demon.
When we iterate with the
iterate_text()
method, observe that the extraction is applied.>>> for c in corpus.iterate_text(): ... print(c) Gizmo is a Mogw... This is a sente... This is another... This very long ... In chinese folk...
A corpus object can be saved/loaded with the
dump()
andload()
methods inherited from the MixInMixInIO
class. Theload()
method is a class method to be used instead of the constructor.>>> import tempfile >>> corpus1 = Corpus(toy_source_text) >>> with tempfile.TemporaryDirectory() as tmpdirname: ... corpus1.dump(filename="myfile", path=tmpdirname) ... corpus2 = Corpus.load(filename="myfile", path=tmpdirname) >>> corpus2[0] 'Gizmo is a Mogwaï.'
- merge_new_source(new_source, doc2key=None)[source]#
Incorporate new entries while avoiding the creation of duplicates. This method is typically used when you have a dynamic source like a RSS feed and you want to periodically update your corpus.
- Parameters:
new_source (list) – Source compatible (e.g. similar item type) with the current source.
doc2key (function) – Callback that provides items with unique hashable keys, used to avoid duplicates.
Examples
The following code uses the
toy_source_dict
list as source and add two new items, including a redundant one.>>> corpus = Corpus(toy_source_dict.copy(), to_text=lambda x: x['content'][:14]) >>> len(corpus) 5 >>> new_corpus = [{"title": "Another document", "content": "I don't know what to say!"}, ... {'title': 'Fifth Document', 'content': 'In chinese folklore, a Mogwaï is a demon.'}] >>> corpus.merge_new_source(new_corpus, doc2key=lambda e: e['title']) >>> len(corpus) 6 >>> for c in corpus.iterate_text(): ... print(c) Gizmo is a Mog This is a sent This is anothe This very long In chinese fol I don't know w
- class gismo.corpus.CorpusList(corpus_list=None, filename=None, path='.')[source]#
This class makes a list of corpi behave like one single virtual corpus. This is useful to glue together corpi with distinct shapes and
to_text()
methods.- Parameters:
corpus_list (list of
Corpus
) – The list of corpi to glue.
Example
>>> multi_corp = CorpusList([Corpus(toy_source_text, lambda x: x[:15]+"..."), ... Corpus(toy_source_dict, lambda e: e['title'])]) >>> len(multi_corp) 10 >>> multi_corp[7] {'title': 'Third Document', 'content': 'This is another sentence about Shadoks.'} >>> for c in multi_corp.iterate_text(): ... print(c) Gizmo is a Mogw... This is a sente... This is another... This very long ... In chinese folk... First Document Second Document Third Document Fourth Document Fifth Document