FileSource#

This module can be used to read documents one-by-one from disk instead of loading them all in memory. Useful for very large corpi.

class gismo.filesource.FileSource(filename='mysource', path='.', load_source=False)[source]#

Yield a file source as a list. Assumes the existence of two files: The mysource.data file contains the stacked items. Each item is compressed with zlib; The mysource.index files contains the list of pointers to seek items in the data file.

The resulting source object is fully compatible with the Corpus class:

It can be iterated ([item for item in source]);
It can yield single items (source[i]);
It has a length (len(source)).

More advanced functionalities like slices are not implemented.

Parameters:

path (str) – Location of the files
filename (str) – Stem of the file
load_source (bool) – Should the data be loaded in RAM

Examples

>>> import tempfile
>>> with tempfile.TemporaryDirectory() as dirname:
...    create_file_source(filename='mysource', path=dirname)
...    source = FileSource(filename='mysource', path=dirname, load_source=True)
...    content = [e['content'] for e in source]
>>> content[:3]
['Gizmo is a Mogwaï.', 'This is a sentence about Blade.', 'This is another sentence about Shadoks.']

Note: when source is read from file (load_source=False, default behavior), you need to close the source afterward to avoid pending file handles. Or use a context manager.

>>> with tempfile.TemporaryDirectory() as dirname:
...    create_file_source(filename='mysource', path=dirname)
...    with FileSource(filename='mysource', path=dirname) as source:
...        size = len(source)
...        items = [source[i] for i in range(0, size, 2)]
>>> size
5
>>> items  
[{'title': 'First Document', 'content': 'Gizmo is a Mogwaï.'},
{'title': 'Third Document', 'content': 'This is another sentence about Shadoks.'},
{'title': 'Fifth Document', 'content': 'In chinese folklore, a Mogwaï is a demon.'}]

gismo.filesource.create_file_source(source=None, filename='mysource', path='.')[source]#

Write a source (list of dict) to files in the same format used by FileSource. Only useful to transfer from a computer with a lot of RAM to a computer with less RAM. For more complex cases, e.g. when the initial source itself is a very large file, a dedicated converter has to be provided.

Parameters:

source (list of dict) – The source to write
filename (str) – Stem of the file. Two files will be created, with suffixes .index and .data.
path (str or Path) – Destination directory

FileSource#

This Page