FileSource#

This module can be used to read documents one-by-one from disk instead of loading them all in memory. Useful for very large corpi.

class gismo.filesource.FileSource(filename='mysource', path='.', load_source=False)[source]#

Yield a file source as a list. Assumes the existence of two files: The mysource.data file contains the stacked items. Each item is compressed with zlib; The mysource.index files contains the list of pointers to seek items in the data file.

The resulting source object is fully compatible with the Corpus class:
  • It can be iterated ([item for item in source]);

  • It can yield single items (source[i]);

  • It has a length (len(source)).

More advanced functionalities like slices are not implemented.

Parameters:
  • path (str) – Location of the files

  • filename (str) – Stem of the file

  • load_source (bool) – Should the data be loaded in RAM

Examples

>>> import tempfile
>>> with tempfile.TemporaryDirectory() as dirname:
...    create_file_source(filename='mysource', path=dirname)
...    source = FileSource(filename='mysource', path=dirname, load_source=True)
...    content = [e['content'] for e in source]
>>> content[:3]
['Gizmo is a Mogwaï.', 'This is a sentence about Blade.', 'This is another sentence about Shadoks.']

Note: when source is read from file (load_source=False, default behavior), you need to close the source afterwards to avoid pending file handles.

>>> with tempfile.TemporaryDirectory() as dirname:
...    create_file_source(filename='mysource', path=dirname)
...    source = FileSource(filename='mysource', path=dirname)
...    size = len(source)
...    item = source[0]
...    source.close()
>>> size
5
>>> item
{'title': 'First Document', 'content': 'Gizmo is a Mogwaï.'}
gismo.filesource.create_file_source(source=None, filename='mysource', path='.')[source]#

Write a source (list of dict) to files in the same format used by FileSource. Only useful to transfer from a computer with a lot of RAM to a computer with less RAM. For more complex cases, e.g. when the initial source itself is a very large file, a dedicated converter has to be provided.

Parameters:
  • source (list of dict) – The source to write

  • filename (str) – Stem of the file. Two files will be created, with suffixes .index and .data.

  • path (str or Path) – Destination directory