FileSource#
This module can be used to read documents one-by-one from disk instead of loading them all in memory. Useful for very large corpi.
- class gismo.filesource.FileSource(filename='mysource', path='.', load_source=False)[source]#
Yield a file source as a list. Assumes the existence of two files: The mysource.data file contains the stacked items. Each item is compressed with
zlib
; The mysource.index files contains the list of pointers to seek items in the data file.- The resulting source object is fully compatible with the
Corpus
class: It can be iterated (
[item for item in source]
);It can yield single items (
source[i]
);It has a length (
len(source)
).
More advanced functionalities like slices are not implemented.
- Parameters:
Examples
>>> import tempfile >>> with tempfile.TemporaryDirectory() as dirname: ... create_file_source(filename='mysource', path=dirname) ... source = FileSource(filename='mysource', path=dirname, load_source=True) ... content = [e['content'] for e in source] >>> content[:3] ['Gizmo is a Mogwaï.', 'This is a sentence about Blade.', 'This is another sentence about Shadoks.']
Note: when source is read from file (
load_source=False
, default behavior), you need to close the source afterwards to avoid pending file handles.>>> with tempfile.TemporaryDirectory() as dirname: ... create_file_source(filename='mysource', path=dirname) ... source = FileSource(filename='mysource', path=dirname) ... size = len(source) ... item = source[0] ... source.close() >>> size 5 >>> item {'title': 'First Document', 'content': 'Gizmo is a Mogwaï.'}
- The resulting source object is fully compatible with the
- gismo.filesource.create_file_source(source=None, filename='mysource', path='.')[source]#
Write a source (list of dict) to files in the same format used by FileSource. Only useful to transfer from a computer with a lot of RAM to a computer with less RAM. For more complex cases, e.g. when the initial source itself is a very large file, a dedicated converter has to be provided.