Reference 

Gismo

class gismo.gismo.Gismo(corpus=None, embedding=None, **kwargs)[source]

Gismo mixes a corpus and its embedding to provide search and structure methods.

Parameters

corpus (Corpus) – Defines the documents of the gismo.
embedding (Embedding) – Defines the embedding of the gismo.
kwargs (dict) – Custom default runtime parameters. You just need to specify the parameters that differ from DEFAULT_PARAMETERS.

Example

The Corpus class defines how documents of a source should be converted to plain text.

>>> corpus = Corpus(toy_source_dict, lambda x: x['content'])

The Embedding class extracts features (e.g. words) and computes weights between documents and features.

>>> vectorizer = CountVectorizer(dtype=float)
>>> embedding = Embedding(vectorizer=vectorizer)
>>> embedding.fit_transform(corpus)
>>> embedding.m # number of features
36

The Gismo class combines them for performing queries. After a query is performed, one can ask for the best items. The number of items to return can be specified with parameter k or automatically adjusted.

>>> gismo = Gismo(corpus, embedding)
>>> success = gismo.rank("Gizmo")
>>> gismo.parameters.target_k = .2 # The toy dataset is very small, so we lower the auto_k parameter.
>>> gismo.get_documents_by_rank()
[{'title': 'First Document', 'content': 'Gizmo is a Mogwaï.'}, {'title': 'Fourth Document', 'content': 'This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda.'}, {'title': 'Fifth Document', 'content': 'In chinese folklore, a Mogwaï is a demon.'}]

Post processing functions can be used to tweak the returned object (the underlying ranking is unchanged)

>>> gismo.post_documents_item = partial(post_documents_item_content, max_size=44)
>>> gismo.get_documents_by_rank()
['Gizmo is a Mogwaï.', 'This very long sentence, with a lot of stuff', 'In chinese folklore, a Mogwaï is a demon.']

Ranking also works on features.

>>> gismo.get_features_by_rank()
['mogwaï', 'gizmo', 'is', 'in', 'demon', 'chinese', 'folklore']

Clustering organizes results can provide additional hints on their relationships.

>>> gismo.post_documents_cluster = post_documents_cluster_print
>>> gismo.get_documents_by_cluster(resolution=.9) 
 F: 0.60. R: 0.65. S: 0.98.
- F: 0.71. R: 0.57. S: 0.98.
-- Gizmo is a Mogwaï. (R: 0.54; S: 0.99)
-- In chinese folklore, a Mogwaï is a demon. (R: 0.04; S: 0.71)
- This very long sentence, with a lot of stuff (R: 0.08; S: 0.69)
>>> gismo.post_features_cluster = post_features_cluster_print
>>> gismo.get_features_by_cluster() 
 F: 0.03. R: 0.29. S: 0.98.
- F: 1.00. R: 0.27. S: 0.99.
-- mogwaï (R: 0.12; S: 0.99)
-- gizmo (R: 0.12; S: 0.99)
-- is (R: 0.03; S: 0.99)
- F: 1.00. R: 0.02. S: 0.07.
-- in (R: 0.00; S: 0.07)
-- demon (R: 0.00; S: 0.07)
-- chinese (R: 0.00; S: 0.07)
-- folklore (R: 0.00; S: 0.07)

As an alternative to a textual query, the rank() method can directly use a vector z as input.

>>> z, s = gismo.embedding.query_projection("gizmo chinese folklore")
>>> z 
<1x36 sparse matrix of type '<class 'numpy.float64'>'
    with 3 stored elements in Compressed Sparse Row format>
>>> s = gismo.rank(z=z)
>>> s
True
>>> gismo.get_documents_by_rank(k=2)
['In chinese folklore, a Mogwaï is a demon.', 'Gizmo is a Mogwaï.']
>>> gismo.get_features_by_rank()
['mogwaï', 'in', 'chinese', 'folklore', 'demon', 'gizmo', 'is']

The class also offers get_documents_by_coverage() and get_features_by_coverage() that yield a list of results obtained from a Covering-like traversal of the ranked cluster.

To demonstrate it, we first add an outsider document to the corpus and rebuild Gismo.

>>> new_entry = {'title': 'Minority Report', 'content': 'Totally unrelated stuff.'}
>>> corpus = Corpus(toy_source_dict+[new_entry], lambda x: x['content'])
>>> vectorizer = CountVectorizer(dtype=float)
>>> embedding = Embedding(vectorizer=vectorizer)
>>> embedding.fit_transform(corpus)
>>> gismo = Gismo(corpus, embedding)
>>> gismo.post_documents_item = post_documents_item_content
>>> success = gismo.rank("Gizmo")
>>> gismo.parameters.target_k = .3

Remind the classical rank-based result.

>>> gismo.get_documents_by_rank()
['Gizmo is a Mogwaï.', 'This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda.', 'In chinese folklore, a Mogwaï is a demon.']

Gismo can use the cluster to propose alternate results that try to cover more subjects.

>>> gismo.get_documents_by_coverage()
['Gizmo is a Mogwaï.', 'Totally unrelated stuff.', 'This is a sentence about Blade.']

Note how the new entry, which has nothing to do with the rest, is pushed into the results. By setting the wide option to False, we get an alternative that focuses on mainstream results.

>>> gismo.get_documents_by_coverage(wide=False)
['Gizmo is a Mogwaï.', 'This is a sentence about Blade.', 'This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda.']

The same principle applies for features.

>>> gismo.get_features_by_rank()
['mogwaï', 'gizmo', 'is', 'in', 'chinese', 'folklore', 'demon']

>>> gismo.get_features_by_coverage()
['mogwaï', 'this', 'in', 'by', 'gizmo', 'is', 'chinese']

get_documents_by_cluster(k=None, **kwargs)[source]

Returns a cluster of the best ranked documents. The cluster is by default post_processed through the post_documents_cluster method.

Parameters

k (int, optional) – Number of documents to output. If not set, k is automatically computed using the max_k and target_k runtime parameters.
kwargs (dict, optional) – Custom runtime parameters.

Return type

get_documents_by_cluster_from_indices(indices, **kwargs)[source]

Returns a cluster of documents. The cluster is by default post_processed through the post_documents_cluster method.

Parameters

indices (list of int) – The indices of documents to be processed. It is assumed that the documents are sorted by importance.
kwargs (dict, optional) – Custom runtime parameters.

Return type

get_documents_by_coverage(k=None, **kwargs)[source]

Returns a list of top covering documents. By default, the documents are post_processed through the post_documents_item method.

Parameters

k (int, optional) – Number of documents to output. If not set, k is automatically computed using the max_k and target_k runtime parameters.
kwargs (dict, optional) – Custom runtime parameters.

Return type

get_documents_by_rank(k=None, **kwargs)[source]

Returns a list of top documents according to the current ranking. By default, the documents are post_processed through the post_documents_item method.

Parameters

k (int, optional) – Number of documents to output. If not set, k is automatically computed using the max_k and target_k runtime parameters.
kwargs (dict, optional) – Custom runtime parameters.

Return type

get_features_by_cluster(k=None, **kwargs)[source]

Returns a cluster of the best ranked features. The cluster is by default post_processed through the post_features_cluster method.

Parameters

k (int, optional) – Number of documents to output. If not set, k is automatically computed using the max_k and target_k runtime parameters.
kwargs (dict, optional) – Custom runtime parameters.

Return type

get_features_by_cluster_from_indices(indices, **kwargs)[source]

Returns a cluster of features. The cluster is by default post_processed through the post_features_cluster method.

Parameters

indices (list of int) – The indices of features to be processed. It is assumed that the features are sorted by importance.
kwargs (dict, optional) – Custom runtime parameters

Return type

get_features_by_coverage(k=None, **kwargs)[source]

Returns a list of top covering features. By default, the features are post_processed through the post_features_item method.

Parameters

k (int, optional) – Number of documents to output. If not set, k is automatically computed using the max_k and target_k runtime parameters.
kwargs (dict, optional) – Custom runtime parameters.

Return type

get_features_by_rank(k=None, **kwargs)[source]

Returns a list of top features according to the current ranking. By default, the features are post_processed through the post_features_item method.

Parameters

k (int, optional) – Number of documents to output. If not set, k is automatically computed using the max_k and target_k runtime parameters.
kwargs (dict, optional) – Custom runtime parameters.

Return type

rank(query='', z=None, **kwargs)[source]

Runs the Diteration using query as starting point

Parameters

query (str) – Text that starts DIteration
z (csr_matrix, optional) – Query vector to use in place of the textual query
kwargs (dict, optional) – Custom runtime parameters.

Returns

success – success of the query projection. If projection fails, a ranking on uniform distribution is performed.

Return type

bool

class gismo.gismo.XGismo(x_embedding=None, y_embedding=None, filename=None, path='.', **kwargs)[source]

Given two distinct embeddings base on the same set of documents, builds a new gismo. The features of x_embedding are the corpus of this new gismo. The features of y_embedding are the features of this new gismo. The dual embedding of the new gismo is obtained by crossing the two input dual embeddings.

xgismo behaves essentially as a gismo object. The main difference is an additional parameter y for the rank method, to control if the query projection should be performed on the y_embedding or on the x_embedding.

Parameters

x_embedding (Embedding) – The left embedding, which defines the documents of the xgismo.
y_embedding (Embedding) – The right embedding, which defines the features of the xgismo.
filename (str, optional) – If set, will load xgismo from file.
path (str or Path, optional) – Directory where the xgismo is to be loaded from.
kwargs (dict) – Custom default runtime parameters. You just need to specify the parameters that differ from DEFAULT_PARAMETERS.

Examples

One the main use case for XGismo consists in transforming a list of articles into a Gismo that relates authors and the words they use. Let’s start by retrieving a few articles.

>>> toy_url = "https://dblp.org/pers/xx/m/Mathieu:Fabien.xml"
>>> source = [a for a in url2source(toy_url) if int(a['year'])<2023]

Then we build the embedding of words.

>>> corpus = Corpus(source, to_text=lambda x: x['title'])
>>> w_count = CountVectorizer(dtype=float, stop_words='english')
>>> w_embedding = Embedding(w_count)
>>> w_embedding.fit_transform(corpus)

And the embedding of authors.

>>> to_authors_text = lambda dic: " ".join([a.replace(' ', '_') for a in dic['authors']])
>>> corpus.to_text = to_authors_text
>>> a_count = CountVectorizer(dtype=float, preprocessor=lambda x:x, tokenizer=lambda x: x.split(' '))
>>> a_embedding = Embedding(a_count)
>>> a_embedding.fit_transform(corpus)

We can now combine the two embeddings in one xgismo.

>>> xgismo = XGismo(a_embedding, w_embedding)
>>> xgismo.post_documents_item = lambda g, i: g.corpus[i].replace('_', ' ')

We can use xgismo to query keyword(s).

>>> success = xgismo.rank("Pagerank")
>>> xgismo.get_documents_by_rank()
['Mohamed Bouklit', 'Dohy Hong', 'The Dang Huynh']

We can use it to query researcher(s).

>>> success = xgismo.rank("Anne_Bouillard", y=False)
>>> xgismo.get_documents_by_rank()
['Anne Bouillard', 'Elie de Panafieu', 'Céline Comte', 'Thomas Deiß', 'Philippe Sehier', 'Dmitry Lebedev']

rank(query='', y=True, **kwargs)[source]

Runs the DIteration using query as starting point. query can be evaluated on features (y=True) or documents (y=False).

Parameters

query (str) – Text that starts DIteration
y (bool) – Determines if query should be evaluated on features (True) or documents (False).
kwargs (dict, optional) – Custom runtime parameters.

Returns

success – success of the query projection. If projection fails, a ranking on uniform distribution is performed.

Return type

bool

Landmarks

class gismo.landmarks.Landmarks(source=None, to_text=None, **kwargs)[source]

The Landmarks class is a subclass Corpus. It offers the capability to batch-rank all its entries against a Gismo instance. After it has been processed, a Landmarks can be used to analyze/classify Gismo queries, Cluster, or Landmarks.

Landmarks also offers the possibility to reduce a source or a gismo to its neighborhood. This can be useful if the source is huge and one wants something smaller for performance.

Parameters

source (list) – The list of items that form Landmarks.
to_text (function) – The function that transforms an item into text
kwargs (dict) – Custom default runtime parameters. You just need to specify the parameters that differ from DEFAULT_LANDMARKS_PARAMETERS.

Examples

Landmarks lean on a Gismo. We can use a toy Gismo to start with.

>>> corpus = Corpus(toy_source_text)
>>> embedding = Embedding()
>>> embedding.fit_transform(corpus)
>>> gismo = Gismo(corpus, embedding)
>>> print(toy_source_text) 
['Gizmo is a Mogwaï.',
'This is a sentence about Blade.',
'This is another sentence about Shadoks.',
'This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda.',
'In chinese folklore, a Mogwaï is a demon.']

Landmarks are constructed exactly like a Gismo object, with a source and a to_text function.

>>> landmarks_source = [{'name': 'Movies', 'content': 'Star Wars, Gremlins, and Blade are movies.'},
... {'name': 'Gremlins', 'content': 'The Gremlins movie features a Mogwai.'},
... {'name': 'Star Wars', 'content': 'The Star Wars movies feature Yoda.'},
... {'name': 'Shadoks', 'content': 'Shadoks is a French sarcastic show.'},]
>>> landmarks = Landmarks(landmarks_source, to_text=lambda e: e['content'])

The fit() method compute gismo queries for all landmarks and retain the results.

>>> landmarks.fit(gismo)

We run the request Yoda and look at the key landmarks. Note that Gremlins comes before Star Wars. This is actually correct in this small dataset: the word Yoda only exists in one sentence, which contains the words Gremlins and Gizmo.

>>> success = gismo.rank('yoda')
>>> landmarks.get_landmarks_by_rank(gismo) 
[{'name': 'Gremlins', 'content': 'The Gremlins movie features a Mogwai.'},
{'name': 'Star Wars', 'content': 'The Star Wars movies feature Yoda.'},
{'name': 'Movies', 'content': 'Star Wars, Gremlins, and Blade are movies.'}]

For better readibility, we set the item post_processing to return the name of a landmark item.

>>> landmarks.post_item = lambda lmk, i: lmk[i]['name']
>>> landmarks.get_landmarks_by_rank(gismo)
['Gremlins', 'Star Wars', 'Movies']

The balance adjusts between documents and features spaces. A balance set to 1.0 focuses only on documents.

>>> success = gismo.rank('blade')
>>> landmarks.get_landmarks_by_rank(gismo, balance=1)
['Movies']

A balance set to 0.0 focuses only on features. For blade, this triggers Shadoks as a secondary result, because of the shared word sentence.

>>> landmarks.get_landmarks_by_rank(gismo, balance=0)
['Movies', 'Shadoks']

Landmarks can be used to analyze landmarks.

>>> landmarks.get_landmarks_by_rank(landmarks)
['Gremlins', 'Star Wars']

See again how balance can change things. Here a balance set to 0.0 (using only features) fully changes the results.

>>> landmarks.get_landmarks_by_rank(landmarks, balance=0)
['Shadoks']

Like for Gismo, landmarks can provide clusters.

>>> success = gismo.rank('gizmo')
>>> landmarks.get_landmarks_by_cluster(gismo) 
{'landmark': {'name': 'Gremlins',
              'content': 'The Gremlins movie features a Mogwai.'},
              'focus': 0.999998...,
              'children': [{'landmark': {'name': 'Gremlins',
                                         'content': 'The Gremlins movie features a Mogwai.'},
                                         'focus': 1.0, 'children': []},
                           {'landmark': {'name': 'Star Wars',
                                         'content': 'The Star Wars movies feature Yoda.'},
                                         'focus': 1.0, 'children': []},
                           {'landmark': {'name': 'Movies',
                                         'content': 'Star Wars, Gremlins, and Blade are movies.'},
                                         'focus': 1.0, 'children': []}]}

We can set the post_cluster attribute to customize the output. Gismo provides a simple display.

>>> from gismo.post_processing import post_landmarks_cluster_print
>>> landmarks.post_cluster = post_landmarks_cluster_print
>>> landmarks.get_landmarks_by_cluster(gismo) 
F: 1.00.
- Gremlins
- Star Wars
- Movies

Like for Gismo, parameters like k, distortion, or resolution can be used.

>>> landmarks.get_landmarks_by_cluster(gismo, k=4, distortion=False, resolution=.9) 
F: 0.03.
- F: 0.93.
-- F: 1.00.
--- Gremlins
--- Star Wars
-- Movies
- Shadoks

Note that a Cluster can also be used as reference for the get_landmarks_by_rank() and get_landmarks_by_cluster() methods.

>>> cluster = landmarks.get_landmarks_by_cluster(gismo, post=False)
>>> landmarks.get_landmarks_by_rank(cluster)
['Gremlins', 'Star Wars', 'Movies']

Yet, you cannot use anything as reference. For example, you cannot use a string as such.

>>> landmarks.get_landmarks_by_rank("Landmarks do not use external queries (pass them to a gismo")  # doctest.ELLIPSIS
Traceback (most recent call last):
...
TypeError: bad operand type for unary -: 'NoneType'

Last but not least, landmarks can be used to reduce the size of a source or a Gismo. The reduction is controlled by the x_density attribute that tells the number of documents each landmark will allow to keep.

>>> landmarks.parameters.x_density = 1
>>> reduced_gismo = landmarks.get_reduced_gismo(gismo)
>>> reduced_gismo.corpus.source 
['This is another sentence about Shadoks.',
'This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the
Gremlins movie by comparing Gizmo and Yoda.']

Side remark #1: in the constructor, to_text indicates how to convert an item to str, while ranking_function specifies how to run a query on a Gismo. Yet, it is possible to have the text conversion handled by the ranking_function.

>>> landmarks = Landmarks(landmarks_source, rank=lambda g, q: g.rank(q['content']))
>>> landmarks.fit(gismo)
>>> success = gismo.rank('yoda')
>>> landmarks.post_item = lambda lmk, i: lmk[i]['name']
>>> landmarks.get_landmarks_by_rank(gismo)
['Star Wars', 'Movies', 'Gremlins']

However, this is bad practice. When you only need to customize the way an item is converted to text, you should stick to to_text. ranking_function is for more elaborated filters that require to change the default way gismo does queries.

Side remark #2: if a landmark item query fails (its text does not intersect the gismo features), the default uniform projection will be used and a warning will be issued. This may yield to undesired results.

>>> landmarks_source.append({'name': 'unrelated', 'content': 'unrelated.'})
>>> landmarks = Landmarks(landmarks_source, to_text=lambda e: e['content'])
>>> landmarks.fit(gismo)
>>> success = gismo.rank('gizmo')
>>> landmarks.post_item = lambda lmk, i: lmk[i]['name']
>>> landmarks.get_landmarks_by_rank(gismo)
['Shadoks', 'unrelated']

fit(gismo, **kwargs)[source]

Runs gismo queries on all landmarks. The relevance results are used to build two set of vectors: x_vectors is the vectors on the document space; y_vectors is the vectors on the document space. On each space, vectors are summed to build a direction, which is a sort of vector summary of the landmarks.

Parameters

gismo (Gismo) – The Gismo on which vectors will be computed.
kwargs (dict) – Custom Landmarks runtime parameters.

Return type

None

gismo.landmarks.get_direction(reference, balance)[source]

Converts a reference object into a n+m direction (dense or sparse depending on reference type).

Parameters

reference (Gismo or Landmarks or Cluster or np.ndarray or csr_matrix.) – The object from which a direction will be extracted.
balance (float in range [0.0, 1.0]) – The trade-off between documents and features. Set to 0.0, only the feature space will be used. Set to 1.0, only the document space will be used.

Returns

A n+m direction.

Return type

np.ndarray or csr_matrix

Post Processing

gismo.post_processing.post_documents_cluster_json(gismo, cluster)[source]

Convert cluster of documents into basic json

Parameters

gismo (Gismo) – Gismo instance
cluster (Cluster) – Cluster of documents

Returns

dictionary with keys ‘document’, ‘focus’, and recursive ‘children’

Return type

gismo.post_processing.post_documents_cluster_print(gismo, cluster, post_item=None, depth='')[source]

Print an ASCII view of a document cluster with metrics (focus, relevance, similarity)

Parameters

gismo (Gismo) – Gismo instance
cluster (Cluster) – Cluster of documents
post_item (function, optional) – Post-processing function for individual documents
depth (str, optional) – Current depth string used in recursion

gismo.post_processing.post_documents_item_content(gismo, i, max_size=None)[source]

Document indice to document content.

Assumes that document has a ‘content’ key.

Parameters

gismo (Gismo) – Gismo instance
i (int) – document indice
max_size (int, optional) – Maximum number of chars to return

Returns

Content of document i from corpus

Return type

str

gismo.post_processing.post_documents_item_raw(gismo, i)[source]

Document indice to document entry

Parameters

gismo (Gismo) – Gismo instance
i (int) – document indice

Returns

Document i from corpus

Return type

gismo.post_processing.post_features_cluster_json(gismo, cluster)[source]

Convert feature cluster into basic json

Parameters

gismo (Gismo) – Gismo instance
cluster (Cluster) – Cluster of features

Returns

dictionary with keys ‘feature’, ‘focus’, and recursive ‘children’

Return type

gismo.post_processing.post_features_cluster_print(gismo, cluster, post_item=None, depth='')[source]

Print an ASCII view of a feature cluster with metrics (focus, relevance, similarity)

Parameters

gismo (Gismo) – Gismo instance
cluster (Cluster) – Cluster of features
post_item (function, optional) – Post-processing function for individual features
depth (str, optional) – Current depth string used in recursion

gismo.post_processing.post_features_item_raw(gismo, i)[source]

Feature indice to feature name

Parameters

gismo (Gismo) – Gismo instance
i (int) – feature indice

Returns

Feature i from embedding

Return type

str

gismo.post_processing.post_landmarks_cluster_json(landmark, cluster)[source]

Default post processor for a cluster of landmarks.

Parameters

landmark (Landmarks) – A Landmarks instance
cluster (Cluster) – Cluster of the landmarks to process.

Returns

A dict with the head landmark, cluster focus, and list of children.

Return type

gismo.post_processing.post_landmarks_cluster_print(landmark, cluster, post_item=None, depth='')[source]

ASCII display post processor for a cluster of landmarks.

Parameters

landmark (Landmarks) – A Landmarks instance
cluster (Cluster) – Cluster of the landmarks to process.
post_item (function, optional) – Post-processing function for individual landmarks
depth (str, optional) – Current depth string used in recursion

gismo.post_processing.post_landmarks_item_raw(landmark, i)[source]

Default post processor for individual landmarks.

Parameters

landmark (Landmarks) – A Landmarks instance
i (int) – Indice of the landmark to process.

Returns

The landmark of indice i.

Return type

FileSource

class gismo.filesource.FileSource(filename='mysource', path='.', load_source=False)[source]

Yield a file source as a list. Assumes the existence of two files: The mysource.data file contains the stacked items. Each item is compressed with zlib; The mysource.index files contains the list of pointers to seek items in the data file.

The resulting source object is fully compatible with the Corpus class:

It can be iterated ([item for item in source]);
It can yield single items (source[i]);
It has a length (len(source)).

More advanced functionalities like slices are not implemented.

Parameters

path (str) – Location of the files
filename (str) – Stem of the file
load_source (bool) – Should the data be loaded in RAM

Examples

>>> import tempfile
>>> with tempfile.TemporaryDirectory() as dirname:
...    create_file_source(filename='mysource', path=dirname)
...    source = FileSource(filename='mysource', path=dirname, load_source=True)
...    content = [e['content'] for e in source]
>>> content[:3]
['Gizmo is a Mogwaï.', 'This is a sentence about Blade.', 'This is another sentence about Shadoks.']

Note: when source is read from file (load_source=False, default behavior), you need to close the source afterwards to avoid pending file handles.

>>> with tempfile.TemporaryDirectory() as dirname:
...    create_file_source(filename='mysource', path=dirname)
...    source = FileSource(filename='mysource', path=dirname)
...    size = len(source)
...    item = source[0]
...    source.close()
>>> size
5
>>> item
{'title': 'First Document', 'content': 'Gizmo is a Mogwaï.'}

gismo.filesource.create_file_source(source=None, filename='mysource', path='.')[source]

Write a source (list of dict) to files in the same format used by FileSource. Only useful to transfer from a computer with a lot of RAM to a computer with less RAM. For more complex cases, e.g. when the initial source itself is a very large file, a dedicated converter has to be provided.

Parameters

source (list of dict) – The source to write
filename (str) – Stem of the file. Two files will be created, with suffixes .index and .data.
path (str or Path) – Destination directory

Sentencizer

class gismo.sentencizer.Sentencizer(gismo)[source]

The Sentencizer class allows to refine a document-level gismo into a sentence-level gismo. A simple sentence extraction is proposed. For more complex usages, the class can provide a full Gismo instance that operates at sentence-level.

Parameters: gismo (Gismo) – Document-level Gismo.

Examples

We use the C50 Reuters dataset (5000 news paragraphs).

>>> from gismo.datasets.reuters import get_reuters_news
>>> corpus = Corpus(get_reuters_news(), to_text=lambda e: e['content'])
>>> embedding = Embedding()
>>> embedding.fit_transform(corpus)
>>> gismo = Gismo(corpus, embedding)
>>> sentencer = Sentencizer(gismo)

First Example: run explicitly the query Orange at document-level, extract 4 covering sentences with narrow bfs.

>>> success = gismo.rank("Orange")
>>> sentencer.get_sentences(s=4, wide=False) 
['Snook says the all important average retained revenue per Orange subscriber will
rise from around 442 pounds per year, partly because dominant telecoms player British
Telecommunications last month raised the price of a call to Orange phones from its fixed lines.',
'Analysts said that Orange shares had good upside potential after a rollercoaster ride in their
short time on the market.',
'Orange, which was floated last March at 205 pence per share, initially saw its stock slump to
157.5 pence before recovering over the last few months to trade at 218 on Tuesday, a rise of four
pence on the day.',
'One-2-One and Orange ORA.L, which offer only digital services, are due to release their
connection figures next week.']

Second example: extract Ericsson-related sentences

>>> sentencer.get_sentences(query="Ericsson") 
['These latest wins follow a recent $350 million contract win with Telefon AB L.M. Ericsson,
bolstering its already strong activity in the contract manufacturing of telecommuncation and
data communciation products, he said.',
'The restraints are few in areas such as consumer products, while in sectors such as banking,
distribution and insurance, foreign firms are kept on a very tight leash.',
"The company also said it had told analysts in a briefing Tuesday of new contract wins with
Ascend Communications Inc, Harris Corp's Communications unit and Philips Electronics NV.",
'Pocket is the first from the high-priced 1996 auction known to have filed for bankruptcy
protection.',
'With Ascend in particular, he said the company would be manufacturing the company\'s
mainstream MAX TNT remote access network equipment. "']

Third example: extract Communications-related sentences from a string.

>>> txt = gismo.corpus[4517]['content']
>>> sentencer.get_sentences(query="communications", txt=txt) 
["Privately-held Pocket's big creditors include a group of Asian entrepreneurs and
communications-equipment makers Siemens AG of Germany and L.M. Ericsson of Sweden.",
"2 bidder at the government's high-flying wireless phone auction last year has filed for
bankruptcy protection from its creditors, underscoring the problems besetting the
auction's winners.",
"The Federal Communications Commission on Monday gave PCS companies from last year's
auction some breathing space when it suspended indefinitely a March 31 deadline for
them to make payments to the agency for their licenses."]

get_sentences(query=None, txt=None, k=None, s=None, resolution=0.7, stretch=2.0, wide=True, post=True)[source]

All-in-one method to extract covering sentences from the corpus. Computes sentence-level corpus, sentence-level gismo, and calls get_documents_by_coverage().

Parameters

query (str (optional)) – Query to run on the document-level Gismo
txt (str (optional)) – Text to use for sentence extraction. If not set, the sentences will be extracted from the top-documents.
k (int (optional)) – Number of top-documents used for the built. If not set, the auto_k() heuristic of the document-level Gismo will be used.
s (int (optional)) – Number of sentences to return. If not set, the auto_k() heuristic of the sentence-level Gismo will be used.
resolution (float (optional)) – Tree resolution passed to the get_documents_by_coverage() method.
stretch (float >= 1 (optional)) – Stretch factor passed to the get_documents_by_coverage() method.
wide (bool (optional)) – bfs wideness passed to the get_documents_by_coverage() method.
post (bool (optional)) – Use of post-proccessing passed to the get_documents_by_coverage() method.

Return type

make_sent_gismo(query=None, txt=None, k=None, **kwargs)[source]

Construct a sentence-level Gismo stored in the sent_gismo attribute.

Parameters

query (str (optional)) – Query to run on the document-level Gismo.
txt (str (optional)) – Text to use for sentence extraction. If not set, the sentences will be extracted from the top-documents.
k (int (optional)) – Number of top-documents used for the built. If not set, the auto_k() heuristic will be used.
kwargs (dict) – Custom default runtime parameters to pass to the sentence-level Gismo. You just need to specify the parameters that differ from DEFAULT_PARAMETERS. Note that distortion will be automatically de-activated. If you really want it, manually change the value of self.sent_gismo.parameters.distortion afterwards.

Return type

Sentencizer

splitter(txt)[source]

Transform input content into a corpus of sentences stored into the sent_corpus attribute.

Parameters: txt (str or list) – Text or list of documents to split in sentences. For the latter, documents are assumed to be provided as (content, id) pairs, where content is the actual text and id a reference of the document.
Return type: Sentencizer

Datasets

gismo.datasets.acm.flatten_acm(acm, min_size=5, max_depth=100, exclude=None, depth=0)[source]

Select subdomains of an acm tree and return them as a list.

Parameters

acm (list of dicts) – acm tree from get_acm.
min_size (int) – size threshold to select a domain (avoids small domains)
max_depth (int) – depth threshold to select a domain (avoids deep domains)
exclude (list) – list of domains to exclude from the results

Returns

A flat list of domains described by name and query.

Return type

Example

>>> acm = flatten_acm(get_acm())
>>> acm[111]['name']
'Graph theory'

gismo.datasets.acm.get_acm(refresh=False)[source]

Parameters: refresh (bool) – If True, builds a new forest from the Internet, otherwise use a static version.
Returns: acm – Each dict is an ACM domain. It contains category name, query (concatenation of names from domain and subdomains), size (number of subdomains including itself), and children (list of domain dicts).
Return type: list of dicts

Examples

>>> acm = get_acm()
>>> subdomain = acm[4]['children'][2]['children'][1]
>>> subdomain['name']
'Software development process management'
>>> subdomain['size']
10
>>> subdomain['query']
'Software development process management, Software development methods, Rapid application development, Agile software development, Capability Maturity Model, Waterfall model, Spiral model, V-model, Design patterns, Risk management'

>>> acm = get_acm(refresh=True)
>>> len(acm)
13

gismo.datasets.dblp.DEFAULT_FIELDS = {'authors', 'title', 'type', 'venue', 'year'}: Default fields to extract.

gismo.datasets.dblp.DTD_URL = 'https://dblp.uni-trier.de/xml/dblp.dtd': URL of the dtd file (required to correctly parse non-ASCII characters).

class gismo.datasets.dblp.Dblp(dblp_url='https://dblp.uni-trier.de/xml/dblp.xml.gz', filename='dblp', path='.')[source]

The DBLP class can download DBLP database and produce source files compatible with the FileSource class.

Parameters

dblp_url (str, optional) – Alternative URL for the dblp.xml.gz file
filename (str) – Stem of the files (suffixes will be appened)
path (str or path, optional) – Destination of the files

build(refresh=False, d=2, fields=None)[source]

Main class method. Create the data and index files.

Parameters

refresh (bool) – Tell if files are to be rebuilt if they are already there.
d (int) – depth level where articles are. Usually 2 or 3 (2 for the main database).
fields (set, optional) – Set of fields to collect. Default to DEFAULT_FIELDS.

Example

By default, the class downloads the full dataset. Here we will limit to one entry.

>>> toy_url = "https://dblp.org/pers/xx/m/Mathieu:Fabien.xml"
>>> import tempfile
>>> from gismo.filesource import FileSource
>>> tmp = tempfile.TemporaryDirectory()
>>> dblp = Dblp(dblp_url=toy_url, path=tmp.name)
>>> dblp.build() # doctest.ELLIPSIS
Retrieve https://dblp.org/pers/xx/m/Mathieu:Fabien.xml from the Internet.
DBLP database downloaded to ...xml.gz.
Converting DBLP database from ...xml.gz (may take a while).
Building Index.
Conversion done.

By default, build uses existing files.

>>> dblp.build() # doctest.ELLIPSIS
File ...xml.gz already exists. Use refresh option to overwrite.
File ...data already exists. Use refresh option to overwrite.

The refresh parameter can be used to ignore existing files.

>>> dblp.build(d=3, refresh=True) # doctest.ELLIPSIS
Retrieve https://dblp.org/pers/xx/m/Mathieu:Fabien.xml from the Internet.
DBLP database downloaded to ...xml.gz.
Converting DBLP database from ...xml.gz (may take a while).
Building Index.
Conversion done.

The resulting files can be used to create a FileSource.

>>> source = FileSource(filename="dblp", path=tmp.name)
>>> art = [s for s in source if s['title']=="Can P2P networks be super-scalable?"][0]
>>> art['authors'] # doctest.ELLIPSIS
['François Baccelli', 'Fabien Mathieu', 'Ilkka Norros', 'Rémi Varloot']

Don’t forget to close source after use.

>>> source.close()
>>> tmp.cleanup()

gismo.datasets.dblp.LIST_TYPE_FIELDS = {'authors', 'urls'}: DBLP fields with possibly multiple entries.

gismo.datasets.dblp.URL = 'https://dblp.uni-trier.de/xml/dblp.xml.gz': URL of the full DBLP database.

gismo.datasets.dblp.element_to_filesource(elt, data_handler, index, fields)[source]

Converts the xml element elt into a dict if it is an article.
Compress and write the dict in data_handler
Append file position in data_handler to index.

Parameters

elt (Any) – a XML element.
data_handler (file_descriptor) – Where the compressed data will be stored. Must be writable.
index – a list that contains the initial position of the data_handler for all previously processed elements.
fields (set) – Set of fields to retrieve.

Returns

Always return True for compatibility with the xml parser.

Return type

bool

gismo.datasets.dblp.element_to_source(elt, source, fields)[source]

Test if elt is an article, converts it to dictionary and appends to source

Parameters

elt (Any) – a XML element.
source (list) – the source in construction.
fields (set) – Set of fields to retrieve.

gismo.datasets.dblp.fast_iter(context, func, d=2, **kwargs)[source]

Applies func to all xml elements of depth 1 of the xml parser context. ` **kwargs are passed to func.

Modified version of a modified version of Liza Daly’s fast_iter Inspired by https://stackoverflow.com/questions/4695826/efficient-way-to-iterate-through-xml-elements

Parameters

context (XMLparser) – A parser obtained from etree.iterparse
func (function) – How to process the elements
d (int, optional) – Depth to process elements.

gismo.datasets.dblp.url2source(url, fields=None)[source]

Directly transform URL of a dblp xml into a list of dictionnary. Only use for datasets that fit into memory (e.g. articles from one author). If the dataset does not fit, consider using the Dblp class instead.

Parameters

url (str) – the URL to fetch.
fields (set) – Set of DBLP fields to capture.

Returns

source – Articles retrieved from the URL

Return type

list of dict

Example

>>> source = url2source("https://dblp.org/pers/xx/t/Tixeuil:S=eacute=bastien.xml", fields={'authors', 'title', 'year', 'venue', 'urls'})
>>> art = [s for s in source if s['title']=="Distributed Computing with Mobile Robots: An Introductory Survey."][0]
>>> art['authors']
['Maria Potop-Butucaru', 'Michel Raynal', 'Sébastien Tixeuil']
>>> art['urls']
['https://doi.org/10.1109/NBiS.2011.55', 'http://doi.ieeecomputersociety.org/10.1109/NBiS.2011.55']

gismo.datasets.dblp.xml_element_to_dict(elt, fields)[source]

Converts the xml element elt into a dict if it is a paper.

Parameters

elt (Any) – a XML element.
fields (set) – Set of entries to retrieve.

Returns

Article dictionary if element contains the attributes of an article, None otherwise.

Return type

dict or None

gismo.datasets.reuters.get_reuters_entry(name, z)[source]

Read the Reuters news referenced by name in the zip archive z and returns it as a dict.

Parameters

name (str) – Location of the file inside the Reuters archive
z (ZipFile) – Zipfile descriptor of the Reuters archive

Returns

entry – dict with keys set (C50test or c50train), author, id, and content

Return type