Gismo#

The main gismo module combines all modules above to provide high level, end-to-end, analysis methods.

Main module.

class gismo.gismo.Gismo(corpus=None, embedding=None, **kwargs)[source]#

Gismo mixes a corpus and its embedding to provide search and structure methods.

Parameters:
  • corpus (Corpus) – Defines the documents of the gismo.

  • embedding (Embedding) – Defines the embedding of the gismo.

  • kwargs (dict) – Custom default runtime parameters. You just need to specify the parameters that differ from DEFAULT_PARAMETERS.

Example

The Corpus class defines how documents of a source should be converted to plain text.

>>> corpus = Corpus(toy_source_dict, lambda x: x['content'])

The Embedding class extracts features (e.g. words) and computes weights between documents and features.

>>> vectorizer = CountVectorizer(dtype=float)
>>> embedding = Embedding(vectorizer=vectorizer)
>>> embedding.fit_transform(corpus)
>>> embedding.m # number of features
36

The Gismo class combines them for performing queries. After a query is performed, one can ask for the best items. The number of items to return can be specified with parameter k or automatically adjusted.

>>> gismo = Gismo(corpus, embedding)
>>> success = gismo.rank("Gizmo")
>>> gismo.parameters.target_k = .2 # The toy dataset is very small, so we lower the auto_k parameter.
>>> gismo.get_documents_by_rank()
[{'title': 'First Document', 'content': 'Gizmo is a Mogwaï.'}, {'title': 'Fourth Document', 'content': 'This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda.'}, {'title': 'Fifth Document', 'content': 'In chinese folklore, a Mogwaï is a demon.'}]

Post-processing functions can be used to tweak the returned object (the underlying ranking is unchanged)

>>> gismo.post_documents_item = partial(post_documents_item_content, max_size=44)
>>> gismo.get_documents_by_rank()
['Gizmo is a Mogwaï.', 'This very long sentence, with a lot of stuff', 'In chinese folklore, a Mogwaï is a demon.']

Ranking also works on features.

>>> gismo.get_features_by_rank()
['mogwaï', 'gizmo', 'is', 'chinese', 'demon', 'in', 'folklore']

Clustering organizes results can provide additional hints on their relationships.

>>> gismo.post_documents_cluster = post_documents_cluster_print
>>> gismo.get_documents_by_cluster(resolution=.9) 
 F: 0.60. R: 0.65. S: 0.98.
- F: 0.71. R: 0.57. S: 0.98.
-- Gizmo is a Mogwaï. (R: 0.54; S: 0.99)
-- In chinese folklore, a Mogwaï is a demon. (R: 0.04; S: 0.71)
- This very long sentence, with a lot of stuff (R: 0.08; S: 0.69)
>>> gismo.post_features_cluster = post_features_cluster_print
>>> gismo.get_features_by_cluster() 
 F: 0.03. R: 0.29. S: 0.98.
- F: 1.00. R: 0.27. S: 0.99.
-- mogwaï (R: 0.12; S: 0.99)
-- gizmo (R: 0.12; S: 0.99)
-- is (R: 0.03; S: 0.99)
- F: 1.00. R: 0.02. S: 0.07.
-- chinese (R: 0.00; S: 0.07)
-- demon (R: 0.00; S: 0.07)
-- in (R: 0.00; S: 0.07)
-- folklore (R: 0.00; S: 0.07)

As an alternative to a textual query, the rank() method can directly use a vector z as input.

>>> z, s = gismo.embedding.query_projection("gizmo chinese folklore")
>>> z 
 <Compressed Sparse Row sparse matrix of dtype 'float64'
    with 3 stored elements and shape (1, 36)>
>>> s = gismo.rank(z=z)
>>> s
True
>>> gismo.get_documents_by_rank(k=2)
['In chinese folklore, a Mogwaï is a demon.', 'Gizmo is a Mogwaï.']
>>> gismo.get_features_by_rank()
['mogwaï', 'chinese', 'folklore', 'in', 'demon', 'gizmo', 'is']

The class also offers get_documents_by_coverage() and get_features_by_coverage() that yield a list of results obtained from a Covering-like traversal of the ranked cluster.

To demonstrate it, we first add an outsider document to the corpus and rebuild Gismo.

>>> new_entry = {'title': 'Minority Report', 'content': 'Totally unrelated stuff.'}
>>> corpus = Corpus(toy_source_dict+[new_entry], lambda x: x['content'])
>>> vectorizer = CountVectorizer(dtype=float)
>>> embedding = Embedding(vectorizer=vectorizer)
>>> embedding.fit_transform(corpus)
>>> gismo = Gismo(corpus, embedding)
>>> gismo.post_documents_item = post_documents_item_content
>>> success = gismo.rank("Gizmo")
>>> gismo.parameters.target_k = .3

Remind the classical rank-based result.

>>> gismo.get_documents_by_rank()
['Gizmo is a Mogwaï.', 'This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda.', 'In chinese folklore, a Mogwaï is a demon.']

Gismo can use the cluster to propose alternate results that try to cover more subjects.

>>> gismo.get_documents_by_coverage()
['Gizmo is a Mogwaï.', 'Totally unrelated stuff.', 'This is a sentence about Blade.']

Note how the new entry, which has nothing to do with the rest, is pushed into the results. By setting the wide option to False, we get an alternative that focuses on mainstream results.

>>> gismo.get_documents_by_coverage(wide=False)
['Gizmo is a Mogwaï.', 'This is a sentence about Blade.', 'This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda.']

The same principle applies for features.

>>> gismo.get_features_by_rank()
['mogwaï', 'gizmo', 'is', 'chinese', 'demon', 'in', 'folklore']
>>> gismo.get_features_by_coverage()
['mogwaï', 'about', 'chinese', 'and', 'gizmo', 'is', 'demon']
get_documents_by_cluster(k=None, **kwargs)[source]#

Returns a cluster of the best ranked documents. The cluster is by default post_processed through the post_documents_cluster method.

Parameters:
  • k (int, optional) – Number of documents to output. If not set, k is automatically computed using the max_k and target_k runtime parameters.

  • kwargs (dict, optional) – Custom runtime parameters.

Return type:

object

get_documents_by_cluster_from_indices(indices, **kwargs)[source]#

Returns a cluster of documents. The cluster is by default post_processed through the post_documents_cluster method.

Parameters:
  • indices (list of int) – The indices of documents to be processed. It is assumed that the documents are sorted by importance.

  • kwargs (dict, optional) – Custom runtime parameters.

Return type:

object

get_documents_by_coverage(k=None, **kwargs)[source]#

Returns a list of top covering documents. By default, the documents are post_processed through the post_documents_item method.

Parameters:
  • k (int, optional) – Number of documents to output. If not set, k is automatically computed using the max_k and target_k runtime parameters.

  • kwargs (dict, optional) – Custom runtime parameters.

Return type:

list

get_documents_by_rank(k=None, **kwargs)[source]#

Returns a list of top documents according to the current ranking. By default, the documents are post_processed through the post_documents_item method.

Parameters:
  • k (int, optional) – Number of documents to output. If not set, k is automatically computed using the max_k and target_k runtime parameters.

  • kwargs (dict, optional) – Custom runtime parameters.

Return type:

list

get_features_by_cluster(k=None, **kwargs)[source]#

Returns a cluster of the best ranked features. The cluster is by default post_processed through the post_features_cluster method.

Parameters:
  • k (int, optional) – Number of documents to output. If not set, k is automatically computed using the max_k and target_k runtime parameters.

  • kwargs (dict, optional) – Custom runtime parameters.

Return type:

object

get_features_by_cluster_from_indices(indices, **kwargs)[source]#

Returns a cluster of features. The cluster is by default post_processed through the post_features_cluster method.

Parameters:
  • indices (list of int) – The indices of features to be processed. It is assumed that the features are sorted by importance.

  • kwargs (dict, optional) – Custom runtime parameters

Return type:

object

get_features_by_coverage(k=None, **kwargs)[source]#

Returns a list of top covering features. By default, the features are post_processed through the post_features_item method.

Parameters:
  • k (int, optional) – Number of documents to output. If not set, k is automatically computed using the max_k and target_k runtime parameters.

  • kwargs (dict, optional) – Custom runtime parameters.

Return type:

list

get_features_by_rank(k=None, **kwargs)[source]#

Returns a list of top features according to the current ranking. By default, the features are post_processed through the post_features_item method.

Parameters:
  • k (int, optional) – Number of documents to output. If not set, k is automatically computed using the max_k and target_k runtime parameters.

  • kwargs (dict, optional) – Custom runtime parameters.

Return type:

list

rank(query='', z=None, **kwargs)[source]#

Runs the Diteration using query as starting point

Parameters:
  • query (str) – Text that starts DIteration

  • z (csr_matrix, optional) – Query vector to use in place of the textual query

  • kwargs (dict, optional) – Custom runtime parameters.

Returns:

success – success of the query projection. If projection fails, a ranking on uniform distribution is performed.

Return type:

bool

class gismo.gismo.XGismo(x_embedding=None, y_embedding=None, filename=None, path='.', **kwargs)[source]#

Given two distinct embeddings base on the same set of documents, builds a new gismo. The features of x_embedding are the corpus of this new gismo. The features of y_embedding are the features of this new gismo. The dual embedding of the new gismo is obtained by crossing the two input dual embeddings.

xgismo behaves essentially as a gismo object. The main difference is an additional parameter y for the rank method, to control if the query projection should be performed on the y_embedding or on the x_embedding.

Parameters:
  • x_embedding (Embedding) – The left embedding, which defines the documents of the xgismo.

  • y_embedding (Embedding) – The right embedding, which defines the features of the xgismo.

  • filename (str, optional) – If set, will load xgismo from file.

  • path (str or Path, optional) – Directory where the xgismo is to be loaded from.

  • kwargs (dict) – Custom default runtime parameters. You just need to specify the parameters that differ from DEFAULT_PARAMETERS.

Examples

One the main use case for XGismo consists in transforming a list of articles into a Gismo that relates authors and the words they use. Let’s start by retrieving a few articles.

>>> toy_url = "https://dblp.org/pers/xx/m/Mathieu:Fabien.xml"
>>> source = [a for a in url2source(toy_url) if int(a['year'])<2023]

Then we build the embedding of words.

>>> corpus = Corpus(source, to_text=lambda x: x['title'])
>>> w_count = CountVectorizer(dtype=float, stop_words='english')
>>> w_embedding = Embedding(w_count)
>>> w_embedding.fit_transform(corpus)

And the embedding of authors.

>>> to_authors_text = lambda dic: " ".join([a.replace(' ', '_') for a in dic['authors']])
>>> corpus.to_text = to_authors_text
>>> a_count = CountVectorizer(dtype=float, preprocessor=lambda x:x, tokenizer=lambda x: x.split(' '))
>>> a_embedding = Embedding(a_count)
>>> a_embedding.fit_transform(corpus)

We can now combine the two embeddings in one xgismo.

>>> xgismo = XGismo(a_embedding, w_embedding)
>>> xgismo.post_documents_item = lambda g, i: g.corpus[i].replace('_', ' ')

We can use xgismo to query keyword(s).

>>> success = xgismo.rank("Pagerank")
>>> xgismo.get_documents_by_rank()
['Mohamed Bouklit', 'Dohy Hong', 'The Dang Huynh']

We can use it to query researcher(s).

>>> success = xgismo.rank("Anne_Bouillard", y=False)
>>> xgismo.get_documents_by_rank()
['Anne Bouillard', 'Elie de Panafieu', 'Céline Comte', 'Philippe Sehier', 'Thomas Deiß', 'Dmitry Lebedev']
rank(query='', y=True, **kwargs)[source]#

Runs the DIteration using query as starting point. query can be evaluated on features (y=True) or documents (y=False).

Parameters:
  • query (str) – Text that starts DIteration

  • y (bool) – Determines if query should be evaluated on features (True) or documents (False).

  • kwargs (dict, optional) – Custom runtime parameters.

Returns:

success – success of the query projection. If projection fails, a ranking on uniform distribution is performed.

Return type:

bool