Reference
Gismo is made of multiple small modules designed to be mixed together.
corpus: The module contains simple wrappers to turn a wide range of document sources into something that Gismo will be able to process.
embedding: This module can create and manipulate TF-IDTF embeddings out of a corpus.
diteration: This module transforms queries into relevance vectors that can be used to rank and organize documents and features.
clustering: This module implements the tree-like organization of selected items
gismo: The main gismo module combines all modules above to provide high level, end-to-end, analysis methods.
landmarks: Introduced in v0.4, this high-level module allows deeper analysis of a small corpus by using individual query results for the embedding.
post processing: This module provides a simple, unified, way to apply automatic transformations (e.g. formatting) to the results of an analysis.
filesource: This module can be used to read documents one-by-one from disk instead of loading them all in memory. Useful for very large corpi.
sentencizer: This module can leverage a document-level gismo to provide sentence-level analysis. Can be used to extract key phrases (headlines).
datasets: Collection of access to small or less small datasets.
common: Multi-purpose module of things that can be used in more than one other module.
parameters: Management of runtime parameters.
Corpus
- class gismo.corpus.Corpus(source=None, to_text=None)[source]
The Corpus class is the starting point of any Gismo workflow. It abstracts dataset pre-processing. It is just a list of items (called documents in Gismo) augmented with a method that describes how to convert a document to a string object. It is used to build an
Embedding
.- Parameters
source (list) – The list of items that constitutes the dataset to analyze. Actually, any iterable object with
__len__()
and__getitem__()
methods can potentially be used as a source (seeFileSource
for an example).to_text (function, optional) – The function that transforms an item from the source into plain text (
str
). If not set, it will default to the identity functionlambda x: x
.
Examples
The following code uses the
toy_source_text
list as source and specifies that the text extraction method should be: take the 15 first characters and add ….When we iterate with the
iterate()
method, observe that the extraction is not applied.>>> corpus = Corpus(toy_source_text, to_text=lambda x: f"{x[:15]}...") >>> for c in corpus.iterate(): ... print(c) Gizmo is a Mogwaï. This is a sentence about Blade. This is another sentence about Shadoks. This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda. In chinese folklore, a Mogwaï is a demon.
When we iterate with the
iterate_text()
method, observe that the extraction is applied.>>> for c in corpus.iterate_text(): ... print(c) Gizmo is a Mogw... This is a sente... This is another... This very long ... In chinese folk...
A corpus object can be saved/loaded with the
dump()
andload()
methods inherited from the MixInMixInIO
class. Theload()
method is a class method to be used instead of the constructor.>>> import tempfile >>> corpus1 = Corpus(toy_source_text) >>> with tempfile.TemporaryDirectory() as tmpdirname: ... corpus1.dump(filename="myfile", path=tmpdirname) ... corpus2 = Corpus.load(filename="myfile", path=tmpdirname) >>> corpus2[0] 'Gizmo is a Mogwaï.'
- merge_new_source(new_source, doc2key=None)[source]
Incorporate new entries while avoiding the creation of duplicates. This method is typically used when you have a dynamic source like a RSS feed and you want to periodically update your corpus.
- Parameters
new_source (list) – Source compatible (e.g. similar item type) with the current source.
doc2key (function) – Callback that provides items with unique hashable keys, used to avoid duplicates.
Examples
The following code uses the
toy_source_dict
list as source and add two new items, including a redundant one.>>> corpus = Corpus(toy_source_dict.copy(), to_text=lambda x: x['content'][:14]) >>> len(corpus) 5 >>> new_corpus = [{"title": "Another document", "content": "I don't know what to say!"}, ... {'title': 'Fifth Document', 'content': 'In chinese folklore, a Mogwaï is a demon.'}] >>> corpus.merge_new_source(new_corpus, doc2key=lambda e: e['title']) >>> len(corpus) 6 >>> for c in corpus.iterate_text(): ... print(c) Gizmo is a Mog This is a sent This is anothe This very long In chinese fol I don't know w
- class gismo.corpus.CorpusList(corpus_list=None, filename=None, path='.')[source]
This class makes a list of corpi behave like one single virtual corpus. This is useful to glue together corpi with distinct shapes and
to_text()
methods.- Parameters
corpus_list (list of
Corpus
) – The list of corpi to glue.
Example
>>> multi_corp = CorpusList([Corpus(toy_source_text, lambda x: x[:15]+"..."), ... Corpus(toy_source_dict, lambda e: e['title'])]) >>> len(multi_corp) 10 >>> multi_corp[7] {'title': 'Third Document', 'content': 'This is another sentence about Shadoks.'} >>> for c in multi_corp.iterate_text(): ... print(c) Gizmo is a Mogw... This is a sente... This is another... This very long ... In chinese folk... First Document Second Document Third Document Fourth Document Fifth Document
Embedding
- class gismo.embedding.Embedding(vectorizer=None)[source]
This class leverages the
CountVectorizer
class to build the dual embedding of aCorpus
.Documents are embedded in the space of features;
Features are embedded in the space of documents.
See the examples and methods below for all usages of the class.
- Parameters
vectorizer (
CountVectorizer
, optional) – CustomCountVectorizer
to override default behavior (recommended). Having aCountVectorizer
adapted to theCorpus
is good practice.
- fit(corpus)[source]
Learn features from a corpus of documents.
If not yet set, a default
CountVectorizer
is created.Features are computed and stored.
Inverse-Document-Frequency weights of features are computed.
- Parameters
corpus (
Corpus
) – The corpus to ingest.
Example
>>> corpus=Corpus(toy_source_text) >>> embedding = Embedding() >>> embedding.fit(corpus) >>> len(embedding.idf) 21 >>> list(embedding.features[:8]) ['blade', 'chinese', 'comparing', 'demon', 'folklore', 'gizmo', 'gremlins', 'inside']
- fit_ext(embedding)[source]
Use learned features from another
Embedding
. This is useful for the fast creation of local embeddings (e.g. at sentence level) out of a global embedding.- Parameters
embedding (
Embedding
) – External embedding to copy.
Examples
>>> corpus=Corpus(toy_source_text) >>> other_embedding = Embedding() >>> other_embedding.fit(corpus) >>> embedding = Embedding() >>> embedding.fit_ext(other_embedding) >>> len(embedding.idf) 21 >>> list(embedding.features[:8]) ['blade', 'chinese', 'comparing', 'demon', 'folklore', 'gizmo', 'gremlins', 'inside']
- fit_transform(corpus)[source]
Ingest a corpus of documents.
If not yet set, a default
CountVectorizer
is created.Features are computed and stored (fit).
Inverse-Document-Frequency weights of features are computed (fit).
TF-IDF embedding of documents is computed and stored (transform).
TF-ITF embedding of features is computed and stored (transform).
- Parameters
corpus (
Corpus
) – The corpus to ingest.
Example
>>> corpus=Corpus(toy_source_text) >>> embedding = Embedding() >>> embedding.fit_transform(corpus) >>> embedding.x <5x21 sparse matrix of type '<class 'numpy.float64'>' with 25 stored elements in Compressed Sparse Row format> >>> list(embedding.features[:8]) ['blade', 'chinese', 'comparing', 'demon', 'folklore', 'gizmo', 'gremlins', 'inside']
- query_projection(query)[source]
Project a query in the feature space.
- Parameters
query (
str
) – Text to project.- Returns
z (
csr_matrix
) – result of the query projection (IDF distribution if query does not match any feature).success (
bool
) – projection success (True
if at least one feature been found).
Example
>>> corpus=Corpus(toy_source_text) >>> embedding = Embedding() >>> embedding.fit_transform(corpus) >>> z, success = embedding.query_projection("Gizmo is not Yoda but he rocks!") >>> for i in range(len(z.data)): ... print(f"{embedding.features[z.indices[i]]}: {z.data[i]}") gizmo: 0.3868528072... yoda: 0.6131471927... >>> success True >>> z, success = embedding.query_projection("That content does not intersect toy corpus") >>> success False
- transform(corpus)[source]
Ingest a corpus of documents using existing features. Requires that the embedding has been fitted beforehand.
TF-IDF embedding of documents is computed and stored.
TF-ITF embedding of features is computed and stored.
- Parameters
corpus (
Corpus
) – The corpus to ingest.
Example
>>> corpus=Corpus(toy_source_text) >>> embedding = Embedding() >>> embedding.fit_transform(corpus) >>> [embedding.features[i] for i in embedding.x.indices[:8]] ['gizmo', 'mogwaï', 'blade', 'sentence', 'sentence', 'shadoks', 'comparing', 'gizmo'] >>> small_corpus = Corpus(["I only talk about Yoda", "Gizmo forever!"]) >>> embedding.transform(small_corpus) >>> [embedding.features[i] for i in embedding.x.indices] ['yoda', 'gizmo']
- gismo.embedding.auto_vect(corpus=None)[source]
Creates a default
CountVectorizer
compatible with theEmbedding
constructor. For not-too-small corpi, a slight frequency-filter is applied.- Parameters
corpus (
Corpus
, optional) – The corpus for which theCountVectorizer
is intended.- Returns
A
CountVectorizer
object compatible with theEmbedding
constructor.- Return type
- gismo.embedding.idf_fit(indptr, n)[source]
Computes the Inverse-Document-Frequency vector on sparse embedding y.
- gismo.embedding.idf_transform(indptr, data, idf_vector)[source]
Applies inplace Inverse-Document-Frequency transformation on sparse embedding y.
- gismo.embedding.itf_fit_transform(indptr, data, m)[source]
Applies inplace Inverse-Term-Frequency transformation on sparse embedding x.
- gismo.embedding.l1_normalize(indptr, data)[source]
Computes L1 norm on sparse embedding (x or y) and applies inplace normalization.
- gismo.embedding.query_shape(indices, data, idf)[source]
Applies inplace logarithmic smoothing, IDF weighting, and normalization to the output of the
CountVectorizer
transform()
method.- Parameters
indices (
ndarray
) – Indice attribute of thecsr_matrix
obtained fromtransform()
.data (
ndarray
) – Data attribute of thecsr_matrix
obtained fromtransform()
.idf (
ndarray
) – IDF vector of the embedding, obtained fromidf_fit()
.
- Returns
norm – The norm of the vector before normalization.
- Return type
DIteration
- class gismo.diteration.DIteration(n, m)[source]
This class is in charge of performing the DIteration algorithm.
- gismo.diteration.jit_diffusion(x_pointers, x_indices, x_data, y_pointers, y_indices, y_data, z_indices, z_data, x_relevance, y_relevance, alpha, n_iter, offset: float, x_fluid, y_fluid)[source]
Core diffusion engine written to be compatible with Numba. This is where the DIteration algorithm is applied inline.
- Parameters
x_pointers (
ndarray
) – Pointers of thecsr_matrix
embedding of documents.x_indices (
ndarray
) – Indices of thecsr_matrix
embedding of documents.x_data (
ndarray
) – Data of thecsr_matrix
embedding of documents.y_pointers (
ndarray
) – Pointers of thecsr_matrix
embedding of features.y_indices (
ndarray
) – Indices of thecsr_matrix
embedding of features.y_data (
ndarray
) – Data of thecsr_matrix
embedding of features.z_indices (
ndarray
) – Indices of thecsr_matrix
embedding of the query projection.z_data (
ndarray
) – Data of thecsr_matrix
embedding of the query_projection.x_relevance (
ndarray
) – Placeholder for relevance of documents.y_relevance (
ndarray
) – Placeholder for relevance of features.alpha (float in range [0.0, 1.0]) – Damping factor. Controls the trade-off between closeness and centrality.
n_iter (int) – Number of round-trip diffusions to perform. Higher value means better precision but longer execution time.
offset (float in range [0.0, 1.0]) – Controls how much of the initial fluid should be deduced form the relevance.
x_fluid (
ndarray
) – Placeholder for fluid on the side of documents.y_fluid (
ndarray
) – Placeholder for fluid on the side of features.
Clustering
- class gismo.clustering.Cluster(indice=None, rank=None, vector=None)[source]
The ‘Cluster’ class is used for internal representation of hierarchical cluster. It stores the attributes that describe a clustering structure and provides cluster basic addition for merge operations.
- Parameters
indice (int) – Index of the head (main element) of the cluster.
rank (int) – The ranking order of a cluster.
vector (
csr_matrix
) – The vector representation of the cluster.
- vector
The vector representation of the cluster.
- Type
- intersection_vector
The vector representation of the common points of a cluster.
- Type
csr_matrix
(deprecated)
- focus
The consistency of the cluster (higher focus means that elements are more similar).
- Type
float in range [0.0, 1.0]
Examples
>>> c1 = Cluster(indice=0, rank=1, vector=csr_matrix([1.0, 0.0, 1.0])) >>> c2 = Cluster(indice=5, rank=0, vector=csr_matrix([1.0, 1.0, 0.0])) >>> c3 = c1+c2 >>> c3.members [0, 5] >>> c3.indice 5 >>> c3.vector.toarray() array([[2., 1., 1.]]) >>> c3.intersection_vector.toarray() array([[1., 0., 0.]]) >>> c1 == sum([c1]) True
- gismo.clustering.covering_order(cluster, wide=True)[source]
Uses a hierarchical cluster to provide an ordering of the items that mixes rank and coverage.
This is done by exploring all cluster and subclusters by increasing similarity and rank (lexicographic order). Two variants are proposed:
Core: for each cluster, append its representant to the list if new. Central items tend to have better rank.
Wide: for each cluster, append its children representants to the list if new. Marginal items tend to have better rank.
- gismo.clustering.get_sim(csr, arr)[source]
Simple similarity computation between csr_matrix and ndarray.
- Parameters
csr (
csr_matrix
) –arr (
ndarray
) –
- Return type
- gismo.clustering.merge_clusters(cluster_list: list, focus=1.0)[source]
Complete merge operation. In addition to the basic merge provided by
Cluster
, it ensures the following:Consistency of focus by integrating the extra-focus (typically given by
subspace_partition()
).Children (the members of the list) are sorted according to their respective rank.
- gismo.clustering.rec_clusterize(cluster_list: list, resolution=0.7)[source]
Auxiliary recursive function for clustering.
- Parameters
cluster_list (list of
Cluster
) – Current aggregation state.resolution (float in range [0.0, 1.0]) – Sets the lazyness of aggregation. A ‘resolution’ set to 0.0 yields a one-step clustering (star structure), while a ‘resolution ‘ set to 1.0 yields, up to tie similarities, a binary tree (dendrogram).
- Return type
list of
Cluster
- gismo.clustering.subspace_clusterize(subspace, resolution=0.7, indices=None)[source]
Converts a subspace (matrix seen as a list of vectors) to a Cluster object (hierarchical clustering).
- Parameters
subspace (
ndarray
,csr_matrix
) – Ak x m
matrix seen as a list ofk
m
-dimensional vectors sorted by importance order.resolution (float in range [0.0, 1.0]) – Sets the lazyness of aggregation. A ‘resolution’ set to 0.0 yields a one-step clustering (star structure), while a ‘resolution ‘ set to 1.0 yields, up to tie similarities, a binary tree (dendrogram).
indices (list, optional) – Indicates the index for each element of the subspace. Used when ‘subspace’ is extracted from a larger space (e.g. X or Y). If not set, indices are set to
range(k)
.
- Returns
A cluster whose leaves are the k vectors from ‘subspace’.
- Return type
Example
>>> corpus = Corpus(toy_source_text) >>> vectorizer = CountVectorizer(dtype=float) >>> embedding = Embedding(vectorizer=vectorizer) >>> embedding.fit_transform(corpus) >>> subspace = embedding.x[1:, :] >>> cluster = subspace_clusterize(subspace) >>> len(cluster.children) 2 >>> cluster = subspace_clusterize(subspace, resolution=.02) >>> len(cluster.children) 4
- gismo.clustering.subspace_distortion(indices, data, relevance, distortion: float)[source]
Apply inplace distortion of a subspace with relevance.
- Parameters
indices (
ndarray
) – Indice attribute of the subspacecsr_matrix
.data (
ndarray
) – Data attribute of the subspacecsr_matrix
.relevance (
ndarray
) – Relevance values in the embedding space.distortion (float in [0.0, 1.0]) – Power applied to relevance for distortion.
- gismo.clustering.subspace_partition(subspace, resolution=0.7)[source]
Proposes a partition of the subspace that merges together vectors with a similar direction.
- Parameters
subspace (
ndarray
,csr_matrix
) – Ak x m
matrix seen as a list ofk
m
-dimensional vectors sorted by importance order.resolution (float in range [0.0, 1.0]) – How strict the merging should be.
0.0
will merge all items together, while1.0
will only merge mutually closest items.
- Returns
A list of subsets that form a partition. Each subset is represented by a pair
(p, f)
.p
is the set of indices of the subset,f
is the typical similarity of the partition (called focus).- Return type
Gismo
- class gismo.gismo.Gismo(corpus=None, embedding=None, **kwargs)[source]
Gismo mixes a corpus and its embedding to provide search and structure methods.
- Parameters
corpus (Corpus) – Defines the documents of the gismo.
embedding (Embedding) – Defines the embedding of the gismo.
kwargs (dict) – Custom default runtime parameters. You just need to specify the parameters that differ from
DEFAULT_PARAMETERS
.
Example
The Corpus class defines how documents of a source should be converted to plain text.
>>> corpus = Corpus(toy_source_dict, lambda x: x['content'])
The Embedding class extracts features (e.g. words) and computes weights between documents and features.
>>> vectorizer = CountVectorizer(dtype=float) >>> embedding = Embedding(vectorizer=vectorizer) >>> embedding.fit_transform(corpus) >>> embedding.m # number of features 36
The Gismo class combines them for performing queries. After a query is performed, one can ask for the best items. The number of items to return can be specified with parameter
k
or automatically adjusted.>>> gismo = Gismo(corpus, embedding) >>> success = gismo.rank("Gizmo") >>> gismo.parameters.target_k = .2 # The toy dataset is very small, so we lower the auto_k parameter. >>> gismo.get_documents_by_rank() [{'title': 'First Document', 'content': 'Gizmo is a Mogwaï.'}, {'title': 'Fourth Document', 'content': 'This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda.'}, {'title': 'Fifth Document', 'content': 'In chinese folklore, a Mogwaï is a demon.'}]
Post processing functions can be used to tweak the returned object (the underlying ranking is unchanged)
>>> gismo.post_documents_item = partial(post_documents_item_content, max_size=44) >>> gismo.get_documents_by_rank() ['Gizmo is a Mogwaï.', 'This very long sentence, with a lot of stuff', 'In chinese folklore, a Mogwaï is a demon.']
Ranking also works on features.
>>> gismo.get_features_by_rank() ['mogwaï', 'gizmo', 'is', 'in', 'demon', 'chinese', 'folklore']
Clustering organizes results can provide additional hints on their relationships.
>>> gismo.post_documents_cluster = post_documents_cluster_print >>> gismo.get_documents_by_cluster(resolution=.9) F: 0.60. R: 0.65. S: 0.98. - F: 0.71. R: 0.57. S: 0.98. -- Gizmo is a Mogwaï. (R: 0.54; S: 0.99) -- In chinese folklore, a Mogwaï is a demon. (R: 0.04; S: 0.71) - This very long sentence, with a lot of stuff (R: 0.08; S: 0.69) >>> gismo.post_features_cluster = post_features_cluster_print >>> gismo.get_features_by_cluster() F: 0.03. R: 0.29. S: 0.98. - F: 1.00. R: 0.27. S: 0.99. -- mogwaï (R: 0.12; S: 0.99) -- gizmo (R: 0.12; S: 0.99) -- is (R: 0.03; S: 0.99) - F: 1.00. R: 0.02. S: 0.07. -- in (R: 0.00; S: 0.07) -- demon (R: 0.00; S: 0.07) -- chinese (R: 0.00; S: 0.07) -- folklore (R: 0.00; S: 0.07)
As an alternative to a textual query, the
rank()
method can directly use a vector z as input.>>> z, s = gismo.embedding.query_projection("gizmo chinese folklore") >>> z <1x36 sparse matrix of type '<class 'numpy.float64'>' with 3 stored elements in Compressed Sparse Row format> >>> s = gismo.rank(z=z) >>> s True >>> gismo.get_documents_by_rank(k=2) ['In chinese folklore, a Mogwaï is a demon.', 'Gizmo is a Mogwaï.'] >>> gismo.get_features_by_rank() ['mogwaï', 'in', 'chinese', 'folklore', 'demon', 'gizmo', 'is']
The class also offers
get_documents_by_coverage()
andget_features_by_coverage()
that yield a list of results obtained from a Covering-like traversal of the ranked cluster.To demonstrate it, we first add an outsider document to the corpus and rebuild Gismo.
>>> new_entry = {'title': 'Minority Report', 'content': 'Totally unrelated stuff.'} >>> corpus = Corpus(toy_source_dict+[new_entry], lambda x: x['content']) >>> vectorizer = CountVectorizer(dtype=float) >>> embedding = Embedding(vectorizer=vectorizer) >>> embedding.fit_transform(corpus) >>> gismo = Gismo(corpus, embedding) >>> gismo.post_documents_item = post_documents_item_content >>> success = gismo.rank("Gizmo") >>> gismo.parameters.target_k = .3
Remind the classical rank-based result.
>>> gismo.get_documents_by_rank() ['Gizmo is a Mogwaï.', 'This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda.', 'In chinese folklore, a Mogwaï is a demon.']
Gismo can use the cluster to propose alternate results that try to cover more subjects.
>>> gismo.get_documents_by_coverage() ['Gizmo is a Mogwaï.', 'Totally unrelated stuff.', 'This is a sentence about Blade.']
Note how the new entry, which has nothing to do with the rest, is pushed into the results. By setting the
wide
option to False, we get an alternative that focuses on mainstream results.>>> gismo.get_documents_by_coverage(wide=False) ['Gizmo is a Mogwaï.', 'This is a sentence about Blade.', 'This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda.']
The same principle applies for features.
>>> gismo.get_features_by_rank() ['mogwaï', 'gizmo', 'is', 'in', 'chinese', 'folklore', 'demon']
>>> gismo.get_features_by_coverage() ['mogwaï', 'this', 'in', 'by', 'gizmo', 'is', 'chinese']
- get_documents_by_cluster(k=None, **kwargs)[source]
Returns a cluster of the best ranked documents. The cluster is by default post_processed through the post_documents_cluster method.
- get_documents_by_cluster_from_indices(indices, **kwargs)[source]
Returns a cluster of documents. The cluster is by default post_processed through the post_documents_cluster method.
- get_documents_by_coverage(k=None, **kwargs)[source]
Returns a list of top covering documents. By default, the documents are post_processed through the post_documents_item method.
- get_documents_by_rank(k=None, **kwargs)[source]
Returns a list of top documents according to the current ranking. By default, the documents are post_processed through the post_documents_item method.
- get_features_by_cluster(k=None, **kwargs)[source]
Returns a cluster of the best ranked features. The cluster is by default post_processed through the post_features_cluster method.
- get_features_by_cluster_from_indices(indices, **kwargs)[source]
Returns a cluster of features. The cluster is by default post_processed through the post_features_cluster method.
- get_features_by_coverage(k=None, **kwargs)[source]
Returns a list of top covering features. By default, the features are post_processed through the post_features_item method.
- get_features_by_rank(k=None, **kwargs)[source]
Returns a list of top features according to the current ranking. By default, the features are post_processed through the post_features_item method.
- rank(query='', z=None, **kwargs)[source]
Runs the Diteration using query as starting point
- Parameters
query (str) – Text that starts DIteration
z (
csr_matrix
, optional) – Query vector to use in place of the textual querykwargs (dict, optional) – Custom runtime parameters.
- Returns
success – success of the query projection. If projection fails, a ranking on uniform distribution is performed.
- Return type
- class gismo.gismo.XGismo(x_embedding=None, y_embedding=None, filename=None, path='.', **kwargs)[source]
Given two distinct embeddings base on the same set of documents, builds a new gismo. The features of
x_embedding
are the corpus of this new gismo. The features ofy_embedding
are the features of this new gismo. The dual embedding of the new gismo is obtained by crossing the two input dual embeddings.xgismo behaves essentially as a gismo object. The main difference is an additional parameter
y
for the rank method, to control if the query projection should be performed on they_embedding
or on thex_embedding
.- Parameters
x_embedding (Embedding) – The left embedding, which defines the documents of the xgismo.
y_embedding (Embedding) – The right embedding, which defines the features of the xgismo.
filename (str, optional) – If set, will load xgismo from file.
path (str or Path, optional) – Directory where the xgismo is to be loaded from.
kwargs (dict) – Custom default runtime parameters. You just need to specify the parameters that differ from
DEFAULT_PARAMETERS
.
Examples
One the main use case for XGismo consists in transforming a list of articles into a Gismo that relates authors and the words they use. Let’s start by retrieving a few articles.
>>> toy_url = "https://dblp.org/pers/xx/m/Mathieu:Fabien.xml" >>> source = [a for a in url2source(toy_url) if int(a['year'])<2023]
Then we build the embedding of words.
>>> corpus = Corpus(source, to_text=lambda x: x['title']) >>> w_count = CountVectorizer(dtype=float, stop_words='english') >>> w_embedding = Embedding(w_count) >>> w_embedding.fit_transform(corpus)
And the embedding of authors.
>>> to_authors_text = lambda dic: " ".join([a.replace(' ', '_') for a in dic['authors']]) >>> corpus.to_text = to_authors_text >>> a_count = CountVectorizer(dtype=float, preprocessor=lambda x:x, tokenizer=lambda x: x.split(' ')) >>> a_embedding = Embedding(a_count) >>> a_embedding.fit_transform(corpus)
We can now combine the two embeddings in one xgismo.
>>> xgismo = XGismo(a_embedding, w_embedding) >>> xgismo.post_documents_item = lambda g, i: g.corpus[i].replace('_', ' ')
We can use xgismo to query keyword(s).
>>> success = xgismo.rank("Pagerank") >>> xgismo.get_documents_by_rank() ['Mohamed Bouklit', 'Dohy Hong', 'The Dang Huynh']
We can use it to query researcher(s).
>>> success = xgismo.rank("Anne_Bouillard", y=False) >>> xgismo.get_documents_by_rank() ['Anne Bouillard', 'Elie de Panafieu', 'Céline Comte', 'Thomas Deiß', 'Philippe Sehier', 'Dmitry Lebedev']
- rank(query='', y=True, **kwargs)[source]
Runs the DIteration using query as starting point.
query
can be evaluated on features (y=True
) or documents (y=False
).- Parameters
- Returns
success – success of the query projection. If projection fails, a ranking on uniform distribution is performed.
- Return type
Landmarks
- class gismo.landmarks.Landmarks(source=None, to_text=None, **kwargs)[source]
The Landmarks class is a subclass
Corpus
. It offers the capability to batch-rank all its entries against aGismo
instance. After it has been processed, a Landmarks can be used to analyze/classifyGismo
queries,Cluster
, orLandmarks
.Landmarks also offers the possibility to reduce a source or a gismo to its neighborhood. This can be useful if the source is huge and one wants something smaller for performance.
- Parameters
source (list) – The list of items that form Landmarks.
to_text (function) – The function that transforms an item into text
kwargs (dict) – Custom default runtime parameters. You just need to specify the parameters that differ from
DEFAULT_LANDMARKS_PARAMETERS
.
Examples
Landmarks lean on a Gismo. We can use a toy Gismo to start with.
>>> corpus = Corpus(toy_source_text) >>> embedding = Embedding() >>> embedding.fit_transform(corpus) >>> gismo = Gismo(corpus, embedding) >>> print(toy_source_text) ['Gizmo is a Mogwaï.', 'This is a sentence about Blade.', 'This is another sentence about Shadoks.', 'This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda.', 'In chinese folklore, a Mogwaï is a demon.']
Landmarks are constructed exactly like a Gismo object, with a source and a to_text function.
>>> landmarks_source = [{'name': 'Movies', 'content': 'Star Wars, Gremlins, and Blade are movies.'}, ... {'name': 'Gremlins', 'content': 'The Gremlins movie features a Mogwai.'}, ... {'name': 'Star Wars', 'content': 'The Star Wars movies feature Yoda.'}, ... {'name': 'Shadoks', 'content': 'Shadoks is a French sarcastic show.'},] >>> landmarks = Landmarks(landmarks_source, to_text=lambda e: e['content'])
The
fit()
method compute gismo queries for all landmarks and retain the results.>>> landmarks.fit(gismo)
We run the request Yoda and look at the key landmarks. Note that Gremlins comes before Star Wars. This is actually correct in this small dataset: the word Yoda only exists in one sentence, which contains the words Gremlins and Gizmo.
>>> success = gismo.rank('yoda') >>> landmarks.get_landmarks_by_rank(gismo) [{'name': 'Gremlins', 'content': 'The Gremlins movie features a Mogwai.'}, {'name': 'Star Wars', 'content': 'The Star Wars movies feature Yoda.'}, {'name': 'Movies', 'content': 'Star Wars, Gremlins, and Blade are movies.'}]
For better readibility, we set the item post_processing to return the name of a landmark item.
>>> landmarks.post_item = lambda lmk, i: lmk[i]['name'] >>> landmarks.get_landmarks_by_rank(gismo) ['Gremlins', 'Star Wars', 'Movies']
The balance adjusts between documents and features spaces. A balance set to 1.0 focuses only on documents.
>>> success = gismo.rank('blade') >>> landmarks.get_landmarks_by_rank(gismo, balance=1) ['Movies']
A balance set to 0.0 focuses only on features. For blade, this triggers Shadoks as a secondary result, because of the shared word sentence.
>>> landmarks.get_landmarks_by_rank(gismo, balance=0) ['Movies', 'Shadoks']
Landmarks can be used to analyze landmarks.
>>> landmarks.get_landmarks_by_rank(landmarks) ['Gremlins', 'Star Wars']
See again how balance can change things. Here a balance set to 0.0 (using only features) fully changes the results.
>>> landmarks.get_landmarks_by_rank(landmarks, balance=0) ['Shadoks']
Like for
Gismo
, landmarks can provide clusters.>>> success = gismo.rank('gizmo') >>> landmarks.get_landmarks_by_cluster(gismo) {'landmark': {'name': 'Gremlins', 'content': 'The Gremlins movie features a Mogwai.'}, 'focus': 0.999998..., 'children': [{'landmark': {'name': 'Gremlins', 'content': 'The Gremlins movie features a Mogwai.'}, 'focus': 1.0, 'children': []}, {'landmark': {'name': 'Star Wars', 'content': 'The Star Wars movies feature Yoda.'}, 'focus': 1.0, 'children': []}, {'landmark': {'name': 'Movies', 'content': 'Star Wars, Gremlins, and Blade are movies.'}, 'focus': 1.0, 'children': []}]}
We can set the post_cluster attribute to customize the output. Gismo provides a simple display.
>>> from gismo.post_processing import post_landmarks_cluster_print >>> landmarks.post_cluster = post_landmarks_cluster_print >>> landmarks.get_landmarks_by_cluster(gismo) F: 1.00. - Gremlins - Star Wars - Movies
Like for
Gismo
, parameters like k, distortion, or resolution can be used.>>> landmarks.get_landmarks_by_cluster(gismo, k=4, distortion=False, resolution=.9) F: 0.03. - F: 0.93. -- F: 1.00. --- Gremlins --- Star Wars -- Movies - Shadoks
Note that a
Cluster
can also be used as reference for theget_landmarks_by_rank()
andget_landmarks_by_cluster()
methods.>>> cluster = landmarks.get_landmarks_by_cluster(gismo, post=False) >>> landmarks.get_landmarks_by_rank(cluster) ['Gremlins', 'Star Wars', 'Movies']
Yet, you cannot use anything as reference. For example, you cannot use a string as such.
>>> landmarks.get_landmarks_by_rank("Landmarks do not use external queries (pass them to a gismo") # doctest.ELLIPSIS Traceback (most recent call last): ... TypeError: bad operand type for unary -: 'NoneType'
Last but not least, landmarks can be used to reduce the size of a source or a
Gismo
. The reduction is controlled by the x_density attribute that tells the number of documents each landmark will allow to keep.>>> landmarks.parameters.x_density = 1 >>> reduced_gismo = landmarks.get_reduced_gismo(gismo) >>> reduced_gismo.corpus.source ['This is another sentence about Shadoks.', 'This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda.']
Side remark #1: in the constructor, to_text indicates how to convert an item to str, while ranking_function specifies how to run a query on a
Gismo
. Yet, it is possible to have the text conversion handled by the ranking_function.>>> landmarks = Landmarks(landmarks_source, rank=lambda g, q: g.rank(q['content'])) >>> landmarks.fit(gismo) >>> success = gismo.rank('yoda') >>> landmarks.post_item = lambda lmk, i: lmk[i]['name'] >>> landmarks.get_landmarks_by_rank(gismo) ['Star Wars', 'Movies', 'Gremlins']
However, this is bad practice. When you only need to customize the way an item is converted to text, you should stick to to_text. ranking_function is for more elaborated filters that require to change the default way gismo does queries.
Side remark #2: if a landmark item query fails (its text does not intersect the gismo features), the default uniform projection will be used and a warning will be issued. This may yield to undesired results.
>>> landmarks_source.append({'name': 'unrelated', 'content': 'unrelated.'}) >>> landmarks = Landmarks(landmarks_source, to_text=lambda e: e['content']) >>> landmarks.fit(gismo) >>> success = gismo.rank('gizmo') >>> landmarks.post_item = lambda lmk, i: lmk[i]['name'] >>> landmarks.get_landmarks_by_rank(gismo) ['Shadoks', 'unrelated']
- fit(gismo, **kwargs)[source]
Runs gismo queries on all landmarks. The relevance results are used to build two set of vectors: x_vectors is the vectors on the document space; y_vectors is the vectors on the document space. On each space, vectors are summed to build a direction, which is a sort of vector summary of the landmarks.
- gismo.landmarks.get_direction(reference, balance)[source]
Converts a reference object into a n+m direction (dense or sparse depending on reference type).
- Parameters
reference (Gismo or Landmarks or Cluster or np.ndarray or csr_matrix.) – The object from which a direction will be extracted.
balance (float in range [0.0, 1.0]) – The trade-off between documents and features. Set to 0.0, only the feature space will be used. Set to 1.0, only the document space will be used.
- Returns
A n+m direction.
- Return type
np.ndarray or csr_matrix
Post Processing
- gismo.post_processing.post_documents_cluster_json(gismo, cluster)[source]
Convert cluster of documents into basic json
- gismo.post_processing.post_documents_cluster_print(gismo, cluster, post_item=None, depth='')[source]
Print an ASCII view of a document cluster with metrics (focus, relevance, similarity)
- gismo.post_processing.post_documents_item_content(gismo, i, max_size=None)[source]
Document indice to document content.
Assumes that document has a ‘content’ key.
- gismo.post_processing.post_features_cluster_json(gismo, cluster)[source]
Convert feature cluster into basic json
- gismo.post_processing.post_features_cluster_print(gismo, cluster, post_item=None, depth='')[source]
Print an ASCII view of a feature cluster with metrics (focus, relevance, similarity)
- gismo.post_processing.post_landmarks_cluster_json(landmark, cluster)[source]
Default post processor for a cluster of landmarks.
FileSource
- class gismo.filesource.FileSource(filename='mysource', path='.', load_source=False)[source]
Yield a file source as a list. Assumes the existence of two files: The mysource.data file contains the stacked items. Each item is compressed with
zlib
; The mysource.index files contains the list of pointers to seek items in the data file.- The resulting source object is fully compatible with the
Corpus
class: It can be iterated (
[item for item in source]
);It can yield single items (
source[i]
);It has a length (
len(source)
).
More advanced functionalities like slices are not implemented.
- Parameters
Examples
>>> import tempfile >>> with tempfile.TemporaryDirectory() as dirname: ... create_file_source(filename='mysource', path=dirname) ... source = FileSource(filename='mysource', path=dirname, load_source=True) ... content = [e['content'] for e in source] >>> content[:3] ['Gizmo is a Mogwaï.', 'This is a sentence about Blade.', 'This is another sentence about Shadoks.']
Note: when source is read from file (
load_source=False
, default behavior), you need to close the source afterwards to avoid pending file handles.>>> with tempfile.TemporaryDirectory() as dirname: ... create_file_source(filename='mysource', path=dirname) ... source = FileSource(filename='mysource', path=dirname) ... size = len(source) ... item = source[0] ... source.close() >>> size 5 >>> item {'title': 'First Document', 'content': 'Gizmo is a Mogwaï.'}
- The resulting source object is fully compatible with the
- gismo.filesource.create_file_source(source=None, filename='mysource', path='.')[source]
Write a source (list of dict) to files in the same format used by FileSource. Only useful to transfer from a computer with a lot of RAM to a computer with less RAM. For more complex cases, e.g. when the initial source itself is a very large file, a dedicated converter has to be provided.
Sentencizer
- class gismo.sentencizer.Sentencizer(gismo)[source]
The Sentencizer class allows to refine a document-level gismo into a sentence-level gismo. A simple sentence extraction is proposed. For more complex usages, the class can provide a full
Gismo
instance that operates at sentence-level.- Parameters
gismo (Gismo) – Document-level Gismo.
Examples
We use the C50 Reuters dataset (5000 news paragraphs).
>>> from gismo.datasets.reuters import get_reuters_news >>> corpus = Corpus(get_reuters_news(), to_text=lambda e: e['content']) >>> embedding = Embedding() >>> embedding.fit_transform(corpus) >>> gismo = Gismo(corpus, embedding) >>> sentencer = Sentencizer(gismo)
First Example: run explicitly the query Orange at document-level, extract 4 covering sentences with narrow bfs.
>>> success = gismo.rank("Orange") >>> sentencer.get_sentences(s=4, wide=False) ['Snook says the all important average retained revenue per Orange subscriber will rise from around 442 pounds per year, partly because dominant telecoms player British Telecommunications last month raised the price of a call to Orange phones from its fixed lines.', 'Analysts said that Orange shares had good upside potential after a rollercoaster ride in their short time on the market.', 'Orange, which was floated last March at 205 pence per share, initially saw its stock slump to 157.5 pence before recovering over the last few months to trade at 218 on Tuesday, a rise of four pence on the day.', 'One-2-One and Orange ORA.L, which offer only digital services, are due to release their connection figures next week.']
Second example: extract Ericsson-related sentences
>>> sentencer.get_sentences(query="Ericsson") ['These latest wins follow a recent $350 million contract win with Telefon AB L.M. Ericsson, bolstering its already strong activity in the contract manufacturing of telecommuncation and data communciation products, he said.', 'The restraints are few in areas such as consumer products, while in sectors such as banking, distribution and insurance, foreign firms are kept on a very tight leash.', "The company also said it had told analysts in a briefing Tuesday of new contract wins with Ascend Communications Inc, Harris Corp's Communications unit and Philips Electronics NV.", 'Pocket is the first from the high-priced 1996 auction known to have filed for bankruptcy protection.', 'With Ascend in particular, he said the company would be manufacturing the company\'s mainstream MAX TNT remote access network equipment. "']
Third example: extract Communications-related sentences from a string.
>>> txt = gismo.corpus[4517]['content'] >>> sentencer.get_sentences(query="communications", txt=txt) ["Privately-held Pocket's big creditors include a group of Asian entrepreneurs and communications-equipment makers Siemens AG of Germany and L.M. Ericsson of Sweden.", "2 bidder at the government's high-flying wireless phone auction last year has filed for bankruptcy protection from its creditors, underscoring the problems besetting the auction's winners.", "The Federal Communications Commission on Monday gave PCS companies from last year's auction some breathing space when it suspended indefinitely a March 31 deadline for them to make payments to the agency for their licenses."]
- get_sentences(query=None, txt=None, k=None, s=None, resolution=0.7, stretch=2.0, wide=True, post=True)[source]
All-in-one method to extract covering sentences from the corpus. Computes sentence-level corpus, sentence-level gismo, and calls
get_documents_by_coverage()
.- Parameters
query (str (optional)) – Query to run on the document-level Gismo
txt (str (optional)) – Text to use for sentence extraction. If not set, the sentences will be extracted from the top-documents.
k (int (optional)) – Number of top-documents used for the built. If not set, the
auto_k()
heuristic of the document-level Gismo will be used.s (int (optional)) – Number of sentences to return. If not set, the
auto_k()
heuristic of the sentence-level Gismo will be used.resolution (float (optional)) – Tree resolution passed to the
get_documents_by_coverage()
method.stretch (float >= 1 (optional)) – Stretch factor passed to the
get_documents_by_coverage()
method.wide (bool (optional)) – bfs wideness passed to the
get_documents_by_coverage()
method.post (bool (optional)) – Use of post-proccessing passed to the
get_documents_by_coverage()
method.
- Return type
- make_sent_gismo(query=None, txt=None, k=None, **kwargs)[source]
Construct a sentence-level Gismo stored in the
sent_gismo
attribute.- Parameters
query (str (optional)) – Query to run on the document-level Gismo.
txt (str (optional)) – Text to use for sentence extraction. If not set, the sentences will be extracted from the top-documents.
k (int (optional)) – Number of top-documents used for the built. If not set, the
auto_k()
heuristic will be used.kwargs (dict) – Custom default runtime parameters to pass to the sentence-level Gismo. You just need to specify the parameters that differ from
DEFAULT_PARAMETERS
. Note that distortion will be automatically de-activated. If you really want it, manually change the value ofself.sent_gismo.parameters.distortion
afterwards.
- Return type
- splitter(txt)[source]
Transform input content into a corpus of sentences stored into the
sent_corpus
attribute.
Datasets
- gismo.datasets.acm.flatten_acm(acm, min_size=5, max_depth=100, exclude=None, depth=0)[source]
Select subdomains of an acm tree and return them as a list.
- Parameters
- Returns
A flat list of domains described by name and query.
- Return type
Example
>>> acm = flatten_acm(get_acm()) >>> acm[111]['name'] 'Graph theory'
- gismo.datasets.acm.get_acm(refresh=False)[source]
- Parameters
refresh (bool) – If
True
, builds a new forest from the Internet, otherwise use a static version.- Returns
acm – Each dict is an ACM domain. It contains category name, query (concatenation of names from domain and subdomains), size (number of subdomains including itself), and children (list of domain dicts).
- Return type
list of dicts
Examples
>>> acm = get_acm() >>> subdomain = acm[4]['children'][2]['children'][1] >>> subdomain['name'] 'Software development process management' >>> subdomain['size'] 10 >>> subdomain['query'] 'Software development process management, Software development methods, Rapid application development, Agile software development, Capability Maturity Model, Waterfall model, Spiral model, V-model, Design patterns, Risk management'
>>> acm = get_acm(refresh=True) >>> len(acm) 13
- gismo.datasets.dblp.DEFAULT_FIELDS = {'authors', 'title', 'type', 'venue', 'year'}
Default fields to extract.
- gismo.datasets.dblp.DTD_URL = 'https://dblp.uni-trier.de/xml/dblp.dtd'
URL of the dtd file (required to correctly parse non-ASCII characters).
- class gismo.datasets.dblp.Dblp(dblp_url='https://dblp.uni-trier.de/xml/dblp.xml.gz', filename='dblp', path='.')[source]
The DBLP class can download DBLP database and produce source files compatible with the
FileSource
class.- Parameters
- build(refresh=False, d=2, fields=None)[source]
Main class method. Create the data and index files.
- Parameters
refresh (bool) – Tell if files are to be rebuilt if they are already there.
d (int) – depth level where articles are. Usually 2 or 3 (2 for the main database).
fields (set, optional) – Set of fields to collect. Default to
DEFAULT_FIELDS
.
Example
By default, the class downloads the full dataset. Here we will limit to one entry.
>>> toy_url = "https://dblp.org/pers/xx/m/Mathieu:Fabien.xml" >>> import tempfile >>> from gismo.filesource import FileSource >>> tmp = tempfile.TemporaryDirectory() >>> dblp = Dblp(dblp_url=toy_url, path=tmp.name) >>> dblp.build() # doctest.ELLIPSIS Retrieve https://dblp.org/pers/xx/m/Mathieu:Fabien.xml from the Internet. DBLP database downloaded to ...xml.gz. Converting DBLP database from ...xml.gz (may take a while). Building Index. Conversion done.
By default, build uses existing files.
>>> dblp.build() # doctest.ELLIPSIS File ...xml.gz already exists. Use refresh option to overwrite. File ...data already exists. Use refresh option to overwrite.
The refresh parameter can be used to ignore existing files.
>>> dblp.build(d=3, refresh=True) # doctest.ELLIPSIS Retrieve https://dblp.org/pers/xx/m/Mathieu:Fabien.xml from the Internet. DBLP database downloaded to ...xml.gz. Converting DBLP database from ...xml.gz (may take a while). Building Index. Conversion done.
The resulting files can be used to create a FileSource.
>>> source = FileSource(filename="dblp", path=tmp.name) >>> art = [s for s in source if s['title']=="Can P2P networks be super-scalable?"][0] >>> art['authors'] # doctest.ELLIPSIS ['François Baccelli', 'Fabien Mathieu', 'Ilkka Norros', 'Rémi Varloot']
Don’t forget to close source after use.
>>> source.close() >>> tmp.cleanup()
- gismo.datasets.dblp.LIST_TYPE_FIELDS = {'authors', 'urls'}
DBLP fields with possibly multiple entries.
- gismo.datasets.dblp.URL = 'https://dblp.uni-trier.de/xml/dblp.xml.gz'
URL of the full DBLP database.
- gismo.datasets.dblp.element_to_filesource(elt, data_handler, index, fields)[source]
Converts the xml element
elt
into a dict if it is an article.Compress and write the dict in
data_handler
Append file position in
data_handler
toindex
.
- Parameters
elt (Any) – a XML element.
data_handler (file_descriptor) – Where the compressed data will be stored. Must be writable.
index – a list that contains the initial position of the data_handler for all previously processed elements.
fields (set) – Set of fields to retrieve.
- Returns
Always return True for compatibility with the xml parser.
- Return type
- gismo.datasets.dblp.element_to_source(elt, source, fields)[source]
Test if elt is an article, converts it to dictionary and appends to source
- gismo.datasets.dblp.fast_iter(context, func, d=2, **kwargs)[source]
Applies
func
to all xml elements of depth 1 of the xml parsercontext
. `**kwargs
are passed tofunc
.Modified version of a modified version of Liza Daly’s fast_iter Inspired by https://stackoverflow.com/questions/4695826/efficient-way-to-iterate-through-xml-elements
- Parameters
context (XMLparser) – A parser obtained from etree.iterparse
func (function) – How to process the elements
d (int, optional) – Depth to process elements.
- gismo.datasets.dblp.url2source(url, fields=None)[source]
Directly transform URL of a dblp xml into a list of dictionnary. Only use for datasets that fit into memory (e.g. articles from one author). If the dataset does not fit, consider using the Dblp class instead.
- Parameters
- Returns
source – Articles retrieved from the URL
- Return type
Example
>>> source = url2source("https://dblp.org/pers/xx/t/Tixeuil:S=eacute=bastien.xml", fields={'authors', 'title', 'year', 'venue', 'urls'}) >>> art = [s for s in source if s['title']=="Distributed Computing with Mobile Robots: An Introductory Survey."][0] >>> art['authors'] ['Maria Potop-Butucaru', 'Michel Raynal', 'Sébastien Tixeuil'] >>> art['urls'] ['https://doi.org/10.1109/NBiS.2011.55', 'http://doi.ieeecomputersociety.org/10.1109/NBiS.2011.55']
- gismo.datasets.dblp.xml_element_to_dict(elt, fields)[source]
Converts the xml element
elt
into a dict if it is a paper.
- gismo.datasets.reuters.get_reuters_entry(name, z)[source]
Read the Reuters news referenced by name in the zip archive z and returns it as a dict.
- gismo.datasets.reuters.get_reuters_news(url='https://github.com/balouf/datasets/raw/main/C50.zip')[source]
Returns a list of news from the Reuters C50 news datasets
Acknowledgments
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
ZhiLiu, e-mail: liuzhi8673 ‘@’ gmail.com, institution: National Engineering Research Center for E-Learning, Hubei Wuhan, China
- Parameters
url (str) – Location of the C50 dataset
- Returns
The C50 news as a list of dict
- Return type
Example
Cf
Sentencizer
Common
- class gismo.common.MixInIO[source]
Provide basic save/load capacities to other classes.
- dump(filename: str, path='.', overwrite=False, compress=True)[source]
Save instance to file.
- Parameters
Examples
>>> import tempfile >>> v1 = ToyClass(42) >>> v2 = ToyClass() >>> v2.value 0 >>> with tempfile.TemporaryDirectory() as tmpdirname: ... v1.dump(filename='myfile', compress=True, path=tmpdirname) ... dir_content = [file.name for file in Path(tmpdirname).glob('*')] ... v2 = ToyClass.load(filename='myfile', path=Path(tmpdirname)) ... v1.dump(filename='myfile', compress=True, path=tmpdirname) # doctest.ELLIPSIS File ...myfile.pkl.gz already exists! Use overwrite option to overwrite. >>> dir_content ['myfile.pkl.gz'] >>> v2.value 42
>>> with tempfile.TemporaryDirectory() as tmpdirname: ... v1.dump(filename='myfile', compress=False, path=tmpdirname) ... v1.dump(filename='myfile', compress=False, path=tmpdirname) # doctest.ELLIPSIS File ...myfile.pkl already exists! Use overwrite option to overwrite.
>>> v1.value = 51 >>> with tempfile.TemporaryDirectory() as tmpdirname: ... v1.dump(filename='myfile', path=tmpdirname, compress=False) ... v1.dump(filename='myfile', path=tmpdirname, overwrite=True, compress=False) ... v2 = ToyClass.load(filename='myfile', path=tmpdirname) ... dir_content = [file.name for file in Path(tmpdirname).glob('*')] >>> dir_content ['myfile.pkl'] >>> v2.value 51
>>> with tempfile.TemporaryDirectory() as tmpdirname: ... v2 = ToyClass.load(filename='thisfilenamedoesnotexist') Traceback (most recent call last): ... FileNotFoundError: [Errno 2] No such file or directory: ...
- gismo.common.auto_k(data, order=None, max_k=100, target=1.0)[source]
Proposes a threshold k of significant values according to a relevance vector.
- Parameters
data (
ndarray
) – Vector with positive relevance values.max_k (int) – Maximal number of entries to return; also number of entries used to determine threshold.
target (float) – Threshold modulation. Higher target means less result. A target set to 1.0 corresponds to using the average of the max_k top values as threshold.
- Returns
k – Recommended number of values.
- Return type
Example
>>> data = np.array([30, 1, 2, .3, 4, 50, 80]) >>> auto_k(data) 3
- gismo.common.toy_source_dict = [{'title': 'First Document', 'content': 'Gizmo is a Mogwaï.'}, {'title': 'Second Document', 'content': 'This is a sentence about Blade.'}, {'title': 'Third Document', 'content': 'This is another sentence about Shadoks.'}, {'title': 'Fourth Document', 'content': 'This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda.'}, {'title': 'Fifth Document', 'content': 'In chinese folklore, a Mogwaï is a demon.'}]
A minimal source example where items are
dict
with keys title and content.
- gismo.common.toy_source_text = ['Gizmo is a Mogwaï.', 'This is a sentence about Blade.', 'This is another sentence about Shadoks.', 'This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda.', 'In chinese folklore, a Mogwaï is a demon.']
A minimal source example where items are
str
.
Parameters
- gismo.parameters.ALPHA = 0.5
Default value for damping factor. Controls the trade-off between closeness and centrality.
- gismo.parameters.DEFAULT_LANDMARKS_PARAMETERS = {'balance': 0.5, 'distortion': 1.0, 'max_k': 100, 'post': True, 'rank': <function <lambda>>, 'resolution': 0.7, 'stretch': 2.0, 'target_k': 1.0, 'wide': True, 'x_density': 1000, 'y_density': 1000}
Dictionary of default runtime
Landmarks
parameters.
- gismo.parameters.DEFAULT_PARAMETERS = {'alpha': 0.5, 'distortion': 1.0, 'max_k': 100, 'memory': 0.0, 'n_iter': 4, 'offset': 1.0, 'post': True, 'resolution': 0.7, 'stretch': 2.0, 'target_k': 1.0, 'wide': True}
Dictionary of default runtime
Gismo
parameters.
- gismo.parameters.DISTORTION = 1.0
Default distortion. Controls how much of diteration relevance is mixed into the embedding for similarity computation.
- gismo.parameters.MAX_K = 100
Default top population size for estimating k.
- gismo.parameters.MEMORY = 0.0
Default memory value. Controls how much of previous computation is kept when performing a new diffusion.
- gismo.parameters.N_ITER = 4
Default value for the number of round-trip diffusions to perform. Higher value means better precision but longer execution time.
- gismo.parameters.OFFSET = 1.0
Default offset value. Controls how much of the initial fluid should be deduced from the relevance.
- gismo.parameters.POST = True
Default post policy. If True, post function is applied on items and clusters.
- class gismo.parameters.Parameters(parameter_list=None, **kwargs)[source]
Manages
Gismo
runtime parameters. When called, an instance will yield a dictionary of parameters. Is also used for other Gismo classes likeLandmarks
.- Parameters
Examples
Use default parameters.
>>> p = Parameters() >>> p() {'alpha': 0.5, 'n_iter': 4, 'offset': 1.0, 'memory': 0.0, 'stretch': 2.0, 'resolution': 0.7, 'max_k': 100, 'target_k': 1.0, 'wide': True, 'post': True, 'distortion': 1.0}
Use default parameters with changed stretch.
>>> p = Parameters(stretch=1.7) >>> p()['stretch'] 1.7
Note that parameters that do not exist will be ignored and (a warning will be issued)
>>> p = Parameters(strech=1.7) >>> p() {'alpha': 0.5, 'n_iter': 4, 'offset': 1.0, 'memory': 0.0, 'stretch': 2.0, 'resolution': 0.7, 'max_k': 100, 'target_k': 1.0, 'wide': True, 'post': True, 'distortion': 1.0}
You can change the value of an attribute to alter the returned parameter.
>>> p.alpha = 0.85 >>> p()['alpha'] 0.85
You can also apply on-the-fly parameters by passing them when calling the instance.
>>> p(resolution=0.9)['resolution'] 0.9
Like for construction, parameters that do not exist are ignored and a warning is issued.
>>> p(resolutio = .9) {'alpha': 0.85, 'n_iter': 4, 'offset': 1.0, 'memory': 0.0, 'stretch': 2.0, 'resolution': 0.7, 'max_k': 100, 'target_k': 1.0, 'wide': True, 'post': True, 'distortion': 1.0}
Note the possibility to store a custom set of parameters if one uses
parameter_list
in construction.>>> p = Parameters(parameter_list={'a': 1.0, 'b': True}, a=1.5) >>> p() {'a': 1.5, 'b': True}
- gismo.parameters.RESOLUTION = 0.7
Default resolution value. Defines how strict the merging of cluster is during recursive clustering.
- gismo.parameters.STRETCH = 2.0
Default stretch value. When performing covering, defines the ratio between considered pages and selected covering pages.
- gismo.parameters.TARGET_K = 1.0
Default threshold for estimating k.
- gismo.parameters.WIDE = True
Default Covering behavior for covering. True for wide variant, false for core variant.