Embedding#

This module can create and manipulate TF-IDTF embeddings out of a corpus.

class gismo.embedding.Embedding(vectorizer=None)[source]#

This class leverages the CountVectorizer class to build the dual embedding of a Corpus.

  • Documents are embedded in the space of features;

  • Features are embedded in the space of documents.

See the examples and methods below for all usages of the class.

Parameters:

vectorizer (CountVectorizer, optional) – Custom CountVectorizer to override default behavior (recommended). Having a CountVectorizer adapted to the Corpus is good practice.

fit(corpus)[source]#

Learn features from a corpus of documents.

  • If not yet set, a default CountVectorizer is created.

  • Features are computed and stored.

  • Inverse-Document-Frequency weights of features are computed.

Parameters:

corpus (Corpus) – The corpus to ingest.

Example

>>> corpus=Corpus(toy_source_text)
>>> embedding = Embedding()
>>> embedding.fit(corpus)
>>> len(embedding.idf)
21
>>> list(embedding.features[:8])
['blade', 'chinese', 'comparing', 'demon', 'folklore', 'gizmo', 'gremlins', 'inside']
fit_ext(embedding)[source]#

Use learned features from another Embedding. This is useful for the fast creation of local embeddings (e.g. at sentence level) out of a global embedding.

Parameters:

embedding (Embedding) – External embedding to copy.

Examples

>>> corpus=Corpus(toy_source_text)
>>> other_embedding = Embedding()
>>> other_embedding.fit(corpus)
>>> embedding = Embedding()
>>> embedding.fit_ext(other_embedding)
>>> len(embedding.idf)
21
>>> list(embedding.features[:8])
['blade', 'chinese', 'comparing', 'demon', 'folklore', 'gizmo', 'gremlins', 'inside']
fit_transform(corpus)[source]#

Ingest a corpus of documents.

  • If not yet set, a default CountVectorizer is created.

  • Features are computed and stored (fit).

  • Inverse-Document-Frequency weights of features are computed (fit).

  • TF-IDF embedding of documents is computed and stored (transform).

  • TF-ITF embedding of features is computed and stored (transform).

Parameters:

corpus (Corpus) – The corpus to ingest.

Example

>>> corpus=Corpus(toy_source_text)
>>> embedding = Embedding()
>>> embedding.fit_transform(corpus)
>>> embedding.x  
<Compressed Sparse Row sparse matrix of dtype 'float64'
with 25 stored elements and shape (5, 21)>
>>> list(embedding.features[:8])
['blade', 'chinese', 'comparing', 'demon', 'folklore', 'gizmo', 'gremlins', 'inside']
query_projection(query)[source]#

Project a query in the feature space.

Parameters:

query (str) – Text to project.

Returns:

  • z (csr_matrix) – result of the query projection (IDF distribution if query does not match any feature).

  • success (bool) – projection success (True if at least one feature been found).

Example

>>> corpus=Corpus(toy_source_text)
>>> embedding = Embedding()
>>> embedding.fit_transform(corpus)
>>> z, success = embedding.query_projection("Gizmo is not Yoda but he rocks!")
>>> for i in range(len(z.data)):
...    print(f"{embedding.features[z.indices[i]]}: {z.data[i]}") 
gizmo: 0.3868528072...
yoda: 0.6131471927...
>>> success
True
>>> z, success = embedding.query_projection("That content does not intersect toy corpus")
>>> success
False
transform(corpus)[source]#

Ingest a corpus of documents using existing features. Requires that the embedding has been fitted beforehand.

  • TF-IDF embedding of documents is computed and stored.

  • TF-ITF embedding of features is computed and stored.

Parameters:

corpus (Corpus) – The corpus to ingest.

Example

>>> corpus=Corpus(toy_source_text)
>>> embedding = Embedding()
>>> embedding.fit_transform(corpus)
>>> [embedding.features[i] for i in embedding.x.indices[:8]]
['gizmo', 'mogwaï', 'blade', 'sentence', 'sentence', 'shadoks', 'comparing', 'gizmo']
>>> small_corpus = Corpus(["I only talk about Yoda", "Gizmo forever!"])
>>> embedding.transform(small_corpus)
>>> [embedding.features[i] for i in embedding.x.indices]
['yoda', 'gizmo']
gismo.embedding.auto_vect(corpus=None)[source]#

Creates a default CountVectorizer compatible with the Embedding constructor. For not-too-small corpi, a slight frequency-filter is applied.

Parameters:

corpus (Corpus, optional) – The corpus for which the CountVectorizer is intended.

Returns:

A CountVectorizer object compatible with the Embedding constructor.

Return type:

CountVectorizer

gismo.embedding.idf_fit(indptr, n)[source]#

Computes the Inverse-Document-Frequency vector on sparse embedding y.

Parameters:
  • indptr (ndarray) – Pointers of the embedding y (e.g. y.indptr).

  • n (int) – Number of documents.

Returns:

idf_vector – IDF vector of size m.

Return type:

ndarray

gismo.embedding.idf_transform(indptr, data, idf_vector)[source]#

Applies inplace Inverse-Document-Frequency transformation on sparse embedding y.

Parameters:
  • indptr (ndarray) – Pointers of the embedding y (e.g. y.indptr).

  • data (ndarray) – Values of the embedding y (e.g. y.data).

  • idf_vector (ndarray) – IDF vector of the embedding, obtained from idf_fit().

gismo.embedding.itf_fit_transform(indptr, data, m)[source]#

Applies inplace Inverse-Term-Frequency transformation on sparse embedding x.

Parameters:
  • indptr (ndarray) – Pointers of the embedding (e.g. x.indptr).

  • data (ndarray) – Values of the embedding (e.g. x.data).

  • m (int) – Number of features

gismo.embedding.l1_normalize(indptr, data)[source]#

Computes L1 norm on sparse embedding (x or y) and applies inplace normalization.

Parameters:
  • indptr (ndarray) – Pointers of the embedding (e.g. x.indptr).

  • data (ndarray) – Values of the embedding (e.g. x.data).

Returns:

l1_norm – L1 norms of all vectors of the embedding before normalization.

Return type:

ndarray

gismo.embedding.query_shape(indices, data, idf)[source]#

Applies inplace logarithmic smoothing, IDF weighting, and normalization to the output of the CountVectorizer transform() method.

Parameters:
Returns:

norm – The norm of the vector before normalization.

Return type:

float