Embedding#
This module can create and manipulate TF-IDTF embeddings out of a corpus.
- class gismo.embedding.Embedding(vectorizer=None)[source]#
This class leverages the
CountVectorizer
class to build the dual embedding of aCorpus
.Documents are embedded in the space of features;
Features are embedded in the space of documents.
See the examples and methods below for all usages of the class.
- Parameters:
vectorizer (
CountVectorizer
, optional) – CustomCountVectorizer
to override default behavior (recommended). Having aCountVectorizer
adapted to theCorpus
is good practice.
- fit(corpus)[source]#
Learn features from a corpus of documents.
If not yet set, a default
CountVectorizer
is created.Features are computed and stored.
Inverse-Document-Frequency weights of features are computed.
- Parameters:
corpus (
Corpus
) – The corpus to ingest.
Example
>>> corpus=Corpus(toy_source_text) >>> embedding = Embedding() >>> embedding.fit(corpus) >>> len(embedding.idf) 21 >>> list(embedding.features[:8]) ['blade', 'chinese', 'comparing', 'demon', 'folklore', 'gizmo', 'gremlins', 'inside']
- fit_ext(embedding)[source]#
Use learned features from another
Embedding
. This is useful for the fast creation of local embeddings (e.g. at sentence level) out of a global embedding.- Parameters:
embedding (
Embedding
) – External embedding to copy.
Examples
>>> corpus=Corpus(toy_source_text) >>> other_embedding = Embedding() >>> other_embedding.fit(corpus) >>> embedding = Embedding() >>> embedding.fit_ext(other_embedding) >>> len(embedding.idf) 21 >>> list(embedding.features[:8]) ['blade', 'chinese', 'comparing', 'demon', 'folklore', 'gizmo', 'gremlins', 'inside']
- fit_transform(corpus)[source]#
Ingest a corpus of documents.
If not yet set, a default
CountVectorizer
is created.Features are computed and stored (fit).
Inverse-Document-Frequency weights of features are computed (fit).
TF-IDF embedding of documents is computed and stored (transform).
TF-ITF embedding of features is computed and stored (transform).
- Parameters:
corpus (
Corpus
) – The corpus to ingest.
Example
>>> corpus=Corpus(toy_source_text) >>> embedding = Embedding() >>> embedding.fit_transform(corpus) >>> embedding.x <Compressed Sparse Row sparse matrix of dtype 'float64' with 25 stored elements and shape (5, 21)> >>> list(embedding.features[:8]) ['blade', 'chinese', 'comparing', 'demon', 'folklore', 'gizmo', 'gremlins', 'inside']
- query_projection(query)[source]#
Project a query in the feature space.
- Parameters:
query (
str
) – Text to project.- Returns:
z (
csr_matrix
) – result of the query projection (IDF distribution if query does not match any feature).success (
bool
) – projection success (True
if at least one feature been found).
Example
>>> corpus=Corpus(toy_source_text) >>> embedding = Embedding() >>> embedding.fit_transform(corpus) >>> z, success = embedding.query_projection("Gizmo is not Yoda but he rocks!") >>> for i in range(len(z.data)): ... print(f"{embedding.features[z.indices[i]]}: {z.data[i]}") gizmo: 0.3868528072... yoda: 0.6131471927... >>> success True >>> z, success = embedding.query_projection("That content does not intersect toy corpus") >>> success False
- transform(corpus)[source]#
Ingest a corpus of documents using existing features. Requires that the embedding has been fitted beforehand.
TF-IDF embedding of documents is computed and stored.
TF-ITF embedding of features is computed and stored.
- Parameters:
corpus (
Corpus
) – The corpus to ingest.
Example
>>> corpus=Corpus(toy_source_text) >>> embedding = Embedding() >>> embedding.fit_transform(corpus) >>> [embedding.features[i] for i in embedding.x.indices[:8]] ['gizmo', 'mogwaï', 'blade', 'sentence', 'sentence', 'shadoks', 'comparing', 'gizmo'] >>> small_corpus = Corpus(["I only talk about Yoda", "Gizmo forever!"]) >>> embedding.transform(small_corpus) >>> [embedding.features[i] for i in embedding.x.indices] ['yoda', 'gizmo']
- gismo.embedding.auto_vect(corpus=None)[source]#
Creates a default
CountVectorizer
compatible with theEmbedding
constructor. For not-too-small corpi, a slight frequency-filter is applied.- Parameters:
corpus (
Corpus
, optional) – The corpus for which theCountVectorizer
is intended.- Returns:
A
CountVectorizer
object compatible with theEmbedding
constructor.- Return type:
- gismo.embedding.idf_fit(indptr, n)[source]#
Computes the Inverse-Document-Frequency vector on sparse embedding y.
- gismo.embedding.idf_transform(indptr, data, idf_vector)[source]#
Applies inplace Inverse-Document-Frequency transformation on sparse embedding y.
- gismo.embedding.itf_fit_transform(indptr, data, m)[source]#
Applies inplace Inverse-Term-Frequency transformation on sparse embedding x.
- gismo.embedding.l1_normalize(indptr, data)[source]#
Computes L1 norm on sparse embedding (x or y) and applies inplace normalization.
- gismo.embedding.query_shape(indices, data, idf)[source]#
Applies inplace logarithmic smoothing, IDF weighting, and normalization to the output of the
CountVectorizer
transform()
method.- Parameters:
indices (
ndarray
) – Indice attribute of thecsr_matrix
obtained fromtransform()
.data (
ndarray
) – Data attribute of thecsr_matrix
obtained fromtransform()
.idf (
ndarray
) – IDF vector of the embedding, obtained fromidf_fit()
.
- Returns:
norm – The norm of the vector before normalization.
- Return type: