Clustering#
This module implements the tree-like organization of selected items
- class gismo.clustering.Cluster(indice=None, rank=None, vector=None)[source]#
The ‘Cluster’ class is used for internal representation of hierarchical cluster. It stores the attributes that describe a clustering structure and provides cluster basic addition for merge operations.
- Parameters:
indice (int) – Index of the head (main element) of the cluster.
rank (int) – The ranking order of a cluster.
vector (
csr_matrix
) – The vector representation of the cluster.
- vector#
The vector representation of the cluster.
- Type:
- intersection_vector#
The vector representation of the common points of a cluster.
- Type:
csr_matrix
(deprecated)
- focus#
The consistency of the cluster (higher focus means that elements are more similar).
- Type:
float in range [0.0, 1.0]
Examples
>>> c1 = Cluster(indice=0, rank=1, vector=csr_matrix([1.0, 0.0, 1.0])) >>> c2 = Cluster(indice=5, rank=0, vector=csr_matrix([1.0, 1.0, 0.0])) >>> c3 = c1+c2 >>> c3.members [0, 5] >>> c3.indice 5 >>> c3.vector.toarray() array([[2., 1., 1.]]) >>> c3.intersection_vector.toarray() array([[1., 0., 0.]]) >>> c1 == sum([c1]) True
- gismo.clustering.covering_order(cluster, wide=True)[source]#
Uses a hierarchical cluster to provide an ordering of the items that mixes rank and coverage.
This is done by exploring all cluster and subclusters by increasing similarity and rank (lexicographic order). Two variants are proposed:
Core: for each cluster, append its representant to the list if new. Central items tend to have better rank.
Wide: for each cluster, append its children representants to the list if new. Marginal items tend to have better rank.
- gismo.clustering.get_sim(csr, arr)[source]#
Simple similarity computation between csr_matrix and ndarray.
- Parameters:
csr (
csr_matrix
)arr (
ndarray
)
- Return type:
- gismo.clustering.merge_clusters(cluster_list: list, focus=1.0)[source]#
Complete merge operation. In addition to the basic merge provided by
Cluster
, it ensures the following:Consistency of focus by integrating the extra-focus (typically given by
subspace_partition()
).Children (the members of the list) are sorted according to their respective rank.
- gismo.clustering.rec_clusterize(cluster_list: list, resolution=0.7)[source]#
Auxiliary recursive function for clustering.
- Parameters:
cluster_list (list of
Cluster
) – Current aggregation state.resolution (float in range [0.0, 1.0]) – Sets the lazyness of aggregation. A ‘resolution’ set to 0.0 yields a one-step clustering (star structure), while a ‘resolution ‘ set to 1.0 yields, up to tie similarities, a binary tree (dendrogram).
- Return type:
list of
Cluster
- gismo.clustering.subspace_clusterize(subspace, resolution=0.7, indices=None)[source]#
Converts a subspace (matrix seen as a list of vectors) to a Cluster object (hierarchical clustering).
- Parameters:
subspace (
ndarray
,csr_matrix
) – Ak x m
matrix seen as a list ofk
m
-dimensional vectors sorted by importance order.resolution (float in range [0.0, 1.0]) – Sets the lazyness of aggregation. A ‘resolution’ set to 0.0 yields a one-step clustering (star structure), while a ‘resolution ‘ set to 1.0 yields, up to tie similarities, a binary tree (dendrogram).
indices (list, optional) – Indicates the index for each element of the subspace. Used when ‘subspace’ is extracted from a larger space (e.g. X or Y). If not set, indices are set to
range(k)
.
- Returns:
A cluster whose leaves are the k vectors from ‘subspace’.
- Return type:
Example
>>> corpus = Corpus(toy_source_text) >>> vectorizer = CountVectorizer(dtype=float) >>> embedding = Embedding(vectorizer=vectorizer) >>> embedding.fit_transform(corpus) >>> subspace = embedding.x[1:, :] >>> cluster = subspace_clusterize(subspace) >>> len(cluster.children) 2 >>> cluster = subspace_clusterize(subspace, resolution=.02) >>> len(cluster.children) 4
- gismo.clustering.subspace_distortion(indices, data, relevance, distortion: float)[source]#
Apply inplace distortion of a subspace with relevance.
- Parameters:
indices (
ndarray
) – Indice attribute of the subspacecsr_matrix
.data (
ndarray
) – Data attribute of the subspacecsr_matrix
.relevance (
ndarray
) – Relevance values in the embedding space.distortion (float in [0.0, 1.0]) – Power applied to relevance for distortion.
- gismo.clustering.subspace_partition(subspace, resolution=0.7)[source]#
Proposes a partition of the subspace that merges together vectors with a similar direction.
- Parameters:
subspace (
ndarray
,csr_matrix
) – Ak x m
matrix seen as a list ofk
m
-dimensional vectors sorted by importance order.resolution (float in range [0.0, 1.0]) – How strict the merging should be.
0.0
will merge all items together, while1.0
will only merge mutually closest items.
- Returns:
A list of subsets that form a partition. Each subset is represented by a pair
(p, f)
.p
is the set of indices of the subset,f
is the typical similarity of the partition (called focus).- Return type: