Clustering#

This module implements the tree-like organization of selected items

class gismo.clustering.Cluster(indice=None, rank=None, vector=None)[source]#

The ‘Cluster’ class is used for internal representation of hierarchical cluster. It stores the attributes that describe a clustering structure and provides cluster basic addition for merge operations.

Parameters:
  • indice (int) – Index of the head (main element) of the cluster.

  • rank (int) – The ranking order of a cluster.

  • vector (csr_matrix) – The vector representation of the cluster.

indice#

Index of the head (main element) of the cluster.

Type:

int

rank#

The ranking order of a cluster.

Type:

int

vector#

The vector representation of the cluster.

Type:

csr_matrix

intersection_vector#

The vector representation of the common points of a cluster.

Type:

csr_matrix (deprecated)

members#

The indices of the cluster elements.

Type:

list of int

focus#

The consistency of the cluster (higher focus means that elements are more similar).

Type:

float in range [0.0, 1.0]

children#

The subclusters.

Type:

list of Cluster

Examples

>>> c1 = Cluster(indice=0, rank=1, vector=csr_matrix([1.0, 0.0, 1.0]))
>>> c2 = Cluster(indice=5, rank=0, vector=csr_matrix([1.0, 1.0, 0.0]))
>>> c3 = c1+c2
>>> c3.members
[0, 5]
>>> c3.indice
5
>>> c3.vector.toarray()
array([[2., 1., 1.]])
>>> c3.intersection_vector.toarray()
array([[1., 0., 0.]])
>>> c1 == sum([c1])
True
gismo.clustering.covering_order(cluster, wide=True)[source]#

Uses a hierarchical cluster to provide an ordering of the items that mixes rank and coverage.

This is done by exploring all cluster and subclusters by increasing similarity and rank (lexicographic order). Two variants are proposed:

  • Core: for each cluster, append its representant to the list if new. Central items tend to have better rank.

  • Wide: for each cluster, append its children representants to the list if new. Marginal items tend to have better rank.

Parameters:
  • cluster (Cluster) – The cluster to explore.

  • wide (bool) – Use Wide (True) or Core (False) variant.

Returns:

Sorted indices of the items of the cluster.

Return type:

list of int

gismo.clustering.get_sim(csr, arr)[source]#

Simple similarity computation between csr_matrix and ndarray.

Parameters:
Return type:

float

gismo.clustering.merge_clusters(cluster_list: list, focus=1.0)[source]#

Complete merge operation. In addition to the basic merge provided by Cluster, it ensures the following:

  • Consistency of focus by integrating the extra-focus (typically given by subspace_partition()).

  • Children (the members of the list) are sorted according to their respective rank.

Parameters:
  • cluster_list (list of Cluster) – The clusters to merge into one cluster.

  • focus (float) – Evaluation of the focus (similarity) between clusters.

Returns:

The cluster merging the list.

Return type:

Cluster

gismo.clustering.rec_clusterize(cluster_list: list, resolution=0.7)[source]#

Auxiliary recursive function for clustering.

Parameters:
  • cluster_list (list of Cluster) – Current aggregation state.

  • resolution (float in range [0.0, 1.0]) – Sets the lazyness of aggregation. A ‘resolution’ set to 0.0 yields a one-step clustering (star structure), while a ‘resolution ‘ set to 1.0 yields, up to tie similarities, a binary tree (dendrogram).

Return type:

list of Cluster

gismo.clustering.subspace_clusterize(subspace, resolution=0.7, indices=None)[source]#

Converts a subspace (matrix seen as a list of vectors) to a Cluster object (hierarchical clustering).

Parameters:
  • subspace (ndarray, csr_matrix) – A k x m matrix seen as a list of k m-dimensional vectors sorted by importance order.

  • resolution (float in range [0.0, 1.0]) – Sets the lazyness of aggregation. A ‘resolution’ set to 0.0 yields a one-step clustering (star structure), while a ‘resolution ‘ set to 1.0 yields, up to tie similarities, a binary tree (dendrogram).

  • indices (list, optional) – Indicates the index for each element of the subspace. Used when ‘subspace’ is extracted from a larger space (e.g. X or Y). If not set, indices are set to range(k).

Returns:

A cluster whose leaves are the k vectors from ‘subspace’.

Return type:

Cluster

Example

>>> corpus = Corpus(toy_source_text)
>>> vectorizer = CountVectorizer(dtype=float)
>>> embedding = Embedding(vectorizer=vectorizer)
>>> embedding.fit_transform(corpus)
>>> subspace = embedding.x[1:, :]
>>> cluster = subspace_clusterize(subspace)
>>> len(cluster.children)
2
>>> cluster = subspace_clusterize(subspace, resolution=.02)
>>> len(cluster.children)
4
gismo.clustering.subspace_distortion(indices, data, relevance, distortion: float)[source]#

Apply inplace distortion of a subspace with relevance.

Parameters:
  • indices (ndarray) – Indice attribute of the subspace csr_matrix.

  • data (ndarray) – Data attribute of the subspace csr_matrix.

  • relevance (ndarray) – Relevance values in the embedding space.

  • distortion (float in [0.0, 1.0]) – Power applied to relevance for distortion.

gismo.clustering.subspace_partition(subspace, resolution=0.7)[source]#

Proposes a partition of the subspace that merges together vectors with a similar direction.

Parameters:
  • subspace (ndarray, csr_matrix) – A k x m matrix seen as a list of k m-dimensional vectors sorted by importance order.

  • resolution (float in range [0.0, 1.0]) – How strict the merging should be. 0.0 will merge all items together, while 1.0 will only merge mutually closest items.

Returns:

A list of subsets that form a partition. Each subset is represented by a pair (p, f). p is the set of indices of the subset, f is the typical similarity of the partition (called focus).

Return type:

list