Gismo logo

A Generic Information Search… With a Mind of its Own!#

Pypi badge Build badge Documentation badge codecov License: MIT

GISMO is a NLP tool to rank and organize a corpus of documents according to a query.

Gismo stands for Generic Information Search… with a Mind of its Own.

Features#

Gismo combines three main ideas:

  • TF-IDTF: a symmetric version of the TF-IDF embedding.

  • DIteration: a fast, push-based, variant of the PageRank algorithm.

  • Fuzzy dendrogram: a variant of the Louvain clustering algorithm.

Quickstart#

Install gismo:

$ pip install gismo

Use gismo in a Python project:

>>> from gismo.common import toy_source_dict
>>> from gismo import Corpus, Embedding, CountVectorizer, Gismo
>>> corpus = Corpus(toy_source_dict, to_text=lambda x: x['content'])
>>> embedding = Embedding(vectorizer=CountVectorizer(dtype=float))
>>> embedding.fit_transform(corpus)
>>> gismo = Gismo(corpus, embedding)
>>> gismo.rank("Mogwaï")
>>> gismo.get_features_by_rank()
['mogwaï', 'gizmo', 'chinese', 'in', 'demon', 'folklore', 'is']

To get the hang of a typical Gismo workflow, you can check the Toy Example notebook. For more advanced uses, look at the other tutorials or directly the reference section.

Credits#

Thomas Bonald, Anne Bouillard, Marc-Olivier Buob, Dohy Hong for their helpful contribution.

This package was created with Cookiecutter and the francois-durand/package_helper project template.

Coverage#

codecov