{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Toy example" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you never used Gismo before, you should probably start with this tutorial." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A typical Gismo workflow stands as follows:\n", "- Its input is a list of objects, called the source;\n", "- A source is wrapped into a Corpus object;\n", "- A dual embedding is computed that relates objects and their content;\n", "- The embedding fuels a query-based ranking function;\n", "- The best results of a query can be organized in a hierarchical way." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Source" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2022-12-27T10:55:31.094544Z", "start_time": "2022-12-27T10:55:28.290059Z" } }, "outputs": [ { "data": { "text/plain": [ "[{'title': 'First Document', 'content': 'Gizmo is a Mogwaï.'},\n", " {'title': 'Second Document', 'content': 'This is a sentence about Blade.'},\n", " {'title': 'Third Document',\n", " 'content': 'This is another sentence about Shadoks.'},\n", " {'title': 'Fourth Document',\n", " 'content': 'This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda.'},\n", " {'title': 'Fifth Document',\n", " 'content': 'In chinese folklore, a Mogwaï is a demon.'}]" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from gismo.common import toy_source_dict\n", "toy_source_dict" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Corpus" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ``to_text`` parameter tells how to turn a source object into text (``str``). ``iterate_text`` allows to iterate over the textified objects." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2022-12-27T10:55:31.109639Z", "start_time": "2022-12-27T10:55:31.097544Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Gizmo is a Mogwaï.\n", "This is a sentence about Blade.\n", "This is another sentence about Shadoks.\n", "This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda.\n", "In chinese folklore, a Mogwaï is a demon.\n" ] } ], "source": [ "from gismo.corpus import Corpus\n", "corpus = Corpus(source=toy_source_dict, to_text=lambda x: x['content'])\n", "print(\"\\n\".join(corpus.iterate_text()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Embedding" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Gismo embedding relies on sklearn's ``CountVectorizer`` to extract features (words) from text. If no vectorizer is provided to the constructor, a default one will be provided, but it is good practice to shape one's own vectorizer to have a fine control of the parameters.\n", "\n", "Note: always set ``dtype=float`` when building your vectorizer, as the default ``int`` type will break things." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2022-12-27T10:55:31.125547Z", "start_time": "2022-12-27T10:55:31.111546Z" } }, "outputs": [], "source": [ "from gismo.embedding import Embedding\n", "from sklearn.feature_extraction.text import CountVectorizer\n", "vectorizer = CountVectorizer(dtype=float)\n", "embedding = Embedding(vectorizer=vectorizer)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ``fit_transform`` method builds the embedding. It combines the ``fit`` and ``transform`` methods.\n", "- ``fit`` computes the vocabulary (list of features) of the corpus and their IDF weights.\n", "- ``transform`` computes the ITF weights of the documents and the embeddings of documents and features." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2022-12-27T10:55:33.434965Z", "start_time": "2022-12-27T10:55:31.129548Z" } }, "outputs": [], "source": [ "embedding.fit_transform(corpus)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After fitting a corpus, the features can be accessed through ``features``." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2022-12-27T10:55:33.450227Z", "start_time": "2022-12-27T10:55:33.438618Z" } }, "outputs": [ { "data": { "text/plain": [ "'about, and, another, at, blade, by, chinese, comparing, demon, folklore, gizmo, gremlins, in, inside, is, long, lot, makes, mogwaï, movie, of, point, reference, sentence, shadoks, side, some, star, stuff, the, this, to, very, wars, with, yoda'" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\", \".join(embedding.features)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After transformation, a dual embedding is available between the ``èmbedding.n`` documents and the ``embedding.m`` features." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2022-12-27T10:55:33.482581Z", "start_time": "2022-12-27T10:55:33.452755Z" } }, "outputs": [ { "data": { "text/plain": [ "5" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "embedding.n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2022-12-27T10:55:33.497588Z", "start_time": "2022-12-27T10:55:33.485594Z" } }, "outputs": [ { "data": { "text/plain": [ "36" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "embedding.m" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "``x`` is a stochastic csr matrix that represents documents as vectors of features." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2022-12-27T10:55:33.513615Z", "start_time": "2022-12-27T10:55:33.500597Z" } }, "outputs": [ { "data": { "text/plain": [ "<5x36 sparse matrix of type ''\n", "\twith 47 stored elements in Compressed Sparse Row format>" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "embedding.x" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "``y`` is a stochastic csr matrix that represents features as vectors of documents." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2022-12-27T10:55:33.529700Z", "start_time": "2022-12-27T10:55:33.515506Z" } }, "outputs": [ { "data": { "text/plain": [ "<36x5 sparse matrix of type ''\n", "\twith 47 stored elements in Compressed Sparse Row format>" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "embedding.y" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Ranking" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To be able to rank documents according to a specific query, we construct a Gismo object from a corpus and an embedding." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "ExecuteTime": { "end_time": "2022-12-27T10:55:33.545662Z", "start_time": "2022-12-27T10:55:33.534692Z" } }, "outputs": [], "source": [ "from gismo.gismo import Gismo\n", "gismo = Gismo(corpus, embedding)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A query is made by using the ``rank`` method." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "ExecuteTime": { "end_time": "2022-12-27T10:55:34.759533Z", "start_time": "2022-12-27T10:55:33.550653Z" } }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gismo.rank(\"Gizmo\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Results ordered by ranking (e.g. relevance to the query) are accessed through the ``get_documents_by_rank`` and ``get_features_by_rank`` methods. The number of returned results can be given in the parameters." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "ExecuteTime": { "end_time": "2022-12-27T10:55:34.774833Z", "start_time": "2022-12-27T10:55:34.764537Z" } }, "outputs": [ { "data": { "text/plain": [ "[{'title': 'First Document', 'content': 'Gizmo is a Mogwaï.'},\n", " {'title': 'Fourth Document',\n", " 'content': 'This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda.'},\n", " {'title': 'Fifth Document',\n", " 'content': 'In chinese folklore, a Mogwaï is a demon.'},\n", " {'title': 'Second Document', 'content': 'This is a sentence about Blade.'},\n", " {'title': 'Third Document',\n", " 'content': 'This is another sentence about Shadoks.'}]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gismo.get_documents_by_rank(k=5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If not specified, the number of documents is automatically estimated." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "ExecuteTime": { "end_time": "2022-12-27T10:55:34.806278Z", "start_time": "2022-12-27T10:55:34.777638Z" } }, "outputs": [ { "data": { "text/plain": [ "[{'title': 'First Document', 'content': 'Gizmo is a Mogwaï.'}]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gismo.get_documents_by_rank()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As the dataset is small here, the default estimator is very conservative. We can use `target_k` to tune that. " ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "ExecuteTime": { "end_time": "2022-12-27T10:55:34.822299Z", "start_time": "2022-12-27T10:55:34.808278Z" } }, "outputs": [ { "data": { "text/plain": [ "[{'title': 'First Document', 'content': 'Gizmo is a Mogwaï.'},\n", " {'title': 'Fourth Document',\n", " 'content': 'This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda.'},\n", " {'title': 'Fifth Document',\n", " 'content': 'In chinese folklore, a Mogwaï is a demon.'}]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gismo.parameters.target_k = .2\n", "gismo.get_documents_by_rank()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "ExecuteTime": { "end_time": "2022-12-27T10:55:34.838278Z", "start_time": "2022-12-27T10:55:34.824278Z" } }, "outputs": [ { "data": { "text/plain": [ "['mogwaï', 'gizmo', 'is', 'in', 'demon', 'chinese', 'folklore']" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gismo.get_features_by_rank()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default, outputs are lists of raw documents and features. It can be convenient to post-process them by setting ``post_documents_item`` and ``post_features_item``. Gismo provides a few basic post-processing functions." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "ExecuteTime": { "end_time": "2022-12-27T10:55:34.854117Z", "start_time": "2022-12-27T10:55:34.840273Z" } }, "outputs": [], "source": [ "from gismo.post_processing import post_documents_item_content\n", "gismo.post_documents_item = post_documents_item_content" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "ExecuteTime": { "end_time": "2022-12-27T10:55:34.869088Z", "start_time": "2022-12-27T10:55:34.857088Z" } }, "outputs": [ { "data": { "text/plain": [ "['Gizmo is a Mogwaï.',\n", " 'This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda.',\n", " 'In chinese folklore, a Mogwaï is a demon.']" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gismo.get_documents_by_rank()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ranking algorithm is hosted inside gismo.diteration. Runtime parameters are managed insode gismo.parameters. One of the most important parameter is ``alpha`` $\\in [0,1]$, which controls the *range* of the underlying graph diffusion. Small values of ``alpha`` will yield results close to the initial. Larger values will take more into account the relationships between documents and features. " ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "ExecuteTime": { "end_time": "2022-12-27T10:55:34.884678Z", "start_time": "2022-12-27T10:55:34.871097Z" } }, "outputs": [ { "data": { "text/plain": [ "['Gizmo is a Mogwaï.',\n", " 'In chinese folklore, a Mogwaï is a demon.',\n", " 'This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda.']" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gismo.parameters.alpha = .8\n", "gismo.rank(\"Gizmo\")\n", "gismo.get_documents_by_rank()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Clustering" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Gismo can organize the best results into a tree through the ``get_documents_by_cluster`` and ``get_features_by_cluster`` methods. It is recommended to set post-processing functions." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "ExecuteTime": { "end_time": "2022-12-27T10:55:34.900725Z", "start_time": "2022-12-27T10:55:34.886689Z" } }, "outputs": [], "source": [ "from gismo.post_processing import post_documents_cluster_print, post_features_cluster_print\n", "gismo.post_documents_cluster = post_documents_cluster_print\n", "gismo.post_features_cluster = post_features_cluster_print" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "ExecuteTime": { "end_time": "2022-12-27T10:55:35.076160Z", "start_time": "2022-12-27T10:55:34.904729Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " F: 0.05. R: 1.85. S: 0.99.\n", "- F: 0.68. R: 1.77. S: 0.98.\n", "-- Gizmo is a Mogwaï. (R: 1.23; S: 0.98)\n", "-- In chinese folklore, a Mogwaï is a demon. (R: 0.27; S: 0.72)\n", "-- This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda. (R: 0.26; S: 0.67)\n", "- F: 0.70. R: 0.08. S: 0.19.\n", "-- This is a sentence about Blade. (R: 0.04; S: 0.17)\n", "-- This is another sentence about Shadoks. (R: 0.04; S: 0.17)\n" ] } ], "source": [ "gismo.get_documents_by_cluster(k=5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note: for each leaf (documents here), the post-processing indicates the **R**elevance (ranking weight) and **S**imilarity (cosine similarity) with respect to the query. For internal nodes (cluster of documents), a **F**ocus value indicates how similar the documents inside the cluster are." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The depth of the tree is controlled by a ``resolution`` parameter $\\in [0, 1]$. Low resolution yields a flat tree (star structure)." ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "ExecuteTime": { "end_time": "2022-12-27T10:55:35.107696Z", "start_time": "2022-12-27T10:55:35.080224Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " F: 0.04. R: 1.85. S: 0.99.\n", "- Gizmo is a Mogwaï. (R: 1.23; S: 0.98)\n", "- In chinese folklore, a Mogwaï is a demon. (R: 0.27; S: 0.72)\n", "- This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda. (R: 0.26; S: 0.67)\n", "- This is a sentence about Blade. (R: 0.04; S: 0.17)\n", "- This is another sentence about Shadoks. (R: 0.04; S: 0.17)\n" ] } ], "source": [ "gismo.get_documents_by_cluster(k=5, resolution=.01)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "High resolution yields, up to ties, to a binary tree (dendrogram)." ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "ExecuteTime": { "end_time": "2022-12-27T10:55:35.138582Z", "start_time": "2022-12-27T10:55:35.110704Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " F: 0.05. R: 1.85. S: 0.99.\n", "- F: 0.58. R: 1.77. S: 0.98.\n", "-- F: 0.69. R: 1.51. S: 0.98.\n", "--- Gizmo is a Mogwaï. (R: 1.23; S: 0.98)\n", "--- In chinese folklore, a Mogwaï is a demon. (R: 0.27; S: 0.72)\n", "-- This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda. (R: 0.26; S: 0.67)\n", "- F: 0.70. R: 0.08. S: 0.19.\n", "-- This is a sentence about Blade. (R: 0.04; S: 0.17)\n", "-- This is another sentence about Shadoks. (R: 0.04; S: 0.17)\n" ] } ], "source": [ "gismo.get_documents_by_cluster(k=5, resolution=.9)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The principle is the same for features." ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "ExecuteTime": { "end_time": "2022-12-27T10:55:35.169617Z", "start_time": "2022-12-27T10:55:35.142580Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " F: 0.00. R: 1.23. S: 0.93.\n", "- F: 0.08. R: 1.22. S: 0.93.\n", "-- F: 0.99. R: 1.03. S: 0.97.\n", "--- mogwaï (R: 0.46; S: 0.98)\n", "--- gizmo (R: 0.44; S: 0.96)\n", "--- is (R: 0.13; S: 0.98)\n", "-- F: 1.00. R: 0.18. S: 0.21.\n", "--- in (R: 0.05; S: 0.21)\n", "--- chinese (R: 0.05; S: 0.21)\n", "--- folklore (R: 0.05; S: 0.21)\n", "--- demon (R: 0.05; S: 0.21)\n", "- blade (R: 0.01; S: 0.03)\n" ] } ], "source": [ "gismo.get_features_by_cluster(k=8)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 4 }