{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Toy example"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you never used Gismo before, you should probably start with this tutorial."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A typical Gismo workflow stands as follows:\n",
    "- Its input is a list of objects, called the source;\n",
    "- A source is wrapped into a Corpus object;\n",
    "- A dual embedding is computed that relates objects and their content;\n",
    "- The embedding fuels a query-based ranking function;\n",
    "- The best results of a query can be organized in a hierarchical way."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Source"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-12-27T10:55:31.094544Z",
     "start_time": "2022-12-27T10:55:28.290059Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'title': 'First Document', 'content': 'Gizmo is a Mogwaï.'},\n",
       " {'title': 'Second Document', 'content': 'This is a sentence about Blade.'},\n",
       " {'title': 'Third Document',\n",
       "  'content': 'This is another sentence about Shadoks.'},\n",
       " {'title': 'Fourth Document',\n",
       "  'content': 'This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda.'},\n",
       " {'title': 'Fifth Document',\n",
       "  'content': 'In chinese folklore, a Mogwaï is a demon.'}]"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from gismo.common import toy_source_dict\n",
    "toy_source_dict"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Corpus"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The ``to_text`` parameter tells how to turn a source object into text (``str``). ``iterate_text`` allows to iterate over the textified objects."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-12-27T10:55:31.109639Z",
     "start_time": "2022-12-27T10:55:31.097544Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Gizmo is a Mogwaï.\n",
      "This is a sentence about Blade.\n",
      "This is another sentence about Shadoks.\n",
      "This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda.\n",
      "In chinese folklore, a Mogwaï is a demon.\n"
     ]
    }
   ],
   "source": [
    "from gismo.corpus import Corpus\n",
    "corpus = Corpus(source=toy_source_dict, to_text=lambda x: x['content'])\n",
    "print(\"\\n\".join(corpus.iterate_text()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Embedding"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Gismo embedding relies on sklearn's ``CountVectorizer`` to extract features (words) from text. If no vectorizer is provided to the constructor, a default one will be provided, but it is good practice to shape one's own vectorizer to have a fine control of the parameters.\n",
    "\n",
    "Note: always set ``dtype=float`` when building your vectorizer, as the default ``int`` type will break things."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-12-27T10:55:31.125547Z",
     "start_time": "2022-12-27T10:55:31.111546Z"
    }
   },
   "outputs": [],
   "source": [
    "from gismo.embedding import Embedding\n",
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "vectorizer = CountVectorizer(dtype=float)\n",
    "embedding = Embedding(vectorizer=vectorizer)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The ``fit_transform`` method builds the embedding. It combines the ``fit`` and ``transform`` methods.\n",
    "- ``fit`` computes the vocabulary (list of features) of the corpus and their IDF weights.\n",
    "- ``transform`` computes the ITF weights of the documents and the embeddings of documents and features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-12-27T10:55:33.434965Z",
     "start_time": "2022-12-27T10:55:31.129548Z"
    }
   },
   "outputs": [],
   "source": [
    "embedding.fit_transform(corpus)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After fitting a corpus, the features can be accessed through ``features``."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-12-27T10:55:33.450227Z",
     "start_time": "2022-12-27T10:55:33.438618Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'about, and, another, at, blade, by, chinese, comparing, demon, folklore, gizmo, gremlins, in, inside, is, long, lot, makes, mogwaï, movie, of, point, reference, sentence, shadoks, side, some, star, stuff, the, this, to, very, wars, with, yoda'"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "\", \".join(embedding.features)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After transformation, a dual embedding is available between the ``èmbedding.n`` documents and the ``embedding.m`` features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-12-27T10:55:33.482581Z",
     "start_time": "2022-12-27T10:55:33.452755Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "5"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "embedding.n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-12-27T10:55:33.497588Z",
     "start_time": "2022-12-27T10:55:33.485594Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "36"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "embedding.m"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "``x`` is a stochastic csr matrix that represents documents as vectors of features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-12-27T10:55:33.513615Z",
     "start_time": "2022-12-27T10:55:33.500597Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<5x36 sparse matrix of type '<class 'numpy.float64'>'\n",
       "\twith 47 stored elements in Compressed Sparse Row format>"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "embedding.x"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "``y`` is a stochastic csr matrix that represents features as vectors of documents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-12-27T10:55:33.529700Z",
     "start_time": "2022-12-27T10:55:33.515506Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<36x5 sparse matrix of type '<class 'numpy.float64'>'\n",
       "\twith 47 stored elements in Compressed Sparse Row format>"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "embedding.y"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Ranking"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To be able to rank documents according to a specific query, we construct a Gismo object from a corpus and an embedding."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-12-27T10:55:33.545662Z",
     "start_time": "2022-12-27T10:55:33.534692Z"
    }
   },
   "outputs": [],
   "source": [
    "from gismo.gismo import Gismo\n",
    "gismo = Gismo(corpus, embedding)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A query is made by using the ``rank`` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-12-27T10:55:34.759533Z",
     "start_time": "2022-12-27T10:55:33.550653Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gismo.rank(\"Gizmo\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Results ordered by ranking (e.g. relevance to the query) are accessed through the ``get_documents_by_rank`` and ``get_features_by_rank`` methods. The number of returned results can be given in the parameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-12-27T10:55:34.774833Z",
     "start_time": "2022-12-27T10:55:34.764537Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'title': 'First Document', 'content': 'Gizmo is a Mogwaï.'},\n",
       " {'title': 'Fourth Document',\n",
       "  'content': 'This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda.'},\n",
       " {'title': 'Fifth Document',\n",
       "  'content': 'In chinese folklore, a Mogwaï is a demon.'},\n",
       " {'title': 'Second Document', 'content': 'This is a sentence about Blade.'},\n",
       " {'title': 'Third Document',\n",
       "  'content': 'This is another sentence about Shadoks.'}]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gismo.get_documents_by_rank(k=5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If not specified, the number of documents is automatically estimated."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-12-27T10:55:34.806278Z",
     "start_time": "2022-12-27T10:55:34.777638Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'title': 'First Document', 'content': 'Gizmo is a Mogwaï.'}]"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gismo.get_documents_by_rank()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As the dataset is small here, the default estimator is very conservative. We can use `target_k` to tune that. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-12-27T10:55:34.822299Z",
     "start_time": "2022-12-27T10:55:34.808278Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'title': 'First Document', 'content': 'Gizmo is a Mogwaï.'},\n",
       " {'title': 'Fourth Document',\n",
       "  'content': 'This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda.'},\n",
       " {'title': 'Fifth Document',\n",
       "  'content': 'In chinese folklore, a Mogwaï is a demon.'}]"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gismo.parameters.target_k = .2\n",
    "gismo.get_documents_by_rank()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-12-27T10:55:34.838278Z",
     "start_time": "2022-12-27T10:55:34.824278Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['mogwaï', 'gizmo', 'is', 'in', 'demon', 'chinese', 'folklore']"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gismo.get_features_by_rank()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By default, outputs are lists of raw documents and features. It can be convenient to post-process them by setting ``post_documents_item`` and ``post_features_item``. Gismo provides a few basic post-processing functions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-12-27T10:55:34.854117Z",
     "start_time": "2022-12-27T10:55:34.840273Z"
    }
   },
   "outputs": [],
   "source": [
    "from gismo.post_processing import post_documents_item_content\n",
    "gismo.post_documents_item = post_documents_item_content"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-12-27T10:55:34.869088Z",
     "start_time": "2022-12-27T10:55:34.857088Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Gizmo is a Mogwaï.',\n",
       " 'This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda.',\n",
       " 'In chinese folklore, a Mogwaï is a demon.']"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gismo.get_documents_by_rank()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The ranking algorithm is hosted inside gismo.diteration. Runtime parameters are managed insode gismo.parameters. One of the most important parameter is ``alpha`` $\\in [0,1]$, which controls the *range* of the underlying graph diffusion. Small values of ``alpha`` will yield results close to the initial. Larger values will take more into account the relationships between documents and features. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-12-27T10:55:34.884678Z",
     "start_time": "2022-12-27T10:55:34.871097Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Gizmo is a Mogwaï.',\n",
       " 'In chinese folklore, a Mogwaï is a demon.',\n",
       " 'This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda.']"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gismo.parameters.alpha = .8\n",
    "gismo.rank(\"Gizmo\")\n",
    "gismo.get_documents_by_rank()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Clustering"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Gismo can organize the best results into a tree through the ``get_documents_by_cluster`` and ``get_features_by_cluster`` methods. It is recommended to set post-processing functions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-12-27T10:55:34.900725Z",
     "start_time": "2022-12-27T10:55:34.886689Z"
    }
   },
   "outputs": [],
   "source": [
    "from gismo.post_processing import post_documents_cluster_print, post_features_cluster_print\n",
    "gismo.post_documents_cluster = post_documents_cluster_print\n",
    "gismo.post_features_cluster = post_features_cluster_print"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-12-27T10:55:35.076160Z",
     "start_time": "2022-12-27T10:55:34.904729Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " F: 0.05. R: 1.85. S: 0.99.\n",
      "- F: 0.68. R: 1.77. S: 0.98.\n",
      "-- Gizmo is a Mogwaï. (R: 1.23; S: 0.98)\n",
      "-- In chinese folklore, a Mogwaï is a demon. (R: 0.27; S: 0.72)\n",
      "-- This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda. (R: 0.26; S: 0.67)\n",
      "- F: 0.70. R: 0.08. S: 0.19.\n",
      "-- This is a sentence about Blade. (R: 0.04; S: 0.17)\n",
      "-- This is another sentence about Shadoks. (R: 0.04; S: 0.17)\n"
     ]
    }
   ],
   "source": [
    "gismo.get_documents_by_cluster(k=5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note: for each leaf (documents here), the post-processing indicates the **R**elevance (ranking weight) and **S**imilarity (cosine similarity) with respect to the query. For internal nodes (cluster of documents), a **F**ocus value indicates how similar the documents inside the cluster are."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The depth of the tree is controlled by a ``resolution`` parameter $\\in [0, 1]$. Low resolution yields a flat tree (star structure)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-12-27T10:55:35.107696Z",
     "start_time": "2022-12-27T10:55:35.080224Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " F: 0.04. R: 1.85. S: 0.99.\n",
      "- Gizmo is a Mogwaï. (R: 1.23; S: 0.98)\n",
      "- In chinese folklore, a Mogwaï is a demon. (R: 0.27; S: 0.72)\n",
      "- This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda. (R: 0.26; S: 0.67)\n",
      "- This is a sentence about Blade. (R: 0.04; S: 0.17)\n",
      "- This is another sentence about Shadoks. (R: 0.04; S: 0.17)\n"
     ]
    }
   ],
   "source": [
    "gismo.get_documents_by_cluster(k=5, resolution=.01)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "High resolution yields, up to ties, to a binary tree (dendrogram)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-12-27T10:55:35.138582Z",
     "start_time": "2022-12-27T10:55:35.110704Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " F: 0.05. R: 1.85. S: 0.99.\n",
      "- F: 0.58. R: 1.77. S: 0.98.\n",
      "-- F: 0.69. R: 1.51. S: 0.98.\n",
      "--- Gizmo is a Mogwaï. (R: 1.23; S: 0.98)\n",
      "--- In chinese folklore, a Mogwaï is a demon. (R: 0.27; S: 0.72)\n",
      "-- This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda. (R: 0.26; S: 0.67)\n",
      "- F: 0.70. R: 0.08. S: 0.19.\n",
      "-- This is a sentence about Blade. (R: 0.04; S: 0.17)\n",
      "-- This is another sentence about Shadoks. (R: 0.04; S: 0.17)\n"
     ]
    }
   ],
   "source": [
    "gismo.get_documents_by_cluster(k=5, resolution=.9)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The principle is the same for features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2022-12-27T10:55:35.169617Z",
     "start_time": "2022-12-27T10:55:35.142580Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " F: 0.00. R: 1.23. S: 0.93.\n",
      "- F: 0.08. R: 1.22. S: 0.93.\n",
      "-- F: 0.99. R: 1.03. S: 0.97.\n",
      "--- mogwaï (R: 0.46; S: 0.98)\n",
      "--- gizmo (R: 0.44; S: 0.96)\n",
      "--- is (R: 0.13; S: 0.98)\n",
      "-- F: 1.00. R: 0.18. S: 0.21.\n",
      "--- in (R: 0.05; S: 0.21)\n",
      "--- chinese (R: 0.05; S: 0.21)\n",
      "--- folklore (R: 0.05; S: 0.21)\n",
      "--- demon (R: 0.05; S: 0.21)\n",
      "- blade (R: 0.01; S: 0.03)\n"
     ]
    }
   ],
   "source": [
    "gismo.get_features_by_cluster(k=8)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.13"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": false,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}