{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# ACM categories" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This tutorial shows how ACM categories can be studied with Gismo.\n", "\n", "If you have never used Gismo before, you may want to start with the *Toy example tutorial*.\n", "\n", "Imagine that you want to submit an article and are asked to provide an ACM category and some generic keywords. Let see how Gismo can help you.\n", "\n", "Here, *documents* are ACM categories. The *features* of a category will be the words of its name along with the words of the name of its descendants." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Initialisation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we load the required package." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2022-12-27T10:56:16.170919Z", "start_time": "2022-12-27T10:56:14.387174Z" } }, "outputs": [], "source": [ "from sklearn.feature_extraction.text import CountVectorizer\n", "from sklearn.metrics.pairwise import cosine_similarity\n", "import numpy as np\n", "\n", "from gismo.datasets.acm import get_acm, flatten_acm\n", "from gismo.corpus import Corpus\n", "from gismo.embedding import Embedding\n", "from gismo.gismo import Gismo\n", "from gismo.post_processing import post_features_cluster_print, post_documents_cluster_print" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then, we load the ACM source. Note that we flatten the source, i.e. the existing hierarchy is discarded, as Gismo will provide its own dynamic, query-based, structure." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2022-12-27T10:56:16.202254Z", "start_time": "2022-12-27T10:56:16.176824Z" } }, "outputs": [], "source": [ "acm = flatten_acm(get_acm())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each category in the ``acm`` list is a dict with ``name`` and ``query``. We build a corpus that will tell Gismo that the content of a category is its ``query`` value." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2022-12-27T10:56:16.218172Z", "start_time": "2022-12-27T10:56:16.206243Z" } }, "outputs": [], "source": [ "corpus = Corpus(acm, to_text=lambda x: x['query'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We build an embedding on top of that corpus.\n", "- We set ``min_df=3`` to exclude rare features;\n", "- We set ``ngram_range=(1, 3)`` to include bi-grams and tri-grams in the embedding.\n", "- We manually pick a few common words to exclude from the embedding." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2022-12-27T10:56:18.207394Z", "start_time": "2022-12-27T10:56:16.220184Z" } }, "outputs": [], "source": [ "vectorizer = CountVectorizer(min_df=3, ngram_range=(1, 3), dtype=float, stop_words=['to', 'and', 'by'])\n", "embedding = Embedding(vectorizer=vectorizer)\n", "embedding.fit_transform(corpus)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2022-12-27T10:56:18.238968Z", "start_time": "2022-12-27T10:56:18.212382Z" } }, "outputs": [ { "data": { "text/plain": [ "<234x6929 sparse matrix of type ''\n", "\twith 28014 stored elements in Compressed Sparse Row format>" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "embedding.x" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see from ``embedding.x`` that the embedding links 234 documents to 6,936 features. There are 28,041 weights: in average, each document is linked to more than 100 features, each feature is linked to 4 documents." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we initiate the gismo object, and customize post_processers to ease the display." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2022-12-27T10:56:18.254588Z", "start_time": "2022-12-27T10:56:18.242486Z" } }, "outputs": [], "source": [ "gismo = Gismo(corpus, embedding)\n", "gismo.post_documents_item = lambda g, i: g.corpus[i]['name']\n", "gismo.post_documents_cluster = post_documents_cluster_print\n", "gismo.post_features_cluster = post_features_cluster_print" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Machine Learning query" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We perform the query *Machine learning*. The returned ``True`` tells that some of the query features were found in the corpus' features.\n", "\n", "**Remark:** For this tutorial, we just enter a few words, but at the start of this Notebook, we talked about submitting an article. As a query can be as long as you want, you can call the ``rank`` method with the full textual content of your article if you want to." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2022-12-27T10:56:19.408544Z", "start_time": "2022-12-27T10:56:18.256515Z" } }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gismo.rank(\"Machine learning\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What are the best ACM categories for an article on *Machine Learning*?" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2022-12-27T10:56:19.424217Z", "start_time": "2022-12-27T10:56:19.411704Z" } }, "outputs": [ { "data": { "text/plain": [ "['Machine learning',\n", " 'Computing methodologies',\n", " 'Machine learning algorithms',\n", " 'Learning paradigms',\n", " 'Machine learning theory',\n", " 'Machine learning approaches',\n", " 'Theory and algorithms for application domains',\n", " 'Theory of computation',\n", " 'Natural language processing',\n", " 'Artificial intelligence',\n", " 'Learning settings',\n", " 'Supervised learning',\n", " 'Reinforcement learning',\n", " 'Education',\n", " 'Dynamic programming for Markov decision processes',\n", " 'Unsupervised learning']" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gismo.get_documents_by_rank()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sounds nice. How are the top 10 domains related in the context of *Machine Learning*?" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2022-12-27T10:56:19.579954Z", "start_time": "2022-12-27T10:56:19.426360Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " F: 0.06. R: 0.52. S: 0.75.\n", "- F: 0.63. R: 0.48. S: 0.73.\n", "-- F: 0.78. R: 0.41. S: 0.70.\n", "--- F: 0.98. R: 0.16. S: 0.85.\n", "---- Machine learning (R: 0.09; S: 0.84)\n", "---- Computing methodologies (R: 0.06; S: 0.87)\n", "--- Learning paradigms (R: 0.06; S: 0.62)\n", "--- F: 0.94. R: 0.14. S: 0.63.\n", "---- Machine learning theory (R: 0.06; S: 0.61)\n", "---- Theory and algorithms for application domains (R: 0.05; S: 0.64)\n", "---- Theory of computation (R: 0.04; S: 0.66)\n", "--- Machine learning approaches (R: 0.05; S: 0.54)\n", "-- Machine learning algorithms (R: 0.06; S: 0.60)\n", "- F: 0.66. R: 0.04. S: 0.23.\n", "-- Natural language processing (R: 0.03; S: 0.21)\n", "-- Artificial intelligence (R: 0.02; S: 0.30)\n" ] } ], "source": [ "gismo.get_documents_by_cluster(k=10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "OK! Let's decode this:\n", "- Mainstream we have two main groups\n", " - the practical fields (methodology, paradigms)\n", " - the theoretical fields\n", "- If you don't want to decide, you can go with approaches/algorithms.\n", "- But maybe your article uses machine learning to achieve NLP or AI?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's look at the main keywords." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "ExecuteTime": { "end_time": "2022-12-27T10:56:19.595349Z", "start_time": "2022-12-27T10:56:19.582938Z" } }, "outputs": [ { "data": { "text/plain": [ "['learning',\n", " 'reinforcement',\n", " 'reinforcement learning',\n", " 'decision',\n", " 'supervised learning',\n", " 'supervised',\n", " 'machine',\n", " 'iteration',\n", " 'learning learning',\n", " 'machine learning',\n", " 'markov decision',\n", " 'markov decision processes',\n", " 'decision processes',\n", " 'dynamic programming',\n", " 'processes',\n", " 'markov',\n", " 'methods',\n", " 'learning multi',\n", " 'multi agent',\n", " 'multi',\n", " 'dynamic',\n", " 'agent']" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gismo.get_features_by_rank()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's organize them." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "ExecuteTime": { "end_time": "2022-12-27T10:56:19.626804Z", "start_time": "2022-12-27T10:56:19.598359Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " F: 0.82. R: 0.02. S: 0.95.\n", "- F: 0.89. R: 0.02. S: 0.95.\n", "-- learning (R: 0.00; S: 0.96)\n", "-- reinforcement (R: 0.00; S: 0.83)\n", "-- reinforcement learning (R: 0.00; S: 0.83)\n", "-- decision (R: 0.00; S: 0.96)\n", "-- supervised learning (R: 0.00; S: 0.81)\n", "-- supervised (R: 0.00; S: 0.81)\n", "-- machine (R: 0.00; S: 0.95)\n", "-- iteration (R: 0.00; S: 0.68)\n", "-- machine learning (R: 0.00; S: 0.93)\n", "-- markov decision (R: 0.00; S: 0.89)\n", "-- markov decision processes (R: 0.00; S: 0.89)\n", "-- decision processes (R: 0.00; S: 0.89)\n", "-- dynamic programming (R: 0.00; S: 0.70)\n", "-- processes (R: 0.00; S: 0.89)\n", "-- markov (R: 0.00; S: 0.89)\n", "-- methods (R: 0.00; S: 0.86)\n", "-- learning multi (R: 0.00; S: 0.82)\n", "-- multi agent (R: 0.00; S: 0.82)\n", "-- multi (R: 0.00; S: 0.85)\n", "-- dynamic (R: 0.00; S: 0.71)\n", "-- agent (R: 0.00; S: 0.82)\n", "- learning learning (R: 0.00; S: 0.75)\n" ] } ], "source": [ "gismo.get_features_by_cluster()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hum, not very informative. Let's increase the resolution to get more structure!" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "ExecuteTime": { "end_time": "2022-12-27T10:56:19.658090Z", "start_time": "2022-12-27T10:56:19.628801Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " F: 0.73. R: 0.02. S: 0.95.\n", "- F: 0.79. R: 0.02. S: 0.91.\n", "-- F: 0.94. R: 0.01. S: 0.93.\n", "--- F: 0.96. R: 0.01. S: 0.96.\n", "---- F: 0.96. R: 0.00. S: 0.97.\n", "----- learning (R: 0.00; S: 0.96)\n", "----- decision (R: 0.00; S: 0.96)\n", "---- F: 0.99. R: 0.00. S: 0.94.\n", "----- machine (R: 0.00; S: 0.95)\n", "----- machine learning (R: 0.00; S: 0.93)\n", "--- F: 0.96. R: 0.01. S: 0.89.\n", "---- F: 0.99. R: 0.00. S: 0.89.\n", "----- F: 1.00. R: 0.00. S: 0.89.\n", "------ markov decision (R: 0.00; S: 0.89)\n", "------ markov decision processes (R: 0.00; S: 0.89)\n", "------ decision processes (R: 0.00; S: 0.89)\n", "------ markov (R: 0.00; S: 0.89)\n", "----- processes (R: 0.00; S: 0.89)\n", "---- methods (R: 0.00; S: 0.86)\n", "-- F: 0.99. R: 0.00. S: 0.69.\n", "--- iteration (R: 0.00; S: 0.68)\n", "--- dynamic programming (R: 0.00; S: 0.70)\n", "--- dynamic (R: 0.00; S: 0.71)\n", "-- learning learning (R: 0.00; S: 0.75)\n", "- F: 0.95. R: 0.01. S: 0.85.\n", "-- F: 0.99. R: 0.00. S: 0.83.\n", "--- F: 1.00. R: 0.00. S: 0.83.\n", "---- reinforcement (R: 0.00; S: 0.83)\n", "---- reinforcement learning (R: 0.00; S: 0.83)\n", "--- multi (R: 0.00; S: 0.85)\n", "-- F: 0.97. R: 0.00. S: 0.82.\n", "--- F: 1.00. R: 0.00. S: 0.81.\n", "---- supervised learning (R: 0.00; S: 0.81)\n", "---- supervised (R: 0.00; S: 0.81)\n", "--- learning multi (R: 0.00; S: 0.82)\n", "-- F: 1.00. R: 0.00. S: 0.82.\n", "--- multi agent (R: 0.00; S: 0.82)\n", "--- agent (R: 0.00; S: 0.82)\n" ] } ], "source": [ "gismo.get_features_by_cluster(resolution=.9)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Rough analysis:\n", "- Machine learning is about... Machine learning, which seems related to decision. Markov decision process and dynamic programming seem to matter.\n", "- Reinforcement learning and supervised learning seem to be special categories of interest. Seems that multi-agents are involved." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## P2P query" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We perform the query *P2P*. The returned ``False`` tells that P2P is not a feature of the corpus (it's a small corpus after all, made only of catagory titles)." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "ExecuteTime": { "end_time": "2022-12-27T10:56:19.689461Z", "start_time": "2022-12-27T10:56:19.660057Z" } }, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gismo.rank(\"P2P\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try to avoid the acronym. Ok, now it works." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "ExecuteTime": { "end_time": "2022-12-27T10:56:19.721479Z", "start_time": "2022-12-27T10:56:19.694459Z" } }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gismo.rank(\"Peer-to-peer\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What are the best ACM categories for an article on *P2P*?" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "ExecuteTime": { "end_time": "2022-12-27T10:56:19.737115Z", "start_time": "2022-12-27T10:56:19.723471Z" } }, "outputs": [ { "data": { "text/plain": [ "['Network protocols',\n", " 'Distributed architectures',\n", " 'Networks',\n", " 'Network types',\n", " 'Search engine architectures and scalability',\n", " 'Software architectures',\n", " 'Software system structures',\n", " 'Architectures',\n", " 'Computer systems organization',\n", " 'Software organization and properties',\n", " 'Information retrieval',\n", " 'Software and its engineering',\n", " 'Information systems']" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gismo.get_documents_by_rank()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sounds nice. How are these domains related in the context of *P2P*?" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "ExecuteTime": { "end_time": "2022-12-27T10:56:19.767673Z", "start_time": "2022-12-27T10:56:19.739626Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " F: 0.34. R: 0.59. S: 0.79.\n", "- F: 0.90. R: 0.12. S: 0.60.\n", "-- Network protocols (R: 0.06; S: 0.58)\n", "-- Networks (R: 0.06; S: 0.71)\n", "- F: 0.59. R: 0.47. S: 0.69.\n", "-- F: 0.84. R: 0.31. S: 0.66.\n", "--- F: 0.89. R: 0.15. S: 0.64.\n", "---- Distributed architectures (R: 0.06; S: 0.62)\n", "---- Architectures (R: 0.04; S: 0.64)\n", "---- Computer systems organization (R: 0.04; S: 0.67)\n", "--- F: 0.89. R: 0.16. S: 0.62.\n", "---- Software architectures (R: 0.05; S: 0.56)\n", "---- Software system structures (R: 0.05; S: 0.65)\n", "---- Software organization and properties (R: 0.04; S: 0.68)\n", "---- Software and its engineering (R: 0.03; S: 0.70)\n", "-- Network types (R: 0.05; S: 0.52)\n", "-- F: 0.80. R: 0.11. S: 0.52.\n", "--- F: 0.95. R: 0.09. S: 0.50.\n", "---- Search engine architectures and scalability (R: 0.05; S: 0.50)\n", "---- Information retrieval (R: 0.04; S: 0.52)\n", "--- Information systems (R: 0.02; S: 0.65)\n" ] } ], "source": [ "gismo.get_documents_by_cluster()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "OK! Let's decode this. P2P relates to:\n", "- Network protocols\n", "- Architectures, with two main groups\n", " - the design fields (*distributed architecture*, *organization*)\n", " - the implementation fields (*software*)\n", "- Inside architectures, but a little bit isolated, *search engine architectures and scalability* + *Information retrieval / systems* calls for the scalable property of P2P networks. Specifically, a P2P expert will recognize Distributed Hash Tables, one of the main theoretical and practical success of P2P." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's look at the main keywords." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "ExecuteTime": { "end_time": "2022-12-27T10:56:19.783675Z", "start_time": "2022-12-27T10:56:19.770673Z" } }, "outputs": [ { "data": { "text/plain": [ "['peer',\n", " 'protocols',\n", " 'protocol',\n", " 'peer peer',\n", " 'architectures',\n", " 'network',\n", " 'link',\n", " 'architectures tier',\n", " 'architectures tier architectures',\n", " 'tier architectures']" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gismo.get_features_by_rank(k=10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's organize them." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "ExecuteTime": { "end_time": "2022-12-27T10:56:19.815671Z", "start_time": "2022-12-27T10:56:19.786675Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " F: 0.63. R: 0.03. S: 0.92.\n", "- F: 1.00. R: 0.01. S: 0.97.\n", "-- peer (R: 0.01; S: 0.97)\n", "-- peer peer (R: 0.00; S: 0.97)\n", "- F: 0.84. R: 0.01. S: 0.57.\n", "-- F: 1.00. R: 0.01. S: 0.49.\n", "--- protocols (R: 0.00; S: 0.48)\n", "--- protocol (R: 0.00; S: 0.49)\n", "-- network (R: 0.00; S: 0.62)\n", "-- link (R: 0.00; S: 0.61)\n", "- F: 0.95. R: 0.01. S: 0.69.\n", "-- architectures (R: 0.00; S: 0.79)\n", "-- architectures tier (R: 0.00; S: 0.67)\n", "-- architectures tier architectures (R: 0.00; S: 0.67)\n", "-- tier architectures (R: 0.00; S: 0.67)\n" ] } ], "source": [ "gismo.get_features_by_cluster(k=10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Rough analysis:\n", "- One cluster about network protocols\n", "- One cluster about architectures" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## PageRank query" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We perform the query *PageRank*. The returned ``False`` tells that *PageRank* is not a feature of the corpus (it's a small corpus after all, made only of catagory titles)." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "ExecuteTime": { "end_time": "2022-12-27T10:56:19.847204Z", "start_time": "2022-12-27T10:56:19.818675Z" } }, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gismo.rank(\"Pagerank\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try to avoid the copyright infrigment. Ok, now it works." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "ExecuteTime": { "end_time": "2022-12-27T10:56:19.879204Z", "start_time": "2022-12-27T10:56:19.851207Z" } }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gismo.rank(\"ranking the web\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What are the best ACM categories for an article on *PageRank*?" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "ExecuteTime": { "end_time": "2022-12-27T10:56:19.895208Z", "start_time": "2022-12-27T10:56:19.883212Z" } }, "outputs": [ { "data": { "text/plain": [ "['Web searching and information discovery',\n", " 'World Wide Web',\n", " 'Information systems',\n", " 'Web applications',\n", " 'Supervised learning',\n", " 'Retrieval models and ranking',\n", " 'Learning paradigms',\n", " 'Information retrieval',\n", " 'Machine learning',\n", " 'Web mining',\n", " 'Web services',\n", " 'Web data description languages',\n", " 'Computing methodologies',\n", " 'Security and privacy',\n", " 'Internet communications tools',\n", " 'Networks',\n", " 'Software and application security',\n", " 'Network security',\n", " 'Specialized information retrieval',\n", " 'Network types',\n", " 'Interaction paradigms',\n", " 'Middleware for databases',\n", " 'Network properties',\n", " 'Human computer interaction (HCI)']" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gismo.get_documents_by_rank()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sounds nice. How are these domains related in the context of *PageRank*?" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "ExecuteTime": { "end_time": "2022-12-27T10:56:19.926846Z", "start_time": "2022-12-27T10:56:19.899724Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " F: 0.13. R: 0.43. S: 0.78.\n", "- F: 0.45. R: 0.27. S: 0.68.\n", "-- F: 0.81. R: 0.21. S: 0.70.\n", "--- Web searching and information discovery (R: 0.08; S: 0.62)\n", "--- F: 0.91. R: 0.13. S: 0.81.\n", "---- World Wide Web (R: 0.08; S: 0.79)\n", "---- Information systems (R: 0.05; S: 0.85)\n", "-- Web applications (R: 0.05; S: 0.49)\n", "-- Web mining (R: 0.02; S: 0.34)\n", "- F: 0.27. R: 0.16. S: 0.48.\n", "-- F: 0.92. R: 0.09. S: 0.42.\n", "--- Supervised learning (R: 0.04; S: 0.41)\n", "--- Learning paradigms (R: 0.03; S: 0.43)\n", "--- Machine learning (R: 0.02; S: 0.43)\n", "-- F: 0.81. R: 0.07. S: 0.37.\n", "--- Retrieval models and ranking (R: 0.04; S: 0.34)\n", "--- Information retrieval (R: 0.03; S: 0.45)\n" ] } ], "source": [ "gismo.get_documents_by_cluster(k=10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hum, maybe somethin more compact. Let's lower the resolution (default resolution is 0.7)." ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "ExecuteTime": { "end_time": "2022-12-27T10:56:19.957665Z", "start_time": "2022-12-27T10:56:19.928847Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " F: 0.13. R: 0.43. S: 0.78.\n", "- F: 0.45. R: 0.27. S: 0.68.\n", "-- F: 0.81. R: 0.21. S: 0.70.\n", "--- Web searching and information discovery (R: 0.08; S: 0.62)\n", "--- World Wide Web (R: 0.08; S: 0.79)\n", "--- Information systems (R: 0.05; S: 0.85)\n", "-- Web applications (R: 0.05; S: 0.49)\n", "-- Web mining (R: 0.02; S: 0.34)\n", "- F: 0.27. R: 0.16. S: 0.48.\n", "-- F: 0.92. R: 0.09. S: 0.42.\n", "--- Supervised learning (R: 0.04; S: 0.41)\n", "--- Learning paradigms (R: 0.03; S: 0.43)\n", "--- Machine learning (R: 0.02; S: 0.43)\n", "-- F: 0.81. R: 0.07. S: 0.37.\n", "--- Retrieval models and ranking (R: 0.04; S: 0.34)\n", "--- Information retrieval (R: 0.03; S: 0.45)\n" ] } ], "source": [ "gismo.get_documents_by_cluster(k=10, resolution=.6)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Better! Let's broadly decode this:\n", "- One cluster of categories is about the Web & Search\n", "- One cluster is about how-to:\n", " - learning techniques\n", " - information retrieval." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's look at the main keywords." ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "ExecuteTime": { "end_time": "2022-12-27T10:56:19.972945Z", "start_time": "2022-12-27T10:56:19.960948Z" } }, "outputs": [ { "data": { "text/plain": [ "['web',\n", " 'ranking',\n", " 'learning',\n", " 'social',\n", " 'supervised learning',\n", " 'supervised',\n", " 'discovery',\n", " 'security',\n", " 'site',\n", " 'rank',\n", " 'search',\n", " 'learning rank']" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gismo.get_features_by_rank()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's organize them." ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "ExecuteTime": { "end_time": "2022-12-27T10:56:20.003825Z", "start_time": "2022-12-27T10:56:19.976947Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " F: 0.01. R: 0.02. S: 0.93.\n", "- F: 0.13. R: 0.01. S: 0.93.\n", "-- F: 0.87. R: 0.01. S: 0.86.\n", "--- F: 0.87. R: 0.01. S: 0.85.\n", "---- web (R: 0.00; S: 0.86)\n", "---- ranking (R: 0.00; S: 0.91)\n", "---- social (R: 0.00; S: 0.84)\n", "---- discovery (R: 0.00; S: 0.80)\n", "---- site (R: 0.00; S: 0.77)\n", "--- search (R: 0.00; S: 0.83)\n", "-- F: 0.90. R: 0.01. S: 0.47.\n", "--- learning (R: 0.00; S: 0.44)\n", "--- supervised learning (R: 0.00; S: 0.35)\n", "--- supervised (R: 0.00; S: 0.35)\n", "--- rank (R: 0.00; S: 0.51)\n", "--- learning rank (R: 0.00; S: 0.50)\n", "- security (R: 0.00; S: 0.14)\n" ] } ], "source": [ "gismo.get_features_by_cluster()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Rough analysis:\n", "- One cluster about the Web\n", "- One cluster about learning\n", "- One lone wolf: security" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": true, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }