DBLP exploration#

This tutorial shows how explore DBLP with Gismo.

If you have never used Gismo before, you may want to start with the Toy example tutorial or the ACM tutorial.

Note: the DBLP databased is not small. Recommended requirements to excute this Notebook:

  • Fast Internet connection (you will need to download a few hundred Mb)

  • 4 Gb of free space

  • 4 Gb of RAM (8Gb or more recommended)

  • Descent CPU (can take more than one hour on slow CPUs)

Here, documents are articles in DBLP. The features of an article category will vary.

Initialisation#

First, we load the required packages.

[1]:
import numpy as np
import spacy
from gismo import Corpus, Embedding, CountVectorizer, cosine_similarity, Gismo
from pathlib import Path
from functools import partial

from gismo.datasets.dblp import Dblp
from gismo.filesource import FileSource
from gismo.post_processing import post_features_cluster_print, post_documents_cluster_print

Then, we prepare the DBLP source.

First we choose the location of the DBLP files. If you want to run this Notebook at your place, please change the path below and check that it exists.

[2]:
path = Path("../../../../../Datasets/DBLP")
path.exists()
[2]:
True

Construction of the dblp files. Only needs to be performed the first time or when you want to refresh the database. Takes about 10 minutes on a Surface Pro 4 with fiber Internet connection.

[3]:
dblp = Dblp(path=path)
dblp.build()
Retrieve https://dblp.uni-trier.de/xml/dblp.xml.gz from the Internet.
DBLP database downloaded to ..\..\..\..\..\Datasets\DBLP\dblp.xml.gz.
Converting DBLP database from ..\..\..\..\..\Datasets\DBLP\dblp.xml.gz (may take a while).
Building Index.
Conversion done.

Then, we can load the database as a filesource.

[4]:
source = FileSource(filename="dblp", path=path)
source[0]
[4]:
{'type': 'inproceedings',
 'authors': ['Arnon Rosenthal'],
 'title': 'The Future of Classic Data Administration: Objects + Databases + CASE',
 'venue': 'SWEE',
 'year': '1998'}

Each article is a dict with fields type, venue, title, year, and authors. We build a corpus that will tell Gismo that the content of an article is its title value.

[5]:
corpus = Corpus(source, to_text=lambda x: x['title'])

We build an embedding on top of that corpus.

  • We set min_df=30 to exclude rare features;

  • We set max_df=.02 to exclude anything present in more than 2% of the corpus;

  • We use spacy to lemmatize & remove some stopwords; remove preprocessor=... from the input if you want to skip this (takes time);

  • A few manually selected stopwords to fine-tune things.

  • We set ngram_range=(1, 2) to include bi-grams in the embedding.

This will take a few minutes (without spacy) up to a few hours (with spacy enabled). You can save the embedding for later if you want.

[6]:
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
keep = {'ADJ', 'NOUN', 'NUM', 'PROPN', 'SYM', 'VERB'}
vectorizer = CountVectorizer(min_df=30, max_df=.02, ngram_range=(1, 2), dtype=float,
                             preprocessor=lambda txt: " ".join([w.lemma_.lower() for w in nlp(txt)
                                                                if w.pos_ in keep and not w.is_stop]),
                             stop_words=['a', 'about', 'an', 'and', 'for', 'from', 'in', 'of', 'on', 'the', 'with'])

try:
    embedding = Embedding.load(filename="dblp_embedding", path=path)
except:
    embedding = Embedding(vectorizer=vectorizer)
    embedding.fit_transform(corpus)
    embedding.dump(filename="dblp_embedding", path=path)
[7]:
embedding.x
[7]:
<Compressed Sparse Row sparse matrix of dtype 'float64'
        with 76864824 stored elements and shape (7513218, 235264)>

We see from embedding.x that the embedding links about 7,500,000 documents to 235,000 features. In average, each document is linked to about 10 features.

Now, we initiate the gismo object, and customize post_processers to ease the display.

[8]:
gismo = Gismo(corpus, embedding)
[9]:
def post_article(g, i):
    dic = g.corpus[i]
    authors = ", ".join(dic['authors'])
    return f"{dic['title']} By {authors} ({dic['venue']}, {dic['year']})"

gismo.post_documents_item = post_article

def post_title(g, i):
    return g.corpus[i]['title']
    authors = ", ".join(dic['authors'])
    return f"{dic['title']} By {authors} ({dic['venue']}, {dic['year']})"

def post_meta(g, i):
    dic = g.corpus[i]
    authors = ", ".join(dic['authors'])
    return f"{authors} ({dic['venue']}, {dic['year']})"


gismo.post_documents_cluster = partial(post_documents_cluster_print, post_item=post_title)
gismo.post_features_cluster = post_features_cluster_print

As the dataset is big, we lower the precision of the computation to speed up things a little bit.

[10]:
gismo.parameters.n_iter = 2

Machine Learning (and Covid-19) query#

We perform the query Machine learning. The returned True tells that some of the query features were found in the corpus’ features.

[11]:
gismo.rank("Machine Learning")
[11]:
True

What are the best articles on Machine Learning?

[12]:
gismo.get_documents_by_rank()
[12]:
['The Changing Landscape of Machine Learning: A Comparative Analysis of Centralized Machine Learning, Distributed Machine Learning and Federated Machine Learning. By Dishita Naik, Nitin Naik (UKCI, 2023)',
 'Resilient Machine Learning for Networked Cyber Physical Systems: A Survey for Machine Learning Security to Securing Machine Learning for CPS. By Felix O. Olowononi, Danda B. Rawat, Chunmei Liu (CoRR, 2021)',
 'Resilient Machine Learning for Networked Cyber Physical Systems: A Survey for Machine Learning Security to Securing Machine Learning for CPS. By Felix O. Olowononi, Danda B. Rawat, Chunmei Liu (IEEE Commun. Surv. Tutorials, 2021)',
 'The Machine Learning Machine: A Tangible User Interface for Teaching Machine Learning. By Magnus Høholt Kaspersen, Karl-Emil Kjær Bilstrup, Marianne Graves Petersen (TEI, 2021)',
 'A Machine Learning-oriented Survey on Tiny Machine Learning. By Luigi Capogrosso, Federico Cunico, Dong Seon Cheng, Franco Fummi, Marco Cristani (CoRR, 2023)',
 'A Machine Learning-Oriented Survey on Tiny Machine Learning. By Luigi Capogrosso, Federico Cunico, Dong Seon Cheng, Franco Fummi, Marco Cristani (IEEE Access, 2024)',
 'Quantum Machine Learning: An Interplay Between Quantum Computing and Machine Learning. By Jun Qi 0002, Chao-Han Yang, Samuel Yen-Chi Chen, Pin-Yu Chen (CoRR, 2024)',
 'PIMA Diabetes Prediction Using Machine Learning and Quantum Machine Learning Techniques. By Dixit Vimal (ITU K, 2024)',
 'A Machine Learning Tutorial for Operational Meteorology, Part I: Traditional Machine Learning. By Randy J. Chase, David R. Harrison, Amanda Burke, Gary M. Lackmann, Amy McGovern (CoRR, 2022)',
 'Breast Tumor Classification using Machine Learning: Breast Tumor Classification using Machine Learning. By Salman Siddiqui, Mohd Usman Mallick, Ankur Varshney (EAI Endorsed Trans. Context aware Syst. Appl., 2023)',
 'When Physics Meets Machine Learning: A Survey of Physics-Informed Machine Learning. By Chuizheng Meng, Sungyong Seo, Defu Cao, Sam Griesemer, Yan Liu 0002 (CoRR, 2022)',
 'The Application of Machine Learning Algorithms in Predicting the Usage of IoT-based Cleaning Dispensers: Machine Learning Algorithms in Predicting the Usage if IoT-based Dispensers. By Tobechi Obinwanne, Chibuzor Udokwu, Patrick Brandtner (ICEEG, 2023)',
 'The Idealized Machine Learning Pipeline (IMLP) for Advancing Reproducibility in Machine Learning. By Yantong Zheng, Victoria Stodden (ACM-REP, 2024)',
 'Quantum Machine Learning: A Hands-on Tutorial for Machine Learning Practitioners and Researchers. By Yuxuan Du, Xinbiao Wang, Naixu Guo, Zhan Yu, Yang Qian, Kaining Zhang, Min-Hsiu Hsieh, Patrick Rebentrost, Dacheng Tao (CoRR, 2025)',
 'Machine Learning for Health ( ML4H ) 2019 : What Makes Machine Learning in Medicine Different? By Adrian V. Dalca, Matthew B. A. McDermott, Emily Alsentzer, Samuel G. Finlayson, Michael Oberst, Fabian Falck, Corey Chivers, Andrew Beam, Tristan Naumann, Brett K. Beaulieu-Jones (ML4H@NeurIPS, 2019)',
 'Resilient Machine Learning (rML) Ensemble Against Adversarial Machine Learning Attacks. By Likai Yao, Cihan Tunc, Pratik Satam, Salim Hariri (DDDAS, 2020)',
 'Predicting Machine Learning Pipeline Runtimes in the Context of Automated Machine Learning. By Felix Mohr, Marcel Wever, Alexander Tornede, Eyke Hüllermeier (IEEE Trans. Pattern Anal. Mach. Intell., 2021)',
 'Poisoning Attacks Against Machine Learning: Can Machine Learning Be Trustworthy? By Alina Oprea, Anoop Singhal, Apostol Vassilev 0001 (Computer, 2022)',
 'Machine Learning for Testing Machine-Learning Hardware: A Virtuous Cycle. By Arjun Chaudhuri, Jonti Talukdar, Krishnendu Chakrabarty (ICCAD, 2022)',
 'Critical Tools for Machine Learning: Situating, Figuring, Diffracting, Fabulating Machine Learning Systems Design. By Goda Klumbyte, Claude Draude, Alex S. Taylor (CHItaly, 2021)',
 'Adversarial Machine Learning: Difficulties in Applying Machine Learning to Existing Cybersecurity Systems. By Nick Rahimi, Jordan Maynor, Bidyut Gupta (CATA, 2020)',
 'Machine Learning Unplugged - Development and Evaluation of a Workshop About Machine Learning. By Elisaweta Ossovski, Michael Brinkmeier (ISSEP, 2019)',
 'Physics Informed Machine Learning of SPH: Machine Learning Lagrangian Turbulence. By Michael Woodward, Yifeng Tian, Criston Hyett, Chris Fryer, Daniel Livescu, Mikhail G. Stepanov, Michael Chertkov (CoRR, 2021)',
 'Hacking Machine Learning: Towards The Comprehensive Taxonomy of Attacks Against Machine Learning Systems. By Jerzy Surma (ICIAI, 2020)',
 'Machine Learning Lineage for Trustworthy Machine Learning Systems: Information Framework for MLOps Pipelines. By Mikko Raatikainen, Charalampos Souris, Jukka J. Remes, Vlad Stirbu (IEEE Softw., 2025)',
 'Informed Machine Learning - Towards a Taxonomy of Explicit Integration of Knowledge into Machine Learning. By Laura von Rüden, Sebastian Mayer, Jochen Garcke, Christian Bauckhage, Jannis Schücker (CoRR, 2019)',
 'Deep Fast Machine Learning Utils: A Python Library for Streamlined Machine Learning Prototyping. By Fabi Prezja (CoRR, 2024)',
 'How Developers Iterate on Machine Learning Workflows - A Survey of the Applied Machine Learning Literature. By Doris Xin, Litian Ma, Shuchen Song, Aditya G. Parameswaran (CoRR, 2018)',
 'Machine Learning in Antenna Design: An Overview on Machine Learning Concept and Algorithms. By Hilal M. El Misilmani, Tarek Naous (HPCS, 2019)',
 'Can Machine Learning Model with Static Features be Fooled: an Adversarial Machine Learning Approach. By Rahim Taheri, Reza Javidan, Mohammad Shojafar, Vinod P 0001, Mauro Conti (CoRR, 2019)',
 'Machine Learning for All! - Introducing Machine Learning in Middle and High School. By Ramon Mayor Martins, Christiane Gresse von Wangenheim, Marcelo Fernando Rauber, Jean Carlo Hauck (Int. J. Artif. Intell. Educ., 2024)',
 'Ensemble Machine Learning Methods to Predict the Balancing of Ayurvedic Constituents in the Human Body Ensemble Machine Learning Methods to Predict. By Vani Rajasekar, Sathya Krishnamoorthi, Muzafer Saracevic, Dzenis Pepic, Mahir Zajmovic, Haris Zogic (Comput. Sci., 2022)',
 'Can machine learning model with static features be fooled: an adversarial machine learning approach. By Rahim Taheri, Reza Javidan, Mohammad Shojafar, P. Vinod 0001, Mauro Conti (Clust. Comput., 2020)',
 'Contrasting Explain-ML with Interpretability Machine Learning Tools in Light of Interactive Machine Learning Principles. By Bárbara Gabrielle C. O. Lopes, Liziane Santos Soares, Raquel Oliveira Prates, Marcos André Gonçalves (J. Interact. Syst., 2022)',
 'A Comparative Analysis Between Quantum Machine Learning and Machine Learning on EEG Dataset. By Leopoldo Angrisani, Egidio De Benedetto, Alessandro Di Bernardo, Roberto Prevete, Annarita Tedesco (MetroXRAINE, 2024)',
 'Securing Machine Learning in the Cloud: A Systematic Review of Cloud Machine Learning Security. By Adnan Qayyum, Aneeqa Ijaz, Muhammad Usama, Waleed Iqbal, Junaid Qadir 0001, Yehia Elkhatib, Ala I. Al-Fuqaha (Frontiers Big Data, 2020)',
 'Verbalized Machine Learning: Revisiting Machine Learning with Language Models. By Tim Z. Xiao, Robert Bamler, Bernhard Schölkopf, Weiyang Liu (CoRR, 2024)',
 'Detecting Refactoring Commits in Machine Learning Python Projects: A Machine Learning-Based Approach. By Shayan Noei, Heng Li 0007, Ying Zou 0001 (CoRR, 2024)',
 'Arabic Offensive Language Detection Using Machine Learning and Ensemble Machine Learning Approaches. By Fatemah Husain (CoRR, 2020)',
 'Declarative Machine Learning Systems: The future of machine learning will depend on it being in the hands of the rest of us. By Piero Molino, Christopher Ré (ACM Queue, 2021)']

OK, this seems to go everywhere. Maybe we can narrow with a more specific request.

[13]:
gismo.rank("Machine Learning and covid-19")
[13]:
True
[14]:
gismo.get_documents_by_rank()
[14]:
['Ergonomics of Virtual Learning During COVID-19. By Lu Yuan, Alison Garaudy (AHFE (11), 2021)',
 'University Virtual Learning in Covid Times. By Verónica Marín-Díaz, Eloísa Reche, Javier Martín (Technol. Knowl. Learn., 2022)',
 'Mobile Learning for COVID-19 Prevention. By Zhiyi Wang (EAI Endorsed Trans. e Learn., 2024)',
 'Design Issues in e-Learning during the COVID-19 Pandemic. By Alexandra Hosszu, Cosima Rughinis (CSCS, 2021)',
 'Campus Traffic and e-Learning during COVID-19 Pandemic. By Thomas Favale, Francesca Soro, Martino Trevisan, Idilio Drago, Marco Mellia (CoRR, 2020)',
 'Campus traffic and e-Learning during COVID-19 pandemic. By Thomas Favale, Francesca Soro, Martino Trevisan, Idilio Drago, Marco Mellia (Comput. Networks, 2020)',
 'DCML: Deep contrastive mutual learning for COVID-19 recognition. By Hongbin Zhang 0004, Weinan Liang, Chuanxiu Li, Qipeng Xiong, Haowei Shi, Lang Hu, Guangli Li (Biomed. Signal Process. Control., 2022)',
 'The Deaf Experience in Remote Learning during COVID-19. By Yosra Bouzid, Mohamed Jemni (ICTA, 2021)',
 'Interpretable Sequence Learning for Covid-19 Forecasting. By Sercan Ömer Arik, Chun-Liang Li, Jinsung Yoon, Rajarishi Sinha, Arkady Epshteyn, Long T. Le, Vikas Menon, Shashank Singh 0005, Leyou Zhang, Martin Nikoltchev, Yash Sonthalia, Hootan Nakhost, Elli Kanal, Tomas Pfister (NeurIPS, 2020)',
 'Interpretable Sequence Learning for COVID-19 Forecasting. By Sercan Ömer Arik, Chun-Liang Li, Jinsung Yoon, Rajarishi Sinha, Arkady Epshteyn, Long T. Le, Vikas Menon, Shashank Singh 0005, Leyou Zhang, Nate Yoder, Martin Nikoltchev, Yash Sonthalia, Hootan Nakhost, Elli Kanal, Tomas Pfister (CoRR, 2020)',
 'DeCoP: Deep Learning for COVID-19 Prediction of Survival. By Yao Deng, Shigang Liu, Alireza Jolfaei, Hongbing Cheng, Ziyuan Wang, Xi Zheng 0001 (IEEE Trans. Mol. Biol. Multi Scale Commun., 2022)',
 'Differences Between Neurodivergent and Neurotypical Learning During Covid-19: Towards the E-Tivities Satisfaction Scale. By Jonathan Bishop, Kamal Bechkoum (CSCI, 2022)',
 'D-Learning and COVID-19 Crisis: Appraisal of Reactions and Means of Perpetuity. By Jalal Ismaili, Karim El Moutaouakil (SN Comput. Sci., 2023)',
 'Engineering Experiential Learning During the COVID-19 Pandemic. By Nael Barakat, Aws AlShalash, Mohammad Biswas, Shih-Feng Chou, Tahsin Khajah (ICL (2), 2021)',
 'A Survey on Deep Learning in COVID-19 Diagnosis. By Xue Han, Zuojin Hu, Shuihua Wang, Yudong Zhang 0001 (J. Imaging, 2023)',
 'The Study on the Efficiency of Smart Learning in the COVID-19. By Seong-Kyu Kim, Mi-Jung Lee, Eun-Sill Jang, Young-Eun Lee (J. Multim. Inf. Syst., 2022)',
 'Dual Teaching: Simultaneous Remote and In-Person Learning During COVID. By Hunter M. Williams, Malcolm Gibran Haynes, Joseph Kim (SIGITE, 2021)',
 'An Analysis of the Effectiveness of Emergency Distance Learning under COVID-19. By Ngo Tung Son, Bui Ngoc Anh, Kieu Quoc Tuan, Son Ba Nguyen, Son Hoang Nguyen, Jafreezal Jaafar (CCRIS, 2020)',
 'M-learning During COVID-19: A Systematic Literature Review. By Esmhan Jafer, Hossana Twinomurinzi (AFRICATEK, 2022)',
 'Automated Machine Learning for COVID-19 Forecasting. By Jaco Tetteroo, Mitra Baratchi, Holger H. Hoos (IEEE Access, 2022)',
 'A Data Augmented Approach to Transfer Learning for Covid-19 Detection. By Shagufta Henna, Aparna Reji (CoRR, 2021)',
 'A Closer Look at Spatial-Slice Features Learning for COVID-19 Detection. By Chih-Chung Hsu, Chia-Ming Lee, Yang Fan Chiang, Yi-Shiuan Chou, Chih-Yu Jiang, Shen-Chieh Tai, Chi-Han Tsai (CVPR Workshops, 2024)',
 'A Closer Look at Spatial-Slice Features Learning for COVID-19 Detection. By Chih-Chung Hsu, Chia-Ming Lee, Yang Fan Chiang, Yi-Shiuan Chou, Chih-Yu Jiang, Shen-Chieh Tai, Chi-Han Tsai (CoRR, 2024)',
 'Academic Procrastination and Online Learning During the COVID-19 Pandemic. By Jørgen Melgaard, Rubina Monir, Lester Allan Lasrado, Asle Fagerstrøm (CENTERIS/ProjMAN/HCist, 2021)',
 'A comprehensive review of federated learning for COVID-19 detection. By Sadaf Naz, Khoa Tran Phan, Yi-Ping Phoebe Chen (Int. J. Intell. Syst., 2022)',
 'Challenges of Online Learning During the COVID-19: What Can We Learn on Twitter? By Wei Quan (ARTIIS, 2021)',
 'M-learning in the COVID-19 era: physical vs digital class. By Vasiliki Matzavela, Efthimios Alepis (Educ. Inf. Technol., 2021)',
 'Federated Learning for COVID-19 on Heterogeneous CXR Images with Noise. By Mengqing Ding, Juan Li 0011, Changyan Yi, Jun Cai 0001 (ICC, 2023)',
 'Curriculum Contrastive Learning for COVID-19 FAQ Retrieval. By Leilei Zhang, Junfei Liu (BIBM, 2022)',
 'Design Teaching and Learning in Covid-19 Times: An International Experience. By Paulo Ferreira, Filipa Oliveira Antunes, Haroldo Gallo, Marcos Tognon, Heloisa Mendes Pereira (TECH-EDU, 2020)',
 'Educational Transformation: An Evaluation of Online Learning Due To COVID-19. By Rizky Firmansyah, Dhika Maha Putri, Mochammad Galih Satriyo Wicaksono, Sheila Febriani Putri, Ahmad Arif Widianto, Mohd Rizal Palil (Int. J. Emerg. Technol. Learn., 2021)',
 'Online Learning Before, During and After COVID-19: Observations Over 20 Years. By Natalie Wieland, Liz Kollias (Int. J. Adv. Corp. Learn., 2020)',
 'Using Deep Learning for COVID-19 Control: Implementing a Convolutional Neural Network in a Facemask Detection Application. By Caolan Deery, Kevin Meehan (SmartNets, 2021)',
 'LitCovid ensemble learning for COVID-19 multi-label classification. By Jinghang Gu, Emmanuele Chersoni, Xing Wang, Chu-Ren Huang, Longhua Qian, Guodong Zhou (Database J. Biol. Databases Curation, 2022)',
 'A Survey on Deep Learning and Machine Learning for COVID-19 Detection. By Mohamed M. Dessouky, Sahar F. Sabbeh, Boushra Alshehri (ICFNDS, 2021)']

Sounds nice. How are the top-10 articles related? Note: as the graph structure is really sparse on the document side (10 features), it is best to de-activate the query-distortion, which is intended for longer documents.

[15]:
gismo.parameters.distortion = 0.0
gismo.get_documents_by_cluster(k=10)
 F: 0.48. R: 0.00. S: 0.79.
- F: 0.48. R: 0.00. S: 0.79.
-- F: 0.48. R: 0.00. S: 0.77.
--- F: 0.66. R: 0.00. S: 0.52.
---- Ergonomics of Virtual Learning During COVID-19. (R: 0.00; S: 0.62)
---- University Virtual Learning in Covid Times. (R: 0.00; S: 0.34)
--- F: 0.73. R: 0.00. S: 0.74.
---- Design Issues in e-Learning during the COVID-19 Pandemic. (R: 0.00; S: 0.66)
---- F: 1.00. R: 0.00. S: 0.71.
----- Campus Traffic and e-Learning during COVID-19 Pandemic. (R: 0.00; S: 0.71)
----- Campus traffic and e-Learning during COVID-19 pandemic. (R: 0.00; S: 0.71)
--- DCML: Deep contrastive mutual learning for COVID-19 recognition. (R: 0.00; S: 0.60)
-- Mobile Learning for COVID-19 Prevention. (R: 0.00; S: 0.55)
-- F: 1.00. R: 0.00. S: 0.53.
--- Interpretable Sequence Learning for Covid-19 Forecasting. (R: 0.00; S: 0.53)
--- Interpretable Sequence Learning for COVID-19 Forecasting. (R: 0.00; S: 0.53)
- The Deaf Experience in Remote Learning during COVID-19. (R: 0.00; S: 0.50)

Now, let’s look at the main keywords.

[16]:
gismo.get_features_by_rank(20)
[16]:
['covid',
 '19',
 'covid 19',
 'learning covid',
 'machine',
 'pandemic',
 'machine learning',
 '19 pandemic',
 'online learning',
 'online',
 '19 detection',
 'chest',
 'student',
 'deep learning',
 'ray',
 'chest ray',
 'classification',
 'prediction',
 'case',
 'ct']

Let’s organize them.

[17]:
# On the feature side, the graph is more dense so we can use query distortion
gismo.get_features_by_cluster(distortion=1)
 F: 0.25. R: 0.06. S: 0.97.
- F: 0.47. R: 0.05. S: 0.97.
-- F: 0.58. R: 0.05. S: 0.97.
--- F: 0.96. R: 0.04. S: 0.97.
---- covid (R: 0.01; S: 1.00)
---- 19 (R: 0.01; S: 0.99)
---- covid 19 (R: 0.01; S: 0.99)
---- learning covid (R: 0.01; S: 0.97)
--- F: 0.98. R: 0.01. S: 0.58.
---- pandemic (R: 0.00; S: 0.58)
---- 19 pandemic (R: 0.00; S: 0.57)
-- F: 0.94. R: 0.00. S: 0.45.
--- online learning (R: 0.00; S: 0.45)
--- online (R: 0.00; S: 0.48)
- F: 0.99. R: 0.01. S: 0.27.
-- machine (R: 0.00; S: 0.27)
-- machine learning (R: 0.00; S: 0.27)

Rough, very broad analysis:

  • One big keyword cluster about Coronavirus / Covid-19, pandemic, online learning;

  • Machine Learning as a separate small cluster.

[18]:
np.dot(gismo.embedding.query_projection("Machine learning")[0], gismo.embedding.y)
[18]:
<Compressed Sparse Row sparse matrix of dtype 'float64'
        with 121940 stored elements and shape (1, 7513218)>

122,000 articles with an explicit link to machine learning.

[19]:
np.dot(gismo.embedding.query_projection("Covid-19")[0], gismo.embedding.y)
[19]:
<Compressed Sparse Row sparse matrix of dtype 'float64'
        with 20024 stored elements and shape (1, 7513218)>

20,000 articles with an explicit link to covid-19.

Authors query#

Instead of looking at words, we can explore authors and their collaborations.

We just have to rewire the corpus to output string of authors.

[20]:
def to_authors_text(dic):
    return " ".join([a.replace(' ', '_') for a in dic['authors']])
corpus = Corpus(source, to_text=to_authors_text)

We can build a new embedding on top of this modified corpus. We tell the vectorizer to be stupid: don’t preprocess, words are separated spaces.

This will take a few minutes (you can save the embedding for later if you want).

[21]:
vectorizer = CountVectorizer(dtype=float,
                            preprocessor=lambda x:x, tokenizer=lambda x: x.split(' '))
try:
    a_embedding = Embedding.load(filename="dblp_aut_embedding", path=path)
except:
    a_embedding = Embedding(vectorizer=vectorizer)
    a_embedding.fit_transform(corpus)
    a_embedding.dump(filename="dblp_aut_embedding", path=path)
[22]:
a_embedding.x
[22]:
<Compressed Sparse Row sparse matrix of dtype 'float64'
        with 25662979 stored elements and shape (7513218, 3820206)>

We now have about 3,800,000 authors to explore. Let’s reload gismo and try to play.

[23]:
gismo = Gismo(corpus, a_embedding)
gismo.post_documents_item = post_article
gismo.post_features_item = lambda g, i: g.embedding.features[i].replace("_", " ")
[24]:
gismo.post_documents_cluster = partial(post_documents_cluster_print, post_item=post_meta)
gismo.post_features_cluster = post_features_cluster_print

Laurent Massoulié query#

[25]:
gismo.rank("Laurent_Massoulié")
[25]:
True

What are the most central articles of Laurent Massoulié in terms of collaboration?

[26]:
gismo.get_documents_by_rank(k=10)
[26]:
['Aligning Embeddings and Geometric Random Graphs: Informational Results and Computational Approaches for the Procrustes-Wasserstein Problem. By Mathieu Even, Luca Ganassali, Jakob Maier, Laurent Massoulié (NeurIPS, 2024)',
 'Aligning Embeddings and Geometric Random Graphs: Informational Results and Computational Approaches for the Procrustes-Wasserstein Problem. By Mathieu Even, Luca Ganassali, Jakob Maier, Laurent Massoulié (CoRR, 2024)',
 'Scalable Local Area Service Discovery. By Richard Black, Heimir Sverrisson, Laurent Massoulié (ICC, 2007)',
 'Non-Backtracking Spectrum of Degree-Corrected Stochastic Block Models. By Lennart Gulikers, Marc Lelarge, Laurent Massoulié (ITCS, 2017)',
 'An Impossibility Result for Reconstruction in a Degree-Corrected Planted-Partition Model. By Lennart Gulikers, Marc Lelarge, Laurent Massoulié (CoRR, 2015)',
 'A spectral method for community detection in moderately-sparse degree-corrected stochastic block models. By Lennart Gulikers, Marc Lelarge, Laurent Massoulié (CoRR, 2015)',
 'Non-Backtracking Spectrum of Degree-Corrected Stochastic Block Models. By Lennart Gulikers, Marc Lelarge, Laurent Massoulié (CoRR, 2016)',
 'Optimal Content Placement for Peer-to-Peer Video-on-Demand Systems. By Bo Tan 0002, Laurent Massoulié (IEEE/ACM Trans. Netw., 2013)',
 'Optimal content placement for peer-to-peer video-on-demand systems. By Bo Tan 0002, Laurent Massoulié (INFOCOM, 2011)',
 'Brief announcement: adaptive content placement for peer-to-peer video-on-demand systems. By Bo Tan 0002, Laurent Massoulié (PODC, 2010)']

We see lots of duplicates. This is not surprising as many articles can published first as a research report, then as a conference paper, last as a journal article. Luckily, Gismo can cover for you.

[27]:
gismo.get_documents_by_coverage(k=10)
[27]:
['Aligning Embeddings and Geometric Random Graphs: Informational Results and Computational Approaches for the Procrustes-Wasserstein Problem. By Mathieu Even, Luca Ganassali, Jakob Maier, Laurent Massoulié (NeurIPS, 2024)',
 'Aligning Embeddings and Geometric Random Graphs: Informational Results and Computational Approaches for the Procrustes-Wasserstein Problem. By Mathieu Even, Luca Ganassali, Jakob Maier, Laurent Massoulié (CoRR, 2024)',
 'Scalable Local Area Service Discovery. By Richard Black, Heimir Sverrisson, Laurent Massoulié (ICC, 2007)',
 'Non-Backtracking Spectrum of Degree-Corrected Stochastic Block Models. By Lennart Gulikers, Marc Lelarge, Laurent Massoulié (ITCS, 2017)',
 'An Impossibility Result for Reconstruction in a Degree-Corrected Planted-Partition Model. By Lennart Gulikers, Marc Lelarge, Laurent Massoulié (CoRR, 2015)',
 'A spectral method for community detection in moderately-sparse degree-corrected stochastic block models. By Lennart Gulikers, Marc Lelarge, Laurent Massoulié (CoRR, 2015)',
 'Non-Backtracking Spectrum of Degree-Corrected Stochastic Block Models. By Lennart Gulikers, Marc Lelarge, Laurent Massoulié (CoRR, 2016)',
 'Optimal Content Placement for Peer-to-Peer Video-on-Demand Systems. By Bo Tan 0002, Laurent Massoulié (IEEE/ACM Trans. Netw., 2013)',
 'Optimal content placement for peer-to-peer video-on-demand systems. By Bo Tan 0002, Laurent Massoulié (INFOCOM, 2011)',
 'Brief announcement: adaptive content placement for peer-to-peer video-on-demand systems. By Bo Tan 0002, Laurent Massoulié (PODC, 2010)']

Hum, not working well. The reason here is query distortion. Query distortion is a gismo feature that modulates the clustering with the query. Sadly, when features are authors, the underlying graph has a very specific structure (highly sparse and redundant) that makes query distortion too effective. The solution is to desactivate it.

[28]:
gismo.parameters.distortion = 0
gismo.get_documents_by_coverage(k=10)
[28]:
['Aligning Embeddings and Geometric Random Graphs: Informational Results and Computational Approaches for the Procrustes-Wasserstein Problem. By Mathieu Even, Luca Ganassali, Jakob Maier, Laurent Massoulié (NeurIPS, 2024)',
 'Scalable Local Area Service Discovery. By Richard Black, Heimir Sverrisson, Laurent Massoulié (ICC, 2007)',
 'Non-Backtracking Spectrum of Degree-Corrected Stochastic Block Models. By Lennart Gulikers, Marc Lelarge, Laurent Massoulié (ITCS, 2017)',
 'Optimal Content Placement for Peer-to-Peer Video-on-Demand Systems. By Bo Tan 0002, Laurent Massoulié (IEEE/ACM Trans. Netw., 2013)',
 'Adaptive Matching for Expert Systems with Uncertain Task Types. By Virag Shah, Lennart Gulikers, Laurent Massoulié, Milan Vojnovic (Oper. Res., 2020)',
 'Asynchrony and Acceleration in Gossip Algorithms. By Mathieu Even, Hadrien Hendrikx, Laurent Massoulié (CoRR, 2020)',
 'Asymmetric tree correlation testing for graph alignment. By Jakob Maier, Laurent Massoulié (ITW, 2023)',
 'From tree matching to sparse graph alignment. By Luca Ganassali, Laurent Massoulié (CoRR, 2020)',
 'Concentration of Non-Isotropic Random Tensors with Applications to Learning and Empirical Risk Minimization. By Mathieu Even, Laurent Massoulié (COLT, 2021)',
 'Aligning Embeddings and Geometric Random Graphs: Informational Results and Computational Approaches for the Procrustes-Wasserstein Problem. By Mathieu Even, Luca Ganassali, Jakob Maier, Laurent Massoulié (CoRR, 2024)']

Much better. No duplicate and more diversity in the results. Let’s observe the communities.

[29]:
gismo.get_documents_by_cluster(k=20, resolution=.9)
 F: 0.38. R: 0.06. S: 0.87.
- F: 0.42. R: 0.06. S: 0.87.
-- F: 0.43. R: 0.04. S: 0.83.
--- F: 0.50. R: 0.03. S: 0.77.
---- F: 0.59. R: 0.02. S: 0.68.
----- F: 0.70. R: 0.01. S: 0.58.
------ F: 1.00. R: 0.01. S: 0.49.
------- Mathieu Even, Luca Ganassali, Jakob Maier, Laurent Massoulié (NeurIPS, 2024) (R: 0.00; S: 0.49)
------- Mathieu Even, Luca Ganassali, Jakob Maier, Laurent Massoulié (CoRR, 2024) (R: 0.00; S: 0.49)
------ Jakob Maier, Laurent Massoulié (ITW, 2023) (R: 0.00; S: 0.59)
----- F: 1.00. R: 0.01. S: 0.65.
------ Luca Ganassali, Laurent Massoulié (CoRR, 2020) (R: 0.00; S: 0.65)
------ Luca Ganassali, Laurent Massoulié (COLT, 2020) (R: 0.00; S: 0.65)
---- F: 0.80. R: 0.01. S: 0.65.
----- F: 1.00. R: 0.01. S: 0.60.
------ Mathieu Even, Hadrien Hendrikx, Laurent Massoulié (CoRR, 2020) (R: 0.00; S: 0.60)
------ Mathieu Even, Hadrien Hendrikx, Laurent Massoulié (IEEE Trans. Autom. Control., 2025) (R: 0.00; S: 0.60)
------ Mathieu Even, Hadrien Hendrikx, Laurent Massoulié (CoRR, 2021) (R: 0.00; S: 0.60)
----- Mathieu Even, Laurent Massoulié (COLT, 2021) (R: 0.00; S: 0.69)
--- F: 1.00. R: 0.01. S: 0.60.
---- Bo Tan 0002, Laurent Massoulié (IEEE/ACM Trans. Netw., 2013) (R: 0.00; S: 0.60)
---- Bo Tan 0002, Laurent Massoulié (INFOCOM, 2011) (R: 0.00; S: 0.60)
---- Bo Tan 0002, Laurent Massoulié (PODC, 2010) (R: 0.00; S: 0.60)
-- F: 0.61. R: 0.02. S: 0.59.
--- F: 1.00. R: 0.01. S: 0.58.
---- Lennart Gulikers, Marc Lelarge, Laurent Massoulié (ITCS, 2017) (R: 0.00; S: 0.58)
---- Lennart Gulikers, Marc Lelarge, Laurent Massoulié (CoRR, 2015) (R: 0.00; S: 0.58)
---- Lennart Gulikers, Marc Lelarge, Laurent Massoulié (CoRR, 2015) (R: 0.00; S: 0.58)
---- Lennart Gulikers, Marc Lelarge, Laurent Massoulié (CoRR, 2016) (R: 0.00; S: 0.58)
--- F: 1.00. R: 0.01. S: 0.47.
---- Virag Shah, Lennart Gulikers, Laurent Massoulié, Milan Vojnovic (Oper. Res., 2020) (R: 0.00; S: 0.47)
---- Virag Shah, Lennart Gulikers, Laurent Massoulié, Milan Vojnovic (Allerton, 2017) (R: 0.00; S: 0.47)
---- Virag Shah, Lennart Gulikers, Laurent Massoulié, Milan Vojnovic (CoRR, 2017) (R: 0.00; S: 0.47)
- Richard Black, Heimir Sverrisson, Laurent Massoulié (ICC, 2007) (R: 0.00; S: 0.46)

OK! We see that the articles are organized by writing commmunities. Also note how Gismo managed to organize a hierachical grouping of the communities.

Now, let’s look in terms of authors. This is actually the interesting part when studying collaborations.

[30]:
gismo.get_features_by_rank()
[30]:
['Laurent Massoulié',
 'Marc Lelarge',
 'Mathieu Even',
 'Hadrien Hendrikx',
 'Peter B. Key',
 'Stratis Ioannidis',
 'Nidhi Hegde 0001',
 'Francis R. Bach',
 'Luca Ganassali',
 'Anne-Marie Kermarrec',
 'Romain Cosson',
 'Ayalvadi J. Ganesh',
 'Lennart Gulikers',
 'Milan Vojnovic',
 'Dan-Cristian Tomozei',
 'Laurent Viennot',
 'Kevin Scaman',
 'Amin Karbasi',
 'Augustin Chaintreau',
 'Mathieu Leconte']

We see many authors that were not present in the articles listed above. This is an important observation: central articles (with respect to a query) are not necessarily written by central authors!

Let’s organize them into communities.

[31]:
gismo.get_features_by_cluster(resolution=.6)
 F: 0.01. R: 0.21. S: 0.55.
- F: 0.01. R: 0.20. S: 0.56.
-- F: 0.01. R: 0.20. S: 0.55.
--- F: 0.03. R: 0.17. S: 0.52.
---- F: 0.05. R: 0.15. S: 0.50.
----- F: 0.09. R: 0.15. S: 0.50.
------ F: 0.23. R: 0.12. S: 0.52.
------- F: 0.26. R: 0.11. S: 0.57.
-------- Laurent Massoulié (R: 0.10; S: 1.00)
-------- Mathieu Even (R: 0.01; S: 0.28)
------- Hadrien Hendrikx (R: 0.01; S: 0.21)
------ F: 0.13. R: 0.02. S: 0.28.
------- Marc Lelarge (R: 0.01; S: 0.15)
------- Luca Ganassali (R: 0.01; S: 0.18)
------- Lennart Gulikers (R: 0.00; S: 0.20)
------ Francis R. Bach (R: 0.01; S: 0.06)
------ Milan Vojnovic (R: 0.00; S: 0.05)
----- Kevin Scaman (R: 0.00; S: 0.08)
---- F: 0.07. R: 0.02. S: 0.18.
----- F: 0.11. R: 0.01. S: 0.17.
------ Peter B. Key (R: 0.01; S: 0.13)
------ Ayalvadi J. Ganesh (R: 0.01; S: 0.13)
----- Anne-Marie Kermarrec (R: 0.01; S: 0.07)
--- F: 0.01. R: 0.02. S: 0.17.
---- F: 0.06. R: 0.01. S: 0.09.
----- Stratis Ioannidis (R: 0.01; S: 0.09)
----- Augustin Chaintreau (R: 0.00; S: 0.05)
---- Nidhi Hegde 0001 (R: 0.01; S: 0.15)
---- Amin Karbasi (R: 0.00; S: 0.03)
--- Mathieu Leconte (R: 0.00; S: 0.08)
-- F: 0.07. R: 0.01. S: 0.13.
--- Romain Cosson (R: 0.01; S: 0.12)
--- Laurent Viennot (R: 0.00; S: 0.05)
- Dan-Cristian Tomozei (R: 0.00; S: 0.07)

Jim Roberts query#

[32]:
gismo.rank("James_W._Roberts")
[32]:
True

Let’s have a covering set of articles.

[33]:
gismo.get_documents_by_coverage(k=10)
[33]:
['Integrated Admission Control for Streaming and Elastic Traffic. By Nabil Benameur, Slim Ben Fredj, Frank Delcoigne, Sara Oueslati-Boulahia, James W. Roberts (QofIS, 2001)',
 'Statistical bandwidth sharing: a study of congestion at flow level. By Slim Ben Fredj, Thomas Bonald, Alexandre Proutière, G. Régnié, James W. Roberts (SIGCOMM, 2001)',
 'An In-Camera Data Stream Processing System for Defect Detection in Web Inspection Tasks. By S. Hossain Hajimowlana, Roberto Muscedere, Graham A. Jullien, James W. Roberts (Real Time Imaging, 1999)',
 "Modifications of Thomae's Function and Differentiability. By Kevin Beanland, James W. Roberts, Craig Stevenson (Am. Math. Mon., 2009)",
 'A Traffic Control Framework for High Speed Data Transmission. By James W. Roberts, Brahim Bensaou, Y. Canetti (Modelling and Evaluation of ATM Networks, 1993)',
 'A CMOS model for computer-aided circuit analysis and design. By James W. Roberts, Savvas G. Chamberlain (IEEE J. Solid State Circuits, 1989)',
 'Burstiness Bounds Based Muliplexing Schemes for VBR Video Connections in the B-ISDN. By Maher Hamdi, James W. Roberts (International Zurich Seminar on Digital Communications, 1996)',
 'Impact of "Trunk Reservation" on Elastic Flow Routing. By Sara Oueslati-Boulahia, James W. Roberts (NETWORKING, 2000)',
 'Swing: Traffic capacity of a simple WDM ring network. By Thomas Bonald, Sara Oueslati, James W. Roberts, Charlotte Roger (ITC, 2009)',
 'Congestion at flow level and the impact of user behaviour. By Thomas Bonald, James W. Roberts (Comput. Networks, 2003)']

Who are the associated authors?

[34]:
gismo.get_features_by_rank(k=10)
[34]:
['James W. Roberts',
 'Thomas Bonald',
 'Maher Hamdi',
 'Sara Oueslati',
 'Sara Oueslati-Boulahia',
 'Ali Ibrahim',
 'Alexandre Proutière',
 'Jorma T. Virtamo',
 'Slim Ben Fredj',
 'Jussi Kangasharju']

Let’s organize them.

[35]:
gismo.get_features_by_cluster(k=10, resolution=.4)
 F: 0.01. R: 0.24. S: 0.51.
- F: 0.01. R: 0.22. S: 0.50.
-- F: 0.11. R: 0.20. S: 0.49.
--- F: 0.18. R: 0.19. S: 0.60.
---- James W. Roberts (R: 0.14; S: 1.00)
---- Thomas Bonald (R: 0.05; S: 0.22)
---- Sara Oueslati (R: 0.01; S: 0.20)
--- F: 0.57. R: 0.01. S: 0.31.
---- Sara Oueslati-Boulahia (R: 0.01; S: 0.26)
---- Slim Ben Fredj (R: 0.01; S: 0.29)
-- Alexandre Proutière (R: 0.01; S: 0.03)
-- Jorma T. Virtamo (R: 0.01; S: 0.04)
-- Jussi Kangasharju (R: 0.00; S: 0.03)
- Maher Hamdi (R: 0.01; S: 0.11)
- Ali Ibrahim (R: 0.01; S: 0.06)

Combined queries#

We can input multiple authors.

[36]:
gismo.rank("Laurent_Massoulié and James_W._Roberts")
[36]:
True

Let’s have a covering set of articles.

[37]:
gismo.get_documents_by_coverage(k=10)
[37]:
['Integrated Admission Control for Streaming and Elastic Traffic. By Nabil Benameur, Slim Ben Fredj, Frank Delcoigne, Sara Oueslati-Boulahia, James W. Roberts (QofIS, 2001)',
 'Statistical bandwidth sharing: a study of congestion at flow level. By Slim Ben Fredj, Thomas Bonald, Alexandre Proutière, G. Régnié, James W. Roberts (SIGCOMM, 2001)',
 'Defect detection in web inspection using fuzzy fusion of texture features. By S. Hossain Hajimowlana, Roberto Muscedere, Graham A. Jullien, James W. Roberts (ISCAS, 2000)',
 "Modifications of Thomae's Function and Differentiability. By Kevin Beanland, James W. Roberts, Craig Stevenson (Am. Math. Mon., 2009)",
 'A Traffic Control Framework for High Speed Data Transmission. By James W. Roberts, Brahim Bensaou, Y. Canetti (Modelling and Evaluation of ATM Networks, 1993)',
 'A CMOS model for computer-aided circuit analysis and design. By James W. Roberts, Savvas G. Chamberlain (IEEE J. Solid State Circuits, 1989)',
 'Burstiness Bounds Based Muliplexing Schemes for VBR Video Connections in the B-ISDN. By Maher Hamdi, James W. Roberts (International Zurich Seminar on Digital Communications, 1996)',
 'Impact of "Trunk Reservation" on Elastic Flow Routing. By Sara Oueslati-Boulahia, James W. Roberts (NETWORKING, 2000)',
 'Swing: Traffic capacity of a simple WDM ring network. By Thomas Bonald, Sara Oueslati, James W. Roberts, Charlotte Roger (ITC, 2009)',
 'Internet and the Erlang formula. By Thomas Bonald, James W. Roberts (Comput. Commun. Rev., 2012)']

Note that we get here only articles by Roberts, yet the articles returned have sightly changed.

Now, let’s look at the main authors.

[38]:
gismo.get_features_by_rank()
[38]:
['James W. Roberts',
 'Laurent Massoulié',
 'Thomas Bonald',
 'Marc Lelarge',
 'Maher Hamdi',
 'Sara Oueslati',
 'Nidhi Hegde 0001',
 'Mathieu Even',
 'Sara Oueslati-Boulahia',
 'Alexandre Proutière',
 'Hadrien Hendrikx',
 'Peter B. Key',
 'Stratis Ioannidis',
 'Ali Ibrahim',
 'Jorma T. Virtamo']

We see a mix of both co-authors. How are they organized?

[39]:
gismo.get_features_by_cluster(resolution=.4)
 F: 0.01. R: 0.20. S: 0.55.
- F: 0.01. R: 0.13. S: 0.52.
-- F: 0.02. R: 0.12. S: 0.52.
--- F: 0.18. R: 0.11. S: 0.51.
---- James W. Roberts (R: 0.07; S: 0.92)
---- Thomas Bonald (R: 0.02; S: 0.21)
---- Sara Oueslati (R: 0.01; S: 0.18)
---- Sara Oueslati-Boulahia (R: 0.00; S: 0.24)
--- Maher Hamdi (R: 0.01; S: 0.10)
--- Nidhi Hegde 0001 (R: 0.00; S: 0.08)
--- Alexandre Proutière (R: 0.00; S: 0.04)
--- Ali Ibrahim (R: 0.00; S: 0.05)
-- Jorma T. Virtamo (R: 0.00; S: 0.04)
- F: 0.02. R: 0.07. S: 0.21.
-- F: 0.19. R: 0.06. S: 0.20.
--- Laurent Massoulié (R: 0.05; S: 0.37)
--- Mathieu Even (R: 0.00; S: 0.11)
--- Hadrien Hendrikx (R: 0.00; S: 0.08)
-- Marc Lelarge (R: 0.01; S: 0.06)
-- Peter B. Key (R: 0.00; S: 0.05)
-- Stratis Ioannidis (R: 0.00; S: 0.03)

Cross-gismo#

Gismo can combine two embeddings two create one hybrid gismo. This is called a cross-gismo (XGismo). This features can be used to analyze authors with respect to the words they use (and vice-versa).

[40]:
from gismo.gismo import XGismo
gismo = XGismo(x_embedding=a_embedding, y_embedding=embedding)
gismo.diteration.n_iter = 2 # to speed up a little bit computation time

Note that XGismo does not use the underlying corpus, so we can now close the source (the source keeps the file dblp.data open).

[41]:
source.close()
[42]:
gismo.post_documents_item = lambda g, i: g.corpus[i].replace("_", " ")
gismo.post_features_cluster = post_features_cluster_print
gismo.post_documents_cluster = post_documents_cluster_print

Let’s try a request.

[43]:
gismo.rank("self-stabilization")
[43]:
True

What are the associated keywords?

[44]:
gismo.get_features_by_rank(k=10)
[44]:
['stabilization',
 'self',
 'self stabilization',
 'stabilize',
 'self stabilize',
 'distribute',
 'sensor',
 'robust',
 'nonlinear',
 'adaptive']

How are keywords structured?

[45]:
gismo.get_features_by_cluster(k=20, resolution=.8)
 F: 0.26. R: 0.02. S: 0.81.
- F: 0.52. R: 0.02. S: 0.81.
-- F: 0.63. R: 0.02. S: 0.81.
--- F: 0.80. R: 0.02. S: 0.81.
---- F: 0.81. R: 0.02. S: 0.81.
----- F: 0.93. R: 0.01. S: 0.82.
------ stabilization (R: 0.00; S: 0.80)
------ self stabilization (R: 0.00; S: 0.82)
----- F: 0.81. R: 0.01. S: 0.69.
------ self (R: 0.00; S: 0.76)
------ stabilize (R: 0.00; S: 0.72)
------ self stabilize (R: 0.00; S: 0.68)
------ distribute (R: 0.00; S: 0.71)
------ optimal (R: 0.00; S: 0.66)
------ distributed (R: 0.00; S: 0.62)
----- stabilizing (R: 0.00; S: 0.70)
---- F: 0.95. R: 0.00. S: 0.69.
----- sensor (R: 0.00; S: 0.69)
----- wireless (R: 0.00; S: 0.66)
---- fault (R: 0.00; S: 0.75)
--- F: 0.92. R: 0.00. S: 0.47.
---- byzantine (R: 0.00; S: 0.46)
---- robot (R: 0.00; S: 0.44)
---- mobile (R: 0.00; S: 0.53)
-- problem (R: 0.00; S: 0.65)
- F: 0.54. R: 0.00. S: 0.32.
-- F: 0.63. R: 0.00. S: 0.51.
--- robust (R: 0.00; S: 0.43)
--- adaptive (R: 0.00; S: 0.49)
-- nonlinear (R: 0.00; S: 0.08)
-- linear (R: 0.00; S: 0.25)

Who are the associated researchers?

[46]:
gismo.get_documents_by_rank(k=10)
[46]:
['Ted Herman',
 'Shlomi Dolev',
 'Sébastien Tixeuil',
 'Sukumar Ghosh',
 'George Varghese',
 'Shay Kutten',
 'Toshimitsu Masuzawa',
 'Stéphane Devismes',
 'Stefan Schmid 0001',
 'Swan Dubois']

How are they structured?

[47]:
gismo.get_documents_by_cluster(k=10, resolution=.9)
 F: 0.71. R: 0.05. S: 0.83.
- F: 0.91. R: 0.05. S: 0.83.
-- F: 0.93. R: 0.03. S: 0.83.
--- F: 0.97. R: 0.01. S: 0.81.
---- Ted Herman (R: 0.01; S: 0.81)
---- George Varghese (R: 0.00; S: 0.80)
--- F: 0.94. R: 0.02. S: 0.82.
---- F: 0.98. R: 0.01. S: 0.81.
----- Sébastien Tixeuil (R: 0.01; S: 0.82)
----- Sukumar Ghosh (R: 0.01; S: 0.80)
---- Swan Dubois (R: 0.00; S: 0.81)
-- F: 0.95. R: 0.02. S: 0.78.
--- F: 0.96. R: 0.02. S: 0.79.
---- Shlomi Dolev (R: 0.01; S: 0.78)
---- Shay Kutten (R: 0.00; S: 0.82)
---- Stéphane Devismes (R: 0.00; S: 0.77)
--- Toshimitsu Masuzawa (R: 0.00; S: 0.70)
- Stefan Schmid 0001 (R: 0.00; S: 0.63)

We can also query researchers. Just use underscores in the query and add y=False to indicate that the input is documents.

[48]:
gismo.rank("Sébastien_Tixeuil and Fabien_Mathieu", y=False)
[48]:
True

What are the associated keywords?

[49]:
gismo.get_features_by_rank(k=10)
[49]:
['p2p',
 'stabilization',
 'byzantine',
 'kleinberg',
 'p2p networks',
 'live streaming',
 'refresh',
 'self stabilization',
 'self',
 'gathering']

Using covering can yield other keywords of interest.

[50]:
gismo.get_features_by_coverage(k=10)
[50]:
['p2p',
 'stabilization',
 'fun',
 'preference',
 'pagerank',
 'byzantine',
 'self stabilization',
 'self',
 'gathering',
 'stabilize']

How are keywords structured?

[51]:
gismo.get_features_by_cluster(k=20, resolution=.7)
 F: 0.12. R: 0.20. S: 0.67.
- F: 0.80. R: 0.11. S: 0.44.
-- F: 0.84. R: 0.09. S: 0.44.
--- p2p (R: 0.02; S: 0.43)
--- kleinberg (R: 0.01; S: 0.44)
--- p2p networks (R: 0.01; S: 0.42)
--- live streaming (R: 0.01; S: 0.40)
--- refresh (R: 0.01; S: 0.40)
--- live (R: 0.01; S: 0.43)
--- old (R: 0.01; S: 0.40)
--- streaming (R: 0.01; S: 0.43)
--- acyclic (R: 0.01; S: 0.43)
-- preference (R: 0.01; S: 0.37)
-- pagerank (R: 0.01; S: 0.39)
- F: 0.56. R: 0.09. S: 0.59.
-- F: 0.81. R: 0.08. S: 0.59.
--- stabilization (R: 0.01; S: 0.56)
--- byzantine (R: 0.01; S: 0.51)
--- self stabilization (R: 0.01; S: 0.58)
--- self (R: 0.01; S: 0.59)
--- gathering (R: 0.01; S: 0.49)
--- stabilize (R: 0.01; S: 0.56)
--- self stabilize (R: 0.01; S: 0.56)
--- asynchronous (R: 0.01; S: 0.52)
-- fun (R: 0.01; S: 0.49)

Who are the associated researchers?

[52]:
gismo.get_documents_by_rank(k=10)
[52]:
['Sébastien Tixeuil',
 'Fabien Mathieu',
 'Shlomi Dolev',
 'Fukuhito Ooshita',
 'Toshimitsu Masuzawa',
 'Michel Raynal',
 'Mieczyslaw A. Klopotek',
 'Stéphane Devismes',
 'Céline Comte',
 'Edmond Bianco']

How are they structured?

[53]:
gismo.get_documents_by_cluster(k=10, resolution=.8)
 F: 0.00. R: 0.01. S: 0.69.
- F: 0.07. R: 0.01. S: 0.69.
-- F: 0.70. R: 0.00. S: 0.50.
--- F: 0.84. R: 0.00. S: 0.53.
---- Sébastien Tixeuil (R: 0.00; S: 0.54)
---- Fukuhito Ooshita (R: 0.00; S: 0.48)
--- F: 0.93. R: 0.00. S: 0.41.
---- Shlomi Dolev (R: 0.00; S: 0.41)
---- Toshimitsu Masuzawa (R: 0.00; S: 0.43)
---- Stéphane Devismes (R: 0.00; S: 0.39)
--- Michel Raynal (R: 0.00; S: 0.37)
-- F: 0.38. R: 0.00. S: 0.66.
--- Fabien Mathieu (R: 0.00; S: 0.71)
--- Mieczyslaw A. Klopotek (R: 0.00; S: 0.29)
--- Céline Comte (R: 0.00; S: 0.28)
- Edmond Bianco (R: 0.00; S: 0.03)