DBLP exploration#

This tutorial shows how explore DBLP with Gismo.

If you have never used Gismo before, you may want to start with the Toy example tutorial or the ACM tutorial.

Note

In previous Gismo version, dblp was handled by Gismo itself. Why we kept the code in the source for legacy purpose, it should be considered as deprecated.

For DBLP access, we now recommmend to use the LDB interface from Gismap.

pip install gismap

Recommended requirements to excute this Notebook (after having gismap installed):

  • Fast Internet connection (you will need to download a few hundred Mb)

  • 4 Gb of free space

  • 4 Gb of RAM (8Gb or more recommended)

  • Descent CPU (can take more than one hour on slow CPUs)

Here, documents are articles in DBLP. The features of an article category will vary.

Initialisation#

First, we load the required packages.

[1]:
import numpy as np
import spacy
from gismo import Corpus, Embedding, CountVectorizer, cosine_similarity, Gismo
from pathlib import Path
from functools import partial

from gismo.post_processing import (
    post_features_cluster_print,
    post_documents_cluster_print,
)

Then, we check that LDB works correctly.

[2]:
from gismap import LDB

LDB.search_author("Fabien Mathieu")
[2]:
[LDBAuthor(name='Fabien Mathieu', key='66/2077')]
[3]:
LDB.author_publications("66/2077")[:4]
[3]:
[LDBPublication(authors=[LDBAuthor(name='Heger Arfaoui', key='116/4885'), LDBAuthor(name='Pierre Fraigniaud', key='74/3005'), LDBAuthor(name='David Ilcinkas', key='64/5947'), LDBAuthor(name='Fabien Mathieu', key='66/2077')], title='Distributedly Testing Cycle-Freeness.', venue='WG', type='conference', year=2014, key='conf/wg/ArfaouiFIM14'),
 LDBPublication(authors=[LDBAuthor(name='Yacine Boufkhad', key='75/5742'), LDBAuthor(name='Fabien Mathieu', key='66/2077'), LDBAuthor(name='Fabien de Montgolfier', key='57/6313'), LDBAuthor(name='Diego Perino', key='03/3645'), LDBAuthor(name='Laurent Viennot', key='v/LaurentViennot')], title='Fine Tuning of a Distributed VoD System.', venue='ICCCN', type='conference', year=2009, key='conf/icccn/BoufkhadMMPV09'),
 LDBPublication(authors=[LDBAuthor(name='Fabien Mathieu', key='66/2077')], title='Upper Bounds for Stabilization in Acyclic Preference-Based Systems.', venue='SSS', type='conference', year=2007, key='conf/sss/Mathieu07'),
 LDBPublication(authors=[LDBAuthor(name='François Durand', key='38/11269'), LDBAuthor(name='Fabien Mathieu', key='66/2077'), LDBAuthor(name='Ludovic Noirie', key='17/428')], title='SVVAMP: Simulator of Various Voting Algorithms in Manipulating Populations.', venue='AAAI', type='conference', year=2016, key='conf/aaai/DurandMN16')]

By default, this is of the publis are represented internally:

[4]:
LDB.publis[6666666]
[4]:
('journals/mcs/ZhangZGF19',
 'A mass-conservative characteristic splitting mixed finite element method for convection-dominated Sobolev equation.',
 'journal',
 [2521603, 3466362, 131057, 2517743],
 'https://doi.org/10.1016/J.MATCOM.2018.12.016',
 ['journals/mcs'],
 '180-191',
 'Math. Comput. Simul.',
 2019)

Each article is a tuple with key, title, type, authors (as indices), url, streams, pages, venue, year. We build a corpus that will tell Gismo that the content of an article is its title.

[5]:
corpus = Corpus(LDB.publis, to_text=lambda p: p[1])

We build an embedding on top of that corpus.

  • We set min_df=30 to exclude rare features;

  • We set max_df=.02 to exclude anything present in more than 2% of the corpus;

  • We use spacy to lemmatize & remove some stopwords; remove preprocessor=... from the input if you want to skip this (takes time);

  • A few manually selected stopwords to fine-tune things.

  • We set ngram_range=(1, 2) to include bi-grams in the embedding.

This will take a few minutes (without spacy) up to a few hours (with spacy enabled). You can save the embedding for later if you want.

[6]:
from pathlib import Path

path = Path.home() / "temp"
path.exists()
[6]:
True
[7]:
nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])
keep = {"ADJ", "NOUN", "NUM", "PROPN", "SYM", "VERB"}
vectorizer = CountVectorizer(
    min_df=30,
    max_df=0.02,
    ngram_range=(1, 2),
    dtype=float,
    preprocessor=lambda txt: " ".join(
        [w.lemma_.lower() for w in nlp(txt) if w.pos_ in keep and not w.is_stop]
    ),
    stop_words=[
        "a",
        "about",
        "an",
        "and",
        "for",
        "from",
        "in",
        "of",
        "on",
        "the",
        "with",
    ],
)

try:
    embedding = Embedding.load(filename="dblp_embedding", path=path)
except:
    embedding = Embedding(vectorizer=vectorizer)
    embedding.fit_transform(corpus)
    embedding.dump(filename="dblp_embedding", path=path)
[8]:
embedding.x
[8]:
<Compressed Sparse Row sparse matrix of dtype 'float64'
        with 86356357 stored elements and shape (8236135, 262705)>

We see from embedding.x that the embedding links about 7,500,000 documents to 235,000 features. In average, each document is linked to about 10 features.

Now, we initiate the gismo object, and customize post_processers to ease the display.

[9]:
gismo = Gismo(corpus, embedding)
[10]:
def post_article(g, i):
    tup = g.corpus[i]
    authors = ", ".join([LDB.author_by_index(k).name for k in tup[3]])
    return f"{tup[1]} By {authors} ({tup[7]}, {tup[8]})"


gismo.post_documents_item = post_article


def post_title(g, i):
    return g.corpus[i][1]
    authors = ", ".join(dic["authors"])
    return f"{tup[1]} By {authors} ({tup[7]}, {tup[8]})"


def post_meta(g, i):
    tup = g.corpus[i]
    authors = ", ".join([LDB.author_by_index(k).name for k in tup[3]])
    return f"{authors} ({tup[7]}, {tup[8]})"


gismo.post_documents_cluster = partial(
    post_documents_cluster_print, post_item=post_title
)
gismo.post_features_cluster = post_features_cluster_print

As the dataset is big, we lower the precision of the computation to speed up things a little bit.

[11]:
gismo.parameters.n_iter = 2

Machine Learning (and Covid-19) query#

We perform the query Machine learning. The returned True tells that some of the query features were found in the corpus’ features.

[12]:
gismo.rank("Machine Learning")
[12]:
True

What are the best articles on Machine Learning?

[13]:
gismo.get_documents_by_rank()
[13]:
['The Changing Landscape of Machine Learning: A Comparative Analysis of Centralized Machine Learning, Distributed Machine Learning and Federated Machine Learning. By Dishita Naik, Nitin Naik (UKCI, 2023)',
 'Resilient Machine Learning for Networked Cyber Physical Systems: A Survey for Machine Learning Security to Securing Machine Learning for CPS. By Felix O. Olowononi, Danda B. Rawat, Chunmei Liu (CoRR, 2021)',
 'Resilient Machine Learning for Networked Cyber Physical Systems: A Survey for Machine Learning Security to Securing Machine Learning for CPS. By Felix O. Olowononi, Danda B. Rawat, Chunmei Liu (IEEE Commun. Surv. Tutorials, 2021)',
 'The Machine Learning Machine: A Tangible User Interface for Teaching Machine Learning. By Magnus Høholt Kaspersen, Karl-Emil Kjær Bilstrup, Marianne Graves Petersen (TEI, 2021)',
 'Exploring fairness and privacy in machine learning. (Exploring fairness and privacy in machine learning). By Carlos Pinzón (None, 2023)',
 'A Machine Learning-oriented Survey on Tiny Machine Learning. By Luigi Capogrosso, Federico Cunico, Dong Seon Cheng, Franco Fummi, Marco Cristani (CoRR, 2023)',
 'A Machine Learning-Oriented Survey on Tiny Machine Learning. By Luigi Capogrosso, Federico Cunico, Dong Seon Cheng, Franco Fummi, Marco Cristani (IEEE Access, 2024)',
 'Quantum Machine Learning: An Interplay Between Quantum Computing and Machine Learning. By Jun Qi, Chao-Han Huck Yang, Samuel Yen-Chi Chen, Pin-Yu Chen (ISCAS, 2025)',
 'Quantum Machine Learning: An Interplay Between Quantum Computing and Machine Learning. By Jun Qi, Chao-Han Yang, Samuel Yen-Chi Chen, Pin-Yu Chen (CoRR, 2024)',
 'Critical Tools for Machine Learning: Situating, Figuring, Diffracting, Fabulating Machine Learning Systems Design. By Goda Klumbyte, Claude Draude, Alex S. Taylor (CHItaly, 2021)',
 'PIMA Diabetes Prediction Using Machine Learning and Quantum Machine Learning Techniques. By Dixit Vimal (ITU K, 2024)',
 'A Machine Learning Tutorial for Operational Meteorology, Part I: Traditional Machine Learning. By Randy J. Chase, David R. Harrison, Amanda Burke, Gary M. Lackmann, Amy McGovern (CoRR, 2022)',
 'Breast Tumor Classification using Machine Learning: Breast Tumor Classification using Machine Learning. By Salman Siddiqui, Mohd Usman Mallick, Ankur Varshney (EAI Endorsed Trans. Context aware Syst. Appl., 2023)',
 'Verbalized Machine Learning: Revisiting Machine Learning with Language Models. By Tim Z. Xiao, Robert Bamler, Bernhard Schölkopf, Weiyang Liu (Trans. Mach. Learn. Res., 2025)',
 'Verbalized Machine Learning: Revisiting Machine Learning with Language Models. By Tim Z. Xiao, Robert Bamler, Bernhard Schölkopf, Weiyang Liu (CoRR, 2024)',
 'When Physics Meets Machine Learning: A Survey of Physics-Informed Machine Learning. By Chuizheng Meng, Sungyong Seo, Defu Cao, Sam Griesemer, Yan Liu (CoRR, 2022)',
 'The Idealized Machine Learning Pipeline (IMLP) for Advancing Reproducibility in Machine Learning. By Yantong Zheng, Victoria Stodden (ACM-REP, 2024)',
 'Poisoning Attacks Against Machine Learning: Can Machine Learning Be Trustworthy? By Alina Oprea, Anoop Singhal, Apostol Vassilev (Computer, 2022)',
 'Machine Learning for Health ( ML4H ) 2019 : What Makes Machine Learning in Medicine Different? By Adrian V. Dalca, Matthew B. A. McDermott, Emily Alsentzer, Samuel G. Finlayson, Michael Oberst, Fabian Falck, Corey Chivers, Andrew Beam, Tristan Naumann, Brett K. Beaulieu-Jones (ML4H@NeurIPS, 2019)',
 'Resilient Machine Learning (rML) Ensemble Against Adversarial Machine Learning Attacks. By Likai Yao, Cihan Tunc, Pratik Satam, Salim Hariri (DDDAS, 2020)',
 'Interpretable Machine Learning Models via Maximum Boolean Satisfiability. (Interpretable Machine Learning Models via Maximum Boolean Satisfiability). By Hao Hu (None, 2022)',
 'Adversarial Machine Learning: Difficulties in Applying Machine Learning to Existing Cybersecurity Systems. By Shahram Rahimi, Jordan Maynor, Bidyut Gupta (CATA, 2020)',
 'Machine Learning for Testing Machine-Learning Hardware: A Virtuous Cycle. By Arjun Chaudhuri, Jonti Talukdar, Krishnendu Chakrabarty (ICCAD, 2022)',
 'Machine Learning Unplugged - Development and Evaluation of a Workshop About Machine Learning. By Elisaweta Ossovski, Michael Brinkmeier (ISSEP, 2019)',
 'How Developers Iterate on Machine Learning Workflows - A Survey of the Applied Machine Learning Literature. By Doris Xin, Litian Ma, Shuchen Song, Aditya G. Parameswaran (CoRR, 2018)',
 'Quantum Machine Learning: A Hands-on Tutorial for Machine Learning Practitioners and Researchers. By Yuxuan Du, Xinbiao Wang, Naixu Guo, Zhan Yu, Yang Qian, Kaining Zhang, Min-Hsiu Hsieh, Patrick Rebentrost, Dacheng Tao (CoRR, 2025)',
 'Predicting Machine Learning Pipeline Runtimes in the Context of Automated Machine Learning. By Felix Mohr, Marcel Wever, Alexander Tornede, Eyke Hüllermeier (IEEE Trans. Pattern Anal. Mach. Intell., 2021)',
 "Sustainability of Machine Learning-Enabled Systems: The Machine Learning Practitioner's Perspective. By Vincenzo De Martino, Stefano Lambiase, Fabiano Pecorelli, Willem-Jan van den Heuvel, Filomena Ferrucci, Fabio Palomba (CoRR, 2025)",
 'Hacking Machine Learning: Towards The Comprehensive Taxonomy of Attacks Against Machine Learning Systems. By Jerzy Surma (ICIAI, 2020)',
 'Machine Learning Lineage for Trustworthy Machine Learning Systems: Information Framework for MLOps Pipelines. By Mikko Raatikainen, Charalampos Souris, Jukka J. Remes, Vlad Stirbu (IEEE Softw., 2025)',
 'Deep Fast Machine Learning Utils: A Python Library for Streamlined Machine Learning Prototyping. By Fabi Prezja (CoRR, 2024)',
 'The Application of Machine Learning Algorithms in Predicting the Usage of IoT-based Cleaning Dispensers: Machine Learning Algorithms in Predicting the Usage if IoT-based Dispensers. By Tobechi Obinwanne, Chibuzor Udokwu, Patrick Brandtner (ICEEG, 2023)']

OK, this seems to go everywhere. Maybe we can narrow with a more specific request.

[14]:
gismo.rank("Machine Learning and covid-19")
[14]:
True
[15]:
gismo.get_documents_by_rank()
[15]:
['Ergonomics of Virtual Learning During COVID-19. By Lu Yuan, Alison Garaudy (AHFE (11), 2021)',
 'University Virtual Learning in Covid Times. By Verónica Marín-Díaz, Eloísa Reche, Javier Martín (Technol. Knowl. Learn., 2022)',
 'Mobile Learning for COVID-19 Prevention. By Zhiyi Wang (EAI Endorsed Trans. e Learn., 2024)',
 'Design Issues in e-Learning during the COVID-19 Pandemic. By Alexandra Hosszu, Cosima Rughinis (CSCS, 2021)',
 'Campus traffic and e-Learning during COVID-19 pandemic. By Thomas Favale, Francesca Soro, Martino Trevisan, Idilio Drago, Marco Mellia (Comput. Networks, 2020)',
 'Campus Traffic and e-Learning during COVID-19 Pandemic. By Thomas Favale, Francesca Soro, Martino Trevisan, Idilio Drago, Marco Mellia (CoRR, 2020)',
 'The Deaf Experience in Remote Learning during COVID-19. By Yosra Bouzid, Mohamed Jemni (ICTA, 2021)',
 'Interpretable Sequence Learning for Covid-19 Forecasting. By Sercan Ö. Arik, Chun-Liang Li, Jinsung Yoon, Rajarishi Sinha, Arkady Epshteyn, Long T. Le, Vikas Menon, Shashank Singh, Leyou Zhang, Martin Nikoltchev, Yash Sonthalia, Hootan Nakhost, Elli Kanal, Tomas Pfister (NeurIPS, 2020)',
 'Interpretable Sequence Learning for COVID-19 Forecasting. By Sercan Ö. Arik, Chun-Liang Li, Jinsung Yoon, Rajarishi Sinha, Arkady Epshteyn, Long T. Le, Vikas Menon, Shashank Singh, Leyou Zhang, Nathanael C. Yoder, Martin Nikoltchev, Yash Sonthalia, Hootan Nakhost, Elli Kanal, Tomas Pfister (CoRR, 2020)',
 'DeCoP: Deep Learning for COVID-19 Prediction of Survival. By Yao Deng, Shigang Liu, Alireza Jolfaei, Hongbing Cheng, Ziyuan Wang, Xi Zheng (IEEE Trans. Mol. Biol. Multi Scale Commun., 2022)',
 'Dual Teaching: Simultaneous Remote and In-Person Learning During COVID. By Hunter M. Williams, Malcolm Gibran Haynes, Joseph Kim (SIGITE, 2021)',
 'D-Learning and COVID-19 Crisis: Appraisal of Reactions and Means of Perpetuity. By Jalal Ismaili, Karim El Moutaouakil (SN Comput. Sci., 2023)',
 'Engineering Experiential Learning During the COVID-19 Pandemic. By Nael Barakat, Aws AlShalash, Mohammad Biswas, Shih-Feng Chou, Tahsin Khajah (ICL (2), 2021)',
 'A Survey on Deep Learning in COVID-19 Diagnosis. By Xue Han, Zuojin Hu, Shuihua Wang, Yu-Dong Zhang (J. Imaging, 2023)',
 'Differences Between Neurodivergent and Neurotypical Learning During Covid-19: Towards the E-Tivities Satisfaction Scale. By Jonathan Bishop, Kamal Bechkoum (CSCI, 2022)',
 'The Study on the Efficiency of Smart Learning in the COVID-19. By Seong-Kyu Kim, Mi-Jung Lee, Eun-Sill Jang, Young-Eun Lee (J. Multim. Inf. Syst., 2022)',
 'An Analysis of the Effectiveness of Emergency Distance Learning under COVID-19. By Ngo Tung Son, Bui Ngoc Anh, Kieu Quoc Tuan, Son Ba Nguyen, Son Hoang Nguyen, Jafreezal Jaafar (CCRIS, 2020)',
 'DCML: Deep contrastive mutual learning for COVID-19 recognition. By Hongbin Zhang, Weinan Liang, Chuanxiu Li, Qipeng Xiong, Haowei Shi, Lang Hu, Guangli Li (Biomed. Signal Process. Control., 2022)',
 'M-learning During COVID-19: A Systematic Literature Review. By Esmhan Jafer, Hossana Twinomurinzi (AFRICATEK, 2022)',
 'Automated Machine Learning for COVID-19 Forecasting. By Jaco Tetteroo, Mitra Baratchi, Holger H. Hoos (IEEE Access, 2022)',
 'Urgent Digital Change - Learning from the COVID-19 Pandemic. By Aisha Zahoor Malik, Christopher H. Gyldenkærne, Christine Flagstad Bech, Troels Mønsted, Jesper Simonsen (InfraHealth, 2021)',
 'A Data Augmented Approach to Transfer Learning for Covid-19 Detection. By Shagufta Henna, Aparna Reji (CoRR, 2021)',
 'Challenges of Online Learning During the COVID-19: What Can We Learn on Twitter? By Wei Quan (ARTIIS, 2021)',
 'A comprehensive review of federated learning for COVID-19 detection. By Sadaf Naz, Khoa Tran Phan, Yi-Ping Phoebe Chen (Int. J. Intell. Syst., 2022)',
 'M-learning in the COVID-19 era: physical vs digital class. By Vasiliki Matzavela, Efthimios Alepis (Educ. Inf. Technol., 2021)',
 'Curriculum Contrastive Learning for COVID-19 FAQ Retrieval. By Leilei Zhang, Junfei Liu (BIBM, 2022)',
 'Federated Learning for COVID-19 on Heterogeneous CXR Images with Noise. By Mengqing Ding, Juan Li, Changyan Yi, Jun Cai (ICC, 2023)',
 'Design Teaching and Learning in Covid-19 Times: An International Experience. By Paulo Ferreira, Filipa Oliveira Antunes, Haroldo Gallo, Marcos Tognon, Heloisa Mendes Pereira (TECH-EDU, 2020)',
 'Educational Transformation: An Evaluation of Online Learning Due To COVID-19. By Rizky Firmansyah, Dhika Maha Putri, Mochammad Galih Satriyo Wicaksono, Sheila Febriani Putri, Ahmad Arif Widianto, Mohd Rizal Palil (Int. J. Emerg. Technol. Learn., 2021)',
 'Online Learning Before, During and After COVID-19: Observations Over 20 Years. By Natalie Wieland, Liz Kollias (Int. J. Adv. Corp. Learn., 2020)',
 'Using Deep Learning for COVID-19 Control: Implementing a Convolutional Neural Network in a Facemask Detection Application. By Caolan Deery, Kevin Meehan (SmartNets, 2021)',
 'LitCovid ensemble learning for COVID-19 multi-label classification. By Jinghang Gu, Emmanuele Chersoni, Xing Wang, Chu-Ren Huang, Longhua Qian, Guodong Zhou (Database J. Biol. Databases Curation, 2022)',
 'A Survey on Deep Learning and Machine Learning for COVID-19 Detection. By Mohamed M. Dessouky, Sahar F. Sabbeh, Boushra Alshehri (ICFNDS, 2021)',
 'Boosting Deep Transfer Learning For Covid-19 Classification. By Fouzia Altaf, Syed M. S. Islam, Naeem Khalid Janjua, Naveed Akhtar (ICIP, 2021)',
 'Boosting Deep Transfer Learning for COVID-19 Classification. By Fouzia Altaf, Syed M. S. Islam, Naeem Khalid Janjua, Naveed Akhtar (CoRR, 2021)',
 'Academic Procrastination and Online Learning During the COVID-19 Pandemic. By Jørgen Melgaard, Rubina Monir, Lester Allan Lasrado, Asle Fagerstrøm (CENTERIS/ProjMAN/HCist, 2021)']

Sounds nice. How are the top-10 articles related? Note: as the graph structure is really sparse on the document side (10 features), it is best to de-activate the query-distortion, which is intended for longer documents.

[16]:
gismo.parameters.distortion = 0.0
gismo.get_documents_by_cluster(k=10)
 F: 0.46. R: 0.00. S: 0.79.
- F: 0.46. R: 0.00. S: 0.78.
-- F: 0.65. R: 0.00. S: 0.52.
--- Ergonomics of Virtual Learning During COVID-19. (R: 0.00; S: 0.61)
--- University Virtual Learning in Covid Times. (R: 0.00; S: 0.34)
-- Mobile Learning for COVID-19 Prevention. (R: 0.00; S: 0.55)
-- F: 0.72. R: 0.00. S: 0.74.
--- Design Issues in e-Learning during the COVID-19 Pandemic. (R: 0.00; S: 0.66)
--- F: 1.00. R: 0.00. S: 0.71.
---- Campus traffic and e-Learning during COVID-19 pandemic. (R: 0.00; S: 0.71)
---- Campus Traffic and e-Learning during COVID-19 Pandemic. (R: 0.00; S: 0.71)
-- F: 1.00. R: 0.00. S: 0.53.
--- Interpretable Sequence Learning for Covid-19 Forecasting. (R: 0.00; S: 0.53)
--- Interpretable Sequence Learning for COVID-19 Forecasting. (R: 0.00; S: 0.53)
- The Deaf Experience in Remote Learning during COVID-19. (R: 0.00; S: 0.51)
- DeCoP: Deep Learning for COVID-19 Prediction of Survival. (R: 0.00; S: 0.52)

Now, let’s look at the main keywords.

[17]:
gismo.get_features_by_rank(20)
[17]:
['covid',
 'covid 19',
 '19',
 'learning covid',
 'machine',
 'machine learning',
 'pandemic',
 '19 pandemic',
 'online learning',
 'online',
 '19 detection',
 'chest',
 'student',
 'deep learning',
 'ray',
 'chest ray',
 'prediction',
 'classification',
 'case',
 'ct']

Let’s organize them.

[18]:
# On the feature side, the graph is more dense so we can use query distortion
gismo.get_features_by_cluster(distortion=1)
 F: 0.26. R: 0.06. S: 0.97.
- F: 0.47. R: 0.05. S: 0.97.
-- F: 0.57. R: 0.05. S: 0.97.
--- F: 0.96. R: 0.04. S: 0.97.
---- covid (R: 0.01; S: 1.00)
---- covid 19 (R: 0.01; S: 0.99)
---- 19 (R: 0.01; S: 0.99)
---- learning covid (R: 0.01; S: 0.97)
--- F: 0.98. R: 0.01. S: 0.57.
---- pandemic (R: 0.00; S: 0.58)
---- 19 pandemic (R: 0.00; S: 0.56)
-- F: 0.94. R: 0.00. S: 0.45.
--- online learning (R: 0.00; S: 0.44)
--- online (R: 0.00; S: 0.47)
- F: 1.00. R: 0.01. S: 0.29.
-- machine (R: 0.00; S: 0.29)
-- machine learning (R: 0.00; S: 0.29)
- 19 detection (R: 0.00; S: 0.28)

Rough, very broad analysis:

  • One big keyword cluster about Coronavirus / Covid-19, pandemic, online learning;

  • Machine Learning as a separate small cluster.

[19]:
np.dot(gismo.embedding.query_projection("Machine learning")[0], gismo.embedding.y)
[19]:
<Compressed Sparse Row sparse matrix of dtype 'float64'
        with 138918 stored elements and shape (1, 8236135)>

139,000 articles with an explicit link to machine learning.

[20]:
np.dot(gismo.embedding.query_projection("Covid-19")[0], gismo.embedding.y)
[20]:
<Compressed Sparse Row sparse matrix of dtype 'float64'
        with 23617 stored elements and shape (1, 8236135)>

23,600 articles with an explicit link to covid-19.

Authors query#

Instead of looking at words, we can explore authors and their collaborations.

We just have to rewire the corpus to output string of authors.

[21]:
def to_authors_text(tup):
    return tup[3]


corpus = Corpus(LDB.publis, to_text=to_authors_text)

We can build a new embedding on top of this modified corpus. We tell the vectorizer to be stupid: don’t preprocess, words are separated spaces.

This will take a few minutes (you can save the embedding for later if you want).

[22]:
vectorizer = CountVectorizer(
    dtype=float, preprocessor=lambda x: x, tokenizer=lambda x: x
)
try:
    a_embedding = Embedding.load(filename="dblp_aut_embedding", path=path)
except:
    a_embedding = Embedding(vectorizer=vectorizer)
    a_embedding.fit_transform(corpus)
    a_embedding.dump(filename="dblp_aut_embedding", path=path)
[23]:
a_embedding.x
[23]:
<Compressed Sparse Row sparse matrix of dtype 'float64'
        with 28298209 stored elements and shape (8236135, 3994027)>

We now have about 4,000,000 authors to explore. Let’s reload gismo and try to play.

[24]:
gismo = Gismo(corpus, a_embedding)
gismo.post_documents_item = post_article
gismo.post_features_item = lambda g, i: LDB.author_by_index(i).name
[25]:
gismo.post_documents_cluster = partial(
    post_documents_cluster_print, post_item=post_meta
)
gismo.post_features_cluster = post_features_cluster_print

Laurent Massoulié query#

[26]:
def name_to_index(name):
    return [LDB.keys[LDB.search_author(name)[0].key]]
[27]:
gismo.rank(name_to_index("Laurent Massoulié"))
[27]:
True

What are the most central articles of Laurent Massoulié in terms of collaboration?

[28]:
gismo.get_documents_by_rank(k=10)
[28]:
['Aligning Embeddings and Geometric Random Graphs: Informational Results and Computational Approaches for the Procrustes-Wasserstein Problem. By Mathieu Even, Luca Ganassali, Jakob Maier, Laurent Massoulié (CoRR, 2024)',
 'Aligning Embeddings and Geometric Random Graphs: Informational Results and Computational Approaches for the Procrustes-Wasserstein Problem. By Mathieu Even, Luca Ganassali, Jakob Maier, Laurent Massoulié (NeurIPS, 2024)',
 'Scalable Local Area Service Discovery. By Richard Black, Heimir Sverrisson, Laurent Massoulié (ICC, 2007)',
 'Asymmetric tree correlation testing for graph alignment. By Jakob Maier, Laurent Massoulié (ITW, 2023)',
 'Asymmetric graph alignment and the phase transition for asymmetric tree correlation testing. By Jakob Maier, Laurent Massoulié (CoRR, 2025)',
 'Decentralized Optimization with Heterogeneous Delays: a Continuous-Time Approach. By Mathieu Even, Hadrien Hendrikx, Laurent Massoulié (CoRR, 2021)',
 'Asynchrony and Acceleration in Gossip Algorithms. By Mathieu Even, Hadrien Hendrikx, Laurent Massoulié (CoRR, 2020)',
 'Asynchronous Speedup in Decentralized Optimization. By Mathieu Even, Hadrien Hendrikx, Laurent Massoulié (IEEE Trans. Autom. Control., 2025)',
 'An Impossibility Result for Reconstruction in a Degree-Corrected Planted-Partition Model. By Lennart Gulikers, Marc Lelarge, Laurent Massoulié (CoRR, 2015)',
 'Non-Backtracking Spectrum of Degree-Corrected Stochastic Block Models. By Lennart Gulikers, Marc Lelarge, Laurent Massoulié (ITCS, 2017)']

We see lots of duplicates. This is not surprising as many articles can published first as a research report, then as a conference paper, last as a journal article. Luckily, Gismo can cover for you.

[29]:
gismo.get_documents_by_coverage(k=10)
[29]:
['Aligning Embeddings and Geometric Random Graphs: Informational Results and Computational Approaches for the Procrustes-Wasserstein Problem. By Mathieu Even, Luca Ganassali, Jakob Maier, Laurent Massoulié (CoRR, 2024)',
 'Aligning Embeddings and Geometric Random Graphs: Informational Results and Computational Approaches for the Procrustes-Wasserstein Problem. By Mathieu Even, Luca Ganassali, Jakob Maier, Laurent Massoulié (NeurIPS, 2024)',
 'Scalable Local Area Service Discovery. By Richard Black, Heimir Sverrisson, Laurent Massoulié (ICC, 2007)',
 'Asymmetric tree correlation testing for graph alignment. By Jakob Maier, Laurent Massoulié (ITW, 2023)',
 'Asymmetric graph alignment and the phase transition for asymmetric tree correlation testing. By Jakob Maier, Laurent Massoulié (CoRR, 2025)',
 'Decentralized Optimization with Heterogeneous Delays: a Continuous-Time Approach. By Mathieu Even, Hadrien Hendrikx, Laurent Massoulié (CoRR, 2021)',
 'Asynchrony and Acceleration in Gossip Algorithms. By Mathieu Even, Hadrien Hendrikx, Laurent Massoulié (CoRR, 2020)',
 'Asynchronous Speedup in Decentralized Optimization. By Mathieu Even, Hadrien Hendrikx, Laurent Massoulié (IEEE Trans. Autom. Control., 2025)',
 'An Impossibility Result for Reconstruction in a Degree-Corrected Planted-Partition Model. By Lennart Gulikers, Marc Lelarge, Laurent Massoulié (CoRR, 2015)',
 'Non-Backtracking Spectrum of Degree-Corrected Stochastic Block Models. By Lennart Gulikers, Marc Lelarge, Laurent Massoulié (ITCS, 2017)']

Hum, not working well. The reason here is query distortion. Query distortion is a gismo feature that modulates the clustering with the query. Sadly, when features are authors, the underlying graph has a very specific structure (highly sparse and redundant) that makes query distortion too effective. The solution is to desactivate it.

[30]:
gismo.parameters.distortion = 0
gismo.get_documents_by_coverage(k=10)
[30]:
['Aligning Embeddings and Geometric Random Graphs: Informational Results and Computational Approaches for the Procrustes-Wasserstein Problem. By Mathieu Even, Luca Ganassali, Jakob Maier, Laurent Massoulié (CoRR, 2024)',
 'Scalable Local Area Service Discovery. By Richard Black, Heimir Sverrisson, Laurent Massoulié (ICC, 2007)',
 'An Impossibility Result for Reconstruction in a Degree-Corrected Planted-Partition Model. By Lennart Gulikers, Marc Lelarge, Laurent Massoulié (CoRR, 2015)',
 'Collective Tree Exploration via Potential Function Method. By Romain Cosson, Laurent Massoulié (ITCS, 2024)',
 'Asymmetric tree correlation testing for graph alignment. By Jakob Maier, Laurent Massoulié (ITW, 2023)',
 'Decentralized Optimization with Heterogeneous Delays: a Continuous-Time Approach. By Mathieu Even, Hadrien Hendrikx, Laurent Massoulié (CoRR, 2021)',
 'Concentration of Non-Isotropic Random Tensors with Applications to Learning and Empirical Risk Minimization. By Mathieu Even, Laurent Massoulié (COLT, 2021)',
 'From tree matching to sparse graph alignment. By Luca Ganassali, Laurent Massoulié (COLT, 2020)',
 'On Sample Optimality in Personalized Collaborative and Federated Learning. By Mathieu Even, Laurent Massoulié, Kevin Scaman (NeurIPS, 2022)',
 'Non-Backtracking Spectrum of Degree-Corrected Stochastic Block Models. By Lennart Gulikers, Marc Lelarge, Laurent Massoulié (ITCS, 2017)']

Much better. No duplicate and more diversity in the results. Let’s observe the communities.

[31]:
gismo.get_documents_by_cluster(k=20, resolution=0.9)
 F: 0.38. R: 0.06. S: 0.87.
- F: 0.39. R: 0.06. S: 0.86.
-- F: 0.47. R: 0.05. S: 0.81.
--- F: 0.53. R: 0.04. S: 0.79.
---- F: 0.54. R: 0.02. S: 0.70.
----- F: 0.70. R: 0.01. S: 0.60.
------ F: 1.00. R: 0.01. S: 0.49.
------- Mathieu Even, Luca Ganassali, Jakob Maier, Laurent Massoulié (CoRR, 2024) (R: 0.00; S: 0.49)
------- Mathieu Even, Luca Ganassali, Jakob Maier, Laurent Massoulié (NeurIPS, 2024) (R: 0.00; S: 0.49)
------ F: 1.00. R: 0.01. S: 0.60.
------- Jakob Maier, Laurent Massoulié (ITW, 2023) (R: 0.00; S: 0.60)
------- Jakob Maier, Laurent Massoulié (CoRR, 2025) (R: 0.00; S: 0.60)
----- F: 1.00. R: 0.01. S: 0.65.
------ Luca Ganassali, Laurent Massoulié (COLT, 2020) (R: 0.00; S: 0.65)
------ Luca Ganassali, Laurent Massoulié (CoRR, 2020) (R: 0.00; S: 0.65)
---- F: 0.80. R: 0.02. S: 0.69.
----- F: 1.00. R: 0.01. S: 0.60.
------ Mathieu Even, Hadrien Hendrikx, Laurent Massoulié (CoRR, 2021) (R: 0.00; S: 0.60)
------ Mathieu Even, Hadrien Hendrikx, Laurent Massoulié (CoRR, 2020) (R: 0.00; S: 0.60)
------ Mathieu Even, Hadrien Hendrikx, Laurent Massoulié (IEEE Trans. Autom. Control., 2025) (R: 0.00; S: 0.60)
----- F: 1.00. R: 0.01. S: 0.69.
------ Mathieu Even, Laurent Massoulié (COLT, 2021) (R: 0.00; S: 0.69)
------ Mathieu Even, Laurent Massoulié (CoRR, 2025) (R: 0.00; S: 0.69)
------ Mathieu Even, Laurent Massoulié (CoRR, 2021) (R: 0.00; S: 0.69)
----- F: 1.00. R: 0.01. S: 0.58.
------ Mathieu Even, Laurent Massoulié, Kevin Scaman (NeurIPS, 2022) (R: 0.00; S: 0.58)
------ Kevin Scaman, Mathieu Even, Laurent Massoulié (CoRR, 2023) (R: 0.00; S: 0.58)
--- Romain Cosson, Laurent Massoulié (ITCS, 2024) (R: 0.00; S: 0.66)
-- F: 1.00. R: 0.01. S: 0.58.
--- Lennart Gulikers, Marc Lelarge, Laurent Massoulié (CoRR, 2015) (R: 0.00; S: 0.58)
--- Lennart Gulikers, Marc Lelarge, Laurent Massoulié (ITCS, 2017) (R: 0.00; S: 0.58)
--- Lennart Gulikers, Marc Lelarge, Laurent Massoulié (CoRR, 2016) (R: 0.00; S: 0.58)
--- Lennart Gulikers, Marc Lelarge, Laurent Massoulié (CoRR, 2015) (R: 0.00; S: 0.58)
- Richard Black, Heimir Sverrisson, Laurent Massoulié (ICC, 2007) (R: 0.00; S: 0.46)

OK! We see that the articles are organized by writing commmunities. Also note how Gismo managed to organize a hierachical grouping of the communities.

Now, let’s look in terms of authors. This is actually the interesting part when studying collaborations.

[32]:
gismo.get_features_by_rank()
[32]:
['Laurent Massoulié',
 'Marc Lelarge',
 'Mathieu Even',
 'Hadrien Hendrikx',
 'Peter B. Key',
 'Stratis Ioannidis',
 'Nidhi Hegde',
 'Romain Cosson',
 'Francis R. Bach',
 'Luca Ganassali',
 'Anne-Marie Kermarrec',
 'Don Towsley',
 'Ayalvadi Ganesh',
 'Kevin Scaman',
 'Lennart Gulikers',
 'Milan Vojnovic',
 'Dan-Cristian Tomozei',
 'Laurent Viennot',
 'Amin Karbasi']

We see many authors that were not present in the articles listed above. This is an important observation: central articles (with respect to a query) are not necessarily written by central authors!

Let’s organize them into communities.

[33]:
gismo.get_features_by_cluster(resolution=0.6)
 F: 0.01. R: 0.20. S: 0.53.
- F: 0.01. R: 0.20. S: 0.53.
-- F: 0.02. R: 0.20. S: 0.54.
--- F: 0.02. R: 0.19. S: 0.52.
---- F: 0.07. R: 0.15. S: 0.48.
----- F: 0.21. R: 0.12. S: 0.52.
------ F: 0.24. R: 0.11. S: 0.56.
------- Laurent Massoulié (R: 0.10; S: 1.00)
------- Mathieu Even (R: 0.01; S: 0.26)
------ Hadrien Hendrikx (R: 0.01; S: 0.20)
----- F: 0.12. R: 0.02. S: 0.25.
------ Marc Lelarge (R: 0.01; S: 0.14)
------ Luca Ganassali (R: 0.01; S: 0.16)
------ Lennart Gulikers (R: 0.00; S: 0.18)
----- Francis R. Bach (R: 0.01; S: 0.05)
----- Kevin Scaman (R: 0.00; S: 0.10)
----- Milan Vojnovic (R: 0.00; S: 0.05)
---- F: 0.04. R: 0.02. S: 0.16.
----- F: 0.08. R: 0.01. S: 0.14.
------ Peter B. Key (R: 0.01; S: 0.12)
------ Ayalvadi Ganesh (R: 0.01; S: 0.09)
----- Anne-Marie Kermarrec (R: 0.01; S: 0.07)
---- F: 0.03. R: 0.01. S: 0.08.
----- Stratis Ioannidis (R: 0.01; S: 0.09)
----- Amin Karbasi (R: 0.00; S: 0.03)
---- Nidhi Hegde (R: 0.01; S: 0.14)
--- F: 0.06. R: 0.01. S: 0.14.
---- Romain Cosson (R: 0.01; S: 0.13)
---- Laurent Viennot (R: 0.00; S: 0.05)
-- Dan-Cristian Tomozei (R: 0.00; S: 0.07)
- Don Towsley (R: 0.01; S: 0.03)

Jim Roberts query#

[34]:
gismo.rank(name_to_index("James W. Roberts"))
[34]:
True

Let’s have a covering set of articles.

[35]:
gismo.get_documents_by_coverage(k=10)
[35]:
['Integrated Admission Control for Streaming and Elastic Traffic. By Nabil Benameur, Slim Ben Fredj, Frank Delcoigne, Sara Oueslati, James W. Roberts (QofIS, 2001)',
 'Statistical bandwidth sharing: a study of congestion at flow level. By Slim Ben Fredj, Thomas Bonald, Alexandre Proutière, G. Régnié, James W. Roberts (SIGCOMM, 2001)',
 'An In-Camera Data Stream Processing System for Defect Detection in Web Inspection Tasks. By S. Hossain Hajimowlana, Roberto Muscedere, Graham A. Jullien, James W. Roberts (Real Time Imaging, 1999)',
 "Modifications of Thomae's Function and Differentiability. By Kevin Beanland, James W. Roberts, Craig Stevenson (Am. Math. Mon., 2009)",
 'A Traffic Control Framework for High Speed Data Transmission. By James W. Roberts, Brahim Bensaou, Y. Canetti (Modelling and Evaluation of ATM Networks, 1993)',
 'Broadband Network Teletraffic - Performance Evaluation and Design of Broadband Multiservice Networks: Final Report of Action COST 242 By James W. Roberts, Ugo Mocci, Jorma T. Virtamo (Lecture Notes in Computer Science, 1996)',
 'Swing: Traffic capacity of a simple WDM ring network. By Thomas Bonald, Sara Oueslati, James W. Roberts, Charlotte Roger (ITC, 2009)',
 'Impact of  By Sara Oueslati, James W. Roberts (NETWORKING, 2000)',
 'Comment on  By James W. Roberts (Comput. Commun. Rev., 2020)',
 'A CMOS model for computer-aided circuit analysis and design. By James W. Roberts, Savvas G. Chamberlain (IEEE J. Solid State Circuits, 1989)']

Who are the associated authors?

[36]:
gismo.get_features_by_rank(k=10)
[36]:
['James W. Roberts',
 'Thomas Bonald',
 'Sara Oueslati',
 'Maher Hamdi',
 'Jorma T. Virtamo',
 'Ali Ibrahim',
 'Alexandre Proutière',
 'Slim Ben Fredj',
 'Jussi Kangasharju',
 'Keith W. Ross']

Let’s organize them.

[37]:
gismo.get_features_by_cluster(k=10, resolution=0.4)
 F: 0.01. R: 0.24. S: 0.54.
- F: 0.01. R: 0.23. S: 0.54.
-- F: 0.04. R: 0.21. S: 0.53.
--- F: 0.20. R: 0.20. S: 0.53.
---- James W. Roberts (R: 0.14; S: 1.00)
---- Thomas Bonald (R: 0.04; S: 0.22)
---- Sara Oueslati (R: 0.02; S: 0.29)
---- Slim Ben Fredj (R: 0.01; S: 0.29)
--- Alexandre Proutière (R: 0.01; S: 0.03)
-- Maher Hamdi (R: 0.01; S: 0.11)
-- Jorma T. Virtamo (R: 0.01; S: 0.06)
-- Ali Ibrahim (R: 0.01; S: 0.05)
- F: 0.06. R: 0.01. S: 0.03.
-- Jussi Kangasharju (R: 0.00; S: 0.03)
-- Keith W. Ross (R: 0.00; S: 0.02)

Combined queries#

We can input multiple authors.

[38]:
gismo.rank(name_to_index("Laurent_Massoulié") + name_to_index("James W. Roberts"))
[38]:
True

Let’s have a covering set of articles.

[39]:
gismo.get_documents_by_coverage(k=10)
[39]:
['Integrated Admission Control for Streaming and Elastic Traffic. By Nabil Benameur, Slim Ben Fredj, Frank Delcoigne, Sara Oueslati, James W. Roberts (QofIS, 2001)',
 'Statistical bandwidth sharing: a study of congestion at flow level. By Slim Ben Fredj, Thomas Bonald, Alexandre Proutière, G. Régnié, James W. Roberts (SIGCOMM, 2001)',
 'An In-Camera Data Stream Processing System for Defect Detection in Web Inspection Tasks. By S. Hossain Hajimowlana, Roberto Muscedere, Graham A. Jullien, James W. Roberts (Real Time Imaging, 1999)',
 "Modifications of Thomae's Function and Differentiability. By Kevin Beanland, James W. Roberts, Craig Stevenson (Am. Math. Mon., 2009)",
 'A Traffic Control Framework for High Speed Data Transmission. By James W. Roberts, Brahim Bensaou, Y. Canetti (Modelling and Evaluation of ATM Networks, 1993)',
 'Broadband Network Teletraffic - Performance Evaluation and Design of Broadband Multiservice Networks: Final Report of Action COST 242 By James W. Roberts, Ugo Mocci, Jorma T. Virtamo (Lecture Notes in Computer Science, 1996)',
 'Swing: Traffic capacity of a simple WDM ring network. By Thomas Bonald, Sara Oueslati, James W. Roberts, Charlotte Roger (ITC, 2009)',
 'Comparing Flow-Aware and Flow-Oblivious Adaptive Routing. By Sara Oueslati, James W. Roberts (CISS, 2006)',
 'A survey on statistical bandwidth sharing. By James W. Roberts (Comput. Networks, 2004)',
 'A CMOS model for computer-aided circuit analysis and design. By James W. Roberts, Savvas G. Chamberlain (IEEE J. Solid State Circuits, 1989)']

Note that we get here only articles by Roberts, yet the articles returned have sightly changed.

Now, let’s look at the main authors.

[40]:
gismo.get_features_by_rank()
[40]:
['James W. Roberts',
 'Laurent Massoulié',
 'Thomas Bonald',
 'Sara Oueslati',
 'Marc Lelarge',
 'Maher Hamdi',
 'Nidhi Hegde',
 'Mathieu Even',
 'Jorma T. Virtamo',
 'Alexandre Proutière',
 'Hadrien Hendrikx',
 'Peter B. Key',
 'Stratis Ioannidis',
 'Ali Ibrahim']

We see a mix of both co-authors. How are they organized?

[41]:
gismo.get_features_by_cluster(resolution=0.4)
 F: 0.02. R: 0.19. S: 0.57.
- F: 0.03. R: 0.19. S: 0.57.
-- F: 0.03. R: 0.13. S: 0.57.
--- F: 0.20. R: 0.10. S: 0.65.
---- James W. Roberts (R: 0.07; S: 0.93)
---- Thomas Bonald (R: 0.02; S: 0.21)
---- Sara Oueslati (R: 0.01; S: 0.27)
--- Maher Hamdi (R: 0.01; S: 0.10)
--- Nidhi Hegde (R: 0.00; S: 0.07)
--- Jorma T. Virtamo (R: 0.00; S: 0.05)
--- Alexandre Proutière (R: 0.00; S: 0.04)
--- Ali Ibrahim (R: 0.00; S: 0.05)
-- F: 0.17. R: 0.05. S: 0.19.
--- Laurent Massoulié (R: 0.05; S: 0.37)
--- Mathieu Even (R: 0.00; S: 0.10)
--- Hadrien Hendrikx (R: 0.00; S: 0.07)
-- Marc Lelarge (R: 0.01; S: 0.05)
-- Peter B. Key (R: 0.00; S: 0.05)
- Stratis Ioannidis (R: 0.00; S: 0.03)

Cross-gismo#

Gismo can combine two embeddings two create one hybrid gismo. This is called a cross-gismo (XGismo). This features can be used to analyze authors with respect to the words they use (and vice-versa).

[42]:
from gismo.gismo import XGismo

gismo = XGismo(x_embedding=a_embedding, y_embedding=embedding)
gismo.diteration.n_iter = 2  # to speed up a little bit computation time

Note that XGismo does not use the underlying corpus, so we can now close the source (the source keeps the file dblp.data open).

[43]:
gismo.post_documents_item = lambda g, i: LDB.author_by_index(i).name
gismo.post_features_cluster = post_features_cluster_print
gismo.post_documents_cluster = post_documents_cluster_print

Let’s try a request.

[44]:
gismo.rank("self-stabilization")
[44]:
True

What are the associated keywords?

[45]:
gismo.get_features_by_rank(k=10)
[45]:
['stabilization',
 'self',
 'self stabilization',
 'stabilize',
 'self stabilize',
 'distribute',
 'fault',
 'sensor',
 'distributed',
 'nonlinear']

How are keywords structured?

[46]:
gismo.get_features_by_cluster(k=20, resolution=0.8)
 F: 0.11. R: 0.02. S: 0.80.
- F: 0.81. R: 0.02. S: 0.80.
-- F: 0.94. R: 0.01. S: 0.81.
--- stabilization (R: 0.00; S: 0.80)
--- self (R: 0.00; S: 0.76)
--- self stabilization (R: 0.00; S: 0.82)
--- stabilize (R: 0.00; S: 0.73)
--- self stabilize (R: 0.00; S: 0.70)
--- stabilizing (R: 0.00; S: 0.75)
-- F: 0.89. R: 0.00. S: 0.75.
--- distribute (R: 0.00; S: 0.67)
--- distributed (R: 0.00; S: 0.77)
-- fault (R: 0.00; S: 0.71)
-- F: 0.94. R: 0.00. S: 0.70.
--- sensor (R: 0.00; S: 0.70)
--- wireless (R: 0.00; S: 0.68)
-- F: 0.96. R: 0.00. S: 0.51.
--- byzantine (R: 0.00; S: 0.50)
--- mobile (R: 0.00; S: 0.55)
--- asynchronous (R: 0.00; S: 0.56)
-- optimal (R: 0.00; S: 0.67)
-- robot (R: 0.00; S: 0.46)
- F: 0.58. R: 0.00. S: 0.19.
-- F: 0.60. R: 0.00. S: 0.12.
--- F: 0.73. R: 0.00. S: 0.09.
---- nonlinear (R: 0.00; S: 0.08)
---- delay (R: 0.00; S: 0.09)
--- linear (R: 0.00; S: 0.24)
-- adaptive (R: 0.00; S: 0.48)

Who are the associated researchers?

[47]:
gismo.get_documents_by_rank(k=10)
[47]:
['Ted Herman',
 'Shlomi Dolev',
 'Sébastien Tixeuil',
 'Sukumar Ghosh',
 'George Varghese',
 'Shay Kutten',
 'Stéphane Devismes',
 'Toshimitsu Masuzawa',
 'Stefan Schmid',
 'Swan Dubois']

How are they structured?

[48]:
gismo.get_documents_by_cluster(k=10, resolution=0.9)
 F: 0.72. R: 0.05. S: 0.83.
- F: 0.94. R: 0.05. S: 0.83.
-- F: 0.96. R: 0.01. S: 0.81.
--- Ted Herman (R: 0.01; S: 0.81)
--- George Varghese (R: 0.00; S: 0.80)
-- F: 0.95. R: 0.02. S: 0.77.
--- Shlomi Dolev (R: 0.01; S: 0.78)
--- Stéphane Devismes (R: 0.00; S: 0.77)
--- Toshimitsu Masuzawa (R: 0.00; S: 0.71)
-- F: 0.98. R: 0.01. S: 0.81.
--- Sébastien Tixeuil (R: 0.01; S: 0.82)
--- Sukumar Ghosh (R: 0.00; S: 0.80)
-- Shay Kutten (R: 0.00; S: 0.84)
-- Swan Dubois (R: 0.00; S: 0.82)
- Stefan Schmid (R: 0.00; S: 0.64)

We can also query researchers. Just use underscores in the query and add y=False to indicate that the input is documents.

[49]:
gismo.rank(
    name_to_index("Sébastien_Tixeuil") + name_to_index("Fabien_Mathieu"), y=False
)
[49]:
True

What are the associated keywords?

[50]:
gismo.get_features_by_rank(k=10)
[50]:
['p2p',
 'grid',
 'stabilization',
 'byzantine',
 'reloaded',
 'refresh',
 'self',
 'self stabilization',
 'fun',
 'live streaming']

Using covering can yield other keywords of interest.

[51]:
gismo.get_features_by_coverage(k=10)
[51]:
['p2p',
 'grid',
 'preference',
 'stabilization',
 'reloaded',
 'refresh',
 'fun',
 'p2p networks',
 'acyclic',
 'streaming']

How are keywords structured?

[52]:
gismo.get_features_by_cluster(k=20, resolution=0.7)
 F: 0.52. R: 0.19. S: 0.67.
- F: 0.60. R: 0.17. S: 0.67.
-- F: 0.81. R: 0.02. S: 0.40.
--- p2p (R: 0.02; S: 0.38)
--- streaming (R: 0.01; S: 0.37)
-- F: 0.82. R: 0.07. S: 0.58.
--- stabilization (R: 0.01; S: 0.55)
--- byzantine (R: 0.01; S: 0.52)
--- self (R: 0.01; S: 0.59)
--- self stabilization (R: 0.01; S: 0.56)
--- stabilize (R: 0.01; S: 0.58)
--- self stabilize (R: 0.01; S: 0.57)
--- gathering (R: 0.01; S: 0.50)
-- reloaded (R: 0.01; S: 0.39)
-- F: 0.90. R: 0.04. S: 0.35.
--- refresh (R: 0.01; S: 0.33)
--- live streaming (R: 0.01; S: 0.35)
--- pagerank (R: 0.01; S: 0.33)
--- old (R: 0.01; S: 0.34)
--- live (R: 0.01; S: 0.38)
-- fun (R: 0.01; S: 0.50)
-- p2p networks (R: 0.01; S: 0.38)
-- acyclic (R: 0.01; S: 0.38)
- grid (R: 0.01; S: 0.50)
- preference (R: 0.01; S: 0.31)

Who are the associated researchers?

[53]:
gismo.get_documents_by_rank(k=10)
[53]:
['Sébastien Tixeuil',
 'Fabien Mathieu',
 'Shlomi Dolev',
 'Michel Raynal',
 'Maria Potop-Butucaru',
 'Athanasios Mazarakis',
 'Toshimitsu Masuzawa',
 'Stéphane Devismes',
 'Fukuhito Ooshita',
 'Elad Michael Schiller']

How are they structured?

[54]:
gismo.get_documents_by_cluster(k=10, resolution=0.8)
 F: 0.07. R: 0.00. S: 0.62.
- F: 0.57. R: 0.00. S: 0.48.
-- F: 0.86. R: 0.00. S: 0.48.
--- Sébastien Tixeuil (R: 0.00; S: 0.53)
--- Shlomi Dolev (R: 0.00; S: 0.41)
--- Maria Potop-Butucaru (R: 0.00; S: 0.48)
--- Toshimitsu Masuzawa (R: 0.00; S: 0.44)
--- Stéphane Devismes (R: 0.00; S: 0.41)
--- Fukuhito Ooshita (R: 0.00; S: 0.48)
--- Elad Michael Schiller (R: 0.00; S: 0.36)
-- Michel Raynal (R: 0.00; S: 0.36)
- F: 0.30. R: 0.00. S: 0.64.
-- Fabien Mathieu (R: 0.00; S: 0.71)
-- Athanasios Mazarakis (R: 0.00; S: 0.19)