Landmarks Tutorial#

In this notebook, we will use the landmarks submodule of Gismo to give an interactive description of ACM topics and researchers of the NPA laboratory (https://www-npa.lip6.fr/).

This notebook can be used as a blueprint to analyze other group of people under the scope of a topic classification.

Before starting this topic, it is recommended to have looked at the ACM and DBLP tutorials.

NPA Researchers#

In this section, we bind the NPA researchers with their DBLP id.

List of DBLP researchers#

First, we open the DBLP database (see the DBLP Tutorial to get your copy of the database).

[1]:

from pathlib import Path

path = Path("../../../../../Datasets/DBLP")

from gismo.filesource import FileSource

source = FileSource(filename="dblp", path=path)

source is a list-like object whose entries are dicts that describe articles.

[2]:

source[1234567]

[2]:

{'type': 'article',
 'authors': ['Hans Solli-Sæther', 'Petter Gottschalk'],
 'title': 'The Modeling Process for Stage Models.',
 'year': '2010',
 'venue': 'J. Organ. Comput. Electron. Commer.'}

Let’s extract the set of authors. Each author is lowered and spaces are replaced with underscore for better later processing.

[3]:

dblp_authors = {a.replace(" ", "_") for paper in source for a in paper["authors"]}

[4]:

"Fabien_Mathieu" in dblp_authors

[4]:

True

[5]:

"Fabin_Mathieu" in dblp_authors

[5]:

False

List of NPA Members#

First we get a copy of the NPA webpage that tells its researchers and feed it to BeautifulSoup

[6]:

import requests
from bs4 import BeautifulSoup as bs

soup = bs(requests.get("https://www.lip6.fr/recherche/team_membres.php?id=760").text)

[7]:

def parse_people_str(txt):
    sp = "\xa0"
    n, s = txt.split(sp)
    return f"{s} {n}"


people = [parse_people_str(td("a")[1].text) for td in soup.table("td") if td("a")]

We make a function to convert table rows of the HTML page into researcher dict. Each dict will have a name (display name) and a dblp (DBLP id) entry.

[8]:

from bof.fuzz import Process

p = Process()
p.fit(list(dblp_authors))

[9]:

def name2dict(name, dblp_authors, manual=None):
    """
    Soup 2 dict conversion

    Parameters
    ----------
    name: soup
        The row to convert
    dblp_authors: set
        The list of DBLP authors
    manual: dict
        Manual associations between name and id

    Returns
    -------
    dict
        A dict shaped like {'name': "John Doe", 'dblp': "john_doe"}
    """
    if manual is None:
        manual = {}
    # manual association
    if name in manual:
        return {"name": name, "dblp": manual[name]}
    # Attempt direct transformation
    dblp = name.replace(" ", "_")
    # If result exists in dblp, return that
    if dblp in dblp_authors:
        return {"name": name, "dblp": dblp}
    # last chance: use bof to guess the good answer.
    print(f"No direct dblp entry found for {name}, start fuzzy search")
    candidate = p.extractOne(name.lower().replace(" ", "_"))
    #     candidates = get_close_matches(name.lower().replace(" ", "_"), candidates, cutoff=0.3)
    if candidate and candidate[1] > 60:
        print(f"Found candidate: {candidate}")
        dblp = candidate[0]
        return {"name": name, "dblp": dblp}
    # If all failed, return empty id
    return {"name": name, "dblp": ""}

The manual override below was actually populated by executing the cell afterwards and iterating until all things were OK.

[10]:

manual = {"Giovanni Pau": "Giovanni_Pau_0001"}

The actual construction of the list of NPA researchers.

[11]:

npa = [name2dict(name, dblp_authors, manual) for name in people]
npa = [c for c in npa if c["dblp"] and "Guillaume" not in c["name"]]

No direct dblp entry found for Capucine Barré, start fuzzy search
No direct dblp entry found for Ufuk Bombar, start fuzzy search
No direct dblp entry found for Mateus Da Silva Gilbert, start fuzzy search
Found candidate: ('Mateus_da_Silva_Gilbert', 100.0)
No direct dblp entry found for Lorenzo Di Filippo, start fuzzy search
No direct dblp entry found for Stefan Galkiewicz, start fuzzy search
No direct dblp entry found for Dimitrios Kefalas, start fuzzy search
Found candidate: ('Dimitris_Kefalas', 69.1358024691358)
No direct dblp entry found for Guillaume Nibert, start fuzzy search
Found candidate: ('Guillaume_Gibert', 63.75)
No direct dblp entry found for Elif Ebru Ohri, start fuzzy search
No direct dblp entry found for Giuliano Prestes Fittipaldi, start fuzzy search
Found candidate: ('Giuliano_Fittipaldi', 61.016949152542374)
No direct dblp entry found for Saied Kazemi, start fuzzy search
Found candidate: ('Saied_Kazeminejad', 68.1159420289855)
No direct dblp entry found for Émilie Mespoulhes, start fuzzy search
No direct dblp entry found for Hassane Rahich, start fuzzy search
No direct dblp entry found for Frédéric Vaissade, start fuzzy search
No direct dblp entry found for Taha Mohsen Zaidi, start fuzzy search
No direct dblp entry found for Thi-Mai-Trang Nguyen, start fuzzy search
Found candidate: ('Thi-Trang_Nguyen', 73.80952380952381)
No direct dblp entry found for José-Marcos Nogueira, start fuzzy search
Found candidate: ('Marcos_Nogueira', 73.80952380952381)
No direct dblp entry found for Andrea Richa, start fuzzy search
Found candidate: ('Andrea_Richaud', 83.92857142857143)
No direct dblp entry found for Téo Lohrer, start fuzzy search
No direct dblp entry found for James Kurose, start fuzzy search
Found candidate: ('James_F._Kurose', 61.76470588235294)
No direct dblp entry found for Ufuk Bombar, start fuzzy search
No direct dblp entry found for Liuba Shira, start fuzzy search
Found candidate: ('Liuba_Shrira', 60.714285714285715)

[12]:

npa

[12]:

[{'name': 'Sébastien Baey', 'dblp': 'Sébastien_Baey'},
 {'name': 'Bruno Baynat', 'dblp': 'Bruno_Baynat'},
 {'name': 'Quentin Bramas', 'dblp': 'Quentin_Bramas'},
 {'name': 'Binh-Minh Bui-Xuan', 'dblp': 'Binh-Minh_Bui-Xuan'},
 {'name': 'Marcelo Dias de Amorim', 'dblp': 'Marcelo_Dias_de_Amorim'},
 {'name': 'Serge Fdida', 'dblp': 'Serge_Fdida'},
 {'name': 'Anne Fladenmuller', 'dblp': 'Anne_Fladenmuller'},
 {'name': 'Francesca Fossati', 'dblp': 'Francesca_Fossati'},
 {'name': 'Olivier Fourmaux', 'dblp': 'Olivier_Fourmaux'},
 {'name': 'Timur Friedman', 'dblp': 'Timur_Friedman'},
 {'name': 'Brigitte Kervella', 'dblp': 'Brigitte_Kervella'},
 {'name': 'Naceur Malouch', 'dblp': 'Naceur_Malouch'},
 {'name': 'Maria Potop-Butucaru', 'dblp': 'Maria_Potop-Butucaru'},
 {'name': 'Guy Pujolle', 'dblp': 'Guy_Pujolle'},
 {'name': 'Kim Loan Thai', 'dblp': 'Kim_Loan_Thai'},
 {'name': 'Sébastien Tixeuil', 'dblp': 'Sébastien_Tixeuil'},
 {'name': 'Bilel Zaghdoudi', 'dblp': 'Bilel_Zaghdoudi'},
 {'name': 'Solayman Ayoubi', 'dblp': 'Solayman_Ayoubi'},
 {'name': 'Lionel Beltrando', 'dblp': 'Lionel_Beltrando'},
 {'name': 'Mateus Da Silva Gilbert', 'dblp': 'Mateus_da_Silva_Gilbert'},
 {'name': 'Cherifa Hamroun', 'dblp': 'Cherifa_Hamroun'},
 {'name': 'Dimitrios Kefalas', 'dblp': 'Dimitris_Kefalas'},
 {'name': 'Mohamed Amine Legheraba', 'dblp': 'Mohamed_Amine_Legheraba'},
 {'name': 'Alexandre Pham', 'dblp': 'Alexandre_Pham'},
 {'name': 'Giuliano Prestes Fittipaldi', 'dblp': 'Giuliano_Fittipaldi'},
 {'name': 'Hugo Rimlinger', 'dblp': 'Hugo_Rimlinger'},
 {'name': 'Alexandros Stoltidis', 'dblp': 'Alexandros_Stoltidis'},
 {'name': 'Massinissa Tighilt', 'dblp': 'Massinissa_Tighilt'},
 {'name': 'Theodoros Tsourdinis', 'dblp': 'Theodoros_Tsourdinis'},
 {'name': 'Florent Krasnopol', 'dblp': 'Florent_Krasnopol'},
 {'name': 'Saied Kazemi', 'dblp': 'Saied_Kazeminejad'},
 {'name': 'Elena Nardi', 'dblp': 'Elena_Nardi'},
 {'name': 'Nicolas Peugnet', 'dblp': 'Nicolas_Peugnet'},
 {'name': 'Anastasios Giovanidis', 'dblp': 'Anastasios_Giovanidis'},
 {'name': 'Fabien Mathieu', 'dblp': 'Fabien_Mathieu'},
 {'name': 'Thi-Mai-Trang Nguyen', 'dblp': 'Thi-Trang_Nguyen'},
 {'name': 'Ciro Scognamiglio', 'dblp': 'Ciro_Scognamiglio'},
 {'name': 'Lélia Blin', 'dblp': 'Lélia_Blin'},
 {'name': 'José-Marcos Nogueira', 'dblp': 'Marcos_Nogueira'},
 {'name': 'Andrea Richa', 'dblp': 'Andrea_Richaud'},
 {'name': 'Miguel Elias Mitre Campista',
  'dblp': 'Miguel_Elias_Mitre_Campista'},
 {'name': 'James Kurose', 'dblp': 'James_F._Kurose'},
 {'name': 'Ciro Scognamiglio', 'dblp': 'Ciro_Scognamiglio'},
 {'name': 'Liuba Shira', 'dblp': 'Liuba_Shrira'},
 {'name': 'Maurice Herlihy', 'dblp': 'Maurice_Herlihy'},
 {'name': 'Dipankar Raychaudhuri', 'dblp': 'Dipankar_Raychaudhuri'},
 {'name': 'Boris Chan Yip Hon', 'dblp': 'Boris_Chan_Yip_Hon'}]

DBLP Gismo#

In this Section, we use Landmarks to construct a small XGismo focused around the NPA researchers. In details:

We construct a large Gismo between articles and researchers, exactly like in the DBLP tutorial;
We use landmarks to extract a (much smaller) list of articles based on collaboration proximity.
We build a XGismo between researchers and keywords from this smaller source.

Construction of a full Gismo on authors#

This part is similar to the one from the DBLP tutorial.

[13]:

from gismo import Corpus, Embedding, CountVectorizer, Gismo

vectorizer_author = CountVectorizer(
    dtype=float, preprocessor=lambda x: x, tokenizer=lambda x: x.split(" ")
)

[14]:

def to_authors_text(dic):
    return " ".join([a.replace(" ", "_") for a in dic["authors"]])


corpus = Corpus(source, to_text=to_authors_text)

[15]:

try:
    embedding = Embedding.load(filename="dblp_aut_embedding", path=path)
except:
    embedding = Embedding(vectorizer=vectorizer_author)
    embedding.fit_transform(corpus)
    embedding.dump(filename="dblp_aut_embedding", path=path)

C:\Users\loufa\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\feature_extraction\text.py:517: UserWarning: The parameter 'token_pattern' will not be used since 'tokenizer' is not None'
  warnings.warn(

[16]:

gismo = Gismo(corpus, embedding)

Given the size of the dataset, processing a query can take about one second.

[17]:

gismo.rank("Fabien_Mathieu")

[17]:

True

[18]:

from gismo.post_processing import post_features_cluster_print

gismo.post_features_cluster = post_features_cluster_print
gismo.get_features_by_cluster()

 F: 0.02. R: 0.26. S: 0.72.
- F: 0.02. R: 0.25. S: 0.71.
-- F: 0.03. R: 0.24. S: 0.70.
--- F: 0.07. R: 0.22. S: 0.67.
---- F: 0.07. R: 0.19. S: 0.60.
----- F: 0.37. R: 0.18. S: 0.55.
------ Fabien_Mathieu (R: 0.11; S: 1.00)
------ F: 0.61. R: 0.03. S: 0.37.
------- François_Durand (R: 0.02; S: 0.44)
------- Ludovic_Noirie (R: 0.01; S: 0.32)
------- Emma_Caizergues (R: 0.01; S: 0.26)
------ F: 0.61. R: 0.03. S: 0.39.
------- Laurent_Viennot (R: 0.02; S: 0.44)
------- F: 0.70. R: 0.02. S: 0.34.
-------- Diego_Perino (R: 0.01; S: 0.39)
-------- Yacine_Boufkhad (R: 0.01; S: 0.28)
----- F: 0.33. R: 0.01. S: 0.28.
------ The_Dang_Huynh (R: 0.01; S: 0.27)
------ Dohy_Hong (R: 0.00; S: 0.17)
---- F: 0.25. R: 0.03. S: 0.36.
----- F: 0.77. R: 0.02. S: 0.34.
------ Julien_Reynier (R: 0.01; S: 0.35)
------ Fabien_de_Montgolfier (R: 0.01; S: 0.36)
------ Anh-Tuan_Gai (R: 0.00; S: 0.28)
----- Gheorghe_Postelnicu (R: 0.00; S: 0.17)
--- F: 0.40. R: 0.02. S: 0.32.
---- F: 0.58. R: 0.02. S: 0.31.
----- Céline_Comte (R: 0.01; S: 0.31)
----- Thomas_Bonald (R: 0.01; S: 0.23)
---- Anne_Bouillard (R: 0.00; S: 0.20)
--- Nidhi_Hegde_0001 (R: 0.00; S: 0.20)
-- F: 1.00. R: 0.01. S: 0.23.
--- Ilkka_Norros (R: 0.01; S: 0.23)
--- François_Baccelli (R: 0.00; S: 0.23)
- Mohamed_Bouklit (R: 0.01; S: 0.18)

Using landmarks to shrink a source#

To reduce the size of the dataset, we make landmarks out of the researchers, and we credit each entry with a budget of 2,000 articles.

[19]:

from gismo.landmarks import Landmarks

npa_landmarks_full = Landmarks(source=npa, to_text=lambda x: x["dblp"], x_density=2000)

We launch the computation of the source. This takes a couple of minutes, as a ranking diffusion needs to be performed for all researchers.

[20]:

import logging

logging.basicConfig()
log = logging.getLogger("Gismo")
log.setLevel(level=logging.DEBUG)

[21]:

reduced_source = npa_landmarks_full.get_reduced_source(gismo)

INFO:Gismo:Start computation of 47 landmarks.
DEBUG:Gismo:Processing Sébastien_Baey.
DEBUG:Gismo:Landmarks of Sébastien_Baey computed.
DEBUG:Gismo:Processing Bruno_Baynat.
DEBUG:Gismo:Landmarks of Bruno_Baynat computed.
DEBUG:Gismo:Processing Quentin_Bramas.
DEBUG:Gismo:Landmarks of Quentin_Bramas computed.
DEBUG:Gismo:Processing Binh-Minh_Bui-Xuan.
DEBUG:Gismo:Landmarks of Binh-Minh_Bui-Xuan computed.
DEBUG:Gismo:Processing Marcelo_Dias_de_Amorim.
DEBUG:Gismo:Landmarks of Marcelo_Dias_de_Amorim computed.
DEBUG:Gismo:Processing Serge_Fdida.
DEBUG:Gismo:Landmarks of Serge_Fdida computed.
DEBUG:Gismo:Processing Anne_Fladenmuller.
DEBUG:Gismo:Landmarks of Anne_Fladenmuller computed.
DEBUG:Gismo:Processing Francesca_Fossati.
DEBUG:Gismo:Landmarks of Francesca_Fossati computed.
DEBUG:Gismo:Processing Olivier_Fourmaux.
DEBUG:Gismo:Landmarks of Olivier_Fourmaux computed.
DEBUG:Gismo:Processing Timur_Friedman.
DEBUG:Gismo:Landmarks of Timur_Friedman computed.
DEBUG:Gismo:Processing Brigitte_Kervella.
DEBUG:Gismo:Landmarks of Brigitte_Kervella computed.
DEBUG:Gismo:Processing Naceur_Malouch.
DEBUG:Gismo:Landmarks of Naceur_Malouch computed.
DEBUG:Gismo:Processing Maria_Potop-Butucaru.
DEBUG:Gismo:Landmarks of Maria_Potop-Butucaru computed.
DEBUG:Gismo:Processing Guy_Pujolle.
DEBUG:Gismo:Landmarks of Guy_Pujolle computed.
DEBUG:Gismo:Processing Kim_Loan_Thai.
DEBUG:Gismo:Landmarks of Kim_Loan_Thai computed.
DEBUG:Gismo:Processing Sébastien_Tixeuil.
DEBUG:Gismo:Landmarks of Sébastien_Tixeuil computed.
DEBUG:Gismo:Processing Bilel_Zaghdoudi.
DEBUG:Gismo:Landmarks of Bilel_Zaghdoudi computed.
DEBUG:Gismo:Processing Solayman_Ayoubi.
DEBUG:Gismo:Landmarks of Solayman_Ayoubi computed.
DEBUG:Gismo:Processing Lionel_Beltrando.
DEBUG:Gismo:Landmarks of Lionel_Beltrando computed.
DEBUG:Gismo:Processing Mateus_da_Silva_Gilbert.
DEBUG:Gismo:Landmarks of Mateus_da_Silva_Gilbert computed.
DEBUG:Gismo:Processing Cherifa_Hamroun.
DEBUG:Gismo:Landmarks of Cherifa_Hamroun computed.
DEBUG:Gismo:Processing Dimitris_Kefalas.
DEBUG:Gismo:Landmarks of Dimitris_Kefalas computed.
DEBUG:Gismo:Processing Mohamed_Amine_Legheraba.
DEBUG:Gismo:Landmarks of Mohamed_Amine_Legheraba computed.
DEBUG:Gismo:Processing Alexandre_Pham.
DEBUG:Gismo:Landmarks of Alexandre_Pham computed.
DEBUG:Gismo:Processing Giuliano_Fittipaldi.
DEBUG:Gismo:Landmarks of Giuliano_Fittipaldi computed.
DEBUG:Gismo:Processing Hugo_Rimlinger.
DEBUG:Gismo:Landmarks of Hugo_Rimlinger computed.
DEBUG:Gismo:Processing Alexandros_Stoltidis.
DEBUG:Gismo:Landmarks of Alexandros_Stoltidis computed.
DEBUG:Gismo:Processing Massinissa_Tighilt.
DEBUG:Gismo:Landmarks of Massinissa_Tighilt computed.
DEBUG:Gismo:Processing Theodoros_Tsourdinis.
DEBUG:Gismo:Landmarks of Theodoros_Tsourdinis computed.
DEBUG:Gismo:Processing Florent_Krasnopol.
DEBUG:Gismo:Landmarks of Florent_Krasnopol computed.
DEBUG:Gismo:Processing Saied_Kazeminejad.
DEBUG:Gismo:Landmarks of Saied_Kazeminejad computed.
DEBUG:Gismo:Processing Elena_Nardi.
DEBUG:Gismo:Landmarks of Elena_Nardi computed.
DEBUG:Gismo:Processing Nicolas_Peugnet.
DEBUG:Gismo:Landmarks of Nicolas_Peugnet computed.
DEBUG:Gismo:Processing Anastasios_Giovanidis.
DEBUG:Gismo:Landmarks of Anastasios_Giovanidis computed.
DEBUG:Gismo:Processing Fabien_Mathieu.
DEBUG:Gismo:Landmarks of Fabien_Mathieu computed.
DEBUG:Gismo:Processing Thi-Trang_Nguyen.
DEBUG:Gismo:Landmarks of Thi-Trang_Nguyen computed.
DEBUG:Gismo:Processing Ciro_Scognamiglio.
DEBUG:Gismo:Landmarks of Ciro_Scognamiglio computed.
DEBUG:Gismo:Processing Lélia_Blin.
DEBUG:Gismo:Landmarks of Lélia_Blin computed.
DEBUG:Gismo:Processing Marcos_Nogueira.
DEBUG:Gismo:Landmarks of Marcos_Nogueira computed.
DEBUG:Gismo:Processing Andrea_Richaud.
DEBUG:Gismo:Landmarks of Andrea_Richaud computed.
DEBUG:Gismo:Processing Miguel_Elias_Mitre_Campista.
DEBUG:Gismo:Landmarks of Miguel_Elias_Mitre_Campista computed.
DEBUG:Gismo:Processing James_F._Kurose.
DEBUG:Gismo:Landmarks of James_F._Kurose computed.
DEBUG:Gismo:Processing Ciro_Scognamiglio.
DEBUG:Gismo:Landmarks of Ciro_Scognamiglio computed.
DEBUG:Gismo:Processing Liuba_Shrira.
DEBUG:Gismo:Landmarks of Liuba_Shrira computed.
DEBUG:Gismo:Processing Maurice_Herlihy.
DEBUG:Gismo:Landmarks of Maurice_Herlihy computed.
DEBUG:Gismo:Processing Dipankar_Raychaudhuri.
DEBUG:Gismo:Landmarks of Dipankar_Raychaudhuri computed.
DEBUG:Gismo:Processing Boris_Chan_Yip_Hon.
DEBUG:Gismo:Landmarks of Boris_Chan_Yip_Hon computed.
INFO:Gismo:All landmarks are built.

[22]:

print(f"Source length went down from {len(source)} to {len(reduced_source)}.")

Source length went down from 7513218 to 50350.

Instead of 7,500,000 all purposes articles, we now have 50,000 articles lying in the neighborhood of the considered researchers. We now can close the file descriptor as we won’t need further access to the original source.

[23]:

source.close()

Building a small XGismo#

Author Embedding#

Author embedding takes a couple of seconds instead of a couple of minutes.

[24]:

reduced_corpus = Corpus(reduced_source, to_text=to_authors_text)
reduced_author_embedding = Embedding(vectorizer=vectorizer_author)
reduced_author_embedding.fit_transform(reduced_corpus)

C:\Users\loufa\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\feature_extraction\text.py:517: UserWarning: The parameter 'token_pattern' will not be used since 'tokenizer' is not None'
  warnings.warn(

Sanity Check#

We can rebuild a small author Gismo. This part is merely a sanity check to verify that the reduction didn’t change too much things in the vicinity of the NPA..

[25]:

reduced_gismo = Gismo(reduced_corpus, reduced_author_embedding)

Ranking is nearly instant.

[26]:

reduced_gismo.rank("Fabien_Mathieu")

[26]:

True

The results are almost identical to what was returned by the full Gismo.

[27]:

from gismo.post_processing import post_features_cluster_print

reduced_gismo.post_features_cluster = post_features_cluster_print
reduced_gismo.get_features_by_cluster()

 F: 0.02. R: 0.25. S: 0.72.
- F: 0.02. R: 0.25. S: 0.70.
-- F: 0.03. R: 0.24. S: 0.70.
--- F: 0.06. R: 0.22. S: 0.67.
---- F: 0.07. R: 0.19. S: 0.60.
----- F: 0.37. R: 0.18. S: 0.55.
------ Fabien_Mathieu (R: 0.11; S: 1.00)
------ F: 0.61. R: 0.03. S: 0.37.
------- François_Durand (R: 0.02; S: 0.44)
------- Ludovic_Noirie (R: 0.01; S: 0.32)
------- Emma_Caizergues (R: 0.01; S: 0.27)
------ F: 0.60. R: 0.03. S: 0.39.
------- Laurent_Viennot (R: 0.01; S: 0.44)
------- F: 0.68. R: 0.02. S: 0.33.
-------- Diego_Perino (R: 0.01; S: 0.39)
-------- Yacine_Boufkhad (R: 0.01; S: 0.27)
----- F: 0.33. R: 0.01. S: 0.28.
------ The_Dang_Huynh (R: 0.01; S: 0.27)
------ Dohy_Hong (R: 0.00; S: 0.17)
---- F: 0.26. R: 0.03. S: 0.35.
----- F: 0.77. R: 0.02. S: 0.34.
------ Julien_Reynier (R: 0.01; S: 0.35)
------ Fabien_de_Montgolfier (R: 0.01; S: 0.35)
------ Anh-Tuan_Gai (R: 0.01; S: 0.27)
----- Gheorghe_Postelnicu (R: 0.00; S: 0.17)
--- F: 0.58. R: 0.02. S: 0.31.
---- Céline_Comte (R: 0.01; S: 0.31)
---- Thomas_Bonald (R: 0.00; S: 0.23)
--- Nidhi_Hegde_0001 (R: 0.00; S: 0.20)
-- F: 1.00. R: 0.01. S: 0.23.
--- Ilkka_Norros (R: 0.01; S: 0.23)
--- François_Baccelli (R: 0.00; S: 0.23)
- Mohamed_Bouklit (R: 0.01; S: 0.18)

Word Embedding#

Now we build the word embedding, with the spacy add-on. Takes a couple of minutes instead of a couple of hours.

[28]:

import spacy

# Initialize spacy 'en' model, keeping only tagger component needed for lemmatization
nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])
# Who cares about DET and such?
keep = {"ADJ", "NOUN", "NUM", "PROPN", "SYM", "VERB"}

preprocessor = lambda txt: " ".join(
    [
        token.lemma_.lower()
        for token in nlp(txt)
        if token.pos_ in keep and not token.is_stop
    ]
)
vectorizer_text = CountVectorizer(
    dtype=float, min_df=5, max_df=0.02, ngram_range=(1, 3), preprocessor=preprocessor
)

[29]:

reduced_corpus.to_text = lambda e: e["title"]
reduced_word_embedding = Embedding(vectorizer=vectorizer_text)
reduced_word_embedding.fit_transform(reduced_corpus)

Gathering pieces together#

We can combine the reduced embeddings to build a XGismo between authors and words.

[30]:

from gismo.gismo import XGismo

xgismo = XGismo(
    x_embedding=reduced_author_embedding, y_embedding=reduced_word_embedding
)

We can save this for later use.

[31]:

xgismo.dump(filename="reduced_npa_xgismo", path=path, overwrite=True)

The file should be about 32 Mb, whereas a full-size DBLP XGismo is about 4 Gb. What about speed and quality of results?

[32]:

xgismo.rank("Anne_Bouillard", y=False)

[32]:

True

[33]:

xgismo.get_documents_by_rank()

[33]:

['Anne_Bouillard',
 'Bruno_Gaujal',
 'François_Baccelli',
 'Paul_Nikolaus',
 'Jens_B._Schmitt',
 'Ana_Busic',
 'Eric_Thierry',
 'Ke_Feng',
 'Giovanni_Stea',
 'Zhen_Liu_0001',
 'Seyed_Mohammadhossein_Tabatabaee',
 'Aurore_Junier',
 'Jean-Yves_Le_Boudec',
 'Christelle_Rovetta',
 'Jean_Mairesse',
 'Oded_Goldreich_0001',
 'Yves_Dallery',
 'Donald_F._Towsley',
 'Leandros_Tassiulas',
 'Nissim_Francez']

Let’s try some more elaborate display.

[34]:

from gismo.post_processing import (
    post_documents_cluster_print,
    post_features_cluster_print,
)

xgismo.parameters.distortion = 1.0
xgismo.post_documents_cluster = post_documents_cluster_print
xgismo.post_features_cluster = post_features_cluster_print
xgismo.get_documents_by_cluster()

 F: 0.06. R: 0.09. S: 0.82.
- F: 0.60. R: 0.06. S: 0.76.
-- F: 0.69. R: 0.06. S: 0.76.
--- F: 0.85. R: 0.06. S: 0.75.
---- F: 0.92. R: 0.05. S: 0.75.
----- Anne_Bouillard (R: 0.03; S: 0.79)
----- Paul_Nikolaus (R: 0.00; S: 0.73)
----- Jens_B._Schmitt (R: 0.00; S: 0.73)
----- Eric_Thierry (R: 0.00; S: 0.75)
----- Ke_Feng (R: 0.00; S: 0.70)
---- Bruno_Gaujal (R: 0.01; S: 0.75)
--- Aurore_Junier (R: 0.00; S: 0.66)
-- François_Baccelli (R: 0.00; S: 0.59)
-- Nissim_Francez (R: 0.00; S: 0.49)
- F: 0.17. R: 0.02. S: 0.43.
-- F: 0.55. R: 0.01. S: 0.26.
--- F: 0.55. R: 0.01. S: 0.25.
---- Ana_Busic (R: 0.00; S: 0.33)
---- Zhen_Liu_0001 (R: 0.00; S: 0.33)
---- Christelle_Rovetta (R: 0.00; S: 0.21)
---- Donald_F._Towsley (R: 0.00; S: 0.29)
--- Yves_Dallery (R: 0.00; S: 0.24)
-- F: 0.28. R: 0.01. S: 0.36.
--- F: 0.59. R: 0.01. S: 0.35.
---- Giovanni_Stea (R: 0.00; S: 0.33)
---- F: 1.00. R: 0.01. S: 0.30.
----- Seyed_Mohammadhossein_Tabatabaee (R: 0.00; S: 0.30)
----- Jean-Yves_Le_Boudec (R: 0.00; S: 0.30)
--- Oded_Goldreich_0001 (R: 0.00; S: 0.27)
-- Leandros_Tassiulas (R: 0.00; S: 0.31)
- Jean_Mairesse (R: 0.00; S: 0.27)

[35]:

xgismo.get_features_by_cluster(target_k=1.5, resolution=0.5, distortion=0.5)

 F: 0.32. R: 0.20. S: 0.94.
- F: 0.50. R: 0.19. S: 0.94.
-- F: 0.76. R: 0.18. S: 0.93.
--- network calculus (R: 0.05; S: 0.91)
--- calculus (R: 0.05; S: 0.92)
--- bad case (R: 0.01; S: 0.91)
--- bad (R: 0.01; S: 0.91)
--- free choice (R: 0.01; S: 0.82)
--- stochastic network (R: 0.01; S: 0.90)
--- multiplexing (R: 0.01; S: 0.89)
--- closed queue (R: 0.01; S: 0.85)
-- stochastic (R: 0.01; S: 0.59)
- monotonicity (R: 0.02; S: 0.36)

Rebuild landmarks#

NPA landmarks#

We can rebuild NPA landmarks on the XGismo.

[36]:

npa_landmarks = Landmarks(
    source=npa, to_text=lambda x: x["dblp"], rank=lambda g, q: g.rank(q, y=False)
)
npa_landmarks.fit(xgismo)

INFO:Gismo:Start computation of 47 landmarks.
DEBUG:Gismo:Processing Sébastien_Baey.
DEBUG:Gismo:Landmarks of Sébastien_Baey computed.
DEBUG:Gismo:Processing Bruno_Baynat.
DEBUG:Gismo:Landmarks of Bruno_Baynat computed.
DEBUG:Gismo:Processing Quentin_Bramas.
DEBUG:Gismo:Landmarks of Quentin_Bramas computed.
DEBUG:Gismo:Processing Binh-Minh_Bui-Xuan.
DEBUG:Gismo:Landmarks of Binh-Minh_Bui-Xuan computed.
DEBUG:Gismo:Processing Marcelo_Dias_de_Amorim.
DEBUG:Gismo:Landmarks of Marcelo_Dias_de_Amorim computed.
DEBUG:Gismo:Processing Serge_Fdida.
DEBUG:Gismo:Landmarks of Serge_Fdida computed.
DEBUG:Gismo:Processing Anne_Fladenmuller.
DEBUG:Gismo:Landmarks of Anne_Fladenmuller computed.
DEBUG:Gismo:Processing Francesca_Fossati.
DEBUG:Gismo:Landmarks of Francesca_Fossati computed.
DEBUG:Gismo:Processing Olivier_Fourmaux.
DEBUG:Gismo:Landmarks of Olivier_Fourmaux computed.
DEBUG:Gismo:Processing Timur_Friedman.
DEBUG:Gismo:Landmarks of Timur_Friedman computed.
DEBUG:Gismo:Processing Brigitte_Kervella.
DEBUG:Gismo:Landmarks of Brigitte_Kervella computed.
DEBUG:Gismo:Processing Naceur_Malouch.
DEBUG:Gismo:Landmarks of Naceur_Malouch computed.
DEBUG:Gismo:Processing Maria_Potop-Butucaru.
DEBUG:Gismo:Landmarks of Maria_Potop-Butucaru computed.
DEBUG:Gismo:Processing Guy_Pujolle.
DEBUG:Gismo:Landmarks of Guy_Pujolle computed.
DEBUG:Gismo:Processing Kim_Loan_Thai.
DEBUG:Gismo:Landmarks of Kim_Loan_Thai computed.
DEBUG:Gismo:Processing Sébastien_Tixeuil.
DEBUG:Gismo:Landmarks of Sébastien_Tixeuil computed.
DEBUG:Gismo:Processing Bilel_Zaghdoudi.
DEBUG:Gismo:Landmarks of Bilel_Zaghdoudi computed.
DEBUG:Gismo:Processing Solayman_Ayoubi.
DEBUG:Gismo:Landmarks of Solayman_Ayoubi computed.
DEBUG:Gismo:Processing Lionel_Beltrando.
DEBUG:Gismo:Landmarks of Lionel_Beltrando computed.
DEBUG:Gismo:Processing Mateus_da_Silva_Gilbert.
DEBUG:Gismo:Landmarks of Mateus_da_Silva_Gilbert computed.
DEBUG:Gismo:Processing Cherifa_Hamroun.
DEBUG:Gismo:Landmarks of Cherifa_Hamroun computed.
DEBUG:Gismo:Processing Dimitris_Kefalas.
DEBUG:Gismo:Landmarks of Dimitris_Kefalas computed.
DEBUG:Gismo:Processing Mohamed_Amine_Legheraba.
DEBUG:Gismo:Landmarks of Mohamed_Amine_Legheraba computed.
DEBUG:Gismo:Processing Alexandre_Pham.
DEBUG:Gismo:Landmarks of Alexandre_Pham computed.
DEBUG:Gismo:Processing Giuliano_Fittipaldi.
DEBUG:Gismo:Landmarks of Giuliano_Fittipaldi computed.
DEBUG:Gismo:Processing Hugo_Rimlinger.
DEBUG:Gismo:Landmarks of Hugo_Rimlinger computed.
DEBUG:Gismo:Processing Alexandros_Stoltidis.
DEBUG:Gismo:Landmarks of Alexandros_Stoltidis computed.
DEBUG:Gismo:Processing Massinissa_Tighilt.
DEBUG:Gismo:Landmarks of Massinissa_Tighilt computed.
DEBUG:Gismo:Processing Theodoros_Tsourdinis.
DEBUG:Gismo:Landmarks of Theodoros_Tsourdinis computed.
DEBUG:Gismo:Processing Florent_Krasnopol.
DEBUG:Gismo:Landmarks of Florent_Krasnopol computed.
DEBUG:Gismo:Processing Saied_Kazeminejad.
DEBUG:Gismo:Landmarks of Saied_Kazeminejad computed.
DEBUG:Gismo:Processing Elena_Nardi.
DEBUG:Gismo:Landmarks of Elena_Nardi computed.
DEBUG:Gismo:Processing Nicolas_Peugnet.
DEBUG:Gismo:Landmarks of Nicolas_Peugnet computed.
DEBUG:Gismo:Processing Anastasios_Giovanidis.
DEBUG:Gismo:Landmarks of Anastasios_Giovanidis computed.
DEBUG:Gismo:Processing Fabien_Mathieu.
DEBUG:Gismo:Landmarks of Fabien_Mathieu computed.
DEBUG:Gismo:Processing Thi-Trang_Nguyen.
DEBUG:Gismo:Landmarks of Thi-Trang_Nguyen computed.
DEBUG:Gismo:Processing Ciro_Scognamiglio.
DEBUG:Gismo:Landmarks of Ciro_Scognamiglio computed.
DEBUG:Gismo:Processing Lélia_Blin.
DEBUG:Gismo:Landmarks of Lélia_Blin computed.
DEBUG:Gismo:Processing Marcos_Nogueira.
DEBUG:Gismo:Landmarks of Marcos_Nogueira computed.
DEBUG:Gismo:Processing Andrea_Richaud.
DEBUG:Gismo:Landmarks of Andrea_Richaud computed.
DEBUG:Gismo:Processing Miguel_Elias_Mitre_Campista.
DEBUG:Gismo:Landmarks of Miguel_Elias_Mitre_Campista computed.
DEBUG:Gismo:Processing James_F._Kurose.
DEBUG:Gismo:Landmarks of James_F._Kurose computed.
DEBUG:Gismo:Processing Ciro_Scognamiglio.
DEBUG:Gismo:Landmarks of Ciro_Scognamiglio computed.
DEBUG:Gismo:Processing Liuba_Shrira.
DEBUG:Gismo:Landmarks of Liuba_Shrira computed.
DEBUG:Gismo:Processing Maurice_Herlihy.
DEBUG:Gismo:Landmarks of Maurice_Herlihy computed.
DEBUG:Gismo:Processing Dipankar_Raychaudhuri.
DEBUG:Gismo:Landmarks of Dipankar_Raychaudhuri computed.
DEBUG:Gismo:Processing Boris_Chan_Yip_Hon.
DEBUG:Gismo:Landmarks of Boris_Chan_Yip_Hon computed.
INFO:Gismo:All landmarks are built.

We can extract the NPA researchers that the most similar to a given researcher (not necessarily from NPA).

[37]:

xgismo.rank("Anne_Bouillard", y=False)
npa_landmarks.post_item = lambda l, i: l[i]["name"]
npa_landmarks.get_landmarks_by_rank(xgismo)

[37]:

['Fabien Mathieu',
 'Bruno Baynat',
 'Florent Krasnopol',
 'James Kurose',
 'Nicolas Peugnet',
 'Mohamed Amine Legheraba',
 'Giuliano Prestes Fittipaldi',
 'Massinissa Tighilt',
 'Timur Friedman',
 'Anastasios Giovanidis',
 'Guy Pujolle',
 'Marcelo Dias de Amorim',
 'Maurice Herlihy',
 'Quentin Bramas',
 'Serge Fdida',
 'Maria Potop-Butucaru',
 'Sébastien Tixeuil',
 'Lélia Blin',
 'Binh-Minh Bui-Xuan',
 'Anne Fladenmuller',
 'Sébastien Baey',
 'Naceur Malouch',
 'Dipankar Raychaudhuri',
 'Kim Loan Thai']

We can also use a keyword query, and organize the results in clusters.

[38]:

xgismo.rank("Anne_Bouillard", y=False)
from gismo.post_processing import post_landmarks_cluster_print

npa_landmarks.post_cluster = post_landmarks_cluster_print
npa_landmarks.get_landmarks_by_cluster(xgismo, balance=0.5, target_k=1.2)

 F: 0.05.
- F: 0.06.
-- F: 0.08.
--- F: 0.10.
---- F: 0.15.
----- F: 0.33.
------ Fabien Mathieu
------ F: 0.46.
------- Bruno Baynat
------- F: 0.52.
-------- F: 0.52.
--------- James Kurose
--------- Guy Pujolle
-------- F: 0.64.
--------- Timur Friedman
--------- Marcelo Dias de Amorim
------- Anastasios Giovanidis
------- Quentin Bramas
------ Maurice Herlihy
----- Giuliano Prestes Fittipaldi
---- Nicolas Peugnet
--- Massinissa Tighilt
-- Florent Krasnopol
- Mohamed Amine Legheraba

ACM landmarks#

We can build other landmarks using the ACM categories. This will enable to describe things in term of categories.

[39]:

from gismo.datasets.acm import get_acm, flatten_acm

acm = flatten_acm(get_acm(), min_size=10)

[40]:

acm_landmarks = Landmarks(acm, to_text=lambda e: e["query"])

[41]:

log.setLevel(logging.INFO)
acm_landmarks.fit(xgismo)

INFO:Gismo:Start computation of 113 landmarks.
INFO:Gismo:All landmarks are built.

[42]:

xgismo.rank("Fabien_Mathieu", y=False)
acm_landmarks.post_item = lambda l, i: l[i]["name"]
acm_landmarks.get_landmarks_by_rank(xgismo, balance=0.5, target_k=1.2)

[42]:

['Machine learning algorithms',
 'Discrete mathematics',
 'Graph theory',
 'Models of computation',
 'Computational complexity and cryptography',
 'Theory of computation',
 'Symbolic and algebraic algorithms',
 'Mathematics of computing',
 'Symbolic and algebraic manipulation',
 'Power and energy',
 'Architectures',
 'Software system structures',
 'Software organization and properties',
 'Design and analysis of algorithms']

[43]:

xgismo.rank("combinatorics")
acm_landmarks.post_cluster = post_landmarks_cluster_print
acm_landmarks.get_landmarks_by_cluster(xgismo, balance=0.5, target_k=1.5)

 F: 0.36.
- F: 0.99.
-- Discrete mathematics
-- Mathematics of computing
-- Graph theory
-- Visualization
-- Simulation types and techniques
-- Modeling and simulation
- F: 0.41.
-- F: 0.96.
--- Symbolic and algebraic algorithms
--- Symbolic and algebraic manipulation
--- Mathematical analysis
-- F: 0.75.
--- Models of computation
--- Computational complexity and cryptography

Note that we fully ignore the original ACM category hierarchy. Instead, Gismo builds its own hierarchy tailored to the query.

Combining landmarks#

Through the post_processing methods, we can intricate multiple landmarks. For example, the following method associates NPA researchers and keywords to a tree of ACM categories.

[44]:

from gismo.common import auto_k
import numpy as np


def post_cluster_acm(l, cluster, depth=0, kw_size=0.3, mts_size=0.5):
    tk_kw = 1 / kw_size
    tk_mts = 1 / mts_size
    n = l.x_direction.shape[0]

    kws_view = cluster.vector[0, n:]
    k = auto_k(data=kws_view.data, max_k=100, target=tk_kw)
    keywords = [
        xgismo.embedding.features[i]
        for i in kws_view.indices[np.argsort(-kws_view.data)[:k]]
    ]

    if len(cluster.children) > 0:
        print(f"|{'-' * depth}")
        for c in cluster.children:
            post_cluster_acm(l, c, depth=depth + 1)
    else:
        domain = l[cluster.indice]["name"]
        researchers = ", ".join(
            npa_landmarks.get_landmarks_by_rank(
                cluster, target_k=tk_mts, distortion=0.5
            )
        )
        print(f"|{'-' * depth} {domain} ({researchers}) ({', '.join(keywords)})")

[45]:

xgismo.rank("combinatorics")
acm_landmarks.post_cluster = post_cluster_acm
acm_landmarks.get_landmarks_by_cluster(xgismo, target_k=1.5)

|
|-
|-- Discrete mathematics (Fabien Mathieu, Sébastien Tixeuil, Andrea Richa) (combinatoric, walk, type, complexity)
|-- Mathematics of computing (Fabien Mathieu) (combinatoric, type, complexity, walk, algorithms)
|-- Graph theory (Fabien Mathieu) (complexity, combinatoric, walk, degree, algorithms, random, enumeration, type)
|-- Visualization (Fabien Mathieu, Elena Nardi) (analytic, type, walk, complexity, algorithms)
|-- Simulation types and techniques (Fabien Mathieu, Elena Nardi) (type, analytic, complexity, datum, risk, algorithms)
|-- Modeling and simulation (Fabien Mathieu, Elena Nardi) (type, analytic, complexity, datum, risk, algorithms)
|-
|--
|--- Symbolic and algebraic algorithms (Fabien Mathieu, Binh-Minh Bui-Xuan, Florent Krasnopol, Bruno Baynat) (symbolic, algebra, combinatoric, type, calculus, complexity, algorithms)
|--- Symbolic and algebraic manipulation (Fabien Mathieu, Binh-Minh Bui-Xuan, Florent Krasnopol) (symbolic, algebra, combinatoric, type, calculus, complexity, algorithms, functions)
|--- Mathematical analysis (Fabien Mathieu, Binh-Minh Bui-Xuan, Florent Krasnopol) (type, complexity, differential, algorithms, functions, constraint, symbolic, calculus)
|--
|--- Models of computation (Fabien Mathieu, Elena Nardi, Binh-Minh Bui-Xuan, Maurice Herlihy, Maria Potop-Butucaru) (complexity, type, algorithms, calculus, functions)
|--- Computational complexity and cryptography (Elena Nardi, Binh-Minh Bui-Xuan, Maria Potop-Butucaru, Maurice Herlihy, Liuba Shira) (complexity, algorithms, functions, type, random)

Conversely, one can associate ACM categories and keywords to a tree of NPA researchers.

[46]:

def post_cluster_npa(l, cluster, depth=0, kw_size=0.3, acm_size=0.5):
    tk_kw = 1 / kw_size
    tk_acm = 1 / acm_size
    n = l.x_direction.shape[0]

    kws_view = cluster.vector[0, n:]
    k = auto_k(data=kws_view.data, max_k=100, target=tk_kw)
    keywords = [
        xgismo.embedding.features[i]
        for i in kws_view.indices[np.argsort(-kws_view.data)[:k]]
    ]

    if len(cluster.children) > 0:
        print(f"|{'-' * depth}")
        for c in cluster.children:
            post_cluster_npa(l, c, depth=depth + 1)
    else:
        researcher = l[cluster.indice]["name"]
        domains = ", ".join(
            acm_landmarks.get_landmarks_by_rank(
                cluster, target_k=tk_acm, distortion=0.5
            )
        )
        print(f"|{'-' * depth} {researcher} ({domains}) ({', '.join(keywords)})")

[47]:

xgismo.rank("Anne_Bouillard", y=False)
npa_landmarks.post_cluster = post_cluster_npa
npa_landmarks.get_landmarks_by_cluster(xgismo, target_k=1.4)

|
|-
|--
|---
|----
|----- Fabien Mathieu (Symbolic and algebraic algorithms, Symbolic and algebraic manipulation, Models of computation, Computational complexity and cryptography, Discrete mathematics, Graph theory, Mathematical analysis, Mathematics of computing) (network calculus, calculus, space, aggregate)
|----- Bruno Baynat (Symbolic and algebraic algorithms, Models of computation, Symbolic and algebraic manipulation, Computational complexity and cryptography) (closed, queue, markovian, burstiness, queue network, modeling, closed queue, trade)
|----- James Kurose (Models of computation) (end, delay, end end, comparison, routing)
|---- Massinissa Tighilt (Models of computation, Symbolic and algebraic algorithms, Computational complexity and cryptography, Symbolic and algebraic manipulation) (periodic, computation)
|--- Florent Krasnopol (Logic, Compilers, Knowledge representation and reasoning) (generation, order, calculus, automatic)
|--- Giuliano Prestes Fittipaldi (Hardware test) (pattern, impact)
|-- Nicolas Peugnet (Software verification and validation) (case, architecture)
|- Mohamed Amine Legheraba (Computational complexity and cryptography, Models of computation) (sampling)

That’s all for this tutorial!