Building a Gismo for the Covid Dataset

This tutorial shows how to build a Gismo from a Covid Dataset. The Gismo is the base object that is used to analyze and summarize the dataset (see for example the Covid Summarizer tutorial).

Retrieving the Covid-19 Dataset Zip archive.

The dataset can be downloaded from Kaggle website by clicking the Download button once registered. You’ll get a zip file.

We assume you downloaded the archive and that it is available in some directory (adjust the parameters below according to your own settings).

[1]:
from pathlib import Path
DATASET_DIR = Path("../../../../../../Datasets/covid")
ARCHIVE = Path("CORD-19-research-challenge.zip")
(DATASET_DIR / ARCHIVE).exists()
[1]:
True

Loading the corpus from zip

sisu provides a simple interface to load the archive in the form of a list of dictionaries. For testing purposes, you can specify the number of documents you want to retrieve. Here we retrieve the first 100 documents.

[2]:
from sisu.datasets.covid import load_from_zip
source = load_from_zip(file=ARCHIVE, data_path=DATASET_DIR, max_docs=1000)
[3]:
len(source)
[3]:
1000

Each entry contains by default 5 keys (this can be tuned): - title - abstract - content - id - lang

For example, the first 10 titles

[4]:
[e['title'] for e in source[:10]]
[4]:
['The RNA pseudoknots in foot-and-mouth disease virus are dispensable for genome replication but essential for the production of infectious virus. 2 3',
 'Analysis Title: Regaining perspective on SARS-CoV-2 molecular tracing and its implications',
 'Healthcare-resource-adjusted vulnerabilities towards the 2019-nCoV epidemic across China',
 'Real-time, MinION-based, amplicon sequencing for lineage typing of infectious bronchitis virus from upper respiratory samples',
 'A Combined Evidence Approach to Prioritize Nipah Virus Inhibitors',
 'Assessing spread risk of Wuhan novel coronavirus within and beyond China, January-April 2020: a travel network-based modelling study',
 'TWIRLS, an automated topic-wise inference method based on massive literature, suggests a possible mechanism via ACE2 for the pathological changes in the human host after coronavirus infection',
 'Title: Viruses are a dominant driver of protein adaptation in mammals',
 'The impact of regular school closure on seasonal influenza epidemics: a data-driven spatial transmission model for Belgium',
 'Carbon Nanocarriers Deliver siRNA to Intact Plant Cells for Efficient Gene']

Statistics on language used:

[5]:
from collections import Counter
Counter([e['lang'] for e in source])
[5]:
Counter({'en': 998, 'fr': 2})

Converting the archive into a FileSource

Loading the whole file can be time and memory consuming.

[6]:
import time
start = time.perf_counter()
source = load_from_zip(file=ARCHIVE, data_path=DATASET_DIR)
print(f"Time to load {len(source)} articles: {(time.perf_counter()-start):.2f} seconds.")
Time to load 33375 articles: 188.65 seconds.

If you only need to use the source for linear browsing or accessing a few elements, it can be interesting to save the source as a FileSource. A FileSource will basically behave like a list, except that the data stays stored on the hard drive.

[7]:
from gismo.filesource import FileSource, create_file_source
create_file_source(source=source, filename="covid", path=DATASET_DIR)

After the FileSource has been created, it can be used instead of the in-memory list.

[8]:
from gismo.filesource import FileSource
del source
start = time.perf_counter()
source = FileSource(filename="covid", path=DATASET_DIR)
print(f"Time to load {len(source)} articles: {(time.perf_counter()-start):.2f} seconds.")
Time to load 33375 articles: 0.03 seconds.

The main difference is that you cannot use slice index, just plain index or iterators.

[9]:
[source[i]['title'] for i in range(10)] # use range to avoid slice indexing
[9]:
['The RNA pseudoknots in foot-and-mouth disease virus are dispensable for genome replication but essential for the production of infectious virus. 2 3',
 'Analysis Title: Regaining perspective on SARS-CoV-2 molecular tracing and its implications',
 'Healthcare-resource-adjusted vulnerabilities towards the 2019-nCoV epidemic across China',
 'Real-time, MinION-based, amplicon sequencing for lineage typing of infectious bronchitis virus from upper respiratory samples',
 'A Combined Evidence Approach to Prioritize Nipah Virus Inhibitors',
 'Assessing spread risk of Wuhan novel coronavirus within and beyond China, January-April 2020: a travel network-based modelling study',
 'TWIRLS, an automated topic-wise inference method based on massive literature, suggests a possible mechanism via ACE2 for the pathological changes in the human host after coronavirus infection',
 'Title: Viruses are a dominant driver of protein adaptation in mammals',
 'The impact of regular school closure on seasonal influenza epidemics: a data-driven spatial transmission model for Belgium',
 'Carbon Nanocarriers Deliver siRNA to Intact Plant Cells for Efficient Gene']
[10]:
Counter([e['lang'] for e in source])
[10]:
Counter({'en': 32590, 'fr': 372, 'xx': 51, 'es': 295, 'de': 67})

Building a corpus

A gismo corpus is essentially a list with instructions about how to convert items of the list to a text that will be used for the embedding (the text does not have to be comprehensible for humans).

For example, we will build a list of English articles with non-trivial title, abtract and content. Note that we can close the source afterwards to avoid keeping an open file.

[12]:
english_source = [d for d in source if
                  len(d['abstract']) > 140 and
                  len(d['title']) > 20 and
                  len(d['content']) > 200 and
                  d['lang']=='en']
source.close()
len(english_source)
[12]:
23114

Now we can associate the source and a text function (we use a text sanitizer that will extract content and do some cleaning).

[13]:
from gismo.corpus import Corpus
from sisu.preprocessing.tokenizer import to_text_sanitized

corpus = Corpus(source=english_source, to_text=to_text_sanitized)

Building the embedding

From the corpus, one can create the Gismo dual embedding of documents into words and words into documents. We will use some stopwords to avoid cluttering the embedding with common words that not not bring much information.

[32]:
from sisu.preprocessing.language import EN_STOP_WORDS
covid_stop_words = ['preprint', 'copyright', 'holder', 'reuse', 'doi', 'reads', 'fig', 'figure']

from sklearn.feature_extraction.text import CountVectorizer
from gismo.embedding import Embedding
vectorizer = CountVectorizer(min_df=5, dtype=float, stop_words=EN_STOP_WORDS+covid_stop_words)
embedding = Embedding(vectorizer)
embedding.fit_transform(corpus)
[15]:
embedding.x
[15]:
<23114x92278 sparse matrix of type '<class 'numpy.float64'>'
        with 18690530 stored elements in Compressed Sparse Row format>

The embedding graph relates 23,114 articles to 92,278 words through a bipartite graph of 18,690,530 relationships.

Average unique words per documents:

[16]:
18690530/23114
[16]:
808.6237777970061

Average number of documents where a random word appears:

[17]:
18690530/92278
[17]:
202.5458939292139

The Gismo

Gismo is just a concatenation of a corpus and an embedding.

[33]:
from gismo.gismo import Gismo
gismo = Gismo(corpus, embedding)

Small example: for a given query, proposes titles of relevant articles.

[20]:
gismo.post_documents_item = lambda g, i: g.corpus[i]['title']
def propose_titles(query):
    success = gismo.rank(query)
    if success:
        return gismo.get_documents_by_rank()
    else:
        print(f"Not found anything about: {query}!")
[21]:
propose_titles("flklkfl")
Not found anything about: flklkfl!
[22]:
propose_titles("pangolin")
[22]:
['Pangolin homology associated with 2019-nCoV',
 'Probable Pangolin Origin of SARS-CoV-2 Associated with the COVID-19 Outbreak',
 'Evidence of recombination in coronaviruses implicating pangolin origins of nCoV- 2019',
 'Evidence of the Recombinant Origin and Ongoing Mutations in Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2)',
 'Spike protein recognition of mammalian ACE2 predicts the host range and an optimized ACE2 for SARS-CoV-2 infection',
 'Viral Metagenomics Revealed Sendai Virus and Coronavirus Infection of Malayan Pangolins (Manis javanica)',
 'Emergence of SARS-CoV-2 through Recombination and Strong Purifying Selection Short Title: Recombination and origin of SARS-CoV-2 One Sentence Summary: Extensive Recombination and Strong Purifying Selection among coronaviruses from different hosts facilitate the emergence of SARS-CoV-2',
 'Mutations, Recombination and Insertion in the Evolution of 2019-nCoV',
 'SARS-CoV-2, an evolutionary perspective of interaction with human ACE2 reveals undiscovered amino acids necessary for complex stability',
 'Journal Pre-proof Predicting the angiotensin converting enzyme 2 (ACE2) utilizing capability as the receptor of SARS-CoV-2 Predicting the angiotensin converting enzyme 2 (ACE2) utilizing capability as the receptor of SARS-CoV-2 2',
 'CRISPR-based surveillance for COVID-19 using genomically-comprehensive machine learning design',
 'ROLE OF CHANGES IN SARS-COV-2 SPIKE PROTEIN IN THE INTERACTION WITH THE HUMAN ACE2 RECEPTOR: AN IN SILICO ANALYSIS',
 'Systematic Comparison of Two Animal-to-Human Transmitted Human Coronaviruses: SARS-CoV-2 and SARS-CoV',
 'Animal origins of SARS coronavirus: possible links with the international trade in small carnivores',
 'Strong evolutionary convergence of receptor-binding protein spike between COVID-19 and SARS-related coronaviruses',
 'The epidemic of 2019-novel-coronavirus (2019-nCoV) pneumonia and insights for emerging infectious diseases in the future',
 'Potential Factors Influencing Repeated SARS Outbreaks in China',
 'The bushmeat and food security nexus: A global account of the contributions, conundrums and ethical collisions',
 'Epidemic Situation of Novel Coronavirus Pneumonia in China mainland',
 'Genome Detective Coronavirus Typing Tool for rapid identification and characterization of novel coronavirus genomes',
 'Protein structure and sequence re-analysis of 2019-nCoV genome does not indicate snakes as its intermediate host or the unique similarity between its spike protein insertions and HIV-1',
 'Digital Surveillance: A Novel Approach to Monitoring the Illegal Wildlife Trade',
 'From sequence to enzyme mechanism using multi-label machine learning',
 'Reasons for the increase in emerging and re-emerging viral infectious diseases',
 'Wildlife trade, consumption and conservation awareness in southwest China']
[23]:
propose_titles("platypus")
[23]:
['Widespread Divergence of the CEACAM/PSG Genes in Vertebrates and Humans Suggests Sensitivity to Selection',
 'Coevolution of activating and inhibitory receptors within mammalian carcinoembryonic antigen families',
 'Phylogenetic Distribution of CMP-Neu5Ac Hydroxylase (CMAH), the Enzyme Synthetizing the Proinflammatory Human Xenoantigen Neu5Gc',
 'Immunoglobulin heavy chain diversity in Pteropid bats: evidence for a diverse and highly specific antigen binding repertoire',
 'Evolutionary Dynamics of the Interferon-Induced Transmembrane Gene Family in Vertebrates',
 'Evolution of vertebrate interferon inducible transmembrane proteins',
 'A Comprehensive Phylogenetic and Structural Analysis of the Carcinoembryonic Antigen (CEA) Gene Family',
 'Chiropteran types I and II interferon genes inferred from genome sequencing traces by a statistical gene-family assembler',
 'A novel fast vector method for genetic sequence comparison OPEN',
 'Alignment-free method for DNA sequence clustering using Fuzzy integral similarity',
 'Emodin Inhibits EBV Reactivation and Represses NPC Tumorigenesis',
 'Crystallographic Analysis of the Reaction Cycle of 2′,3′-Cyclic Nucleotide 3′-Phosphodiesterase, a Unique Member of the 2H Phosphoesterase Family',
 'Evolutionary aspects of lipoxygenases and genetic diversity of human leukotriene signaling',
 'A new method to analyze protein sequence similarity using Dynamic Time Warping',
 'Effective one-pot multienzyme (OPME) synthesis of monotreme milk oligosaccharides and other sialosides containing a 4-O- acetyl sialic acid † HHS Public Access Author manuscript',
 'A Poisson model of sequence comparison and its application to coronavirus phylogeny',
 'Post-Glycosylation Modification of Sialic Acid and Its Role in Virus Pathogenesis',
 'Similar ratios of introns to intergenic sequence across animal 1 genomes',
 'Genomic organization and adaptive evolution of IGHC genes in marine mammals',
 'Contrasting Patterns in Mammal-Bacteria Coevolution: Bartonella and Leptospira in Bats and Rodents',
 'Peptide presentation by bat MHC class I provides new insight into the antiviral immunity of bats',
 'Multiomics analysis of the giant triton snail salivary gland, a crown- of-thorns starfish predator OPEN',
 'A novel immunity system for bacterial nucleic acid degrading toxins and its recruitment in various eukaryotic and DNA viral systems',
 'The Value of the Tree of Life "Nothing makes sense except in light of evolution"',
 '34 Animals Hazardous to Humans']
[29]:
propose_titles("marseille")
[29]:
['Respiratory viruses within homeless shelters in Marseille, France',
 'Epidemiology of respiratory pathogen carriage in the homeless population within two shelters in Marseille, France, 2015e2017: cross sectional 1-day surveys',
 'Incidence of Hajj-associated febrile cough episodes among French pilgrims: a prospective cohort study on the influence of statin use and risk factors',
 'Hydroxychloroquine and azithromycin as a treatment of COVID-19: results of an open- label non-randomized clinical trial',
 'Infectious disease symptoms and microbial carriage among French medical students travelling abroad: A prospective study',
 'Acquisition of respiratory viruses and presence of respiratory symptoms in French pilgrims during the 2016 Hajj: A prospective cohort study',
 'The VIZIER project: Preparedness against pathogenic RNA viruses',
 "French Hajj pilgrims' experience with pneumococcal infection and vaccination: A knowledge, attitudes and practice (KAP) evaluation",
 'Journal Pre-proof SARS-CoV-2: fear versus data SARS-CoV-2: fear versus data',
 'Novel Virus Influenza A (H1N1sw) in South-Eastern',
 'Bacterial respiratory carriage in French Hajj pilgrims and the effect of pneumococcal vaccine and other individual preventive measures: A prospective cohort survey',
 'Detection of Plant DNA in the Bronchoalveolar Lavage of Patients with Ventilator-Associated Pneumonia',
 'The dynamics and interactions of respiratory pathogen carriage among French pilgrims during the 2018 Hajj',
 'Screening Pneumonia Patients for Mimivirus 1',
 'Studies of nonhuman primates: key sources of data on zoonoses and microbiota',
 'Studies of nonhuman primates: key sources of data on zoonoses and microbiota',
 'Survey of laboratory-acquired infections around the world in biosafety level 3 and 4 laboratories',
 'Filovirus Research in Gabon and Equatorial Africa: The Experience of a Research Center in the Heart of Africa',
 'The impact of high-speed railway on tourism spatial structures between two adjoining metropolitan cities in China: Beijing and Tianjin',
 'Travel and migration associated infectious diseases morbidity in Europe, 2008',
 'Highly infectious diseases in the Mediterranean Sea area: Inventory of isolation capabilities and recommendations for appropriate isolation',
 'Highly infectious diseases in the Mediterranean Sea area: Inventory of isolation capabilities and recommendations for appropriate isolation',
 'Natural Killer Cells Promote Early CD8 T Cell Responses against Cytomegalovirus',
 'The effect of chlortracycline treatment on enteric bacteria in pigs SoA1.3 Delsol AA a Analysis of Helicobacter pylori resistance to antimicrobial agents in Polish children SoA4.2',
 'The history of the plague and the research on the causative agent Yersinia pestis',
 'Antimicrobial resistance in anaerobic bacteria. Experiences in Europe and North America (Symposium arranged with ESGARAB) S11 Antimicrobial susceptibility patterns in different European countries Pathogenesis and prevention of nosocomial infections-new aspects (Symposium arranged with ESGNI) S14 Ventilator-associated pneumonia',
 '2018 international meeting of the Global Virus Network',
 'Meeting report: 29th International Conference on Antiviral Research',
 "Travelers' Actual and Subjective Knowledge about Risk for Ebola Virus Disease",
 'Asymptomatic Middle East Respiratory Syndrome Coronavirus (MERS-CoV) infection: Extent and implications for infection control: A systematic review']

A Gismo can serve many purpose that will be exposed in other tutorials. Note that you can save your Gismo for later use.

[34]:
gismo.save(filename="covid_gismo", path=DATASET_DIR, compress=True, erase=True)