Summarizer tutorial

This tutorial aims at showing the basics for using sisu’s flat summarizer.

Loading a Gismo

The summarizer uses a Gismo. For this tutorial, we will use a Gismo of a corpus of articles on Covid. You can use the following tutorial to build such a Gismo:

https://balouf.github.io/sisu/tutorials/Gismo%20Covid.html

The following loads the Gismo in memory.

[1]:
from pathlib import Path
data_folder = Path('../../../../Datasets/covid')
data_folder.exists()
[2]:
from gismo.gismo import Gismo
gismo = Gismo(filename="covid_gismo", path=data_folder)

Creating a summarizer instance

You create a summarizer with the Summarizer class, which feeds on a Gismo.

Summarizer also has hyper-parameters you may want to play with (or not!).

The general rule for the hyper-parameters is as follows: - All parameters have default values that will be use if you don’t specify anything. - When creating an instance, you can specify some parameters. They will override the default parameters for the instance. - Default instance parameters can be accessed and changed through the attributes of the parameters attribute. - When building a summary, you can specify runtime parameters. They will override instance parameters (without changing them) for the build.

The hyper-parameters are described there:

https://balouf.github.io/sisu/reference/summarizer.html#sisu.summarizer.default_summarizer_parameters

The following creates a summarizer that intends to produce 4 sentences out of the content of the articles.

[3]:
from sisu.summarizer import Summarizer
summa = Summarizer(gismo, num_sentences=4, text_getter=lambda e: e['content'])

The parameters are:

[4]:
summa.parameters()
[4]:
{'order': 'rank',
 'text_getter': <function __main__.<lambda>(e)>,
 'sentence_tester': <function sisu.preprocessing.tokenizer.is_relevant_sentence(sentence: str, min_num_words: int = 6, max_num_words: int = 60) -> bool>,
 'itf': True,
 'post_processing': <function sisu.summarizer.<lambda>(summa, i)>,
 'sentence_gismo_parameters': {'post': False, 'resolution': 0.99},
 'num_documents': None,
 'num_query': None,
 'num_sentences': 4,
 'max_chars': None}

We test it on a request on pangolin.

[5]:
q = "pangolin"
print(summa(q))
['The amino acid residues change in S-protein of SARS-CoV-2 was further analyzed with SARS-CoV, pangolin and bat CoVs including pangolin/Guandong/1/2019, pangolin/Guangdong/lung08, and bat/Yunnan/RaTG13 (Figure 2) .', 'Figure 2b and Table S1 describe that all key amino acid residues of RBD (except two positions) are completely homologues between SARS-CoV-2 (Wuhan-Hu-1_MN908947) and pangolin CoVs (pangolin/Guandong/1/2019 and pangolin/Guangdong/lung08), supporting our postulation of recombination event in S-protein gene.', 'On the other hand, Beta-CoVs from pangolin sources (pangolin/Guandong/1/2019 and .', 'The Malayan pangolin (Manis javanica), a representative mammal species of the order Pholidota, is one of the only eight pangolin species worldwide.']

Ordering

The default order is rank. How different are the two others?

[6]:
print(summa(q, order="coverage"))
print(summa(q, order='cosine'))
['The amino acid residues change in S-protein of SARS-CoV-2 was further analyzed with SARS-CoV, pangolin and bat CoVs including pangolin/Guandong/1/2019, pangolin/Guangdong/lung08, and bat/Yunnan/RaTG13 (Figure 2) .', 'Figure 2b and Table S1 describe that all key amino acid residues of RBD (except two positions) are completely homologues between SARS-CoV-2 (Wuhan-Hu-1_MN908947) and pangolin CoVs (pangolin/Guandong/1/2019 and pangolin/Guangdong/lung08), supporting our postulation of recombination event in S-protein gene.', 'The Malayan pangolin (Manis javanica), a representative mammal species of the order Pholidota, is one of the only eight pangolin species worldwide.', 'These results indicate that the Malayan pangolin might carry a novel CoV (here named Pangolin-CoV) that is similar to SARS-CoV-2.']
['The two genomes were merged using the easymerge.pl subcommand from VirMAP to create the final pangolin-associated coronavirus (Pangolin-CoV) genome.', 'These results indicate that Pangolin-CoV could have pathogenic potential similar to that of SARS-CoV-2.', 'These new findings suggest further research to investigate pangolin as a SARS-CoV-2 reservoir.', 'This finding further supports the hypothesis that pangolin is involved in SARS-CoV-2 evolution.']

Spacy post-processing

Notice how on the first summary we got an incomplete sentence without a verb? To automatically remove these, we can add a NLP post-processing (note that the number or returned sentences will be less than 4).

[7]:
from sisu.summarizer import PostNLP
import spacy
nlp = spacy.load("en_core_web_sm")

print(summa(q, post_processing=PostNLP(nlp)))
['The amino acid residues change in S-protein of SARS-CoV-2 was further analyzed with SARS-CoV, pangolin and bat CoVs including pangolin/Guandong/1/2019, pangolin/Guangdong/lung08, and bat/Yunnan/RaTG13 (Figure 2) .', 'Figure 2b and Table S1 describe that all key amino acid residues of RBD (except two positions) are completely homologues between SARS-CoV-2 (Wuhan-Hu-1_MN908947) and pangolin CoVs (pangolin/Guandong/1/2019 and pangolin/Guangdong/lung08), supporting our postulation of recombination event in S-protein gene.', 'The Malayan pangolin (Manis javanica), a representative mammal species of the order Pholidota, is one of the only eight pangolin species worldwide.']

If you have the module neuralcoref installed on your system, you can activate co-reference resolution on the NLP post-processing. The following shows a proof of concept on the working of co-reference resolution.

[8]:
import neuralcoref
neuralcoref.add_to_pipe(nlp)

summa.sentences_[0]['sentence'] = "My taylor is rich."
summa.sentences_[1]['sentence'] = "She has a dog."
summa.sentences_[2]['sentence'] = "She lives uptown."
post_nlp = PostNLP(nlp, coref=True)
for i in range(3):
    print(post_nlp(summa, i))
My taylor is rich.
My taylor has a dog.
My taylor lives uptown.

Character limit

You can also have a target budget of characters. This will override num_sentences. Let try that on a new request.

[9]:
txt = " ".join(summa("hydroxychloroquine", max_chars=3000))
[10]:
txt
[10]:
'The cause of failure for hydroxychloroquine treatment should be investigated by testing the isolated SARS-CoV-2 strains of the non-respondents and analyzing their genome, and by analyzing the host factors that may be associated with the metabolism of hydroxychloroquine. When comparing the effect of hydroxychloroquine treatment as a single drug and the effect of hydroxychloroquine and azithromyc in combination, the proportion of patients that had negative PCR results in nasopharyngeal samples was significantly different between the two groups at days 3-4-5 and 6 post-inclusion (Table 3)  Cultures. Lessons learnt from chloroquine/ hydroxychloroquine use in HIV infection. The peak of the chromatogram at 1.05 min of retention corresponds to hydroxychloroquine metabolite. Equally important, chloroquine and hydroxychloroquine are generically produced, very inexpensive, and could be made available worldwide. Effect of hydroxychloroquine on viral load. Hydroxychloroquine (17 μM; HCQ) was purchased from Sanofi-Synthelabo. Role of Chloroquine and Hydroxychloroquine (HCQ). Chloroquine and Hydroxychloroquine and Emerging Viruses. Mean hydroxychloroquine serum concentration was 0.46 µg/ml+0.2 (N=20). Hydroxychloroquine differs from chloroquine by the presence of a hydroxyl group at the end of the side chain: the N-ethyl substituent is \uf062-hydroxylated. Our preliminary results also suggest a synergistic effect of the combination of hydroxychloroquine and azithromycin. The serum concentration of this metabolite is deduced from UV absorption, as for hydroxychloroquine concentration. Recurrence occurred in 11.1% of patients without Hydroxychloroquine treatment. Decrease in the maternal-fetal transmission by hydroxychloroquine 1. Hydroxychloroquine was provided by the National Pharmacy of France on nominative demand. These hits are gemcitabine, gefitinib and vibarabine (FLUAV); gemcitabine, pirlindole dibucaine, fluoxetine and dalbavancin (EV1); gemcitabine, imatinib, ivermectin, lopinavir, lovastatin, ezetimibe, fluoxetine, BCX4430, chloroquine and hydroxychloroquine (ZIKV); chloroquine and mycophenolic acid (CHIKV); chloroquine, mycophenolic acid, dibucaine and itraconazole (RRV); as well as 5-azacitidine, gemcitabine, trifluridine and vidarabine (HSV-1). The antiretroviral effects of chloroquine/hydroxychloroquine may though become visible in anatomical sanctuaries of those individuals treated with PI-containing antiretroviral regimens. Both chloroquine/hydroxychloroquine and auranofin can influence these transitions by exerting a pro-apoptotic effect, the efficacy of which is graphically exemplified by the intensity of the blue color in the corresponding rectangles. In this light, chloroquine/hydroxychloroquine should have an anti-reservoir potential. Considering both concentrations provides an estimation of initial serum hydroxychloroquine concentration. In vivo effects of chloroquine/hydroxychloroquine: preclinical models.'
[11]:
len(txt)
[11]:
2962