Summarizer¶
-
class
sisu.summarizer.
PostNLP
(nlp, coref=False)[source]¶ Post-processor for the
Summarizer
that leverages a spacy NLP engine.Discard sentences with no verb.
Remove adverbs and punctuations that starts a sentence (e.g. “However, we …” -> “We …”).
Optionally, if the engine supports co-references, resolve them.
- Parameters
nlp (callable) – A Spacy nlp engine.
coref (
bool
) – Resolve co-references if the nlp engine supports it.
-
class
sisu.summarizer.
Summarizer
(gismo, **kwargs)[source]¶ Summarizer class.
- Parameters
gismo (
Gismo
) – Gismo of the documents to analyze.kwargs (
dict
) – Parameters of the summarizer (seedefault_summarizer_parameters
for details).
-
sentences_
¶ Selected sentences. Each sentence is a dictionary with the following keys:
index: position of the sentence in the returned list
sentence: the actual sentence
relevant: a boolean that tells if the sentence is eligible for being part of the summary
sanitized: for relevant sentences, a simplified version to be fed to the embedding
-
order_
¶ Proposed incomplete ordering of the
sentences_
- Type
-
parameters
¶ Handler of parameters.
- Type
Examples
The package contains a data folder with a toy gismo with articles related to Covid-19. We load it.
>>> gismo = Gismo(filename="toy_gismo", path="data")
Then we build a summarizer out of it. We tell to fetch the sentences from the content of the articles.
>>> summa = Summarizer(gismo, text_getter = lambda d: d['content'])
Ask for a summary on bat with a maximal budget of 500 characters, using pure TF-IDF sentence embedding.
>>> summa('bat', max_chars=500, itf=False) ['By comparing the amino acid sequence of 2019-nCoV S-protein (GenBank Accession: MN908947.3) with Bat SARS-like coronavirus isolate bat-SL-CoVZC45 and Bat SARS-like coronavirus isolate Bat-SL-CoVZXC21, the latter two were shown to share 89.1% and 88.6% sequence identity to 2019-nCoV S-protein (supplementary figure 1) .', 'Within our bat-hemoplasma network, genotype sharing was restricted to five host communities, 380 whereas six genotypes were each restricted to a single bat species (Fig. 5A ).']
Now a summary based on the cosine ordering, using the content of abstracts and pure TF-IDF sentence embedding.
>>> summa('bat', max_chars=500, order='cosine', text_getter = lambda d: d['abstract']) ['Bat dipeptidyl peptidase 4 (DPP4) sequences were closely related to 38 those of human and non-human primates but distinct from dromedary DPP4 sequence.', 'The multiple sequence alignment data correlated with already published reports on SARS-CoV-2 indicated that it is closely related to Bat-Severe Acute Respiratory Syndrome like coronavirus (Bat CoV SARS-like) and wellstudied Human SARS.', '(i.e., hemoplasmas) across a species-rich 40 bat community in Belize over two years.']
Now 4 sentences using a coverage ordering.
>>> summa('bat', num_sentences=4, order='coverage') ['By comparing the amino acid sequence of 2019-nCoV S-protein (GenBank Accession: MN908947.3) with Bat SARS-like coronavirus isolate bat-SL-CoVZC45 and Bat SARS-like coronavirus isolate Bat-SL-CoVZXC21, the latter two were shown to share 89.1% and 88.6% sequence identity to 2019-nCoV S-protein (supplementary figure 1) .', 'However, we have not done the IDPs analysis for ORF10 from the Bat-SL-CoVZC45 strain since we have taken different strain of Bat CoV (reviewed strain HKU3-1) in our study.', 'To test the dependence of the hemoplasma 290 phylogeny upon the bat phylogeny and thus assess evidence of evolutionary codivergence, we 291 applied the Procrustes Approach to Cophylogeny (PACo) using distance matrices and the paco 292 We used hemoplasma genotype assignments to create a network, with each node representing a 299 bat species and edges representing shared genotypes among bat species pairs.', 'However, these phylogenetic patterns in prevalence were decoupled from those describing bat 526 species centrality in sharing hemoplasmas, such that genotype sharing was generally restricted 527 by bat phylogeny.']
As you can see, there are some ``However, ‘’ in the answers. A bit of NLP post_processing can take care of those.
>>> import spacy >>> nlp = spacy.load("en_core_web_sm") >>> post_nlp = PostNLP(nlp) >>> summa('bat', num_sentences=4, order='coverage', post_processing=post_nlp) ['By comparing the amino acid sequence of 2019-nCoV S-protein (GenBank Accession: MN908947.3) with Bat SARS-like coronavirus isolate bat-SL-CoVZC45 and Bat SARS-like coronavirus isolate Bat-SL-CoVZXC21, the latter two were shown to share 89.1% and 88.6% sequence identity to 2019-nCoV S-protein (supplementary figure 1) .', 'We have not done the IDPs analysis for ORF10 from the Bat-SL-CoVZC45 strain since we have taken different strain of Bat CoV (reviewed strain HKU3-1) in our study.', 'To test the dependence of the hemoplasma 290 phylogeny upon the bat phylogeny and thus assess evidence of evolutionary codivergence, we 291 applied the Procrustes Approach to Cophylogeny (PACo) using distance matrices and the paco 292 We used hemoplasma genotype assignments to create a network, with each node representing a 299 bat species and edges representing shared genotypes among bat species pairs.', 'These phylogenetic patterns in prevalence were decoupled from those describing bat 526 species centrality in sharing hemoplasmas, such that genotype sharing was generally restricted 527 by bat phylogeny.']
-
build_coverage_order
(k)[source]¶ Populate
order_
with a covering order with target number of sentences k. The actual number of indices is stretched by the sentence Gismo stretch factor.- Parameters
k (
int
) – Number of optimal covering sentences.- Returns
Covering order.
- Return type
-
build_sentence_gismo
(itf=None, s_g_p=None)[source]¶ Creates the Gismo of sentences (
sentence_gismo_
)
-
build_sentence_source
(num_documents=None, getter=None, tester=None)[source]¶ Creates the corpus of sentences (
sentences_
)- Parameters
num_documents (
int
, optional) – Number of documents to select (if not, Gismo will automatically decide).getter (callable) – Extraction of text from corpus item. If not specify, the to_text of the
Corpus
will be used.tester (callable) – Function that estimates if a sentence is eligible to be part of the summary.
- Returns
- Return type
-
rank_documents
(query, num_query=None)[source]¶ Perform a Gismo query at document-level. If the query fails, builds a generic query instead. The
gismo
andquery_
attributes are updated.
-
summarize
(query='', **kwargs)[source]¶ Performs a full run of all summary-related operations:
Rank a query at document level, fallback to a generic query if the query fails;
Extract sentences from the top documents
Order sentences by one of the three methods proposed, rank, coverage, and cosine
Apply post-processing and return list of selected sentences.
Note that calling a
Summarizer
will call itssummarize()
method.- Parameters
query (
str
) – Query to run.kwargs (
dict
) – Runtime specific parameters (seedefault_summarizer_parameters
for possible arguments).
- Returns
Summary.
- Return type
-
sisu.summarizer.
cosine_order
(projection, sentences, query)[source]¶ Order relevant sentences by cosine similarity to the query.
- Parameters
projection (callable) – A function that converts a text into a tuple whose first element is an embedding (typically a Gismo
query_projection()
).sentences (
list
ofdict
) – Sentences as output byextract_sentences()
.query (
str
) – Target query
- Returns
Ordered list of indexes of relevant sentences, sorted by cosine similarity
- Return type
-
sisu.summarizer.
default_summarizer_parameters
= {'itf': True, 'max_chars': None, 'num_documents': None, 'num_query': None, 'num_sentences': None, 'order': 'rank', 'post_processing': <function <lambda>>, 'sentence_gismo_parameters': {'post': False, 'resolution': 0.99}, 'sentence_tester': <function is_relevant_sentence>, 'text_getter': None}¶ List of parameters for the summarizer with their default values.
- Parameters
order (
str
) – Sorting function.text_getter (callable) – Extraction of text from corpus item. If not specify, the to_text of the
Corpus
will be used.sentence_tester (callable) – Function that estimates if a sentence is eligible to be part of the summary
itf (
bool
) – Use of ITF normalization in the sentence-level Gismopost_processing (callable) – post_processing transformation. Signature is (
Summarizer
,int
) ->str
sentence_gismo_parameters (
dict
) – Tuning of sentence-level gismo. post MUST be set to False.num_documents (
int
or None) – Number of documents to pre-selectnum_query (
int
or None) – Number of features to use in generic querynum_sentences (
int
or None) – Number of sentences to returnmax_chars (
int
or None) – Maximal number of characters to return
-
sisu.summarizer.
extract_sentences
(source, indices, getter=None, tester=None)[source]¶ Pick up the entries of the source corresponding to indices and build a list of sentences out of that.
Each sentence is a dictionary with the following keys:
index: position of the sentence in the returned list
sentence: the actual sentence
relevant: a boolean that tells if the sentence is eligible for being part of the summary
sanitized: for relevant sentences, a simplified version to be fed to the embedding
- Parameters
- Returns
- Return type
list of dict
Examples
>>> doc1 = ("This is a short sentence! This is a sentence with reference to the url http://www.ix.com! " ... "This sentence is not too short and not too long, without URL and without citation. " ... "I have many things to say in that sentence, to the point " ... "I do not know if I will stop anytime soon but don't let it stop " ... "you from reading this meaninless garbage and this goes on and " ... "this goes on and this goes on and this goes on and this goes on and " ... "this goes on and this goes on and this goes on and this goes on " ... "and this goes on and this goes on and this goes on and this goes " ... "on and this goes on and this goes on and this goes on and this goes " ... "on and this goes on and that is all.") >>> doc2 = ("This is a a sentence with some citations [3, 7]. " ... "This sentence is not too short and not too long, without URL and without citation. " ... "Note that the previous sentence is already present in doc1. " ... "The enzyme cytidine monophospho-N-acetylneuraminic acid hydroxylase (CMAH) catalyzes " ... "the synthesis of Neu5Gc by hydroxylation of Neu5Ac (Schauer et al. 1968).") >>> extract_sentences([doc1, doc2], [1, 0]) [{'index': 0, 'sentence': 'This is a a sentence with some citations [3, 7].', 'relevant': False, 'sanitized': ''}, {'index': 1, 'sentence': 'This sentence is not too short and not too long, without URL and without citation.', 'relevant': True, 'sanitized': 'This sentence is not too short and not too long without URL and without citation'}, {'index': 2, 'sentence': 'Note that the previous sentence is already present in doc1.', 'relevant': True, 'sanitized': 'Note that the previous sentence is already present in doc'}, {'index': 3, 'sentence': 'The enzyme cytidine monophospho-N-acetylneuraminic acid hydroxylase (CMAH) catalyzes the synthesis of Neu5Gc by hydroxylation of Neu5Ac (Schauer et al. 1968).', 'relevant': False, 'sanitized': ''}, {'index': 4, 'sentence': 'This is a short sentence!', 'relevant': False, 'sanitized': ''}, {'index': 5, 'sentence': 'This is a sentence with reference to the url http://www.ix.com!', 'relevant': False, 'sanitized': ''}, {'index': 6, 'sentence': 'This sentence is not too short and not too long, without URL and without citation.', 'relevant': False, 'sanitized': ''}, {'index': 7, 'sentence': "I have many things to say in that sentence...", 'relevant': False, 'sanitized': ''}]