Summarizer

class sisu.summarizer.PostNLP(nlp, coref=False)[source]

Post-processor for the Summarizer that leverages a spacy NLP engine.

  • Discard sentences with no verb.

  • Remove adverbs and punctuations that starts a sentence (e.g. “However, we …” -> “We …”).

  • Optionally, if the engine supports co-references, resolve them.

Parameters
  • nlp (callable) – A Spacy nlp engine.

  • coref (bool) – Resolve co-references if the nlp engine supports it.

class sisu.summarizer.Summarizer(gismo, **kwargs)[source]

Summarizer class.

Parameters
query_

Query used to summarize.

Type

str

sentences_

Selected sentences. Each sentence is a dictionary with the following keys:

  • index: position of the sentence in the returned list

  • sentence: the actual sentence

  • relevant: a boolean that tells if the sentence is eligible for being part of the summary

  • sanitized: for relevant sentences, a simplified version to be fed to the embedding

Type

list of dict

order_

Proposed incomplete ordering of the sentences_

Type

numpy.ndarray

sentence_gismo_

Gismo running at sentence level.

Type

Gismo

parameters

Handler of parameters.

Type

Parameters

Examples

The package contains a data folder with a toy gismo with articles related to Covid-19. We load it.

>>> gismo = Gismo(filename="toy_gismo", path="data")

Then we build a summarizer out of it. We tell to fetch the sentences from the content of the articles.

>>> summa = Summarizer(gismo, text_getter = lambda d: d['content'])

Ask for a summary on bat with a maximal budget of 500 characters, using pure TF-IDF sentence embedding.

>>> summa('bat', max_chars=500, itf=False) 
['By comparing the amino acid sequence of 2019-nCoV S-protein (GenBank Accession: MN908947.3) with
  Bat SARS-like coronavirus isolate bat-SL-CoVZC45 and Bat SARS-like coronavirus isolate Bat-SL-CoVZXC21,
  the latter two were shown to share 89.1% and 88.6% sequence identity to 2019-nCoV S-protein
  (supplementary figure 1) .',
 'Within our bat-hemoplasma network, genotype sharing was restricted to five host communities,
  380 whereas six genotypes were each restricted to a single bat species (Fig. 5A ).']

Now a summary based on the cosine ordering, using the content of abstracts and pure TF-IDF sentence embedding.

>>> summa('bat', max_chars=500, order='cosine', text_getter = lambda d: d['abstract']) 
['Bat dipeptidyl peptidase 4 (DPP4) sequences were closely related to 38 those of human and non-human
  primates but distinct from dromedary DPP4 sequence.',
 'The multiple sequence alignment data correlated with already published reports on SARS-CoV-2
  indicated that it is closely related to Bat-Severe Acute Respiratory Syndrome like coronavirus
  (Bat CoV SARS-like) and wellstudied Human SARS.',
 '(i.e., hemoplasmas) across a species-rich 40 bat community in Belize over two years.']

Now 4 sentences using a coverage ordering.

>>> summa('bat', num_sentences=4, order='coverage') 
['By comparing the amino acid sequence of 2019-nCoV S-protein (GenBank Accession: MN908947.3)
  with Bat SARS-like coronavirus isolate bat-SL-CoVZC45 and Bat SARS-like coronavirus isolate
  Bat-SL-CoVZXC21, the latter two were shown to share 89.1% and 88.6% sequence identity
  to 2019-nCoV S-protein (supplementary figure 1) .',
 'However, we have not done the IDPs analysis for ORF10 from the Bat-SL-CoVZC45 strain since we
  have taken different strain of Bat CoV (reviewed strain HKU3-1) in our study.',
 'To test the dependence of the hemoplasma 290 phylogeny upon the bat phylogeny and thus assess
  evidence of evolutionary codivergence, we 291 applied the Procrustes Approach to Cophylogeny
  (PACo) using distance matrices and the paco 292 We used hemoplasma genotype assignments to
  create a network, with each node representing a 299 bat species and edges representing shared
  genotypes among bat species pairs.',
 'However, these phylogenetic patterns in prevalence were decoupled from those describing bat
  526 species centrality in sharing hemoplasmas, such that genotype sharing was generally
  restricted 527 by bat phylogeny.']

As you can see, there are some ``However, ‘’ in the answers. A bit of NLP post_processing can take care of those.

>>> import spacy
>>> nlp = spacy.load("en_core_web_sm")
>>> post_nlp = PostNLP(nlp)
>>> summa('bat', num_sentences=4, order='coverage', post_processing=post_nlp) 
['By comparing the amino acid sequence of 2019-nCoV S-protein (GenBank Accession: MN908947.3)
  with Bat SARS-like coronavirus isolate bat-SL-CoVZC45 and Bat SARS-like coronavirus isolate
  Bat-SL-CoVZXC21, the latter two were shown to share 89.1% and 88.6% sequence identity
  to 2019-nCoV S-protein (supplementary figure 1) .',
 'We have not done the IDPs analysis for ORF10 from the Bat-SL-CoVZC45 strain since we
  have taken different strain of Bat CoV (reviewed strain HKU3-1) in our study.',
 'To test the dependence of the hemoplasma 290 phylogeny upon the bat phylogeny and thus assess
  evidence of evolutionary codivergence, we 291 applied the Procrustes Approach to Cophylogeny
  (PACo) using distance matrices and the paco 292 We used hemoplasma genotype assignments to
  create a network, with each node representing a 299 bat species and edges representing shared
  genotypes among bat species pairs.',
 'These phylogenetic patterns in prevalence were decoupled from those describing bat
  526 species centrality in sharing hemoplasmas, such that genotype sharing was generally
  restricted 527 by bat phylogeny.']
build_coverage_order(k)[source]

Populate order_ with a covering order with target number of sentences k. The actual number of indices is stretched by the sentence Gismo stretch factor.

Parameters

k (int) – Number of optimal covering sentences.

Returns

Covering order.

Return type

numpy.ndarray

build_sentence_gismo(itf=None, s_g_p=None)[source]

Creates the Gismo of sentences (sentence_gismo_)

Parameters
  • itf (bool, optional) – Applies TF-IDTF embedding. I False, TF-IDF embedding is used.

  • s_g_p (dict) – Parameters for the sentence Gismo.

Returns

Return type

None

build_sentence_source(num_documents=None, getter=None, tester=None)[source]

Creates the corpus of sentences (sentences_)

Parameters
  • num_documents (int, optional) – Number of documents to select (if not, Gismo will automatically decide).

  • getter (callable) – Extraction of text from corpus item. If not specify, the to_text of the Corpus will be used.

  • tester (callable) – Function that estimates if a sentence is eligible to be part of the summary.

Returns

Return type

None

rank_documents(query, num_query=None)[source]

Perform a Gismo query at document-level. If the query fails, builds a generic query instead. The gismo and query_ attributes are updated.

Parameters
  • query (str) – Input text

  • num_query (int) – Number of words of the generic query, is any

Returns

Return type

None

summarize(query='', **kwargs)[source]

Performs a full run of all summary-related operations:

  • Rank a query at document level, fallback to a generic query if the query fails;

  • Extract sentences from the top documents

  • Order sentences by one of the three methods proposed, rank, coverage, and cosine

  • Apply post-processing and return list of selected sentences.

Note that calling a Summarizer will call its summarize() method.

Parameters
Returns

Summary.

Return type

list of str

sisu.summarizer.cosine_order(projection, sentences, query)[source]

Order relevant sentences by cosine similarity to the query.

Parameters
  • projection (callable) – A function that converts a text into a tuple whose first element is an embedding (typically a Gismo query_projection()).

  • sentences (list of dict) – Sentences as output by extract_sentences().

  • query (str) – Target query

Returns

Ordered list of indexes of relevant sentences, sorted by cosine similarity

Return type

list of int

sisu.summarizer.default_summarizer_parameters = {'itf': True, 'max_chars': None, 'num_documents': None, 'num_query': None, 'num_sentences': None, 'order': 'rank', 'post_processing': <function <lambda>>, 'sentence_gismo_parameters': {'post': False, 'resolution': 0.99}, 'sentence_tester': <function is_relevant_sentence>, 'text_getter': None}

List of parameters for the summarizer with their default values.

Parameters
  • order (str) – Sorting function.

  • text_getter (callable) – Extraction of text from corpus item. If not specify, the to_text of the Corpus will be used.

  • sentence_tester (callable) – Function that estimates if a sentence is eligible to be part of the summary

  • itf (bool) – Use of ITF normalization in the sentence-level Gismo

  • post_processing (callable) – post_processing transformation. Signature is (Summarizer, int) -> str

  • sentence_gismo_parameters (dict) – Tuning of sentence-level gismo. post MUST be set to False.

  • num_documents (int or None) – Number of documents to pre-select

  • num_query (int or None) – Number of features to use in generic query

  • num_sentences (int or None) – Number of sentences to return

  • max_chars (int or None) – Maximal number of characters to return

sisu.summarizer.extract_sentences(source, indices, getter=None, tester=None)[source]

Pick up the entries of the source corresponding to indices and build a list of sentences out of that.

Each sentence is a dictionary with the following keys:

  • index: position of the sentence in the returned list

  • sentence: the actual sentence

  • relevant: a boolean that tells if the sentence is eligible for being part of the summary

  • sanitized: for relevant sentences, a simplified version to be fed to the embedding

Parameters
  • source (list) – list of objects

  • indices (iterable of int) – Indexes of the source items to select

  • getter (callable, optional) – Tells how to convert a source entry into text.

  • tester (callable, optional) – Tells if the sentence is eligible for being part of the summary.

Returns

Return type

list of dict

Examples

>>> doc1 = ("This is a short sentence! This is a sentence with reference to the url http://www.ix.com! "
...        "This sentence is not too short and not too long, without URL and without citation. "
...        "I have many things to say in that sentence, to the point "
...        "I do not know if I will stop anytime soon but don't let it stop "
...        "you from reading this meaninless garbage and this goes on and "
...        "this goes on and this goes on and this goes on and this goes on and "
...        "this goes on and  this goes on and  this goes on and  this goes on "
...        "and  this goes on and  this goes on and  this goes on and  this goes "
...        "on and  this goes on and  this goes on and  this goes on and  this goes "
...        "on and  this goes on and that is all.")
>>> doc2 = ("This is a a sentence with some citations [3, 7]. "
...         "This sentence is not too short and not too long, without URL and without citation. "
...         "Note that the previous sentence is already present in doc1. "
...         "The enzyme cytidine monophospho-N-acetylneuraminic acid hydroxylase (CMAH) catalyzes "
...         "the synthesis of Neu5Gc by hydroxylation of Neu5Ac (Schauer et al. 1968).")
>>> extract_sentences([doc1, doc2], [1, 0]) 
[{'index': 0, 'sentence': 'This is a a sentence with some citations [3, 7].', 'relevant': False, 'sanitized': ''},
 {'index': 1, 'sentence': 'This sentence is not too short and not too long, without URL and without citation.',
  'relevant': True, 'sanitized': 'This sentence is not too short and not too long without URL and without citation'},
 {'index': 2, 'sentence': 'Note that the previous sentence is already present in doc1.',
  'relevant': True, 'sanitized': 'Note that the previous sentence is already present in doc'},
 {'index': 3, 'sentence': 'The enzyme cytidine monophospho-N-acetylneuraminic acid hydroxylase (CMAH) catalyzes
                           the synthesis of Neu5Gc by hydroxylation of Neu5Ac (Schauer et al. 1968).',
  'relevant': False, 'sanitized': ''},
 {'index': 4, 'sentence': 'This is a short sentence!', 'relevant': False, 'sanitized': ''},
 {'index': 5, 'sentence': 'This is a sentence with reference to the url http://www.ix.com!',
  'relevant': False, 'sanitized': ''},
 {'index': 6, 'sentence': 'This sentence is not too short and not too long, without URL and without citation.',
  'relevant': False, 'sanitized': ''},
 {'index': 7, 'sentence': "I have many things to say in that sentence...",
  'relevant': False, 'sanitized': ''}]