Reference

PIT

sisu.pit.gismo_wrapper.COVID19_TEXT_GETTERS = {'abstract': <function get_abstract>, 'content': <function get_content>, 'title': <function get_title>}

Getters for the covid dataset. MOVE TO COVID SUBMODULE.

sisu.pit.gismo_wrapper.RE_EMAIL = re.compile('https?://[a-zA-Z.:0-9]+|www.[a-zA-Z.:0-9]+.*|www.|.org|[a-zA-Z.:0-9]+/')

regexp for email detection.

sisu.pit.gismo_wrapper.RE_NOISE = re.compile('[,.:;()0-9+=%\\[\\]_]')

regexp for useless text.

sisu.pit.gismo_wrapper.RE_REFERENCE = re.compile('\\[\\d+,\\s\\d+\\]|\\[\\d+\\]|\\(\\d+,\\s\\d+\\)|\\(\\d+\\)')

regexp for citations.

sisu.pit.gismo_wrapper.document_to_text(document: dict, text_getters=None)str[source]

NOT WORKING

Convert a document (e.g. from COVID-19 dataset) to a string.

Parameters
  • document (dict) – A simplified COVID-19 document.

  • text_getters (dict, optional) – CODE USES LIST NOT DICT THIS FUNCTION SHOULD PROBABLY BE REMOVED

Returns

Return type

The str representing the input document.

Examples

NO EXAMPLE, REMOVE THIS FUNCTION.

sisu.pit.gismo_wrapper.initialize_embedding(documents: list, stop_words: Optional[list] = None, max_ngram: int = 1, min_df: float = 0.02, max_df: float = 0.85, document_to_text=<function simplified_document_to_string>, preprocessor=None)gismo.embedding.Embedding[source]

Initializes an embedding, fitting it from documents

Parameters
  • documents – A list of dict representing documents with strings in the values.

  • stop_words – A list of words to ignore in the vocabulary.

  • max_ngram – the maximum length of ngrams to take into account (e.g. 2 if bigrams in vocabulary).

  • min_df – minimum frequency of a word to be considered in the vocabulary, if an int the word must be contained in at least min_df documents.

  • max_df (maximum frequency of a word to be considered in the vocabulary.) –

  • document_to_text – Callback(Document) -> str.

  • preprocessor

Returns

The embedding fitted on the documents.

Return type

Embedding

sisu.pit.gismo_wrapper.make_gismo(documents: list, alpha: float = 0.2, other_embedding: Optional[gismo.embedding.Embedding] = None, is_documents_embedding: bool = False, document_to_text=<function simplified_document_to_string>)gismo.gismo.Gismo[source]

Make a Gismo object from a list of documents. :param documents: A list of documents with strings in the values. :param alpha: A float in [0, 1] indicating the damping factor used in the D-iteration used by Gismo. :param other_embedding: embedding already fitted on a corpus. :param document_to_text: Callback(Document) -> str.

Returns

A Gismo object made from the given documents and embedding.

sisu.pit.gismo_wrapper.old_make_gismo(documents: list, alpha: float = 0.2, other_embedding: Optional[gismo.embedding.Embedding] = None, is_documents_embedding: bool = False, document_to_text=<function simplified_document_to_string>)gismo.gismo.Gismo[source]

Make a Gismo object from a list of documents. :param documents: A list of documents with strings in the values. :param alpha: A float in [0, 1] indicating the damping factor used in the D-iteration used by Gismo. :param other_embedding: embedding already fitted on a corpus. :param document_to_text: Callback(Document) -> str.

Returns

A Gismo object made from the given documents and embedding.

sisu.pit.gismo_wrapper.sanitize_text(text: str)str[source]

Sanitize a text. This is done to improve the Embedding quality.

Parameters

text (str) – Text to clean.

Returns

The corresponding sanitized str instance.

Return type

str

Examples

>>> sanitize_text("This is a mail: santa@northpole.com!")
'This is a mail santa@northpolecom!'
>>> sanitize_text("This is a !*[ url: https://www.ens.fr!")
'This is a !* url /'
>>> sanitize_text("This are references [3, 17].")
'This are references  '
sisu.pit.gismo_wrapper.simplified_document_to_string(doc: dict)str[source]

Transforms a dict into a string made of its values.

Parameters

doc (dict) – A dict representing a document of “depth one”, all the values are strings.

Returns

Concatenation of doc values.

Return type

str

Examples

>>> from sisu.pit.preprocessing.sentences import toy_article
>>> simplified_document_to_string(toy_article)
"Predator In the jungle, no-one hears you far cry. And vice-versa. They say to make a long abstract, with the number 42 in it, so here I am. There is no-one in the trees. Is there? Predators don't like to lose."
sisu.pit.building_summary.is_relevant_sentence(sentence: str, min_num_words: int = 6, max_num_words: int = 60)bool[source]

Ignore sentences that are too short, too long, that contain a URL or a citation.

Parameters
  • sentence (str) – Sentence to analyze.

  • min_num_words (int, optional) – Minimal number of words.

  • max_num_words (int, optional) – Maximal number of words.

Returns

Is the sentence OK?

Return type

bool

Examples

>>> is_relevant_sentence("This is a short sentence!")
False
>>> is_relevant_sentence("This is a sentence with reference to the url http://www.ix.com!")
False
>>> is_relevant_sentence("This is a a sentence with some citations [3, 7]!")
False
>>> is_relevant_sentence("I have many things to say in that sentence, to the point "
...                      "I do not know if I will stop anytime soon but don't let it stop"
...                      "you from reading this meaninless garbage and this goes on and "
...                      "this goes on and this goes on and this goes on and this goes on and "
...                      " this goes on and  this goes on and  this goes on and  this goes on "
...                      "and  this goes on and  this goes on and  this goes on and  this goes "
...                      "on and  this goes on and  this goes on and  this goes on and  this goes "
...                      "on and  this goes on and ")
False
>>> is_relevant_sentence("This sentence is not too short and not too long, without URL and without citation.")
True
sisu.pit.building_summary.make_query(sentence: str, language='en')str[source]

Builds a query from a sentence.

Parameters
  • sentence (str) – A string from which we want to build a query.

  • language (str) – Language use

Returns

Return type

A string corresponding to the query.

Examples

>>> make_query("Life is something nice!")
'life nice'
>>> make_query("La vie est belle !", language='fr')
'vie belle'
sisu.pit.building_summary.make_tree(documents: list, query: str = '', depth: int = 1, trees: Optional[list] = None, documents_gismo: Optional[gismo.gismo.Gismo] = None, num_documents: Optional[int] = None, num_sentences: Optional[int] = None, embedding: Optional[gismo.embedding.Embedding] = None, used_sentences: Optional[set] = None)list[source]

Builds a hierarchical summary.

Parameters
  • documents (list of dict) – A list of dict corresponding to documents, only the values of the “content” key will be summarized.

  • query (str, optional) – Textual query to focus the summary on one subject.

  • depth (int, optional) – An int giving the depth of the summary (depth one is a sequential summary).

  • trees (list, optional) – A list of dict being completed, necessary for the recursivity.

  • documents_gismo (Gismo) – Pre-existing Gismo

  • num_documents (int, optional) – Number of top documents to be taking into account for the summary.

  • num_sentences (int, optional) – Number of sentences wanted in the summary.

  • embedding (Embedding, optional) – An Embedding fitted on a bigger corpus than documents.

  • used_sentences (set, optional) – A set of “forbidden” sentences. Will be updated inplace.

Returns

A list of dict corresponding to the hierarchical summary

Return type

list of dict

Examples

>>> from gismo.datasets.reuters import get_reuters_news
>>> make_tree(get_reuters_news(), query="Orange", num_documents=10, num_sentences=3, depth=2) 
[{'text': 'But some analysts still believe Orange is overvalued.',
  'current_keywords': ['orange', 'one', 'is', 'at', 'on', 'in', 'and', 'its', 'shares', 'has', 'analysts', 'of', 'market', 'believe', 'overvalued'],
  'url': None,
  'children': [{'text': 'Trading sources said China was staying out of the market, and that Indian meal was currently overvalued by a good $20 a tonne.',
                'current_keywords': ['orange', 'overvalued', 'analysts', 'that', 'and', 'are', 'compared', 'believe', 'market', 'but', 'some', 'still', 'of', 'said', 'we'],
                'url': None, 'children': []},
               {'text': 'Since the purchase, widely seen by analysts as overvalued, Quaker has struggled with the line of ready-to-drink teas and juices.',
                'current_keywords': ['orange', 'overvalued', 'analysts', 'that', 'and', 'are', 'compared', 'believe', 'market', 'but', 'some', 'still', 'of', 'said', 'we'],
                'url': None, 'children': []},
               {'text': '"No question that if the dollar continues to be overvalued and continues to be strong, we\'ll see some price erosion later in the year."',
                'current_keywords': ['orange', 'overvalued', 'analysts', 'that', 'and', 'are', 'compared', 'believe', 'market', 'but', 'some', 'still', 'of', 'said', 'we'],
                'url': None, 'children': []}]},
 {'text': 'Orange shares were 2.5p higher at 188p on Friday.',
  'current_keywords': ['orange', 'one', 'is', 'at', 'on', 'in', 'and', 'its', 'shares', 'has', 'analysts', 'of', 'market', 'believe', 'overvalued'],
  'url': None,
  'children': [{'text': 'Orange, Calif.-based Bergen is the largest U.S. distributor of generic drugs, while Miami-based Ivax is a generic drug manufacturing giant.',
                'current_keywords': ['orange', 'higher', 'shares', 'friday', 'on', 'at', 'and', 'in', 'its', 'of', 'percent', 'one', 'mobile', 'to', 'market'],
                'url': None, 'children': []},
               {'text': 'One-2-One and Orange ORA.L, which offer only digital services, are due to release their connection figures next week.',
                'current_keywords': ['orange', 'higher', 'shares', 'friday', 'on', 'at', 'and', 'in', 'its', 'of', 'percent', 'one', 'mobile', 'to', 'market'],
                'url': None, 'children': []},
               {'text': "Dodd noted that BT's plans to raise the price of calls to Orange and One 2 One handsets would be beneficial.",
                'current_keywords': ['orange', 'higher', 'shares', 'friday', 'on', 'at', 'and', 'in', 'its', 'of', 'percent', 'one', 'mobile', 'to', 'market'],
                'url': None, 'children': []}]},
 {'text': 'Orange already has a full roaming agreement in Germany and a partial one in France, centred on Paris.',
  'current_keywords': ['orange', 'one', 'is', 'at', 'on', 'in', 'and', 'its', 'shares', 'has', 'analysts', 'of', 'market', 'believe', 'overvalued'],
  'url': None,
  'children': [{'text': 'Orange says its offer of roaming services between the UK and other countries is part of its aim to provide customers with the best value for money.',
                'current_keywords': ['orange', 'roaming', 'partial', 'centred', 'paris', 'france', 'germany', 'agreement', 'full', 'on', 'and', 'in', 'of', 'for', 'with'],
                'url': None, 'children': []},
               {'text': 'As with all roaming agreements, the financial details of the Swiss deal remain a trade secret.',
                'current_keywords': ['orange', 'roaming', 'partial', 'centred', 'paris', 'france', 'germany', 'agreement', 'full', 'on', 'and', 'in', 'of', 'for', 'with'],
                'url': None, 'children': []},
               {'text': '"We look forward in 1997 to continuing to move ahead and to extending our international service through new roaming agreements and the introduction of dual band handsets."',
                'current_keywords': ['orange', 'roaming', 'partial', 'centred', 'paris', 'france', 'germany', 'agreement', 'full', 'on', 'and', 'in', 'of', 'for', 'with'],
                'url': None, 'children': []}]}]
sisu.pit.building_summary.summarize(documents, query='', num_documents=None, num_sentences=None, ratio=0.05, embedding=None, num_keywords: int = 15, size_generic_query: int = 5, used_sentences: Optional[set] = None, get_content=<function <lambda>>)tuple[source]

Produces a list of sentences and a list of keywords.

Parameters
  • documents (list) – A list of documents.

  • query (str, optional) – Textual query to focus the summary on one subject.

  • num_documents (int, optional) – Number of top documents to be taking into account for the summary.

  • num_sentences (int, optional) – Number of sentences wanted in the summary. Overrides ratio.

  • ratio (float in ]0, 1], optional) – length of the summary as a proportion of the length of the num_documents kept.

  • embedding (Embedding, optional) – An Embedding fitted on a bigger corpus than documents.

  • num_keywords (int, optional) – An int corresponding to the number of keywords returned

  • size_generic_query (int, optional) – size generic query

  • used_sentences (set, optional) – A set of “forbidden” sentences. Will be updated inplace.

  • get_content (callable, optional) – A function that allows the retrieval of a document’s content.

Returns

A list of the summary sentences, A list of keywords.

Return type

list

Examples

>>> from gismo.datasets.reuters import get_reuters_news
>>> summarize(get_reuters_news(), num_documents=10, num_sentences=4) 
(['Gum arabic has a history dating back to ancient times.',
  'Hungry nomads pluck gum arabic as they pass with grazing goats and cattle.',
  'For impoverished sub-Saharan states producing the bulk of world demand, gum arabic simply means export currency.',
  "After years of war-induced poverty, gum arabic is offering drought-stricken Chad's rural poor a lifeline to the production plants of the world's food and beverage giants."],
  ['norilsk', 'icewine', 'amiel', 'gum', 'arabic', 'her', 'tibet', 'chad', 'deng', 'oil', 'grapes', 'she', 'his', 'czechs', 'chechnya'])
>>> summarize(get_reuters_news(), query="Ericsson", num_documents=10, num_sentences=5) 
(['The restraints are few in areas such as consumer products, while in sectors such as banking, distribution and insurance, foreign firms are kept on a very tight leash.',
  'These latest wins follow a recent $350 million contract win with Telefon AB L.M.',
  'Pocket is the first from the high-priced 1996 auction known to have filed for bankruptcy protection.',
  '"That is, assuming the deal is done right," she added.',
  '"Generally speaking, the easiest place to make a profit tends to be in the consumer industry, usually fairly small-scale operations," said Anne Stevenson-Yang, director of China operations for the U.S.-China Business Council.'],
  ['ericsson', 'sweden', 'motorola', 'telecommuncation', 'communciation', 'bolstering', 'priced', 'sectors', 'makers', 'equipment', 'schaumberg', 'lm', 'done', 'manufacturing', 'consumer'])