Datasets¶

Covid¶

Module for loading Covid 19 Kaggle challenge datasets

sisu.datasets.covid.covid_zip_iterator(file='CORD-19-research-challenge.zip', data_path='.', getters=None, language=True)[source]¶

Parameters

file (str or Path, optional) – Zip file of a covid archive.
data_path (str or Path, optional) – Location of the archive
getters (dict, optional) – Recipes to build values for the formatted document.
language (bool, optional) – Add a lang entry that tells the language.

Returns

Formatted articles.

Return type

iterator

Examples

>>> for doc in covid_zip_iterator("covid_sample.zip", data_path='data', getters={'title': get_title}, language=False):
...    print(doc)
{'title': 'Community frailty response service: the ED at your front door'}
{'title': 'COVID-19-Pneumonie'}
{'title': '"Multi-faceted" COVID-19: Russian experience'}

sisu.datasets.covid.default_getters = {'abstract': <function get_abstract>, 'content': <function get_content>, 'id': <function get_id>, 'title': <function get_title>}¶: Default fields to build from a covid json.

sisu.datasets.covid.format_covid_json(document, getters=None, language=True)[source]¶

Parameters

document (dict) – A covid document in its original format.
getters (dict, optional) – Recipes to build values for the formatted document.
language (bool, optional) – Add a lang entry that tells the language.

Returns

Formatted document

Return type

dict

Examples

We will use a few json samples embedded in the package.

>>> data_dir = Path("data/covid_sample")
>>> articles = []
>>> for f in data_dir.rglob('*.json'):
...     with open(f) as fp:
...         articles.append(format_covid_json(json.load(fp)))
>>> for art in sorted(articles, key=lambda e: e['title']): 
...     print(art)
{'title': '"Multi-faceted" COVID-19: Russian experience',
 'abstract': '',
 'content': 'Editor. According to current live statistics at the time of editing this letter,...',
 'id': '0000028b5cc154f68b8a269f6578f21e31f62977',
 'lang': 'en'}
{'title': 'COVID-19-Pneumonie',
 'abstract': '',
 'content': '. der Entwicklung einer schweren Pneumonie im Vordergrund, die in der Regel prognostisch...',
 'id': '0009745a11d206af9e405e00677c51b01251dba7',
 'lang': 'de'}
{'title': 'Community frailty response service: the ED at your front door',
 'abstract': 'We describe the expansion and adaptation of a frailty response team to assess older people in
their usual place of residence. The team had commenced a weekend service to a limited area in February 2020.
As a consequence of demand related to the COVID-19 pandemic, we expanded it and adapted...',
 'content': "INTRODUCTION. A large proportion of short-stay admissions in older adults may be avoidable...,
 'id': '000680e3114af4aa10e8f208cd162a61195f4465',
 'lang': 'en'}

sisu.datasets.covid.get_abstract(document: dict) → str [source]¶

Get the abstract of a document with the same structure as the json in the COVID-19 dataset.

Parameters: document (dict) – A dict representing a document.
Returns: Abstract of the document.
Return type: str

Examples

>>> get_abstract(toy_covid_article) 
'In the jungle, no-one hears you far cry. And vice-versa.
They say to make a long abstract, with the number 42 in it, so here I am.'

sisu.datasets.covid.get_content(document: dict) → str [source]¶

Get the content of a document with the same structure as the json in the COVID-19 dataset.

Parameters: document (dict) – A dict representing a document.
Returns: Content of the document.
Return type: str

Examples

>>> get_content(toy_covid_article)
"introduction. There is no-one in the trees. Is there? conclusion. Predators don't like to lose."

sisu.datasets.covid.get_id(document)[source]¶

Parameters: document (dict) – A covid article
Returns: Unique Id of the article
Return type: str

Examples

>>> get_id(toy_covid_article)
'thisismyid'

sisu.datasets.covid.get_title(document: dict) → str [source]¶

Get the title of a document from a document with the same structure as the json in the COVID-19 dataset.

Parameters: document (dict) – A dict representing a document.
Returns: The title of the document.
Return type: str

Examples

>>> get_title(toy_covid_article)
'Predator'

sisu.datasets.covid.load_from_zip(file='CORD-19-research-challenge.zip', data_path='.', getters=None, language=True, max_docs=None)[source]¶

Parameters

file (str or Path, optional) – Zip file of a covid archive.
data_path (str or Path, optional) – Location of the archive
getters (dict, optional) – Recipes to build values for the formatted document.
language (bool, optional) – Add a lang entry that tells the language.
max_docs (int, optional) – Max number of documents to return

Returns

Return type

list of dict

Examples

>>> load_from_zip("covid_sample.zip", data_path='data', getters={'title': get_title}, language=False) 
[{'title': 'Community frailty response service: the ED at your front door'},
 {'title': 'COVID-19-Pneumonie'},
 {'title': '"Multi-faceted" COVID-19: Russian experience'}]
>>> load_from_zip("covid_sample.zip", data_path='data', getters={'id': get_id}, language=False, max_docs=1) 
[{'id': '000680e3114af4aa10e8f208cd162a61195f4465'}]

sisu.datasets.covid.toy_covid_article = {'abstract': [{'text': 'In the jungle, no-one hears you far cry. And vice-versa.'}, {'text': 'They say to make a long abstract, with the number 42 in it, so here I am.'}], 'body_text': [{'section': 'introduction', 'text': 'There is no-one in the trees. Is there?'}, {'section': 'conclusion', 'text': "Predators don't like to lose."}], 'metadata': {'title': 'Predator'}, 'paper_id': 'thisismyid'}¶: A toy document with the same structure as the json in the COVID-19 dataset.