Datasets¶
Covid¶
Module for loading Covid 19 Kaggle challenge datasets
- 
sisu.datasets.covid.covid_zip_iterator(file='CORD-19-research-challenge.zip', data_path='.', getters=None, language=True)[source]¶
- Parameters
- Returns
- Formatted articles. 
- Return type
- iterator 
 - Examples - >>> for doc in covid_zip_iterator("covid_sample.zip", data_path='data', getters={'title': get_title}, language=False): ... print(doc) {'title': 'Community frailty response service: the ED at your front door'} {'title': 'COVID-19-Pneumonie'} {'title': '"Multi-faceted" COVID-19: Russian experience'} 
- 
sisu.datasets.covid.default_getters= {'abstract': <function get_abstract>, 'content': <function get_content>, 'id': <function get_id>, 'title': <function get_title>}¶
- Default fields to build from a covid json. 
- 
sisu.datasets.covid.format_covid_json(document, getters=None, language=True)[source]¶
- Parameters
- Returns
- Formatted document 
- Return type
 - Examples - We will use a few json samples embedded in the package. - >>> data_dir = Path("data/covid_sample") >>> articles = [] >>> for f in data_dir.rglob('*.json'): ... with open(f) as fp: ... articles.append(format_covid_json(json.load(fp))) >>> for art in sorted(articles, key=lambda e: e['title']): ... print(art) {'title': '"Multi-faceted" COVID-19: Russian experience', 'abstract': '', 'content': 'Editor. According to current live statistics at the time of editing this letter,...', 'id': '0000028b5cc154f68b8a269f6578f21e31f62977', 'lang': 'en'} {'title': 'COVID-19-Pneumonie', 'abstract': '', 'content': '. der Entwicklung einer schweren Pneumonie im Vordergrund, die in der Regel prognostisch...', 'id': '0009745a11d206af9e405e00677c51b01251dba7', 'lang': 'de'} {'title': 'Community frailty response service: the ED at your front door', 'abstract': 'We describe the expansion and adaptation of a frailty response team to assess older people in their usual place of residence. The team had commenced a weekend service to a limited area in February 2020. As a consequence of demand related to the COVID-19 pandemic, we expanded it and adapted...', 'content': "INTRODUCTION. A large proportion of short-stay admissions in older adults may be avoidable..., 'id': '000680e3114af4aa10e8f208cd162a61195f4465', 'lang': 'en'} 
- 
sisu.datasets.covid.get_abstract(document: dict) → str[source]¶
- Get the abstract of a document with the same structure as the json in the COVID-19 dataset. - Parameters
- document ( - dict) – A dict representing a document.
- Returns
- Abstract of the document. 
- Return type
 - Examples - >>> get_abstract(toy_covid_article) 'In the jungle, no-one hears you far cry. And vice-versa. They say to make a long abstract, with the number 42 in it, so here I am.' 
- 
sisu.datasets.covid.get_content(document: dict) → str[source]¶
- Get the content of a document with the same structure as the json in the COVID-19 dataset. - Parameters
- document ( - dict) – A dict representing a document.
- Returns
- Content of the document. 
- Return type
 - Examples - >>> get_content(toy_covid_article) "introduction. There is no-one in the trees. Is there? conclusion. Predators don't like to lose." 
- 
sisu.datasets.covid.get_title(document: dict) → str[source]¶
- Get the title of a document from a document with the same structure as the json in the COVID-19 dataset. - Parameters
- document ( - dict) – A dict representing a document.
- Returns
- The title of the document. 
- Return type
 - Examples - >>> get_title(toy_covid_article) 'Predator' 
- 
sisu.datasets.covid.load_from_zip(file='CORD-19-research-challenge.zip', data_path='.', getters=None, language=True, max_docs=None)[source]¶
- Parameters
- file (str or - Path, optional) – Zip file of a covid archive.
- data_path (str or - Path, optional) – Location of the archive
- getters (dict, optional) – Recipes to build values for the formatted document. 
- language (bool, optional) – Add a lang entry that tells the language. 
- max_docs (int, optional) – Max number of documents to return 
 
- Returns
- Return type
- list of dict 
 - Examples - >>> load_from_zip("covid_sample.zip", data_path='data', getters={'title': get_title}, language=False) [{'title': 'Community frailty response service: the ED at your front door'}, {'title': 'COVID-19-Pneumonie'}, {'title': '"Multi-faceted" COVID-19: Russian experience'}] >>> load_from_zip("covid_sample.zip", data_path='data', getters={'id': get_id}, language=False, max_docs=1) [{'id': '000680e3114af4aa10e8f208cd162a61195f4465'}] 
- 
sisu.datasets.covid.toy_covid_article= {'abstract': [{'text': 'In the jungle, no-one hears you far cry. And vice-versa.'}, {'text': 'They say to make a long abstract, with the number 42 in it, so here I am.'}], 'body_text': [{'section': 'introduction', 'text': 'There is no-one in the trees. Is there?'}, {'section': 'conclusion', 'text': "Predators don't like to lose."}], 'metadata': {'title': 'Predator'}, 'paper_id': 'thisismyid'}¶
- A toy document with the same structure as the json in the COVID-19 dataset.