Datasets¶
Covid¶
Module for loading Covid 19 Kaggle challenge datasets
-
sisu.datasets.covid.
covid_zip_iterator
(file='CORD-19-research-challenge.zip', data_path='.', getters=None, language=True)[source]¶ - Parameters
- Returns
Formatted articles.
- Return type
iterator
Examples
>>> for doc in covid_zip_iterator("covid_sample.zip", data_path='data', getters={'title': get_title}, language=False): ... print(doc) {'title': 'Community frailty response service: the ED at your front door'} {'title': 'COVID-19-Pneumonie'} {'title': '"Multi-faceted" COVID-19: Russian experience'}
-
sisu.datasets.covid.
default_getters
= {'abstract': <function get_abstract>, 'content': <function get_content>, 'id': <function get_id>, 'title': <function get_title>}¶ Default fields to build from a covid json.
-
sisu.datasets.covid.
format_covid_json
(document, getters=None, language=True)[source]¶ - Parameters
- Returns
Formatted document
- Return type
Examples
We will use a few json samples embedded in the package.
>>> data_dir = Path("data/covid_sample") >>> articles = [] >>> for f in data_dir.rglob('*.json'): ... with open(f) as fp: ... articles.append(format_covid_json(json.load(fp))) >>> for art in sorted(articles, key=lambda e: e['title']): ... print(art) {'title': '"Multi-faceted" COVID-19: Russian experience', 'abstract': '', 'content': 'Editor. According to current live statistics at the time of editing this letter,...', 'id': '0000028b5cc154f68b8a269f6578f21e31f62977', 'lang': 'en'} {'title': 'COVID-19-Pneumonie', 'abstract': '', 'content': '. der Entwicklung einer schweren Pneumonie im Vordergrund, die in der Regel prognostisch...', 'id': '0009745a11d206af9e405e00677c51b01251dba7', 'lang': 'de'} {'title': 'Community frailty response service: the ED at your front door', 'abstract': 'We describe the expansion and adaptation of a frailty response team to assess older people in their usual place of residence. The team had commenced a weekend service to a limited area in February 2020. As a consequence of demand related to the COVID-19 pandemic, we expanded it and adapted...', 'content': "INTRODUCTION. A large proportion of short-stay admissions in older adults may be avoidable..., 'id': '000680e3114af4aa10e8f208cd162a61195f4465', 'lang': 'en'}
-
sisu.datasets.covid.
get_abstract
(document: dict) → str[source]¶ Get the abstract of a document with the same structure as the json in the COVID-19 dataset.
- Parameters
document (
dict
) – A dict representing a document.- Returns
Abstract of the document.
- Return type
Examples
>>> get_abstract(toy_covid_article) 'In the jungle, no-one hears you far cry. And vice-versa. They say to make a long abstract, with the number 42 in it, so here I am.'
-
sisu.datasets.covid.
get_content
(document: dict) → str[source]¶ Get the content of a document with the same structure as the json in the COVID-19 dataset.
- Parameters
document (
dict
) – A dict representing a document.- Returns
Content of the document.
- Return type
Examples
>>> get_content(toy_covid_article) "introduction. There is no-one in the trees. Is there? conclusion. Predators don't like to lose."
-
sisu.datasets.covid.
get_title
(document: dict) → str[source]¶ Get the title of a document from a document with the same structure as the json in the COVID-19 dataset.
- Parameters
document (
dict
) – A dict representing a document.- Returns
The title of the document.
- Return type
Examples
>>> get_title(toy_covid_article) 'Predator'
-
sisu.datasets.covid.
load_from_zip
(file='CORD-19-research-challenge.zip', data_path='.', getters=None, language=True, max_docs=None)[source]¶ - Parameters
file (str or
Path
, optional) – Zip file of a covid archive.data_path (str or
Path
, optional) – Location of the archivegetters (dict, optional) – Recipes to build values for the formatted document.
language (bool, optional) – Add a lang entry that tells the language.
max_docs (int, optional) – Max number of documents to return
- Returns
- Return type
list of dict
Examples
>>> load_from_zip("covid_sample.zip", data_path='data', getters={'title': get_title}, language=False) [{'title': 'Community frailty response service: the ED at your front door'}, {'title': 'COVID-19-Pneumonie'}, {'title': '"Multi-faceted" COVID-19: Russian experience'}] >>> load_from_zip("covid_sample.zip", data_path='data', getters={'id': get_id}, language=False, max_docs=1) [{'id': '000680e3114af4aa10e8f208cd162a61195f4465'}]
-
sisu.datasets.covid.
toy_covid_article
= {'abstract': [{'text': 'In the jungle, no-one hears you far cry. And vice-versa.'}, {'text': 'They say to make a long abstract, with the number 42 in it, so here I am.'}], 'body_text': [{'section': 'introduction', 'text': 'There is no-one in the trees. Is there?'}, {'section': 'conclusion', 'text': "Predators don't like to lose."}], 'metadata': {'title': 'Predator'}, 'paper_id': 'thisismyid'}¶ A toy document with the same structure as the json in the COVID-19 dataset.