The files below represent data files used throughout the jupyter notebooks for “Methods in Medical Informatics”. All data files are organized by chapter.

  •  sample.txt – this file contains the article “A machine learning algorithm to increase COVID-19 inpatient diagnostic capacity” represented in XML (Source)
  • Test_directory – a directory containing two separate text files
    • test1.txt – text file containing the abstract from the article “COVID-19: what has been learned and to be learned about the novel coronavirus disease”
    • test2.txt – this file is a copy of “test1.txt”
  • mim2gene.txt – this is a text file that details the links between the genes in OMIM and other gene identifiers
  • d2020.bin – a binary file which list current Medical Subject Headings (MeSH) as of 2020
  • sample.bin – a binary file which contain a single example string
  •  us.gif – an image file which contains an image of the United States
  • neo1.jpg – a JPEG file displaying a diagram of different neoplasm subtypes
  • loc_states.txt – a text file which contains the longitude and latitude for the geographic centers of all 50 states
  • d2020.bin – a binary file which contains tens of thousands of MeSH terms
  • stop.txt – a text file with a list of stopwords
  • titles.txt – a text file that contains a list of 100 titles of journal articles
  • cancer_gene_titles.txt – a text file which contains a list of cancer-related journal article titles
  • text.txt – a text file that contains a sample of a journal article
  • paradise.txt – the novel paradise lost in text format
  • treasure.txt – the novel treasure island in text format
  • d2020.bin – a binary file which contains tens of thousands of MeSH terms
  • each10.txt – this is a text file which contains an electronic version of the ICD
  • icdo3.txt – this is a text file which contains an electronic version of the ICD-O
  • cancer_gene_titles.txt – a text file which contains a list of cancer-related journal article titles
  • neocl.xml – the Neoplasm Classification nomenclature formatted using XML
  • neocl.lst – this file contains a list of candidate neoplasm classification terms
  • neocl.xml – the Neoplasm Classification nomenclature formatted using XML
  • doublets.txt – this is a text file containing numerous medical term doublets