The Carolina Health Informatics Program (CHIP) has developed a few online training modules called An Introduction to Data Science through a health care lens to expose learners to the field of data science. These online modules are accessible to anyone who is interested, and require no prior training or knowledge in data science. If you complete the entire set of modules – the entire “short course” – and successfully pass a simple final assessment, you will receive a certificate of completion.

Introduction to Data Science Curriculum

Text Mining

Data Mining

Text Preprocessing is an important step for natural language processing (NLP). It transforms text into a more digestible form so that machine learning algorithms can perform better. This module will teach various text preprocessing techniques. Begin Module
Exploratory analysis is an initial approach to analyzing data sets. It commonly involves summarizing the main characteristics of datasets their main characteristics and data visualizations. This module will teach you how to perform exploratory analysis for text data.  Note: If you encounter an error in the optional section, please copy and paste the below code into the code cell with the error. Begin Module
Text data is often rich with both information and meaning. However, text data is also often complex which can make analysis difficulty. This module will introduce you to parts of speech tagging, named entity recognition, and relation extraction. This will allow you to both understand the structure of your textual data and derive meaning from it. Begin Module
Feature representation is a way to present your data so a machine or computer can understand it and perform an analysis. This module will investigate feature representation for text data. You will also explore generating different types of feature representations and comparing how well they perform. Begin Module
One of the most powerful uses of data is using it to make future predictions. In this module, we will be exploring how to use text data to perform predictions. Specifically, you will learn about two common machine learning algorithms, logistic regression and k-nearest neighbor. Begin Module
Preparing data is an important step in any data mining project. In this module you will learn how to upload a CSV file and how to deal with missing or improbable data. Begin Module
Univariate analysis allows you to deeply analyze a single variable. This module will teach you the skills to perform univariate analysis including variable types, summary statistics, and univariate data visualization. Along the way, you’ll learn by analyzing specific variables from real patient data! Begin Module
Bivariate analysis is a statistical method which helps us see how our variable relate to one another. In this module, you’ll learn different bivariate analysis techniques and how to apply those techniques in R.  Begin Module
Feature selection is the process of selecting a subset of variables for the purpose of building a machine learning model. Reducing the number of features can improve model performance, make models more easily understandable, and reduces the time required to run a model. In this module you will learn filter, wrapper, and embedded feature selection methods.    Begin Module
Predictive analysis is a powerful tool which allows us to make future predictions from data. This module will pull together the previous four data mining modules to teach advanced techniques such as machine learning, logistic regression, and decision trees. Along the way, you’ll learn by predicting mortality from real ICU patient data!  Begin Module