Thursday, 20 September 2012

Data Preparation and Cleaning for Analytics

Books on data mining (my own included) usually focus on the statistical and machine learning algorithms used to make predictions, associations, etc.   Real-world data miners, however, spend most of their time preparing and cleaning the data.  This potentially overwhelming task is easier, though, if you can learn from the experience of hundreds of other data miners and break the task down into a standard set of steps and procedures.  Learn how in Dr. Robert Nisbet's "Data Prep and Cleaning for Analytics" at For more details please visit at

"Data Preparation  and Cleaning for Analytics" covers joining and merging tables, recoding data, detecting outliners, dealing with missing data, deriving new variables, and more.  The course culminates in a data mining project in which you will bring data through the cleaning and preparation stages, and to the point where you implement a data mining model.

Who Should Take This Course:
Anyone involved in the specification or preparation of data mining or predictive modeling application.

Course Program:
Lesson 1 - Introduction
  • Introduction to the course elements
  • Introduction to the major elements of a data mining project
  • The iterative nature of data mining
  • Perform several common data description analyses
  • Submit a data description report

Lesson 2 - Data Integration, Cleaning, Standardization
  • Metadata analysis of multiple data sources
  • Merge multiple tables/files with the same structure
  • Join multiple tables/files with different structures
  • Data lookup operation
  • Assemble the Customer Analytic Record (CAR)
  • "Dirty data" analysis and deletion
  • Data recoding
  • Outlier analysis and deletion
  • Missing data imputation by multiple regression or decision tree
  • Data standardization and normalization
  • Reverse Pivoting

Lesson 3 - Operations on Variables
  • Assign variable weights
  • Balance data sets with rare target values
  • Create data abstractions for categorical variables
  • Create temporal abstraction (lag) variables
  • Perform a data de-duplication operation
  • Perform a data filtering operation
  • Perform a simple random sampling operation
  • Perform a stratified random sampling operation

Lesson 4 - Operations on Variables, cont.
  • Perform a data binning operation for continuous variables
  • Understand how to use data bins
  • Create "dummy" variables for categorical variables
  • Derive new continuous variables for data mining
  • Derive new categorical variables for data mining
  • Perform feature selection using simple correlation coefficients
  • Perform feature selection using various advanced methods

Dr. Robert Nisbet has over 35 years’ experience in analytics and modeling as a college professor, researcher, and data miner in telecommunications, retail, membership clubs (AAA), insurance and banking. is the lead author of the "Handbook of Statistical Analysis and Data Mining Applications."   He is skilled also in the use of Extract-Transform-Load (ETL) tools for building dependent data marts designed for management reporting and data mining.

You will be able to ask questions and exchange comments with Dr. Robert Nisbet via a private discussion board throughout the course.   The courses take place online at in a series of 4 weekly lessons and assignments, and require about 15 hours/week.  Participate at your own convenience; there are no set times when you must be online. You have the flexibility to work a bit every day, if that is your preference, or concentrate your work in just a couple of days.

For Indian participants accepts registration for its courses at special prices in Indian Rupees through its partner, the Center for eLearning and Training (C-eLT), Pune.

For India Registration and pricing, please visit us at

Call: 020 66009116


No comments: