Monday, 4 February 2013

Wrangling and Munging Data with SQL and R

"I didn't have time to write a short letter, so here's a long one" (Mark Twain). Such is the case with data in organizations - its natural state is sprawling, complex, often messy and overweight. Statistical analysis (including much data mining) needs trim, well-organized data.

Most big data are stored in relational database management systems in a set of related tables. Statistical analysis needs that data in a single table. Learn how to use SQL (structured query language) programming language to pull data from various tables in a database and assemble it in a format amenable to statistical analysis in R.

The purpose of this course is to teach you how to extract data from a relational database using SQL, and merge it into a single file in R, so that you can perform statistical operations. Join Kathryn Vasilaky in her online course "Wrangling and Munging with SQL and R". For more details please visit at

Who Should Take This Course:
This course is meant for beginners in data science who need to learn how to extract data from their relational databases.  It can also be a good kickstarter course for individuals who have fluency in SPSS, SAS, or STATA. Such data scientists might not have had to work with relational databases directly before, or they may not be familiar with the syntax and logic of a more object oriented language like R.

Course Program:

Course outline: The course is structured as follows
Session 1: Thinking Ahead, How to Organize your Data?
  • Discussion and visuals of data models how data are stored in Postgres(SQL), i.e. entity relationship schemes, vs. R, flat files or data frames.
    • Unique identifiers, and primary vs. foreign keys
  • Defining your statistical objectives. Writing pseudo code to manipulate your data into its necessary form.
  • Questions we’ll ask: Will I need all the data (panel), or should I collapse it on a particular variable (cross section)?
  • Restoring a database in Postgres (this is repeated in Week Two?)

Session 2:  Basic SQL Procedures and Functions
  • Using commands: Select, Where, Like, Order By
  • Using functions, Count, Avg, Sum
  • Using group-by and sortCopying tables to csv

Session 3:  Bringing SQL into R
  • Joins (Inner, Outer, and multiple joins) in SQL
  • Using the ODBC driver to run SQL queries in R.
  • Reading in csv files
  • Joins vs merges?
  • Formatting dates
  • Subsetting data and merging it back in

Session 4:  Working in R
The Reshape2 function in R to reshape data long or wide in R
Plyr function in R for grouping data
  • Performing a few basic statistical calculations in R
  • Data summaries
  • Linear regression
  • T-test
  • Basic Graphs with Ggplot

Dr. Kathryn Vasilaky has worked at the World Bank, the Gates Foundation, the Federal Reserve, TroopSwap (a tech start-up in Washington DC), and as a soloist dancer with the San Francisco Ballet. She has a wealth of experience extracting and preparing data for analysis and business decision-making: macroeconomic data, commodity data, data resulting from experiments, social network data, e-commerce data, and other web data.

This course takes place over the internet at the Institute for 4 weeks. During each course week, you participate at times of your own choosing - there are no set times when you must be online. The course typically requires 15 hours per week. Course participants will be given access to a private discussion board so that they
will be able to ask questions and exchange comments with instructor, Dr. Iain Pardoe. The class discussions led by the instructor, you can post questions, seek clarification, and interact with your fellow students and the instructor.

For Indian participants accepts registration for its courses at special prices in Indian Rupees through its partner, the Center for eLearning and Training (C-eLT), Pune.

For India Registration and pricing, please visit us at

Call: 020 66009116


No comments: