"I didn't
have time to write a short letter, so here's a long one" (Mark Twain).
Such is the case with data in organizations - its natural state is sprawling,
complex, often messy and overweight. Statistical analysis (including much data
mining) needs trim, well-organized data.
Most big data are stored in relational database management systems in a set of related tables. Statistical analysis needs that data in a single table. Learn how to use SQL (structured query language) programming language to pull data from various tables in a database and assemble it in a format amenable to statistical analysis in R.
The purpose of this course is to teach you how to extract data from a relational database using SQL, and merge it into a single file in R, so that you can perform statistical operations. Join Kathryn Vasilaky in her online course "Wrangling and Munging with SQL and R". For more details please visit at http://www.statistics.com/data-munging/.
Who Should Take This Course:
This course is meant for beginners in data
science who need to learn how to extract data from their relational
databases. It can also be a good
kickstarter course for individuals who have fluency in SPSS, SAS, or STATA.
Such data scientists might not have had to work with relational databases
directly before, or they may not be familiar with the syntax and logic of a
more object oriented language like R.
Course Program:
Course outline: The course
is structured as follows
Session 1: Thinking Ahead, How to
Organize your Data?
- Discussion
and visuals of data models how data are stored in Postgres(SQL), i.e.
entity relationship schemes, vs. R, flat files or data frames.
- Unique identifiers, and primary vs. foreign keys
- Defining
your statistical objectives. Writing pseudo code to manipulate your data
into its necessary form.
- Questions
we’ll ask: Will I need all the data (panel), or should I collapse it on a
particular variable (cross section)?
- Restoring
a database in Postgres (this is repeated in Week Two?)
Session 2: Basic SQL Procedures and Functions
- Using
commands: Select, Where, Like, Order By
- Using
functions, Count, Avg, Sum
- Using
group-by and sortCopying tables to csv
Session 3: Bringing SQL into R
- Joins
(Inner, Outer, and multiple joins) in SQL
- Using
the ODBC driver to run SQL queries in R.
- Reading
in csv files
- Joins
vs merges?
- Formatting
dates
- Subsetting
data and merging it back in
Session 4: Working in R
The Reshape2 function in R to reshape data long or wide in R
Plyr function in R for grouping data
Plyr function in R for grouping data
- Performing
a few basic statistical calculations in R
- Data
summaries
- Linear
regression
- T-test
- Basic
Graphs with Ggplot
Dr. Kathryn Vasilaky has worked at the World Bank, the Gates Foundation, the Federal Reserve, TroopSwap (a tech start-up in Washington DC), and as a soloist dancer with the San Francisco Ballet. She has a wealth of experience extracting and preparing data for analysis and business decision-making: macroeconomic data, commodity data, data resulting from experiments, social network data, e-commerce data, and other web data.
This course takes place over the internet at the Institute for 4 weeks. During each course week, you participate at times of your own choosing - there are no set times when you must be online. The course typically requires 15 hours per week. Course participants will be given access to a private discussion board so that they will be able to ask questions and exchange comments with instructor, Dr. Iain Pardoe. The class discussions led by the instructor, you can post questions, seek clarification, and interact with your fellow students and the instructor.
For Indian participants statistics.com accepts registration for its courses
at special prices in Indian Rupees through its partner, the Center for
eLearning and Training (C-eLT), Pune.
Call: 020 66009116
Websites:
No comments:
Post a Comment