Chapter 8 Data recipes

Raw vs processed data when transforming raw data into processed data it is important to remember that all the process must be recorded (cookbook)

tidying: structuring datasets to facilitate analysis

Tidy data is a standard way of mapping the meaning of a dataset to its structure.

An observation contains all values measured on the same unit (like a person, or a day, or a race) across attributes.

  1. Each variable forms column
  2. Each observation forms a row
  3. Each type of observation unit forms a table

One way of organizing variables is by their role in the analysis: are values fixed by the design of the data collection, or are they measured during the course of the experiment? … Fixed variables should come first, followed by measured variables, each ordered so that related variables are contiguous

problems to be faced: * Column headers are values * Multiple variables are stored in one column * Variables are stored in both rows and columns * Multiple type of observational units are stored in the same table * A single observational unit is stored in multiple tables

8.1 codebook

Must contain: * A “Study design” section, a description on the methods used to collect the data.
* Description of each and every variables to be used including it’s units.

instruction list An script with no parameters that has the raw data as an intake and produces the processed/tidy data. if it is not possible to make all the process be done through the script the should be instruction on any additional steps.

8.3 XLM

extensible markup language. used to store structured data. Extensibility used in web scrapping. It is composed by two parts: * The markup: the label that composes the structure * The content: the actual value store

Same as HTML structure it works with tags usually there is an starting tag and an ending one, tags can hold attributes

8.4 JSON

JavaScript Object Notation is way of storing data in a structured manner, used extensibility in API’s. transform data sets and turn it into JSON format

##  [1] "url"          "forks_url"    "commits_url"  "id"           "node_id"     
##  [6] "git_pull_url" "git_push_url" "html_url"     "files"        "public"      
## [11] "created_at"   "updated_at"   "description"  "comments"     "user"        
## [16] "comments_url" "owner"        "truncated"
## [
##   {
##     "Sepal.Length": 5.1,
##     "Sepal.Width": 3.5,
##     "Petal.Length": 1.4,
##     "Petal.Width": 0.2,
##     "Species": "setosa"
##   },
##   {
##     "Sepal.Length": 4.9,
##     "Sepal.Width": 3,
##     "Petal.Length": 1.4,
##     "Petal.Width": 0.2,
##     "Species": "setosa"
##   },
##   {
##     "Sepal.Length": 4.7,
##     "Sepal.Width": 3.2,
##     "Petal.Length": 1.3,
##     "Petal.Width": 0.2,
##     "Species": "setosa"
##   }
## ]

8.5 data table

Is an analogue to data frame structure however it tends to be more optimized. The tables command (don’t confuse it with table) display the currently used data.tables

## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:lubridate':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday, week,
##     yday, year
## [1] "data.table" "data.frame"
##    NAME NROW NCOL MB                                                      COLS
## 1:   DT  150    5  0 Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
##    KEY
## 1:    
## Total: 0MB
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.Area
## 1:          5.1         3.5          1.4         0.2  setosa       0.28
## 2:          4.9         3.0          1.4         0.2  setosa       0.28
## 3:          4.7         3.2          1.3         0.2  setosa       0.26

“Expressions” add new columns, if a data table is assign to a new variable they become intertwined, so any change in one will affect the other, it is better to create copy through the copy function.



## SWIRL practice


```r
install_from_swirl("Getting and Cleaning Data")
print("i should not show up")