Chapter 8 Data recipes
Raw vs processed data when transforming raw data into processed data it is important to remember that all the process must be recorded (cookbook)
tidying: structuring datasets to facilitate analysis
Tidy data is a standard way of mapping the meaning of a dataset to its structure.
An observation contains all values measured on the same unit (like a person, or a day, or a race) across attributes.
- Each variable forms column
- Each observation forms a row
- Each type of observation unit forms a table
One way of organizing variables is by their role in the analysis: are values fixed by the design of the data collection, or are they measured during the course of the experiment? … Fixed variables should come first, followed by measured variables, each ordered so that related variables are contiguous
problems to be faced: * Column headers are values * Multiple variables are stored in one column * Variables are stored in both rows and columns * Multiple type of observational units are stored in the same table * A single observational unit is stored in multiple tables
8.1 codebook
Must contain:
* A “Study design” section, a description on the methods used to collect the data.
* Description of each and every variables to be used including it’s units.
instruction list An script with no parameters that has the raw data as an intake and produces the processed/tidy data. if it is not possible to make all the process be done through the script the should be instruction on any additional steps.
8.2 Downloading files
let’s face the internet is gateway to the knowledge of the world and you might obtain must of your dataset downloading them through the internet
download.file("https://google.com","google.html")
dateDownloaded<- date()
if(!file.exists("../data")){
dir.create("../data")
}
work with files and directories curl ulr
8.3 XLM
extensible markup language. used to store structured data. Extensibility used in web scrapping. It is composed by two parts: * The markup: the label that composes the structure * The content: the actual value store
Same as HTML structure it works with tags usually there is an starting tag and an ending one, tags can hold attributes
8.4 JSON
JavaScript Object Notation is way of storing data in a structured manner, used extensibility in API’s. transform data sets and turn it into JSON format
library(jsonlite)
repos<-fromJSON("https://api.github.com/users/jsduenass/repos")
gists<-fromJSON("https://api.github.com/users/jsduenass/gists")
names(gists)
## [1] "url" "forks_url" "commits_url" "id" "node_id"
## [6] "git_pull_url" "git_push_url" "html_url" "files" "public"
## [11] "created_at" "updated_at" "description" "comments" "user"
## [16] "comments_url" "owner" "truncated"
## [
## {
## "Sepal.Length": 5.1,
## "Sepal.Width": 3.5,
## "Petal.Length": 1.4,
## "Petal.Width": 0.2,
## "Species": "setosa"
## },
## {
## "Sepal.Length": 4.9,
## "Sepal.Width": 3,
## "Petal.Length": 1.4,
## "Petal.Width": 0.2,
## "Species": "setosa"
## },
## {
## "Sepal.Length": 4.7,
## "Sepal.Width": 3.2,
## "Petal.Length": 1.3,
## "Petal.Width": 0.2,
## "Species": "setosa"
## }
## ]
8.5 data table
Is an analogue to data frame structure however it tends to be more optimized. The tables
command (don’t confuse it with table
) display the currently used data.tables
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:lubridate':
##
## hour, isoweek, mday, minute, month, quarter, second, wday, week,
## yday, year
## [1] "data.table" "data.frame"
## NAME NROW NCOL MB COLS
## 1: DT 150 5 0 Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
## KEY
## 1:
## Total: 0MB
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.Area
## 1: 5.1 3.5 1.4 0.2 setosa 0.28
## 2: 4.9 3.0 1.4 0.2 setosa 0.28
## 3: 4.7 3.2 1.3 0.2 setosa 0.26
“Expressions” add new columns, if a data table is assign to a new variable they become intertwined, so any change in one will affect the other, it is better to create copy through the copy
function.
## SWIRL practice
```r
install_from_swirl("Getting and Cleaning Data")
print("i should not show up")