Chapter 9 Data Formats

9.1 MySQL database

Access database information is stored in the form of tables with relations. As a data scientist you’ll be expected to know how to collect data from a database.

Install MySQL

NOTE: remember to close the connection to the database

library(sqldf)

## Loading required package: gsubfn

## Loading required package: proto

## Loading required package: RSQLite

9.1.1 SQL commands

SELECT:

Wildcards: can be pass to extract everything

LIMIT: control the number of results, analogous to `head in a data frame.

ORDER BY: can order variables in ascending or descending order.

_WHERE:__ limit to a condition

9.2 read data from HDF5

Hierarchical Dataset Format (HDF) used for large data sets check for recent info. write data from the data sets package

know more at

9.3 web scrapping

Know a little more about the http protocol and GET POST PUT commands can be access by the httr

how Netflix reverse engineer Hollywood

connection <- url("https://en.wikipedia.org/wiki/Web_scraping")
raw_htm<-readLines(connection)

## Warning in readLines(connection): incomplete final line found on 'https://
## en.wikipedia.org/wiki/Web_scraping'

close(connection)

library(httr)
library(XML)

url<- "https://en.wikipedia.org/wiki/Web_scraping"

html<- GET(url)
class(html)

## [1] "response"

scrap <- content(html,as="text")
class(scrap)

## [1] "character"

parsed<- htmlParse(scrap, asText = TRUE)
# xpathApply(parsed)


names(html)

##  [1] "url"         "status_code" "headers"     "all_headers" "cookies"    
##  [6] "content"     "date"        "times"       "request"     "handle"

html$cookies

##                       domain  flag path secure          expiration
## 1 #HttpOnly_en.wikipedia.org FALSE    /   TRUE 2021-05-18 19:00:00
## 2   #HttpOnly_.wikipedia.org  TRUE    /   TRUE 2021-05-18 19:00:00
## 3             .wikipedia.org  TRUE    /   TRUE                <NA>
##                     name                        value
## 1        WMF-Last-Access                  17-Apr-2021
## 2 WMF-Last-Access-Global                  17-Apr-2021
## 3                  GeoIP CO:DC:Bogot__:4.64:-74.07:v4

google <- handle("https://www.google.com/")
pg1<- GET(handle=google, path=" search")

9.4 APIs

twitter

(https://developer.twitter.com/apps)

authentication keys tokens when working in open source and public code projects

9.5 Catch all

When dealing with data there are many potential sources which makes unpractical to make an extensive list that cover all of them. Th key takeaway is that there probably exists a R package that would help you extract data for any source you can imagine. ?connections

read images and GIS (Geographic Information System) and read music files with tuneR and seewave.

9.6 QUIZ

Register an application with the Github API here https://github.com/settings/applications. Access the API to get information on your instructors repositories (hint: this is the url you want “https://api.github.com/users/jtleek/repos”). Use this data to find the time that the data sharing repo was created. What time was it created?

This tutorial may be useful (https://github.com/hadley/httr/blob/master/demo/oauth2-github.r). You may also need to run the code in the base R package and not R studio.

library(jsonlite)
repos<-fromJSON("https://api.github.com/users/jtleek/repos")
repos$name

##  [1] "2018"                    "ads2020"                
##  [3] "advdatasci"              "advdatasci-project"     
##  [5] "advdatasci-swirl"        "advdatasci15"           
##  [7] "advdatasci16"            "advdatasci_swirl"       
##  [9] "ballgown"                "big_course"             
## [11] "bookdown-start"          "books"                  
## [13] "capitalIn21stCenturyinR" "careerplanning"         
## [15] "coc"                     "courses"                
## [17] "COVID-19"                "crsra"                  
## [19] "cshlcg-labs"             "data"                   
## [21] "dataanalysis"            "datascientist"          
## [23] "datasharing"             "datawomenontwitter"     
## [25] "day1"                    "derfinder"              
## [27] "derfinder-1"             "DSM"                    
## [29] "EDA-Project"             "escalatr"

id<-grep( "datasharing",repos$name)

repos$created_at[id]

## [1] "2013-11-07T13:25:07Z"

The sqldf package allows for execution of SQL commands on R data frames. We will use the sqldf package to practice the queries we might send with the dbSendQuery command in RMySQL.

Download the American Community Survey data and load it into an R object called

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv

library(sqldf)

url<-"https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv"
fpath<- "../data/M3_week2_quiz_data/AmericanCommunitySurvey.csv "

if(is.null(fpath)){
    download.file(url,fpath)
}

acs<-read.table(fpath)

Which of the following commands will select only the data for the probability weights pwgtp1 with ages less than 50?