Chapter 9 Data Formats
9.1 MySQL database
Access database information is stored in the form of tables with relations. As a data scientist you’ll be expected to know how to collect data from a database.
Install MySQL
NOTE: remember to close the connection to the database
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
9.1.1 SQL commands
SELECT:
Wildcards: can be pass to extract everything
LIMIT: control the number of results, analogous to `head in a data frame.
ORDER BY: can order variables in ascending or descending order.
_WHERE:__ limit to a condition
9.2 read data from HDF5
Hierarchical Dataset Format (HDF) used for large data sets check for recent info. write data from the data sets package
9.3 web scrapping
Know a little more about the http protocol and GET
POST
PUT
commands can be access by the httr
how Netflix reverse engineer Hollywood
## Warning in readLines(connection): incomplete final line found on 'https://
## en.wikipedia.org/wiki/Web_scraping'
library(httr)
library(XML)
url<- "https://en.wikipedia.org/wiki/Web_scraping"
html<- GET(url)
class(html)
## [1] "response"
## [1] "character"
## [1] "url" "status_code" "headers" "all_headers" "cookies"
## [6] "content" "date" "times" "request" "handle"
## domain flag path secure expiration
## 1 #HttpOnly_en.wikipedia.org FALSE / TRUE 2021-05-18 19:00:00
## 2 #HttpOnly_.wikipedia.org TRUE / TRUE 2021-05-18 19:00:00
## 3 .wikipedia.org TRUE / TRUE <NA>
## name value
## 1 WMF-Last-Access 17-Apr-2021
## 2 WMF-Last-Access-Global 17-Apr-2021
## 3 GeoIP CO:DC:Bogot__:4.64:-74.07:v4
9.4 APIs
(https://developer.twitter.com/apps)
authentication keys tokens when working in open source and public code projects
9.5 Catch all
When dealing with data there are many potential sources which makes unpractical to make an extensive list that cover all of them. Th key takeaway is that there probably exists a R package that would help you extract data for any source you can imagine. ?connections
read images and GIS (Geographic Information System) and read music files with tuneR
and seewave
.
9.6 QUIZ
- Register an application with the Github API here https://github.com/settings/applications. Access the API to get information on your instructors repositories (hint: this is the url you want “https://api.github.com/users/jtleek/repos”). Use this data to find the time that the data sharing repo was created. What time was it created?
This tutorial may be useful (https://github.com/hadley/httr/blob/master/demo/oauth2-github.r). You may also need to run the code in the base R package and not R studio.
## [1] "2018" "ads2020"
## [3] "advdatasci" "advdatasci-project"
## [5] "advdatasci-swirl" "advdatasci15"
## [7] "advdatasci16" "advdatasci_swirl"
## [9] "ballgown" "big_course"
## [11] "bookdown-start" "books"
## [13] "capitalIn21stCenturyinR" "careerplanning"
## [15] "coc" "courses"
## [17] "COVID-19" "crsra"
## [19] "cshlcg-labs" "data"
## [21] "dataanalysis" "datascientist"
## [23] "datasharing" "datawomenontwitter"
## [25] "day1" "derfinder"
## [27] "derfinder-1" "DSM"
## [29] "EDA-Project" "escalatr"
## [1] "2013-11-07T13:25:07Z"
- The sqldf package allows for execution of SQL commands on R data frames. We will use the sqldf package to practice the queries we might send with the dbSendQuery command in RMySQL.
Download the American Community Survey data and load it into an R object called
https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv
library(sqldf)
url<-"https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv"
fpath<- "../data/M3_week2_quiz_data/AmericanCommunitySurvey.csv "
if(is.null(fpath)){
download.file(url,fpath)
}
acs<-read.table(fpath)
Which of the following commands will select only the data for the probability weights pwgtp1 with ages less than 50?