Chapter 4 Miscellaneous

4.1 Missing data NaN, NA and NULL

The world tends to be messy and missing data tends to be the norm instead of the exception, R’s way of dealing with this is through defining the NaN and NA data types. NaN is more specific and used when the missing values is a numeric one, NA on the other hand is more general and cover any missing value. NA has a more broader definition, therefore NaN values are considered to be covered by NA, therefore any NaN is simultaneously NA as well. For example, the following the functions is.na() and is.NaN() are used to identify this types of data, the NaN value is recognized to be TRUE for both functions.

answers <- c("good","bad","really good","really bad","bad","really bad","good")
satisfaction <- factor(answers, levels=c("really bad","bad","good","really good"))

satisfaction[3]=NA
satisfaction[4]=NaN    # numeric not define 
is.na(satisfaction)

## [1] FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE

is.nan(satisfaction)

## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Other important data type related with missing values is null, while it might prompt confusion, I’d like to refer to the following saying to address the difference:

NA represents the absence of presence while null represents the presence of absence

NA values can create a lot of trouble, many operations are not defined over this type of data and when apply can render a whole dataset with unusable results. This is a known issue so many functions support a mechanism for dealing with missing values, the na.rm parameter removes the missing values before the function is evaluated.

mean(c(1,2,NaN,4,NA,6))     # the operation is undefined due to the presence of NA

## [1] NaN

mean(c(1,2,NaN,4,NA,6), na.rm = TRUE)

## [1] 3.25

Similar tricks and workarounds can be used in order to remove the missing values of a dataset, they exploit the use of logical indexing in order to remove any element that is NA.

data<- c(1:4, NA ,6:10, NaN, NA ,1 , NA, 3:6)
missing <-is.na(data)
clean_data<- data[!missing]

data2<- c(3:7, 6:10,NA ,3:1, NaN, NA,1,5 )

good<- complete.cases(data,data2)
data[good]

##  [1]  1  2  3  4  6  7  8  9 10  1  5  6

data2[good]

##  [1]  3  4  5  6  6  7  8  9 10  2  1  5

When working with data frames the complete.cases can be useful for handling missing values, it return a logical vector whose value are TRUE or FALSE depending if all of the elements of a row are defined and none is NA, na.omit is an alternative way to accomplish the same result, instead of returning a logic vector it directly returns the rows that does not have missing values.

df<-data.frame(a=1:4, b=2:5, c=10:13,d=21:24)
df[1,3]<-NA
df[3,4]<-NA
df

##   a b  c  d
## 1 1 2 NA 21
## 2 2 3 11 22
## 3 3 4 12 NA
## 4 4 5 13 24

good<-complete.cases(df)
good                     # logic vector data is complete

## [1] FALSE  TRUE FALSE  TRUE

clean_df<-df[good,]
clean_df

##   a b  c  d
## 2 2 3 11 22
## 4 4 5 13 24

alternative<-na.omit(df)
alternative

##   a b  c  d
## 2 2 3 11 22
## 4 4 5 13 24

There are many functions which make part of thena.x() family, which provide a rich range of responding actions when encountering with missing values, like any na.fail()``` which triggers an alert if the element contains an NA value, as well asna.contiguous()``` which identifies the longest contiguous sequence of NA values.

4.2 functions

When programming you always want to break big problems into smaller ones, and functions create the perfect container to hold each of these little pieces. R as mentioned previously is a programing language orientated towards statistical analysis therefore is has already bake in so useful functions, for example sample() can get a random sample with the specified size, by default there is no replacement but this behavior can easily be modified.

sample(x = 1:4, size = 2)

## [1] 1 3

factorial(5)

## [1] 120

y <-sample(x = 1:4, size = 100, replace=TRUE)
PI<-round(pi, digits = 2)

# clear the clutter and remove objects
remove(PI)

y<- rnorm(1000)     #normal distribution

4.3 plot

Other useful tool that R offers and is known for is the ability to draw diagrams. Plots are a fundamental way of conveying meaningful information from a pile of data. And plots play a key role in R’s philosophy of reproducible research.

library('ggplot2')
x <- c(-1, -0.8, -0.6, -0.4, -0.2, 0, 0.2, 0.4, 0.6, 0.8, 1)
y <- x^3
qplot(x, y)

Figure 4.1: cubic function

Another great phrase

bias can easily occur without anyone noticing it in the short run.

—- Grolemund (2014)

4.3.1 practical exercise

Wickham and Grolemund (2017) shows some practical plotting examples using the tiddyverse, a collection of very useful packages for data manipulation. If you want to know more about packages you can head to the corresponding section.

We are going to use some practical data provided by the EPA (Enviromental Proteciton Agency), the mpg dataframe which makes part of the tidyverse and is used as go to introductory dataset.

library(tidyverse)
?mpg
names(mpg)
dims(mpg)

4.4 R Packages

One of the advantages of open source is that the community can creates packages which extends the functionality of the core R implementation, allowing for special niche use cases to be develop and satisfy by people in that circle without being cram into the base program every body uses, it allows to have the flexibility to customize and pick the tools needed while tapping into the work done by a large community of people. There exist many ways to access this packages, the larger collection is provided via the CRANN repository but Git based packages are also supported. In order to use any external package it is necessary not only to install it, but also to load it into the R environment.

CRAN based package: CRAN (Comprehensive R Archive Network) is an collection that holds documentation and code related to R. It stores many R packages which have the advantage that they can be install directly via the install.packages(" ") command making it extremely easy to incorporate new packages into a project. For example in order to install the tidyverse package collection the following command can be use:

install.packages("tidyverse")
library(tidyverse)

Once installed the library() command can be used to load the packages into R. Notice that install.packages() gets an argument between quotes but library() does not.

Note: It is important to note that in order to install some packages, the ones who have c/c++/fortran source code instead of the binary, they need the rtools in order be build, and the rtools must be compatible with the R version.

Git based package:

Note: packages needs to be loaded every time Rstudio is loaded. The following commands show you the loaded packages, and in the case you need to unload a package the detach command can be helpful.

(.packages())
detach("package:tidyverse", unload=TRUE)

Usually when dealing with a new package you want to know what is inside what dataset, functions or functionalities it provide, the help command ? can be use to explore the package documentation and get a better idea of what it contains. The following example allows to explore the datasets package, an built-in package that store some useful datasets.

library(datasets)
?datasets
library(help = "datasets")      # open file with full index of datasets

The following is a list with some of these datasets:

AirPassengers           Monthly Airline Passenger Numbers 1949-1960
BJsales                 Sales Data with Leading Indicator
BOD                     Biochemical Oxygen Demand
CO2                     Carbon Dioxide Uptake in Grass Plants
ChickWeight             Weight versus age of chicks on different diets
DNase                   Elisa assay of DNase
EuStockMarkets          Daily Closing Prices of Major European Stock Indices, 1991-1998

We can take a look into some of this datasets

head(EuStockMarkets)

##          DAX    SMI    CAC   FTSE
## [1,] 1628.75 1678.1 1772.8 2443.6
## [2,] 1613.63 1688.5 1750.5 2460.2
## [3,] 1606.51 1678.6 1718.0 2448.2
## [4,] 1621.04 1684.1 1708.1 2470.4
## [5,] 1618.16 1686.6 1723.1 2484.7
## [6,] 1610.61 1671.6 1714.3 2466.8

head(AirPassengers)

## [1] 112 118 132 129 121 135

tail(ChickWeight)

##     weight Time Chick Diet
## 573    155   12    50    4
## 574    175   14    50    4
## 575    205   16    50    4
## 576    234   18    50    4
## 577    264   20    50    4
## 578    264   21    50    4

4.4.1 create your own package

There are many things to consider when thinking about building a package.

Why packages are important?

Code organization
Consistent documentation
Code distribution

why am I creating a package? It is about code distribution,

For me I am not planning on creating some useful code to mankind -not yet-, I want to create an example package that will in the basics of package creation and would allow my to access some custom scripts across different computers, and want to configure all the needed dependencies.

library(devtools)

## Loading required package: usethis

library(roxygen2)

# locate yourself in the parent directory 
create("simpleRpackage")

here::dr_here()

load_all()

say_hi()

or to make it permanent use the function install

install("cats")
library(cats)
say_hi()

minimum viable product — it’s best to get a project started and improve it through iteration

Never use library() or require() in a R package!

4.5 Read info with read table

Data can be produce in different forms, it can be structured or unstructured, regardless of the format, being able to interact with this data and organize it in a useful manner is an important process that needs to be done, so it can be feed to upcoming analysis. this process of organizing the data is known as pipelining (to learn more about this process feel free to check the Data collection module ).

Some common formats for data storage are csv (Coma Separated Values) files, R has some built in function for reading this type of files through the read.csv() or read.table(). Here is an examples of some students data read from an csv file. These function have some parameters that help to specify the behavior we want it to have. For example one can determine what symbol would be used as separator or either if the file has headers or not. for more info on other supported arguments please read the documentation. One thing to be aware when dealing with large dataset is RAM, if the dataset is larger than the RAM memory capacity it will create troubles.

initial <- read.csv("../data/student_info.csv",sep=";",header=TRUE, nrows=100)
classes<- sapply(initial,class)

students <- read.table("../data/student_info.csv",sep=";",header=TRUE, colClasses = classes)
print(students)

##    Name.and.Lastname          ID                 email.adress
## 1     Samuel Jackson 10124585611             samuelJ@mail.com
## 2      Robert Downey  4525111559          rdownironman@mail.c
## 3      Douglas Adams 14884674721    zaphodbeeblebrox@mail.com
## 4    Stephen Wolfram 74682868914        wolframalpha@mail.com
## 5        Cleve Moler 17463589721  chiefmathematician@mail.com
## 6        Matt Parker 18457956247        parkersquare@mail.com
## 7        Brady Haran 17694858874         numberphile@mail.com
## 8      Emily Graslie 17973595287          brainscoop@mial.com
## 9       Derek Muller 17954965491          veritasium@mail.com
## 10    Destin Sandlin 17895782879             smarter@mail.com
## 11       Freddy Vega 17795795697             fredier@mail.com
## 12     Stanislaw Lem 19748659471               golem@mail.com
## 13      Isaac Asimov 13589844557              robots@mail.com
## 14      Susan Calvin 47958446526           susan@R_MM_Inc.gov
## 15   R Daneel Olivaw 10001110101             dolivaw@mail.com
## 16      Elijah Baley 97565841047          lijahbaley@mail.com

Here is a second example where one displays the lasts elements of the file.

external_deck <- read.csv("../data/deck_of_cards.csv")
tail(external_deck,5)

##     face   suit value
## 48  five hearts     5
## 49  four hearts     4
## 50 three hearts     3
## 51   two hearts     2
## 52   ace hearts     1

When working with large data sets specifying the data type by the parameter colClasses, can optimized and reduce the time needed to process the data.

4.6 week assignment

As part of these week’s assignment it was optional to do the following swirl practices. swirl (Statistics with Interactive R Learning) is a R package develop to allow students to learn by interacting directly with the command interface. In order to use it one needs to install the swirl package see.

install.packages("swirl")
 library(swirl)

References

Grolemund, Garrett. 2014. Hands-on Programming with R. O’Reilly. https://rstudio-education.github.io/hopr/index.html.

Wickham, H., and G. Grolemund. 2017. R for Data Science. O’Reilly Media. https://r4ds.had.co.nz/.