Lab 2 Data manipulation

2.1 Objectives

After this section you should be able to:

Load, explore and manipulate data in R

2.2 Introduction

One of the main uses of R is for data manipulation and plot. This is similar to what many of us do in any regular table editor as excel or google spread sheet.

We will use the following packages. You can read in detail the manual of each of them.

#Install packages
#install.packages("ggplot2")
#install.packages("dplyr")
#install.packages("plyr")

#Load the package
library("ggplot2")
library("dplyr")
library("plyr")
library(RColorBrewer)
library(car)

#Manuals
#vignette("dplyr")
#?ggplot2
#?plyr

2.3 Load data

There are many ways to load data. In the following chapters we will use a diverse set of functions to read the data from files. Some of them are:

read.table() #general to any type of table
read.csv() #specific for comma sepparated tables
read.delim() #specific for tab delimited tables

Some of the important options of these function are:

read.table(file = "location/of/your/file.txt",sep = ".",header = T or F)

Where the separator can be a comma, dot, etc. You can see more details using: ?read.table

In this case we will use data that is already available in R. The package datasets provides a handful set of data to analyze.

We will use the ChickWeight dataset. This is data set of weight in chickens with age an different diet.

This will allow us to visualize the data and to do some statistic tests.

# Install the package
#install.packages("datasets")
# For a full list of these datasets, type library(help = "datasets")
# Load the library and dataset
library(datasets)
data(ChickWeight) #What happens in the Environment section of RStudio?

2.4 Data exploration

It is important to understand the data before heading into the analysis. We will go over some techniques for this.

# To see the table, you can click on the environment part or run this...
#View(ChickWeight)

# As you can see this is a table, just in case we want to convert it to a data.frame
ChickWeight<-as.data.frame(ChickWeight)

To see only the beginning, we can use the head function:

head(ChickWeight)

##   weight Time Chick Diet
## 1     42    0     1    1
## 2     51    2     1    1
## 3     59    4     1    1
## 4     64    6     1    1
## 5     76    8     1    1
## 6     93   10     1    1

What is n doing?

head(ChickWeight,n = 20)

##    weight Time Chick Diet
## 1      42    0     1    1
## 2      51    2     1    1
## 3      59    4     1    1
## 4      64    6     1    1
## 5      76    8     1    1
## 6      93   10     1    1
## 7     106   12     1    1
## 8     125   14     1    1
## 9     149   16     1    1
## 10    171   18     1    1
## 11    199   20     1    1
## 12    205   21     1    1
## 13     40    0     2    1
## 14     49    2     2    1
## 15     58    4     2    1
## 16     72    6     2    1
## 17     84    8     2    1
## 18    103   10     2    1
## 19    122   12     2    1
## 20    138   14     2    1

What is the structure of the data.frame?

str(ChickWeight)

## 'data.frame':	578 obs. of  4 variables:
##  $ weight: num  42 51 59 64 76 93 106 125 149 171 ...
##  $ Time  : num  0 2 4 6 8 10 12 14 16 18 ...
##  $ Chick : Ord.factor w/ 50 levels "18"<"16"<"15"<..: 15 15 15 15 15 15 15 15 15 15 ...
##  $ Diet  : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "formula")=Class 'formula'  language weight ~ Time | Chick
##   .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
##  - attr(*, "outer")=Class 'formula'  language ~Diet
##   .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
##  - attr(*, "labels")=List of 2
##   ..$ x: chr "Time"
##   ..$ y: chr "Body weight"
##  - attr(*, "units")=List of 2
##   ..$ x: chr "(days)"
##   ..$ y: chr "(gm)"

With the $ operator we can explore the columns

class(ChickWeight$weight)

## [1] "numeric"

We can see the dimensions of the table for example: how many rows it has?

nrow(ChickWeight)

## [1] 578

How many columns?

ncol(ChickWeight)

## [1] 4

The names of columns

names(ChickWeight)

## [1] "weight" "Time"   "Chick"  "Diet"

With the [] we can access the individual elements

names(ChickWeight)[3]

## [1] "Chick"

We can see the levels of a factor

levels(ChickWeight$Diet)[1:3]

## [1] "1" "2" "3"

What is the difference if we just print the column?

ChickWeight$Diet[1:3]

## [1] 1 1 1
## Levels: 1 2 3 4

Can we see the levels of a numeric vector? This is a reminder that the data type is important.

levels(ChickWeight$weight) # nop

We can now get different basic statistics now:

mean(ChickWeight$weight)

## [1] 121.8183

summary(ChickWeight$weight)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    35.0    63.0   103.0   121.8   163.8   373.0

summary(ChickWeight)

##      weight           Time           Chick     Diet   
##  Min.   : 35.0   Min.   : 0.00   13     : 12   1:220  
##  1st Qu.: 63.0   1st Qu.: 4.00   9      : 12   2:120  
##  Median :103.0   Median :10.00   20     : 12   3:120  
##  Mean   :121.8   Mean   :10.72   10     : 12   4:118  
##  3rd Qu.:163.8   3rd Qu.:16.00   17     : 12          
##  Max.   :373.0   Max.   :21.00   19     : 12          
##                                  (Other):506

To see what is this exactly doing, just go to the help page: ?summary

To save this summary table we can create an object with just the result of the summary

chick_sumary<-summary(ChickWeight)
chick_sumary

##      weight           Time           Chick     Diet   
##  Min.   : 35.0   Min.   : 0.00   13     : 12   1:220  
##  1st Qu.: 63.0   1st Qu.: 4.00   9      : 12   2:120  
##  Median :103.0   Median :10.00   20     : 12   3:120  
##  Mean   :121.8   Mean   :10.72   10     : 12   4:118  
##  3rd Qu.:163.8   3rd Qu.:16.00   17     : 12          
##  Max.   :373.0   Max.   :21.00   19     : 12          
##                                  (Other):506

class(chick_sumary)

## [1] "table"

We can change the data kind, and assign it to a different object

chick_sumary_df<-as.data.frame(chick_sumary)

This is not that useful as you can see if you inspect the data in using View(chick_sumary_df) this is because it is a complicated format, we better just save the table. We will see other ways to save data in R in the future chapters. You can see more details using: ?write.table

write.table(chick_sumary, "mydata.txt", sep="\t",row.names = F,col.names = T)
#this is clearly no perfect but for the important part, the numeric and integer columns, we have the stat

2.5 Subsetting

Subsetting means extracting part of the data. There are many different ways to do this. One important notion for tables and data frames is that dimensions go as follows: data[row,column]

#we can see specific columns and rows
ChickWeight[1,1:3] #row 1, column 1:3

##   weight Time Chick
## 1     42    0     1

ChickWeight[1:3,1] #col 1, row 1:3

## [1] 42 51 59

ChickWeight[1,1] #row1, col1

## [1] 42

If we want to know for example only the data from the chickens taking the diet 4

head(ChickWeight[ChickWeight$Diet==4,])

##     weight Time Chick Diet
## 461     42    0    41    4
## 462     51    2    41    4
## 463     66    4    41    4
## 464     85    6    41    4
## 465    103    8    41    4
## 466    124   10    41    4

Why == and no =?

Remember in R, = is an assignment, as the <-, while the == is for comparison.

head(ChickWeight$Diet==4)

## [1] FALSE FALSE FALSE FALSE FALSE FALSE

Lets explore the class:

class(ChickWeight$Diet==4)

## [1] "logical"

So, when we do ChickWeight[ChickWeight$Diet==4,], R is just showing the ChickWeight for which ChickWeight$Diet==4 is TRUE

head(which(ChickWeight$Diet==4))

## [1] 461 462 463 464 465 466

head(ChickWeight[ChickWeight$Diet==4,])

##     weight Time Chick Diet
## 461     42    0    41    4
## 462     51    2    41    4
## 463     66    4    41    4
## 464     85    6    41    4
## 465    103    8    41    4
## 466    124   10    41    4

And for more conditions, we can use AND (&) to integrate them.

head(ChickWeight[ChickWeight$Diet==4 & ChickWeight$Time>6,])

##     weight Time Chick Diet
## 465    103    8    41    4
## 466    124   10    41    4
## 467    155   12    41    4
## 468    153   14    41    4
## 469    175   16    41    4
## 470    184   18    41    4

Other option is OR (|).

Remember, computers will read as things come \[ condition-A AND condition-B OR condition-C condition-A & condition-B | condition-C \] Is not the same as

\[ condition A & (condition B | condition C) \]

head(ChickWeight[ChickWeight$Diet==4 & ChickWeight$Time>6 & ChickWeight$Time<20,])

##     weight Time Chick Diet
## 465    103    8    41    4
## 466    124   10    41    4
## 467    155   12    41    4
## 468    153   14    41    4
## 469    175   16    41    4
## 470    184   18    41    4

And if we just want the weights of these…

ChickWeight$weight[ChickWeight$Diet==4 & ChickWeight$Time>6 & ChickWeight$Time<20,]

why this gives an error?

Because we only have one dimension now, not 2. ChickWeight$weight is one dimention object, so we have to use [ ], not [ , ].

head(ChickWeight$weight[ChickWeight$Diet==4 & ChickWeight$Time>6 & ChickWeight$Time<20])

## [1] 103 124 155 153 175 184

2.6 Activity:

This activity integrates knowledge from the previous chapter.

1. Remove the first and last row of the ChickWeight data frame 2. Create a vector with the second column from the data frame