Lab 2 Data manipulation
2.1 Objectives
After this section you should be able to:
- Load, explore and manipulate data in R
2.2 Introduction
One of the main uses of R is for data manipulation and plot. This is similar to what many of us do in any regular table editor as excel or google spread sheet.
We will use the following packages. You can read in detail the manual of each of them.
2.3 Load data
There are many ways to load data. In the following chapters we will use a diverse set of functions to read the data from files. Some of them are:
read.table() #general to any type of table
read.csv() #specific for comma sepparated tables
read.delim() #specific for tab delimited tables
Some of the important options of these function are:
read.table(file = "location/of/your/file.txt",sep = ".",header = T or F)
Where the separator can be a comma, dot, etc. You can see more details using: ?read.table
In this case we will use data that is already available in R. The package datasets provides a handful set of data to analyze.
We will use the ChickWeight dataset. This is data set of weight in chickens with age an different diet.
This will allow us to visualize the data and to do some statistic tests.
2.4 Data exploration
It is important to understand the data before heading into the analysis. We will go over some techniques for this.
# To see the table, you can click on the environment part or run this...
#View(ChickWeight)
# As you can see this is a table, just in case we want to convert it to a data.frame
ChickWeight<-as.data.frame(ChickWeight)
To see only the beginning, we can use the head function:
## weight Time Chick Diet
## 1 42 0 1 1
## 2 51 2 1 1
## 3 59 4 1 1
## 4 64 6 1 1
## 5 76 8 1 1
## 6 93 10 1 1
What is n doing?
## weight Time Chick Diet
## 1 42 0 1 1
## 2 51 2 1 1
## 3 59 4 1 1
## 4 64 6 1 1
## 5 76 8 1 1
## 6 93 10 1 1
## 7 106 12 1 1
## 8 125 14 1 1
## 9 149 16 1 1
## 10 171 18 1 1
## 11 199 20 1 1
## 12 205 21 1 1
## 13 40 0 2 1
## 14 49 2 2 1
## 15 58 4 2 1
## 16 72 6 2 1
## 17 84 8 2 1
## 18 103 10 2 1
## 19 122 12 2 1
## 20 138 14 2 1
What is the structure of the data.frame?
## 'data.frame': 578 obs. of 4 variables:
## $ weight: num 42 51 59 64 76 93 106 125 149 171 ...
## $ Time : num 0 2 4 6 8 10 12 14 16 18 ...
## $ Chick : Ord.factor w/ 50 levels "18"<"16"<"15"<..: 15 15 15 15 15 15 15 15 15 15 ...
## $ Diet : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "formula")=Class 'formula' language weight ~ Time | Chick
## .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv>
## - attr(*, "outer")=Class 'formula' language ~Diet
## .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv>
## - attr(*, "labels")=List of 2
## ..$ x: chr "Time"
## ..$ y: chr "Body weight"
## - attr(*, "units")=List of 2
## ..$ x: chr "(days)"
## ..$ y: chr "(gm)"
With the $ operator we can explore the columns
## [1] "numeric"
We can see the dimensions of the table for example: how many rows it has?
## [1] 578
How many columns?
## [1] 4
The names of columns
## [1] "weight" "Time" "Chick" "Diet"
With the [] we can access the individual elements
## [1] "Chick"
We can see the levels of a factor
## [1] "1" "2" "3"
What is the difference if we just print the column?
## [1] 1 1 1
## Levels: 1 2 3 4
Can we see the levels of a numeric vector? This is a reminder that the data type is important.
levels(ChickWeight$weight) # nop
We can now get different basic statistics now:
## [1] 121.8183
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 35.0 63.0 103.0 121.8 163.8 373.0
## weight Time Chick Diet
## Min. : 35.0 Min. : 0.00 13 : 12 1:220
## 1st Qu.: 63.0 1st Qu.: 4.00 9 : 12 2:120
## Median :103.0 Median :10.00 20 : 12 3:120
## Mean :121.8 Mean :10.72 10 : 12 4:118
## 3rd Qu.:163.8 3rd Qu.:16.00 17 : 12
## Max. :373.0 Max. :21.00 19 : 12
## (Other):506
To see what is this exactly doing, just go to the help page:
?summary
To save this summary table we can create an object with just the result of the summary
## weight Time Chick Diet
## Min. : 35.0 Min. : 0.00 13 : 12 1:220
## 1st Qu.: 63.0 1st Qu.: 4.00 9 : 12 2:120
## Median :103.0 Median :10.00 20 : 12 3:120
## Mean :121.8 Mean :10.72 10 : 12 4:118
## 3rd Qu.:163.8 3rd Qu.:16.00 17 : 12
## Max. :373.0 Max. :21.00 19 : 12
## (Other):506
## [1] "table"
We can change the data kind, and assign it to a different object
This is not that useful as you can see if you inspect the data in using View(chick_sumary_df) this is because it is a complicated format, we better just save the table.
We will see other ways to save data in R in the future chapters.
You can see more details using: ?write.table
2.5 Subsetting
Subsetting means extracting part of the data. There are many different ways to do this. One important notion for tables and data frames is that dimensions go as follows:
data[row,column]
## weight Time Chick
## 1 42 0 1
## [1] 42 51 59
## [1] 42
If we want to know for example only the data from the chickens taking the diet 4
## weight Time Chick Diet
## 461 42 0 41 4
## 462 51 2 41 4
## 463 66 4 41 4
## 464 85 6 41 4
## 465 103 8 41 4
## 466 124 10 41 4
Why == and no =?
Remember in R, = is an assignment, as the <-, while the == is for comparison.
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
Lets explore the class:
## [1] "logical"
So, when we do ChickWeight[ChickWeight$Diet==4,], R is just showing the ChickWeight for which ChickWeight$Diet==4 is TRUE
## [1] 461 462 463 464 465 466
## weight Time Chick Diet
## 461 42 0 41 4
## 462 51 2 41 4
## 463 66 4 41 4
## 464 85 6 41 4
## 465 103 8 41 4
## 466 124 10 41 4
And for more conditions, we can use AND (&) to integrate them.
## weight Time Chick Diet
## 465 103 8 41 4
## 466 124 10 41 4
## 467 155 12 41 4
## 468 153 14 41 4
## 469 175 16 41 4
## 470 184 18 41 4
Other option is OR (|).
Remember, computers will read as things come \[ condition-A AND condition-B OR condition-C condition-A & condition-B | condition-C \] Is not the same as
\[ condition A & (condition B | condition C) \]
## weight Time Chick Diet
## 465 103 8 41 4
## 466 124 10 41 4
## 467 155 12 41 4
## 468 153 14 41 4
## 469 175 16 41 4
## 470 184 18 41 4
And if we just want the weights of these…
ChickWeight$weight[ChickWeight$Diet==4 & ChickWeight$Time>6 & ChickWeight$Time<20,]
why this gives an error?
Because we only have one dimension now, not 2. ChickWeight$weight is one dimention object, so we have to use [ ], not [ , ].
## [1] 103 124 155 153 175 184
2.6 Activity:
This activity integrates knowledge from the previous chapter.
1. Remove the first and last row of the ChickWeight data frame 2. Create a vector with the second column from the data frame