Chapter 7 Creating a Subset of Your Data
From the data sets you’ve worked with so far, you can tell that they largely have these structural characteristics:
- Each variable has its own column.
- Each observation has its own row.
- Each value has its own cell.
For specific goals of analyses, We often need to create a subset of the data by selecting columns (variables), or rows (observations), or both.
7.1 Selecting Columns (Variables)
For your group project, if you know your analyses will be only limited to a few variables, you can create a smaller data set that contains only these variables. It is not a requirement to do so, but it can relieve the heavy processing burden of working with a large dataset on R. You will also have a cleaner view of the variables you are working with.
There are several ways to select columns:
- Use
[ ]
to index what are the columns you would like to select.
Generic: newdata <- fulldata[, c("var1", "var2", "var3", ...)]
For the week
data frame we created earlier, if we only want to keep two variables, day
and temp
, we can use the following code. Note the ,
in the command.
<- week[, c("day", "temp")]
week_temp week_temp
- Use
subset()
command withselect =
argument that specifies the variables to be included.
Generic: newdata <- subset(fulldata, select = c(var1, var2, var3, ...))
<- subset(week, select = c(day, temp))
week_temp week_temp
- Use
select()
command, which is part of thedplyr
, which is included in thetidyverse
package. Generic:newdata <- select(fulldata, var1, var2, var3, ...)
library(tidyverse) # install the package first if you have not already done so
<- select(week, day, temp) week_temp
The above three methods should give you the same data frame:
week_temp
day | temp |
---|---|
Sun | 32.3 |
Mon | 38.7 |
Tue | NA |
Wed | 40.1 |
Thur | 37.6 |
Fri | 33.5 |
Sat | 31.7 |
7.2 Selecting Rows (Observations)
Using the covid
data, let’s say we are only interested in participants whose attitude toward the vaccine fell on the favorable side of our response scale (i.e., above the mid-point on the 1-5 point scale). We can create a subset of respondents that meet this criteria.
Let’s call it covid_fav
. Below are three distinct ways to select these rows.
- Use
[ ]
to specify the logical conditions.
Generic: newdata <- fulldata[variable meeting conditions, ]
Applying this to the covid
data frame:
<- covid[covid$att > 3, ] # row selection based on logical condition, keep all columns
covid_fav covid_fav
- Use the
subset()
command.
Generic: newdata <- subset(fulldata, variable meeting certain conditions)
Applied to the covid
data frame:
<- subset(covid, att > 3)
covid_fav covid_fav
- Use the
filter()
command that is part of thetidyverse
anddplyr
packages.
Generic: newdata <- filter(fulldata, variables meeting certain conditions)
Applied to the covid
data frame:
<- filter(covid, att > 3)
covid_fav covid_fav