Chapter 7 Creating a Subset of Your Data

From the data sets you’ve worked with so far, you can tell that they largely have these structural characteristics:

Each variable has its own column.
Each observation has its own row.
Each value has its own cell.

For specific goals of analyses, We often need to create a subset of the data by selecting columns (variables), or rows (observations), or both.

7.1 Selecting Columns (Variables)

For your group project, if you know your analyses will be only limited to a few variables, you can create a smaller data set that contains only these variables. It is not a requirement to do so, but it can relieve the heavy processing burden of working with a large dataset on R. You will also have a cleaner view of the variables you are working with.

There are several ways to select columns:

Use [ ] to index what are the columns you would like to select.

Generic: newdata <- fulldata[, c("var1", "var2", "var3", ...)]

For the week data frame we created earlier, if we only want to keep two variables, day and temp, we can use the following code. Note the , in the command.

week_temp <- week[, c("day", "temp")] 
week_temp

Use subset() command with select = argument that specifies the variables to be included.

Generic: newdata <- subset(fulldata, select = c(var1, var2, var3, ...))

week_temp <- subset(week, select = c(day, temp))
week_temp

Use select() command, which is part of the dplyr, which is included in the tidyverse package. Generic: newdata <- select(fulldata, var1, var2, var3, ...)

library(tidyverse) # install the package first if you have not already done so
week_temp <- select(week, day, temp)

The above three methods should give you the same data frame:

week_temp

day	temp
Sun	32.3
Mon	38.7
Tue	NA
Wed	40.1
Thur	37.6
Fri	33.5
Sat	31.7

7.2 Selecting Rows (Observations)

Using the covid data, let’s say we are only interested in participants whose attitude toward the vaccine fell on the favorable side of our response scale (i.e., above the mid-point on the 1-5 point scale). We can create a subset of respondents that meet this criteria.

Let’s call it covid_fav. Below are three distinct ways to select these rows.

Use [ ] to specify the logical conditions.

Generic: newdata <- fulldata[variable meeting conditions, ]

Applying this to the covid data frame:

covid_fav <- covid[covid$att > 3, ]  # row selection based on logical condition, keep all columns
covid_fav

Use the subset() command.

Generic: newdata <- subset(fulldata, variable meeting certain conditions)

Applied to the covid data frame:

covid_fav <- subset(covid, att > 3)
covid_fav

Use the filter() command that is part of the tidyverse and dplyr packages.

Generic: newdata <- filter(fulldata, variables meeting certain conditions)

Applied to the covid data frame:

covid_fav <- filter(covid, att > 3)
covid_fav