How To Filter Multiple Values In R

Subset Data Frame Rows in R

This tutorial describes how to subset or extract data frame rows based on certain criteria.

In this tutorial, you will learn the post-obit R functions from the dplyr parcel:

slice(): Extract rows by position
filter(): Extract rows that see a certain logical criteria. For example iris %>% filter(Sepal.Length > six).
filter_all(), filter_if() and filter_at(): filter rows inside a choice of variables. These functions replicate the logical criteria over all variables or a selection of variables.
sample_n(): Randomly select n rows
sample_frac(): Randomly select a fraction of rows
top_n(): Select top n rows ordered by a variable

We will also bear witness you lot how to remove rows with missing values in a given cavalcade.

Contents:

Required packages
Demo dataset
Extract rows past position
Filter rows by logical criteria
- Logical comparisons
- Extract rows based on logical criteria
Filter rows within a selection of variables
Remove missing values
Select random rows from a data frame
Select peak north rows ordered by a variable
Summary

Required packages

Load the tidyverse packages, which include dplyr:

                  library(tidyverse)

Demo dataset

We'll employ the R congenital-in iris information fix, which we start past converting into a tibble data frame (tbl_df) for easier data analysis.

                  my_data <- as_tibble(iris) my_data

                  ## # A tibble: 150 x 5 ##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species ##          <dbl>       <dbl>        <dbl>       <dbl> <fct>   ## i          v.1         three.v          i.4         0.2 setosa  ## 2          4.9         3            1.4         0.2 setosa  ## iii          4.7         3.2          ane.3         0.two setosa  ## iv          iv.vi         3.one          one.5         0.two setosa  ## five          5           3.6          1.4         0.two setosa  ## 6          5.iv         3.9          1.7         0.iv setosa  ## # ... with 144 more rows

Filter rows past logical criteria

Key R function: filter() [dplyr package]. Used to filter rows that meet some logical criteria.

Before standing, we innovate logical comparisons and operators, which are important to know for filtering data.

Logical comparisons

The "logical" comparison operators bachelor in R are:

Logical comparisons
- <: for less than
- >: for greater than
- <=: for less than or equal to
- >=: for greater than or equal to
- ==: for equal to each other
- !=: not equal to each other
- %in%: group membership. For example, "value %in% c(2, three)" means that value tin takes two or 3.
- is.na(): is NA
- !is.na(): is not NA.
Logical operators
- value == 2|three: means that the value equal ii or (|) 3. value %in% c(two, 3) is a shortcut equivalent to value == 2|3.
- &: ways and. For case sex == "female person" & age > 25

The nigh frequent error fabricated past beginners in R is to apply = instead of == when testing for equality. Call back that, when yous are testing for equality, you should always use == (not =).

Filter rows inside a selection of variables

This section presents three functions - filter_all(), filter_if() and filter_at() - to filter rows within a selection of variables.

These functions replicate the logical criteria over all variables or a selection of variables.

Create a new demo data prepare from my_data by removing the group column "Species":

                  my_data2 <- my_data %>% select(-Species)

Select rows where all variables are greater than 2.four:

                  my_data2 %>% filter_all(all_vars(.> ii.4))

                  ## # A tibble: 3 10 4 ##   Sepal.Length Sepal.Width Petal.Length Petal.Width ##          <dbl>       <dbl>        <dbl>       <dbl> ## 1          half-dozen.three         three.3          6           2.5 ## 2          vii.2         3.vi          6.1         2.5 ## three          vi.seven         three.3          5.seven         ii.5

Select rows when whatsoever of the variables are greater than 2.4:

                  my_data2 %>% filter_all(any_vars(.> 2.4))

                  ## # A tibble: 150 x 4 ##   Sepal.Length Sepal.Width Petal.Length Petal.Width ##          <dbl>       <dbl>        <dbl>       <dbl> ## 1          5.1         three.5          ane.four         0.ii ## 2          iv.ix         iii            ane.four         0.2 ## three          four.7         three.2          one.3         0.2 ## 4          4.6         3.ane          i.5         0.2 ## five          five           3.6          i.4         0.2 ## 6          five.4         3.9          1.7         0.iv ## # ... with 144 more than rows

Vary the pick of columns on which to apply the filtering criteria. filter_at() takes a vars() specification. The post-obit R code apply the filtering criteria on the columns Sepal.Length and Sepal.Width:

                  my_data2 %>% filter_at(vars(starts_with("Sepal")), any_vars(. > 2.four))

                  ## # A tibble: 150 ten iv ##   Sepal.Length Sepal.Width Petal.Length Petal.Width ##          <dbl>       <dbl>        <dbl>       <dbl> ## ane          5.1         3.5          1.four         0.2 ## two          four.ix         3            1.iv         0.2 ## 3          4.7         3.2          ane.3         0.2 ## 4          four.vi         three.ane          i.5         0.2 ## 5          v           three.half-dozen          1.4         0.2 ## 6          5.4         3.9          1.7         0.4 ## # ... with 144 more than rows

Remove missing values

Nosotros start by creating a data frame with missing values. In R NA (Non Bachelor) is used to represent missing values:

                  # Create a data frame with missing information friends_data <- data_frame(   name = c("A", "B", "C", "D"),   age = c(27, 25, 29, 26),   height = c(180, NA, NA, 169),   married = c("yes", "yes", "no", "no") ) # Print friends_data

                  ## # A tibble: four ten 4 ##   proper noun    historic period height married ##   <chr> <dbl>  <dbl> <chr>   ## one A        27    180 yep     ## two B        25     NA yes     ## iii C        29     NA no      ## 4 D        26    169 no

Extract rows where elevation is NA:

                  friends_data %>% filter(is.na(height))

                  ## # A tibble: 2 x iv ##   proper name    age height married ##   <chr> <dbl>  <dbl> <chr>   ## 1 B        25     NA yep     ## 2 C        29     NA no

Exclude (drib) rows where height is NA:

                  friends_data %>% filter(!is.na(height))

                  ## # A tibble: 2 x four ##   proper name    age tiptop married ##   <chr> <dbl>  <dbl> <chr>   ## 1 A        27    180 yes     ## 2 D        26    169 no

In the R code above, !is.na() means that "nosotros don't desire" NAs.

Select random rows from a data frame

It'due south possible to select either n random rows with the function sample_n() or a random fraction of rows with sample_frac(). We get-go use the function gear up.seed() to initiate random number generator engine. This of import for users to reproduce the assay.

                  set.seed(1234) # Extract 5 random rows without replacement my_data %>% sample_n(5, replace = FALSE)  # Extract 5% of rows, randomly without replacement my_data %>% sample_frac(0.05, replace = Fake)

Select top north rows ordered by a variable

Select the top 5 rows ordered by Sepal.Length

                  my_data %>% top_n(5, Sepal.Length)

                  ## # A tibble: 5 10 five ##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   ##          <dbl>       <dbl>        <dbl>       <dbl> <fct>     ## 1          7.seven         3.viii          6.7         2.two virginica ## 2          7.7         two.6          six.9         ii.3 virginica ## iii          7.vii         ii.8          6.7         two   virginica ## 4          seven.9         3.8          six.4         2   virginica ## 5          7.7         iii            6.i         2.iii virginica

Group by the column Species and select the top 5 of each group ordered by Sepal.Length:

                  my_data %>%    group_by(Species) %>%   top_n(5, Sepal.Length)

                  ## # A tibble: 16 x v ## # Groups:   Species [3] ##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species    ##          <dbl>       <dbl>        <dbl>       <dbl> <fct>      ## 1          five.8         4            1.2         0.ii setosa     ## 2          5.7         4.4          1.five         0.4 setosa     ## 3          5.vii         three.8          1.seven         0.3 setosa     ## four          5.five         4.2          1.iv         0.two setosa     ## v          v.5         3.5          1.iii         0.2 setosa     ## 6          7           three.2          4.7         1.4 versicolor ## # ... with x more rows

Summary

In this tutorial, nosotros introduce how to filter a data frame rows using the dplyr packet:

Filter rows by logical criteria: my_data %>% filter(Sepal.Length >seven)
Select due north random rows: my_data %>% sample_n(10)
Select a random fraction of rows: my_data %>% sample_frac(10)
Select tiptop n rows by values: my_data %>% top_n(ten, Sepal.Length)

Recommended for you

This section contains best data science and cocky-development resources to assist you lot on your path.

Back to Data Manipulation in R