Tidy dataset with plyr package

A package from tidyverse toolbox

Posted by Bin Ma on May 10, 2019 · 263 1 min read

Overview

dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:

  • mutate() adds new variables that are functions of existing variables
  • select() picks variables based on their names.
  • filter() picks cases based on their values.
  • summarise() reduces multiple values down to a single summary.
  • arrange() changes the ordering of the rows.

dplyr is designed to abstract over how the data is stored. That means as well as working with local data frames, you can also work with remote database tables, using exactly the same R code. Install the dbplyr package then read vignette(“databases”, package = “dbplyr”). mutate

select

  • select(flights, year, month, day)
  • select(flights, year:day)
  • select(flights, -(year:day))

There are a number of helper functions you can use within select():

  • starts_with(“abc”): matches names that begin with “abc”.
  • ends_with(“xyz”): matches names that end with “xyz”.
  • contains(“ijk”): matches names that contain “ijk”.
  • matches(“(.)\1”): selects variables that match a regular expression. This one matches any variables that contain repeated characters. You’ll learn more about regular expressions in strings.
  • num_range(“x”, 1:3): matches x1, x2 and x3.

filter

delays <- flights %>% 
          group_by(dest) %>% 
	  filter(distance < 5)

summarise

delays <- flights %>% 
          group_by(dest) %>% 
	  summarise(
	    count = n(),
	    dist = mean(distance, na.rm = TRUE),
	    delay = mean(arr_delay, na.rm = TRUE)
	  ) %>% 
	  filter(count > 20, dest != "HNL")

arrange

arrange(flights, year, month, day)
arrange(flights, desc(dep_delay))