dplyr - Data Manipulation

dplyr is a 'grammar of data manipulation'. The key functions it exports are:

mutate() - adds new variables that are functions of existing variables.
select() - picks variables based on their names.
filter() - picks cases based on their values.
summarise() - reduces multiple variables to a single summary.
arrange() - changes the ordering of rows.

In this section we'll be using the the 'Parking Bay Arrivals/Departures' data. This contains 21 information on parking bay arrivals and departures in Melbourne. It consists of 13,503,655 rows of 21 columns. This data has been imported into the melb_parking tibble.

filter()

Filter allows you to subset data based on values. The first argument is the name of the data frame. The rest of the arguments are logical expressions on variables within that data frame.

Let's filter out the durations were equal to 0:

> melb_parking %>% filter(DurationSeconds == 3600)
# A tibble: 116,733 x 21
  ParkingEventId DeviceId                  ArrivalTime                DepartureTime DurationSeconds StreetMarker SignPlateId
           <int>    <int>                        <chr>                        <chr>           <int>        <chr>       <int>
1           9869    11785 17/12/2014 01:12:58 PM +0000 17/12/2014 02:12:49 PM +0000            3600       11802E         216
2          24113     8195 24/09/2014 07:30:00 PM +0000 24/09/2014 08:30:00 PM +0000            3600         287W          24
3          24142     8195 25/09/2014 07:30:00 PM +0000 25/09/2014 08:30:00 PM +0000            3600         287W          24
4          11472     2129 22/01/2014 01:41:06 PM +0000 22/01/2014 02:41:14 PM +0000            3600        C3104         145
# ... with 1.167e+05 more rows, and 14 more variables: Sign <chr>, Area <int>, AreaName <chr>, StreetId <int>, StreetName <chr>,
#   `BetweenStreet1 Id` <int>, `BetweenStreet1 Description` <chr>, `BetweenStreet2 Id` <int>, `BetweenStreet2 Description` <chr>,
#   SideOfStreet <int>, SideCode <chr>, SideName <chr>, BayID <int>, InViolation <chr>

If more than one argument is specified, they are treated as an and:

> melb_parking %>% filter(DurationSeconds > 60, DeviceId == 8844)
# A tibble: 3,881 x 21
  ParkingEventId DeviceId                  ArrivalTime                DepartureTime DurationSeconds StreetMarker SignPlateId
           <int>    <int>                        <chr>                        <chr>           <int>        <chr>       <int>
1          13148     8844 07/08/2014 10:26:45 AM +0000 07/08/2014 10:40:11 AM +0000             840        2605S          82
2           7059     8844 02/01/2014 09:26:11 AM +0000 02/01/2014 10:28:31 AM +0000            3720        2605S          82
3          12159     8844 02/07/2014 09:05:37 AM +0000 02/07/2014 09:17:38 AM +0000             720        2605S          82
4          11400     8844 05/06/2014 06:51:36 PM +0000 05/06/2014 06:54:04 PM +0000             180        2605S          82
# ... with 3,877 more rows, and 14 more variables: Sign <chr>, Area <int>, AreaName <chr>, StreetId <int>, StreetName <chr>,
#   `BetweenStreet1 Id` <int>, `BetweenStreet1 Description` <chr>, `BetweenStreet2 Id` <int>, `BetweenStreet2 Description` <chr>,
#   SideOfStreet <int>, SideCode <chr>, SideName <chr>, BayID <int>, InViolation <chr>

The other normal logical operators can be used: !, &, | and xor(x,y):

> melb_parking %>% filter(DeviceId == 8844 | DeviceId == 24113)
# A tibble: 4,078 x 21
  ParkingEventId DeviceId                  ArrivalTime                DepartureTime DurationSeconds StreetMarker SignPlateId
           <int>    <int>                        <chr>                        <chr>           <int>        <chr>       <int>
1          13148     8844 07/08/2014 10:26:45 AM +0000 07/08/2014 10:40:11 AM +0000             840        2605S          82
2           7059     8844 02/01/2014 09:26:11 AM +0000 02/01/2014 10:28:31 AM +0000            3720        2605S          82
3          12159     8844 02/07/2014 09:05:37 AM +0000 02/07/2014 09:17:38 AM +0000             720        2605S          82
4          11400     8844 05/06/2014 06:51:36 PM +0000 05/06/2014 06:54:04 PM +0000             180        2605S          82
...

NA values

By default, filter() only includes rows where the condition is TRUE. It excludes both FALSE and NA. If missing values should be preserved, they need to be asked for explicitly: filter(DeviceID == 8844 | is.na(DeviceID)).

arrange()

Arrange keeps all of the rows, but changes the order.

> melb_parking %>% arrange(DurationSeconds)
# A tibble: 13,503,655 x 21
  ParkingEventId DeviceId                  ArrivalTime                DepartureTime DurationSeconds StreetMarker SignPlateId
           <int>    <int>                        <chr>                        <chr>           <int>        <chr>       <int>
1           9405     8844 29/03/2014 07:30:00 PM +0000 29/03/2014 07:30:15 PM +0000               0        2605S          27
2          17927     8549 11/04/2014 07:30:00 PM +0000 11/04/2014 07:30:35 PM +0000               0         437W          27
3          19585     8489 14/05/2014 07:30:00 PM +0000 14/05/2014 07:30:21 PM +0000               0         346E          24

By default it arranges in ascending order. desc() can be used to reverse this.

> melb_parking %>% arrange(desc(DurationSeconds))
# A tibble: 13,503,655 x 21
  ParkingEventId DeviceId                  ArrivalTime                DepartureTime DurationSeconds StreetMarker SignPlateId
           <int>    <int>                        <chr>                        <chr>           <int>        <chr>       <int>
1           9471    11139 05/12/2014 06:30:00 PM +0000 02/04/2015 01:03:01 PM +0000        10175580       12446N         196
2           1363    12879 13/09/2014 12:30:00 PM +0000 15/09/2014 09:35:57 AM +0000          162300        5530E         206
...

NA values are always sorted at the end, whether ascending or descending.

If the .by_group boolean parameter is TRUE, it will sort first by grouping variable. Only applies to group_by() data frames.

select()

The select() functions filters the columns based on their names:

> melb_parking %>% select(DeviceId, StreetMarker)
# A tibble: 13,503,655 x 2
  DeviceId StreetMarker
     <int>        <chr>
1     2129        C3104
2     2128        C3172
3     2128        C3172
4    11785       11802E
# ... with 1.35e+07 more rows

The colon allows the specification of a range:

> melb_parking %>% select(DeviceId, Sign:StreetName)
# A tibble: 13,503,655 x 6
  DeviceId                    Sign  Area AreaName StreetId       StreetName
     <int>                   <chr> <int>    <chr>    <int>            <chr>
1     2129 2P MTR M-SAT 7:30-20:30    37  Supreme      894  LONSDALE STREET
2     2128 2P MTR M-SAT 7:30-20:30    20   County      894  LONSDALE STREET
3     2128 2P MTR M-SAT 7:30-20:30    20   County      894  LONSDALE STREET
4    11785                    <NA>     7 Jolimont      869 LANSDOWNE STREET
# ... with 1.35e+07 more rows

The minus sign negates the match:

> melb_parking %>% select(-(Sign:StreetName))
# A tibble: 13,503,655 x 16
  ParkingEventId DeviceId                  ArrivalTime                DepartureTime DurationSeconds StreetMarker SignPlateId
           <int>    <int>                        <chr>                        <chr>           <int>        <chr>       <int>
1          11582     2129 30/01/2014 08:52:24 AM +0000 30/01/2014 08:59:49 AM +0000             420        C3104         145
2          18963     2128 11/01/2014 01:29:10 PM +0000 11/01/2014 08:30:00 PM +0000           25260        C3172         145
3          18898     2128 08/01/2014 03:04:29 PM +0000 08/01/2014 03:06:42 PM +0000             120        C3172         145
4           9853    11785 16/12/2014 06:30:00 PM +0000 16/12/2014 08:09:16 PM +0000            5940       11802E           0
# ... with 1.35e+07 more rows, and 9 more variables

There are other helper functions that can be used:

starts_with() and ends_with() - matches variables that begin or end with a string.
contains() - matches variables that contain a string.
matches() - selects variables that match a regular expression.
`num_range("x", 1:3) - this would match variables 'x1', 'x2' and 'x3'.
everything() - all variables.

rename()

The rename() function can be used to rename variables.

mutate()

The mutate() function adds new columns that are functions of the current columns. For example, lets generate DateTime objects for the Arrival and Departure times rather than character vectors. We'll also select only these newly created columns.

> melb_parking %>%
+ mutate(
+ Arrival = dmy_hms(ArrivalTime),
+ Departure = dmy_hms(DepartureTime)
+ ) %>% select(Arrival, Departure)
# A tibble: 13,503,655 x 2
              Arrival           Departure
               <dttm>              <dttm>
1 2014-01-30 08:52:24 2014-01-30 08:59:49
2 2014-01-11 13:29:10 2014-01-11 20:30:00
3 2014-01-08 15:04:29 2014-01-08 15:06:42

Instead of using select(), transmute() will only return the new variables.

Creation Functions

The key property for the creation function is that they must be vectorised: it takes a vector of values of length x, and returns a vector of lenth x. The arithmetic operators are all vectorised. Some other useful functions:

lead() or lag() refer to the leading (next) or lagging (previous) value.
cumsum(), cumprod(), cummin() and cummax() provide cumulative functions.
Logical comparisons.
min_rank() which ranks the values least to most as 1st, 2nd, 3rd etc. Use desc(x) to reverse this order.

summarise() & group_by()

The summarise() function collapses the data frame into a single row.

> melb_parking %>% summarise(mean_duration = mean(DurationSeconds, na.rm = TRUE))
# A tibble: 1 x 1
  mean_duration
          <dbl>
1       2536.94

The na.rm = TRUE is required because the aggregation functions obey the law of missing values: if there's any missing value in the input, the output will be a missing value.

This function is most useful if it's paired with the group_by() function. This changes the unit of analysis from the complete dataset to the individua groups. We then use arrange() to order these by highest average to lowest average.

> melb_parking %>%
+  group_by(StreetName) %>%
+  summarise(AvgDuration = mean(DurationSeconds, na.rm = TRUE)) %>%
+  arrange(desc(AvgDuration))
# A tibble: 106 x 2
                 StreetName AvgDuration
                      <chr>       <dbl>
 1 Lt DRYBURGH STREET SOUTH   14851.903
 2              EADES PLACE   11006.816
 3            HOWARD STREET    6926.569
 4          PRINCESS STREET    6779.371
 5           AUGUSTA AVENUE    6216.761
 6             LEVEN AVENUE    6041.347
 7              WATTLE ROAD    5996.189
 8          CHETWYND STREET    5975.700
 9              PEEL STREET    5745.130
10             DODDS STREET    5532.255
# ... with 96 more rows

It's reccommended to include a count so that we can determine if we're making assumptions based on a very small data set.

> melb_parking %>%
+  group_by(StreetName) %>%
+  summarise(AvgDuration = mean(DurationSeconds, na.rm = TRUE), n = n()) %>%
+  arrange(desc(AvgDuration))
# A tibble: 106 x 3
                 StreetName AvgDuration     n
                      <chr>       <dbl> <int>
 1 Lt DRYBURGH STREET SOUTH   14851.903  2067
 2              EADES PLACE   11006.816 24818
 3            HOWARD STREET    6926.569 38675
 4          PRINCESS STREET    6779.371 10043
 5           AUGUSTA AVENUE    6216.761  3013
 6             LEVEN AVENUE    6041.347  3236
 7              WATTLE ROAD    5996.189  3149
 8          CHETWYND STREET    5975.700 52375
 9              PEEL STREET    5745.130 17645
10             DODDS STREET    5532.255 53434
# ... with 96 more rows

You can use logical subsetting within the summary().

> melb_parking %>%
+  group_by(StreetName) %>%
+  summarise(AvgDuration = mean(DurationSeconds[DurationSeconds < 6000], na.rm = TRUE), n = n()) %>%
+  arrange(desc(AvgDuration))
# A tibble: 106 x 3
         StreetName AvgDuration      n
              <chr>       <dbl>  <int>
 1     RODEN STREET    3600.000      6
 2      EADES PLACE    2605.600  24818
 3 PARLIAMENT PLACE    2523.985  33855
 4   AUGUSTA AVENUE    2347.680   3013
 5  GISBORNE STREET    2344.039  41903
 6    COBDEN STREET    2338.563  93898
 7 NICHOLSON STREET    2253.261  35257
 8     CAPEL STREET    2197.245 110241
 9   FARADAY STREET    2192.639  93693
10      WATTLE ROAD    2170.836   3149
# ... with 96 more rows

Other useful summary functions:

mean() and median()
sd() (standard deviation), IQR() (inter-quartile range), and mad() (median absolute deviation) are measures of the spread of values.
min(), quantile(x, probs), max() are measures of rank.
- e.g quantile(x, .25) will find a value that is greater than 25% of values but less than the remaining 75%.
first(), nth(x, n), and last() are positional. They also allow the setting of a default value if the position doesn't exit.A
n(), n_distinct() count the values.

There is also a separate count() verb that can be used. The a weight wt parameter can be added to the count() which multiplies each count by that value.

> melb_parking %>% count(StreetName)
# A tibble: 106 x 2
           StreetName      n
                <chr>  <int>
 1   A'BECKETT STREET 124388
 2      ALBERT STREET 278396
 3     ANTHONY STREET   6410
 4      AQUITANIA WAY  27535
 5     AUGUSTA AVENUE   3013
 6        AURORA LANE  30736
 7     BALSTON STREET  13608
 8 BATMANS HILL DRIVE  72718
 9      BATMAN STREET  43866
10        BOND STREET   9658
# ... with 96 more rows

One Table Verbs