dplyr - Data Manipulation
dplyr is a 'grammar of data manipulation'. The key functions it exports are:
- mutate() - adds new variables that are functions of existing variables.
- select() - picks variables based on their names.
- filter() - picks cases based on their values.
- summarise() - reduces multiple variables to a single summary.
- arrange() - changes the ordering of rows.
In this section we'll be using the the 'Parking Bay Arrivals/Departures' data. This contains 21 information on parking bay arrivals and departures in Melbourne. It consists of 13,503,655 rows of 21 columns. This data has been imported into the melb_parking tibble.
filter()
Filter allows you to subset data based on values. The first argument is the name of the data frame. The rest of the arguments are logical expressions on variables within that data frame.
Let's filter out the durations were equal to 0:
> melb_parking %>% filter(DurationSeconds == 3600)
# A tibble: 116,733 x 21
ParkingEventId DeviceId ArrivalTime DepartureTime DurationSeconds StreetMarker SignPlateId
<int> <int> <chr> <chr> <int> <chr> <int>
1 9869 11785 17/12/2014 01:12:58 PM +0000 17/12/2014 02:12:49 PM +0000 3600 11802E 216
2 24113 8195 24/09/2014 07:30:00 PM +0000 24/09/2014 08:30:00 PM +0000 3600 287W 24
3 24142 8195 25/09/2014 07:30:00 PM +0000 25/09/2014 08:30:00 PM +0000 3600 287W 24
4 11472 2129 22/01/2014 01:41:06 PM +0000 22/01/2014 02:41:14 PM +0000 3600 C3104 145
# ... with 1.167e+05 more rows, and 14 more variables: Sign <chr>, Area <int>, AreaName <chr>, StreetId <int>, StreetName <chr>,
# `BetweenStreet1 Id` <int>, `BetweenStreet1 Description` <chr>, `BetweenStreet2 Id` <int>, `BetweenStreet2 Description` <chr>,
# SideOfStreet <int>, SideCode <chr>, SideName <chr>, BayID <int>, InViolation <chr>
If more than one argument is specified, they are treated as an and:
> melb_parking %>% filter(DurationSeconds > 60, DeviceId == 8844)
# A tibble: 3,881 x 21
ParkingEventId DeviceId ArrivalTime DepartureTime DurationSeconds StreetMarker SignPlateId
<int> <int> <chr> <chr> <int> <chr> <int>
1 13148 8844 07/08/2014 10:26:45 AM +0000 07/08/2014 10:40:11 AM +0000 840 2605S 82
2 7059 8844 02/01/2014 09:26:11 AM +0000 02/01/2014 10:28:31 AM +0000 3720 2605S 82
3 12159 8844 02/07/2014 09:05:37 AM +0000 02/07/2014 09:17:38 AM +0000 720 2605S 82
4 11400 8844 05/06/2014 06:51:36 PM +0000 05/06/2014 06:54:04 PM +0000 180 2605S 82
# ... with 3,877 more rows, and 14 more variables: Sign <chr>, Area <int>, AreaName <chr>, StreetId <int>, StreetName <chr>,
# `BetweenStreet1 Id` <int>, `BetweenStreet1 Description` <chr>, `BetweenStreet2 Id` <int>, `BetweenStreet2 Description` <chr>,
# SideOfStreet <int>, SideCode <chr>, SideName <chr>, BayID <int>, InViolation <chr>
The other normal logical operators can be used: !, &, | and xor(x,y):
> melb_parking %>% filter(DeviceId == 8844 | DeviceId == 24113)
# A tibble: 4,078 x 21
ParkingEventId DeviceId ArrivalTime DepartureTime DurationSeconds StreetMarker SignPlateId
<int> <int> <chr> <chr> <int> <chr> <int>
1 13148 8844 07/08/2014 10:26:45 AM +0000 07/08/2014 10:40:11 AM +0000 840 2605S 82
2 7059 8844 02/01/2014 09:26:11 AM +0000 02/01/2014 10:28:31 AM +0000 3720 2605S 82
3 12159 8844 02/07/2014 09:05:37 AM +0000 02/07/2014 09:17:38 AM +0000 720 2605S 82
4 11400 8844 05/06/2014 06:51:36 PM +0000 05/06/2014 06:54:04 PM +0000 180 2605S 82
...
NA values
By default, filter() only includes rows where the condition is TRUE. It excludes both FALSE and NA. If missing values should be preserved, they need to be asked for explicitly: filter(DeviceID == 8844 | is.na(DeviceID)).
arrange()
Arrange keeps all of the rows, but changes the order.
> melb_parking %>% arrange(DurationSeconds)
# A tibble: 13,503,655 x 21
ParkingEventId DeviceId ArrivalTime DepartureTime DurationSeconds StreetMarker SignPlateId
<int> <int> <chr> <chr> <int> <chr> <int>
1 9405 8844 29/03/2014 07:30:00 PM +0000 29/03/2014 07:30:15 PM +0000 0 2605S 27
2 17927 8549 11/04/2014 07:30:00 PM +0000 11/04/2014 07:30:35 PM +0000 0 437W 27
3 19585 8489 14/05/2014 07:30:00 PM +0000 14/05/2014 07:30:21 PM +0000 0 346E 24
By default it arranges in ascending order. desc() can be used to reverse this.
> melb_parking %>% arrange(desc(DurationSeconds))
# A tibble: 13,503,655 x 21
ParkingEventId DeviceId ArrivalTime DepartureTime DurationSeconds StreetMarker SignPlateId
<int> <int> <chr> <chr> <int> <chr> <int>
1 9471 11139 05/12/2014 06:30:00 PM +0000 02/04/2015 01:03:01 PM +0000 10175580 12446N 196
2 1363 12879 13/09/2014 12:30:00 PM +0000 15/09/2014 09:35:57 AM +0000 162300 5530E 206
...
NA values are always sorted at the end, whether ascending or descending.
If the .by_group boolean parameter is TRUE, it will sort first by grouping variable. Only applies to group_by() data frames.
select()
The select() functions filters the columns based on their names:
> melb_parking %>% select(DeviceId, StreetMarker)
# A tibble: 13,503,655 x 2
DeviceId StreetMarker
<int> <chr>
1 2129 C3104
2 2128 C3172
3 2128 C3172
4 11785 11802E
# ... with 1.35e+07 more rows
The colon allows the specification of a range:
> melb_parking %>% select(DeviceId, Sign:StreetName)
# A tibble: 13,503,655 x 6
DeviceId Sign Area AreaName StreetId StreetName
<int> <chr> <int> <chr> <int> <chr>
1 2129 2P MTR M-SAT 7:30-20:30 37 Supreme 894 LONSDALE STREET
2 2128 2P MTR M-SAT 7:30-20:30 20 County 894 LONSDALE STREET
3 2128 2P MTR M-SAT 7:30-20:30 20 County 894 LONSDALE STREET
4 11785 <NA> 7 Jolimont 869 LANSDOWNE STREET
# ... with 1.35e+07 more rows
The minus sign negates the match:
> melb_parking %>% select(-(Sign:StreetName))
# A tibble: 13,503,655 x 16
ParkingEventId DeviceId ArrivalTime DepartureTime DurationSeconds StreetMarker SignPlateId
<int> <int> <chr> <chr> <int> <chr> <int>
1 11582 2129 30/01/2014 08:52:24 AM +0000 30/01/2014 08:59:49 AM +0000 420 C3104 145
2 18963 2128 11/01/2014 01:29:10 PM +0000 11/01/2014 08:30:00 PM +0000 25260 C3172 145
3 18898 2128 08/01/2014 03:04:29 PM +0000 08/01/2014 03:06:42 PM +0000 120 C3172 145
4 9853 11785 16/12/2014 06:30:00 PM +0000 16/12/2014 08:09:16 PM +0000 5940 11802E 0
# ... with 1.35e+07 more rows, and 9 more variables
There are other helper functions that can be used:
starts_with()andends_with()- matches variables that begin or end with a string.contains()- matches variables that contain a string.matches()- selects variables that match a regular expression.- `num_range("x", 1:3) - this would match variables 'x1', 'x2' and 'x3'.
everything()- all variables.
rename()
The rename() function can be used to rename variables.
mutate()
The mutate() function adds new columns that are functions of the current columns. For example, lets generate DateTime objects for the Arrival and Departure times rather than character vectors. We'll also select only these newly created columns.
> melb_parking %>%
+ mutate(
+ Arrival = dmy_hms(ArrivalTime),
+ Departure = dmy_hms(DepartureTime)
+ ) %>% select(Arrival, Departure)
# A tibble: 13,503,655 x 2
Arrival Departure
<dttm> <dttm>
1 2014-01-30 08:52:24 2014-01-30 08:59:49
2 2014-01-11 13:29:10 2014-01-11 20:30:00
3 2014-01-08 15:04:29 2014-01-08 15:06:42
Instead of using select(), transmute() will only return the new variables.
Creation Functions
The key property for the creation function is that they must be vectorised: it takes a vector of values of length x, and returns a vector of lenth x. The arithmetic operators are all vectorised. Some other useful functions:
lead()orlag()refer to the leading (next) or lagging (previous) value.cumsum(),cumprod(),cummin()andcummax()provide cumulative functions.- Logical comparisons.
min_rank()which ranks the values least to most as 1st, 2nd, 3rd etc. Usedesc(x)to reverse this order.
summarise() & group_by()
The summarise() function collapses the data frame into a single row.
> melb_parking %>% summarise(mean_duration = mean(DurationSeconds, na.rm = TRUE))
# A tibble: 1 x 1
mean_duration
<dbl>
1 2536.94
The na.rm = TRUE is required because the aggregation functions obey the law of missing values: if there's any missing value in the input, the output will be a missing value.
This function is most useful if it's paired with the group_by() function. This changes the unit of analysis from the complete dataset to the individua groups. We then use arrange() to order these by highest average to lowest average.
> melb_parking %>%
+ group_by(StreetName) %>%
+ summarise(AvgDuration = mean(DurationSeconds, na.rm = TRUE)) %>%
+ arrange(desc(AvgDuration))
# A tibble: 106 x 2
StreetName AvgDuration
<chr> <dbl>
1 Lt DRYBURGH STREET SOUTH 14851.903
2 EADES PLACE 11006.816
3 HOWARD STREET 6926.569
4 PRINCESS STREET 6779.371
5 AUGUSTA AVENUE 6216.761
6 LEVEN AVENUE 6041.347
7 WATTLE ROAD 5996.189
8 CHETWYND STREET 5975.700
9 PEEL STREET 5745.130
10 DODDS STREET 5532.255
# ... with 96 more rows
It's reccommended to include a count so that we can determine if we're making assumptions based on a very small data set.
> melb_parking %>%
+ group_by(StreetName) %>%
+ summarise(AvgDuration = mean(DurationSeconds, na.rm = TRUE), n = n()) %>%
+ arrange(desc(AvgDuration))
# A tibble: 106 x 3
StreetName AvgDuration n
<chr> <dbl> <int>
1 Lt DRYBURGH STREET SOUTH 14851.903 2067
2 EADES PLACE 11006.816 24818
3 HOWARD STREET 6926.569 38675
4 PRINCESS STREET 6779.371 10043
5 AUGUSTA AVENUE 6216.761 3013
6 LEVEN AVENUE 6041.347 3236
7 WATTLE ROAD 5996.189 3149
8 CHETWYND STREET 5975.700 52375
9 PEEL STREET 5745.130 17645
10 DODDS STREET 5532.255 53434
# ... with 96 more rows
You can use logical subsetting within the summary().
> melb_parking %>%
+ group_by(StreetName) %>%
+ summarise(AvgDuration = mean(DurationSeconds[DurationSeconds < 6000], na.rm = TRUE), n = n()) %>%
+ arrange(desc(AvgDuration))
# A tibble: 106 x 3
StreetName AvgDuration n
<chr> <dbl> <int>
1 RODEN STREET 3600.000 6
2 EADES PLACE 2605.600 24818
3 PARLIAMENT PLACE 2523.985 33855
4 AUGUSTA AVENUE 2347.680 3013
5 GISBORNE STREET 2344.039 41903
6 COBDEN STREET 2338.563 93898
7 NICHOLSON STREET 2253.261 35257
8 CAPEL STREET 2197.245 110241
9 FARADAY STREET 2192.639 93693
10 WATTLE ROAD 2170.836 3149
# ... with 96 more rows
Other useful summary functions:
mean()andmedian()sd()(standard deviation),IQR()(inter-quartile range), andmad()(median absolute deviation) are measures of the spread of values.min(),quantile(x, probs),max()are measures of rank.- e.g
quantile(x, .25)will find a value that is greater than 25% of values but less than the remaining 75%.
- e.g
first(),nth(x, n), andlast()are positional. They also allow the setting of a default value if the position doesn't exit.An(),n_distinct()count the values.
There is also a separate count() verb that can be used. The a weight wt parameter can be added to the count() which multiplies each count by that value.
> melb_parking %>% count(StreetName)
# A tibble: 106 x 2
StreetName n
<chr> <int>
1 A'BECKETT STREET 124388
2 ALBERT STREET 278396
3 ANTHONY STREET 6410
4 AQUITANIA WAY 27535
5 AUGUSTA AVENUE 3013
6 AURORA LANE 30736
7 BALSTON STREET 13608
8 BATMANS HILL DRIVE 72718
9 BATMAN STREET 43866
10 BOND STREET 9658
# ... with 96 more rows