[[1]]
[1] 1
[[2]]
[1] 25
[[3]]
[1] 49
[[4]]
[1] 81
Functional programming with purrr
Mapping: distribute a large scale data problem into smaller pieces and assign each piece to a different computing node.
Reduce: combine the results from each node into a final result.
Example: count the frequency of words in a large corpus.
Each node counts the frequency of words in its assigned piece of the corpus (map).
Then, the counts from each node are combined to get the total counts (reduce).
Hadoop and Spark are widely used platforms for distributed computing that implement the MapReduce paradigm.
map is a verb for applying a function to each element of a list or vector.base::apply is a basic example of map that applies a function to the margins of an array or matrix.base::lapply function returns lists (hence the “l” in lapply).purrr package from tidyverse provides a more consistent and user-friendly set of map functions.purrr::map[[1]]
[1] 1
[[2]]
[1] 25
[[3]]
[1] 49
[[4]]
[1] 81
is equivalent to list(f(x[1]), f(x[2]), f(x[3]), f(x[4])).
purrr::mapADVR Ch. 9
purrr::mapreturns an atomic vector…
map_lgl: of logicalsmap_int: of integersmap_dbl: of doublesmap_chr: of characterspurrr provide shortcuts for anonymous functions:
[[1]]
[1] -0.3282237 0.7277602 -1.3256443 0.9296780 -0.2383811
[[2]]
[1] 2.079525 2.261968 4.012060 1.308766 2.579172
[[3]]
[1] 2.6821515 0.4448052 3.0792411 1.9976932 2.7107138
[[4]]
[1] 2.681011 4.252758 2.464766 5.616796 4.489707
[[5]]
[1] 4.439455 2.946883 4.587113 6.491458 4.724296
map will pass along any additional arguments (...) to the function .f.
ADVR Ch. 9
ADVR Ch. 9
vs
map2: iterate over two inputs.pmap: iterate over multiple inputs.imap: iterate with an index.There’s also: - walk: no return value just walk through the input. - modify: output the same type as input.
walk and modify also have [walk|modify]2, p[walk|modify], and i[walk|modify].
Note: modify does not modify in place.
vs
walkThese functions are mainly for side-effects.
write.csvggsaveimapimap can be seen as map2 with the second input being an index or name of the items in the first input.
Replace imap call with map2.
Exercise 9.4.6 Q2 from ADVR.
Rewrite the following code to use iwalk() instead of walk2(). What are the advantages and disadvantages?
pmapIdeal for working with data frames or lists of parameters.
params <- tibble::tribble(
~ n, ~ min, ~ max,
1L, 0, 1,
2L, 10, 100,
3L, 100, 1000
)
pmap(params, runif)[[1]]
[1] 0.2340386
[[2]]
[1] 81.52564 58.25198
[[3]]
[1] 136.9632 550.6070 642.7421
Note: column names match the argument names of runif.
pmapreducereduce(1:4, f) is equivalent to f(f(f(1, 2), 3), 4)f is passed as the first argument to the next call to f.reduceADVR Ch. 9
reduceWrite code to find intersection of a list of vectors.
reducereduce with argumentsADVR Ch. 9
accumulateA variant of reduce that returns the intermediate results as well.
Count the number of times a specific string appears in a corpus.
Practice using map2_dbl and accumulate.
every: all elements satisfy a condition.some: at least one element satisfies a condition.detect: return the first element that satisfies a condition.detect_index: return the index of the first element that satisfies a condition.none: no elements satisfy a condition.keep: keep elements that satisfy a condition.discard: discard elements that satisfy a condition.modify_if: modify elements that satisfy a condition.Function factories are functions that make functions.
The enclosing environment of the manufactured function is an execution environment of the function factory.
make_power <- function(n) {
function(x) {
x ^ n
}
}
square <- make_power(2)
rlang::env_print(square)<environment: 0x7f7dd44570f0>
Parent: <environment: global>
Bindings:
• n: <lazy>
It has binding to value n, which was passed in as an argument to make_power.
What does n: <lazy> mean?
What???
The value of n is looked up when the function is called, not when it is created.
So if it was assigned to a variable in a parent environment (in this case, global) and that variable’s value is changed, the function will use the new value.
Roughly speaking, we want similar number of observations in each bin. In this case, variability in the data should be used to determine the bin width for each group.
binwidth: The width of the bins. Can be specified as a numeric value or as a function that calculates width from unscaled x. Here, “unscaled x” refers to the original x values in the data, before application of any scale transformation. When specifying a function along with a grouping structure, the function will be called once per group.
library(dplyr)
f <- binwidth_bins(20)
df %>%
group_by(sd) %>%
select(x) %>%
summarise(binwidth = f(x))# A tibble: 3 × 2
sd binwidth
<dbl> <dbl>
1 1 0.217
2 5 1.33
3 15 3.40
n=20 indicates the total number of bins you want to divide the data into. user system elapsed
1.500 0.009 1.514
user system elapsed
0.019 0.000 0.019
The values of fibonacci sequences for \(n \leq 30\) are cached after the first call, so subsequent calls are much faster.