class: ur-title, center, middle, title-slide .title[ # BST430 Lecture 02-A ] .subtitle[ ## Data types in R ] .author[ ### Seong-Hwan Jun, based on the course by Andrew McDavid ] .institute[ ### U of Rochester ] .date[ ### 2021-09-01 (updated: 2025-08-29) ] --- class: middle # Why should you care about data types? --- ## Example: Cat lovers A survey asked respondents their name and number of cats. The instructions said to enter the number of cats as a numerical value. ```r cat_lovers = read_csv("data/cat-lovers.csv") ``` ```r cat_lovers ``` ``` ## # A tibble: 60 × 3 ## name number_of_cats handedness ## <chr> <chr> <chr> ## 1 Bernice Warren 0 left ## 2 Woodrow Stone 0 left ## 3 Willie Bass 1 left ## 4 Tyrone Estrada 3 left ## # ℹ 56 more rows ``` --- ## Oh why won't you work?! ```r cat_lovers %>% summarise(mean_cats = mean(number_of_cats)) ``` ``` ## Warning: There was 1 warning in `summarise()`. ## ℹ In argument: `mean_cats = mean(number_of_cats)`. ## Caused by warning in `mean.default()`: ## ! argument is not numeric or logical: returning NA ``` ``` ## # A tibble: 1 × 1 ## mean_cats ## <dbl> ## 1 NA ``` --- ```r ?mean ``` <img src="l02/img/mean-help.png" width="75%" style="display: block; margin: auto;" /> --- ## Oh why won't you still work??!! ```r cat_lovers %>% summarise(mean_cats = mean(number_of_cats, na.rm = TRUE)) ``` ``` ## Warning: There was 1 warning in `summarise()`. ## ℹ In argument: `mean_cats = mean(number_of_cats, na.rm = TRUE)`. ## Caused by warning in `mean.default()`: ## ! argument is not numeric or logical: returning NA ``` ``` ## # A tibble: 1 × 1 ## mean_cats ## <dbl> ## 1 NA ``` --- ## Take a breath and look at your data .question[ What is the type of the `number_of_cats` variable? ] ```r glimpse(cat_lovers) ``` ``` ## Rows: 60 ## Columns: 3 ## $ name <chr> "Bernice Warren", "Woodrow Stone", "Will… ## $ number_of_cats <chr> "0", "0", "1", "3", "3", "2", "1", "1", … ## $ handedness <chr> "left", "left", "left", "left", "left", … ``` -- .large[.center[💡!]] --- ## Let's take another look .small[
] --- ## Sometimes you might need to babysit your respondents .midi[ ```r cat_lovers %>% mutate(number_of_cats = case_when( name == "Ginger Clark" ~ 2, name == "Doug Bass" ~ 3, TRUE ~ as.numeric(number_of_cats) )) %>% summarise(mean_cats = mean(number_of_cats)) ``` ``` ## Warning: There was 1 warning in `mutate()`. ## ℹ In argument: `number_of_cats = case_when(...)`. ## Caused by warning: ## ! NAs introduced by coercion ``` ``` ## # A tibble: 1 × 1 ## mean_cats ## <dbl> ## 1 0.833 ``` ] --- ## Always you need to respect data types ```r cat_lovers %>% mutate( number_of_cats = case_when( name == "Ginger Clark" ~ "2", name == "Doug Bass" ~ "3", TRUE ~ number_of_cats ), number_of_cats = as.numeric(number_of_cats) ) %>% summarise(mean_cats = mean(number_of_cats)) ``` ``` ## # A tibble: 1 × 1 ## mean_cats ## <dbl> ## 1 0.833 ``` <!-- ??? --> <!-- This generates a warning for unknown (case_when specific) reasons --> --- ## Now that we know what we're doing... ```r *cat_lovers = cat_lovers %>% mutate( number_of_cats = case_when( name == "Ginger Clark" ~ "2", name == "Doug Bass" ~ "3", TRUE ~ number_of_cats ), number_of_cats = as.numeric(number_of_cats) ) ``` --- ## Moral of the story - If your data does not behave how you expect it to, type coercion upon reading in the data might be the reason. - Go in and investigate your data, apply the fix, *save your data*, live happily ever after. --- class: middle .hand[.light-blue[now that we have a good motivation for]] .hand[.light-blue[learning about data types in R]] <br> .large[ .hand[.light-blue[let's learn about data types in R!]] ] --- class: middle # Data types --- ## Atomic data types in R These are the fundamental building blocks (**atoms**) of all R vectors (and all data in R is a vector!) - **logical** - **integer** numbers - **double** (real) numbers - **complex** numbers - **character** - and some more, but we won't be discussing these yet. --- ## Logical & character .pull-left[ **logical** - boolean values `TRUE` and `FALSE` ```r typeof(TRUE) ``` ``` ## [1] "logical" ``` ] .pull-right[ **character** - character strings ```r typeof("hello") ``` ``` ## [1] "character" ``` ] --- ## Double & integer .pull-left[ **double** - floating point numerical values (default numerical type) ```r typeof(1.335) ``` ``` ## [1] "double" ``` ```r typeof(7) ``` ``` ## [1] "double" ``` ] .pull-right[ **integer** - integer numerical values (indicated with an `L`) ```r typeof(7L) ``` ``` ## [1] "integer" ``` ```r typeof(1:3) ``` ``` ## [1] "integer" ``` ] --- ## Complex numbers R also natively supports complex numbers, which are their own type: .pull-left[ ```r roots_of_unity = c(1+0i, -1+0i, 0+1i, 0-1i) typeof(roots_of_unity) ``` ``` ## [1] "complex" ``` ```r roots_of_unity^2 ``` ``` ## [1] 1+0i 1+0i -1+0i -1+0i ``` ] .pull-right[ ```r roots_of_unity^4 ``` ``` ## [1] 1+0i 1+0i 1+0i 1+0i ``` ```r Re(roots_of_unity) ``` ``` ## [1] 1 -1 0 0 ``` ```r Im(roots_of_unity) ``` ``` ## [1] 0 0 1 -1 ``` ] --- ## Lists **Lists** are 1d objects that can contain any combination of R objects .pull-left[ .small[ ```r mylist = list("A", 1:4, c(TRUE, FALSE)) mylist ``` ``` ## [[1]] ## [1] "A" ## ## [[2]] ## [1] 1 2 3 4 ## ## [[3]] ## [1] TRUE FALSE ``` ]] .pull-right[ ```r *str(mylist) ``` ``` ## List of 3 ## $ : chr "A" ## $ : int [1:4] 1 2 3 4 ## $ : logi [1:2] TRUE FALSE ``` ] --- # `str` is our friend .font130[ * shows the *str*ucture of the data * `str` is *nearly* a synonym for `dplyr::glimpse` * It shows detailed information on the composition of object. * You should reach for it first when you are trying to understand the low-level properties of an R object. ] --- ## Concatenation Vectors can be constructed and **concatenated** using the `c()` function. .pull-left[.small[ ```r digits = c(1, 2, 3) hello = c("Hello", "World!") greet = c(c("hi", "hello"), c("bye", "jello")) ``` ]] .pull-right[ ```r str(digits) ``` ``` ## num [1:3] 1 2 3 ``` ```r str(hello) ``` ``` ## chr [1:2] "Hello" "World!" ``` ```r str(greet) ``` ``` ## chr [1:4] "hi" "hello" "bye" "jello" ``` ] --- ## Vector length Get the number of entries in a vector with `length(x)`. .pull-left[ ```r x = c(1, 2, 3) y = character(2) empty_dbl = numeric(0) empty_chr = character(0) ``` ] .pull-right[ ```r length(x) ``` ``` ## [1] 3 ``` ```r length(y) ``` ``` ## [1] 2 ``` ```r length(empty_dbl) ``` ``` ## [1] 0 ``` ```r length(empty_chr) ``` ``` ## [1] 0 ``` ] --- ## Concatenation Lists can also be concatenated using the `c()` function. .pull-left[ ```r list1 = list(1, 2, 3) list2 = list( c("Hi!", "I'm a vector", "nested inside", "a list")) cat12 = c(list1, list2) ``` ] .pull-right[ ```r str(cat12) ``` ``` ## List of 4 ## $ : num 1 ## $ : num 2 ## $ : num 3 ## $ : chr [1:4] "Hi!" "I'm a vector" "nested inside" "a list" ``` ] Compare to `list(list1, list2)` which would nest the two lists rather than join them. --- ## length(c(x, y)) = length(x) + length(y) ```r length(list1) ``` ``` ## [1] 3 ``` ```r length(list2) ``` ``` ## [1] 1 ``` ```r length(c(list1, list2)) ``` ``` ## [1] 4 ``` .question[ What would `length(c(list1, list2))` be when `list1 = list(c(1, 2, 3))`? ] --- ## Named lists We often want to name the elements of a list (can also do this with vectors). This can make reading and accessing the list more straight forward. .small[ ```r myotherlist = list(A = "hello", B = 1:4, "knock knock" = "who's there?") str(myotherlist) ``` ``` ## List of 3 ## $ A : chr "hello" ## $ B : int [1:4] 1 2 3 4 ## $ knock knock: chr "who's there?" ``` ```r names(myotherlist) ``` ``` ## [1] "A" "B" "knock knock" ``` ```r myotherlist$B ``` ``` ## [1] 1 2 3 4 ``` ] --- ## unlisting Really, this should be called unnesting. Often, it is [a code smell](https://en.wikipedia.org/wiki/Code_smell) in R, and indicates an issue with how the code was designed ```r str(myotherlist) ``` ``` ## List of 3 ## $ A : chr "hello" ## $ B : int [1:4] 1 2 3 4 ## $ knock knock: chr "who's there?" ``` ```r unlist(myotherlist, recursive = TRUE) ``` ``` ## A B1 B2 B3 ## "hello" "1" "2" "3" ## B4 knock knock ## "4" "who's there?" ``` .question[What just happened to the integers in `myotherlist$B`?] --- ## Converting between types .hand[with intention...] .pull-left[ ```r x = 1:3 x ``` ``` ## [1] 1 2 3 ``` ```r typeof(x) ``` ``` ## [1] "integer" ``` ] -- .pull-right[ ```r y = as.character(x) y ``` ``` ## [1] "1" "2" "3" ``` ```r typeof(y) ``` ``` ## [1] "character" ``` ] --- ## Converting between types .hand[with intention...] .pull-left[ ```r x = c(TRUE, FALSE) x ``` ``` ## [1] TRUE FALSE ``` ```r typeof(x) ``` ``` ## [1] "logical" ``` ] -- .pull-right[ ```r y = as.numeric(x) y ``` ``` ## [1] 1 0 ``` ```r typeof(y) ``` ``` ## [1] "double" ``` ] --- ## Converting between types .hand[without intention...] R will happily convert between various types without complaint when different types of data are concatenated in a vector, and that's not always a great thing! .pull-left[ ```r c(1, "Hello") ``` ``` ## [1] "1" "Hello" ``` ```r c(FALSE, 3L) ``` ``` ## [1] 0 3 ``` ] -- .pull-right[ ```r c(1.2, 3L) ``` ``` ## [1] 1.2 3.0 ``` ```r c(2L, "two") ``` ``` ## [1] "2" "two" ``` ] --- ## Explicit vs. implicit coercion Let's give formal names to what we've seen so far: -- - **Explicit coercion** is when you call a function like `as.logical()`, `as.numeric()`, `as.integer()`, `as.double()`, or `as.character()` -- - **Implicit coercion** happens when you use a vector in a specific context that expects a certain type of vector <!-- --- --> --- class: middle # Special values --- ## Special values - `NA`: Not available - `NaN`: Not a number - `Inf`: Positive infinity - `-Inf`: Negative infinity -- .pull-left[ ```r pi / 0 ``` ``` ## [1] Inf ``` ```r 0 / 0 ``` ``` ## [1] NaN ``` ] .pull-right[ ```r 1/0 - 1/0 ``` ``` ## [1] NaN ``` ```r 1/0 + 1/0 ``` ``` ## [1] Inf ``` ] --- ## `NA`s are special ❄️s ```r x = c(1, 2, 3, 4, NA) ``` ```r mean(x) ``` ``` ## [1] NA ``` ```r mean(x, na.rm = TRUE) ``` ``` ## [1] 2.5 ``` ```r summary(x) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 1.00 1.75 2.50 2.50 3.25 4.00 1 ``` --- ## `NA`s are, by default, logical R uses `NA` to represent missing values in its data structures. ```r typeof(NA) ``` ``` ## [1] "logical" ``` .footnote[There are also `NA\_integer\_`, `NA\_real\_` and `NA\_character\_`, occasionally needed to avoid warnings or errors about unplanned coercions.] --- ## Mental model for `NA`s - Unlike `NaN`, `NA`s are genuinely unknown values - But that doesn't mean they can't function in a logical way - Let's think about why `NA`s are logical... -- .question[ Why do the following give different answers? ] .pull-left[ ```r # TRUE or NA TRUE | NA ``` ``` ## [1] TRUE ``` ] .pull-right[ ```r # FALSE or NA FALSE | NA ``` ``` ## [1] NA ``` ] `\(\rightarrow\)` See next slide for answers... --- - `NA` is unknown, so it could be `TRUE` or `FALSE` .pull-left[ .midi[ - `TRUE | NA` ```r TRUE | TRUE # if NA was TRUE ``` ``` ## [1] TRUE ``` ```r TRUE | FALSE # if NA was FALSE ``` ``` ## [1] TRUE ``` ] ] .pull-right[ .midi[ - `FALSE | NA` ```r FALSE | TRUE # if NA was TRUE ``` ``` ## [1] TRUE ``` ```r FALSE | FALSE # if NA was FALSE ``` ``` ## [1] FALSE ``` ] ] - Doesn't make sense for mathematical operations - Makes sense in the context of missing data --- ## Vectors vs. lists .pull-left[ .small[ ```r x = c(8,4,7) ``` ] .small[ ```r x[1] ``` ``` ## [1] 8 ``` ] .small[ ```r x[[1]] ``` ``` ## [1] 8 ``` ] ] -- .pull-right[ .small[ ```r y = list(8,4,7) ``` ] .small[ ```r y[2] ``` ``` ## [[1]] ## [1] 4 ``` ] .small[ ```r y[[2]] ``` ``` ## [1] 4 ``` ] ] -- <br> <!--**Note:** When using tidyverse code you'll rarely need to refer to elements using square brackets, but it's good to be aware of this syntax, especially since you might encounter it when searching for help online.--> --- ## Vectors vs lists: the punchline * Plain vectors must be flat and "atomic"--comprised of only a single base R type: `logical`, `integer`, `numeric`, `complex` or `character`. * Lists can be arbitrarily nested and contain any R object. * Both have length. * Both can be named. --- class: middle # R Classes and attributes --- ## types are elemental .pull-left[ **R elements** * ```r typeof(1) ``` ``` ## [1] "double" ``` * ```r typeof("A") ``` ``` ## [1] "character" ``` * ```r typeof(list(1)) ``` ``` ## [1] "list" ``` ] .pull-right[ **Meatspace elements** * hydrogen <img src = "l02/img/320px-Hydrogen_discharge_tube.jpg" width = "48%"> * carbon <img src = "l02/img/Graphite-and-diamond-with-scale.jpg" width = "48%"> * uranium <img src = "l02/img/600px-HEUraniumC.jpg" width = "48%"> ] ??? These types can either be atomic (integer, character, numeric, boolean) or generic (lists). --- class: code70 ## Attributes are add-on properties .pull-left[ **R attributes** * ```r attributes(1) ``` ``` ## NULL ``` * .scroll-box-10[ ```r attributes(starwars) ``` ``` ## $names ## [1] "name" "height" "mass" "hair_color" ## [5] "skin_color" "eye_color" "birth_year" "sex" ## [9] "gender" "homeworld" "species" "films" ## [13] "vehicles" "starships" ## ## $row.names ## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ## [21] 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 ## [41] 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ## [61] 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 ## [81] 81 82 83 84 85 86 87 ## ## $class ## [1] "tbl_df" "tbl" "data.frame" ``` ]] .pull-right[ **Meatspace attributes** * temperature * pH * pressure ] --- ## classes are compounds `class` is an `attribute` that signifies a **compound type** (made up of multiple elements or compounds). `class` is a special attribute. It affects what flavor of a function is applied. We'll discuss this in greater detail in a few weeks. --- class: code70 ## Classes .pull-left[ **R compounds** * ```r class(starwars) ``` ``` ## [1] "tbl_df" "tbl" "data.frame" ``` * ```r class(1) ``` ``` ## [1] "numeric" ``` * ```r class(ggplot(starwars, aes(x = weight))+ geom_histogram()) ``` ``` ## [1] "gg" "ggplot" ``` ] .pull-right[ **Meatspace compounds** * `\(H_2O\)` <img src = "l02/img/640px-Ice_Block.jpg" width = "50%"> * NaCl <img src = "l02/img/640px-Salt_Farmers.jpg" width = "50%"> * `\(C_8H_{10}N_4O_2\)` <img src = "l02/img/640px-A_small_cup_of_coffee.jpeg" width = "50%"> ] --- class: center, middle # Factors --- ## Factors Factor objects are how R stores data for categorical variables (fixed numbers of discrete values). ```r x = factor(c("BS", "MS", "PhD", "MS")) ``` ```r attributes(x) ``` ``` ## $levels ## [1] "BS" "MS" "PhD" ## ## $class ## [1] "factor" ``` ```r typeof(x) ``` ``` ## [1] "integer" ``` --- ## Read data in as character strings ```r str(cat_lovers) ``` ``` ## tibble [60 × 3] (S3: tbl_df/tbl/data.frame) ## $ name : chr [1:60] "Bernice Warren" "Woodrow Stone" "Willie Bass" "Tyrone Estrada" ... ## $ number_of_cats: num [1:60] 0 0 1 3 3 2 1 1 0 0 ... ## $ handedness : chr [1:60] "left" "left" "left" "left" ... ``` --- ## But coerce when plotting ```r ggplot(cat_lovers, mapping = aes(x = handedness)) + geom_bar() ``` <img src="l02a-data-types_files/figure-html/unnamed-chunk-63-1.png" width="60%" style="display: block; margin: auto;" /> --- # Acknowledgments This lecture contains materials adapted from [Mine Çetinkaya-Rundel and colleagues](https://www2.stat.duke.edu/courses/Spring18/Sta199/slides/lec-slides/05b-coding-style-data-types.html#1) and [data science in a box](https://rstudio-education.github.io/datascience-box/course-materials/slides/u2-d10-data-types/u2-d10-data-types.html#1)