class: ur-title, center, middle, title-slide .title[ # BST430 Lecture 04 ] .subtitle[ ## ggplot (i) ] .author[ ### Seong-Hwan Jun, based on the notes of Andrew McDavid and Tanzy Love ] .institute[ ### U of Rochester ] .date[ ### 2021-08-29 (updated: 2025-09-18) ] --- ## Agenda - Exploratory data analysis - Data visualization - Visualizing Star Wars - Aesthetics - Faceting - Identifying variables - Visualizing numerical data - Visualizing categorical data --- class: center, middle # Exploratory data analysis --- ## What is EDA?  - Exploratory data analysis (EDA) is an approach to analyzing data sets to summarize its main characteristics. - Often, this is visual. That's what we're focusing on today. - But we might also calculate summary statistics and perform data wrangling/manipulation/transformation at (or before) this stage of the analysis. --- class: center, middle # Data visualization --- ## Data visualization > *"The simple graph has brought more information to the data analyst’s mind than any other device." — John Tukey* - Data visualization is the creation and study of the visual representation of data. - There are many tools for visualizing data (R is one of them), and many approaches/systems within R for making data visualizations (**ggplot2** is one of them, and that's the one we're going to use). --- ## ggplot2, part of the tidyverse - ggplot2 is a data visualization package <!-- - To use ggplot2 functions, first load tidyverse --> ```r library(tidyverse) ``` - In ggplot2 the structure of the code for plots can often be summarized as ```r ggplot + geom_xxx ``` where geoms (geometric objects) describe the type of plot produced. Example: .small[ ```r ggplot(data = [dataset], mapping = aes(x = [x-variable], y = [y-variable])) + geom_xxx() + other options ``` ] --- ## About ggplot2 - ggplot2 is the name of the package - The `gg` in "ggplot2" stands for Grammar of Graphics - Inspired by the book **Grammar of Graphics** by Leland Wilkinson - The main idea is to build plots in a structured manner by layering components. --- ## Layering - Start with data that you want to generate plots from. - Specify the aesthetics, typically these are variables (columns) in your data. - Specify the geometric objects, points, lines, bars, boxplots, etc. - Add scales, facets, themes, color scales, labels, etc. --- ## Layering ```r ggplot(data = <DATA>) + <GEOM_FUNCTION>( mapping = aes(<MAPPINGS>), stat = <STAT>, position = <POSITION> ) + <COORDINATE_FUNCTION> + <FACET_FUNCTION>` ``` - `ggplot()` is the main function in ggplot2 - Every geom has different mapping available. - For full reference of the geoms and mappings, see http://ggplot2.tidyverse.org/ - Note the "+" at the end of the line lets ggplot know when to stop adding components --- ## Back in the days of base R graphics... - Plots are built in an ad-hoc manner. - No fixed data set or aesthetics -- data to be plotted are passed in externally. - Requires more work to polish the figure and generating multi-panel figures (faceted plots). - Most often require different syntax and code (poor reproducibility) - One of the lab questions will ask you to produce a plot using base R graphics and ggplot2. --- class: center, middle # Visualizing Star Wars --- ## Dataset terminology .scroll-box-14[ ```r starwars ``` ``` ## # A tibble: 87 × 14 ## name height mass hair_color skin_color eye_color birth_year sex gender ## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> ## 1 Luke Sk… 172 77 blond fair blue 19 male mascu… ## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu… ## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu… ## 4 Darth V… 202 136 none white yellow 41.9 male mascu… ## 5 Leia Or… 150 49 brown light brown 19 fema… femin… ## 6 Owen La… 178 120 brown, gr… light blue 52 male mascu… ## 7 Beru Wh… 165 75 brown light blue 47 fema… femin… ## 8 R5-D4 97 32 <NA> white, red red NA none mascu… ## 9 Biggs D… 183 84 black light brown 24 male mascu… ## 10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu… ## # ℹ 77 more rows ## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>, ## # vehicles <list>, starships <list> ``` ] <div class="question"> What does each row represent? What does each column represent? </div> --- ## Luke Skywalker  --- ## What's in the Star Wars data? Take a `glimpse` at the data: .scroll-box-14[ ```r glimpse(starwars) ``` ``` ## Rows: 87 ## Columns: 14 ## $ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or… ## $ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2… ## $ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.… ## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N… ## $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "… ## $ eye_color <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",… ## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, … ## $ sex <chr> "male", "none", "none", "male", "female", "male", "female",… ## $ gender <chr> "masculine", "masculine", "masculine", "masculine", "femini… ## $ homeworld <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T… ## $ species <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma… ## $ films <list> <"A New Hope", "The Empire Strikes Back", "Return of the J… ## $ vehicles <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp… ## $ starships <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",… ``` ] --- ## What's in the Star Wars data? Run the following **in the Console** to view the help ```r ?starwars ```  <div class="question"> How many rows and columns does this dataset have? What does each row represent? What does each column represent? </div> <div class="question"> Make a prediction: What relationship do you expect to see between height and mass? </div> --- ## Danged computer, stop doing what I tell you to do. <div class = "middle"> <div class="question"> What will happen if you entered this into the Console: </div> ```r glimpse(Starwars) ``` -- ```r glimpse(Starwars) ``` ``` ## Error: object 'Starwars' not found ``` -- .alert[R is case sensitive!] </div> --- ## Mass vs. height ```r ggplot(data = starwars, mapping = aes(x = height, y = mass)) + geom_point() ``` ``` ## Warning: Removed 28 rows containing missing values or values outside the scale range ## (`geom_point()`). ``` <!-- --> --- ## What's that warning? -- - Not all characters have height and mass information (hence 28 of them not plotted) ``` ## Warning: Removed 28 rows containing missing values (geom_point). ``` - Going forward I'll suppress the warning to save room on slides, but it's important to note it - **Warnings** mean the author thinks something funny is happening - sometimes adding an explicit argument makes them go away - **Errors** stop R from working --- ## Mass vs. height <div class="question"> How would you describe this relationship? What other variables would help us understand data points that don't follow the overall trend? Who is the not so tall but really chubby character? </div> <!-- --> --- ## Jabba! ```r starwars$name[starwars$mass>1000&!is.na(starwars$mass)] ``` ``` ## [1] "Jabba Desilijic Tiure" ``` <img src="l04/img/jabbaplot.png" width="768" /> --- ## Mass vs. height: your turn - Load `dplyr` and the `starwars` data. - Plot mass vs height without Jabba. --- ## Additional variables The scatter plot shows two (continuous) variables. We can display additional variables with - aesthetics (shape, colour, size), or - faceting (small multiples displaying different subsets) --- class: center, middle # Aesthetics --- ## Aesthetics options .alert[Aesthetics] are visual characteristics of the `geom_point` that can be **mapped to data**: - `color` - `size` - `shape` - `alpha` (transparency) -- - ...and some lesser used ones: `fill`, `stroke` -- - ...and `group`. This one is most useful in settings where you have multiple observations from the same group (e.g., repeated measurements from an individual and want to make a line plot). --- ## Mass vs. height + gender ```r starwars_wo_jabba <- starwars[!grepl("Jabba", starwars$name),] ggplot(data = starwars_wo_jabba, mapping = aes(x = height, y = mass, color = gender)) + geom_point() ``` <!-- --> --- ## Aesthetics summary aesthetics | discrete | continuous ------------- | ------------ | ------------ color | rainbow of colors | gradient size | discrete steps | linear mapping between radius and value shape | different shape for each | shouldn't (and doesn't) work --- class: center, middle # Faceting --- ## Faceting options - Smaller plots that display different subsets of the data - Useful for exploring conditional relationships and large data --- ## Mass vs. height by gender .small[ ```r ggplot(data = starwars, mapping = aes(x = height, y = mass)) + facet_grid(. ~ gender) + geom_point() ``` <!-- --> ] --- ## Dive further... <div class="question"> In the next few slides describe what each plot displays. Think about how the code relates to the output. </div> --- ## Code 1 ```r ggplot(data = starwars, mapping = aes(x = height, y = mass)) + geom_point() + facet_grid(gender ~ .) ``` --- ## Plot 1 <!-- --> --- ## Code 2 ```r ggplot(data = starwars, mapping = aes(x = height, y = mass)) + geom_point() + facet_grid(. ~ gender) ``` --- ## Plot 2 <!-- --> --- ## Code 3 ```r ggplot(data = starwars, mapping = aes(x = height, y = mass)) + geom_point() + facet_wrap(~ eye_color) ``` --- ## Plot 3 <!-- --> --- ## Facet summary - `facet_grid()`: 2d grid, `rows ~ cols`, `.` if only using one variable. - `facet_wrap()`: 1d ribbon wrapped into 2d --- class: center, middle # Identifying variables --- ## Number of variables involved * Univariate data analysis - distribution of single variable * Bivariate data analysis - relationship between two variables * Multivariate data analysis - relationship between many variables at once, usually focusing on the relationship between two while conditioning for others --- ## Types of variables - **Numerical variables** can be classified as **continuous** or **discrete** based on whether or not the variable can take on an infinite number of values or only non-negative whole numbers, respectively. - If the variable is **categorical**, we can determine if it is **ordinal** based on whether or not the levels have a natural ordering. --- class: center, middle # Visualizing numerical data --- ## Describing shapes of numerical distributions * shape: * skewness: right-skewed, left-skewed, symmetric (skew is to the side of the longer tail) * modality: unimodal, bimodal, multimodal, uniform * center: mean (`mean`), median (`median`), mode (not always useful) * spread: range (`range`), standard deviation (`sd`), inter-quartile range (`IQR`) * unusal observations --- ## Histograms .small[ ```r ggplot(data = starwars, mapping = aes(x = height)) + geom_histogram(binwidth = 10) ``` <!-- --> ] --- ## Density plots .small[ ```r ggplot(data = starwars, mapping = aes(x = height)) + geom_density() ``` <!-- --> ] --- ## Side-by-side box plots .small[ ```r ggplot(data = starwars, mapping = aes(y = height, x = gender)) + geom_boxplot() ``` <!-- --> ] --- ## Side-by-side <s>box</s> violin plots .small[ ```r ggplot(data = starwars, mapping = aes(y = height, x = gender)) + geom_violin() ``` <!-- --> ] --- class: center, middle # Visualizing categorical data --- ## Bar plots .small[ ```r ggplot(data = starwars, mapping = aes(x = gender)) + geom_bar() ``` <!-- --> ] --- ## Segmented bar plots, counts .small[ ```r ggplot(data = starwars, mapping = aes(x = gender, fill = hair_color)) + geom_bar() ``` <!-- --> ] --- ## Segmented bar plots, proportions .small[ ```r ggplot(data = starwars, mapping = aes(x = gender, fill = hair_color)) + geom_bar(position = "fill") + labs(y = "proportion") ``` <!-- --> ] --- ## Which bar plot is more appropriate? <div class="question"> Which of the previous two bar plots is a more useful representation for visualizing the relationship between gender and hair color? </div> --- # Acknowledgments These materials are adapted from [Mine Çetinkaya-Rundel and colleagues](https://github.com/Sta199-S18/website/blob/master/static/slides/lec-slides/02a-fund-data-viz.Rmd). --- # Quizzz - Accept the quiz: https://classroom.github.com/a/1DSPY_t2. - Clone the repo. Create `quiz1.qmd` file in the repo. - Answer Exercise 1.2.5 questions 1-8, 10 from R4DS: <https://r4ds.hadley.nz/data-visualize.html#exercises> - Commit and push `quiz1.qmd` by the end of class.