Lab04 - Exploratory Statistics

Setup
- Install package
- Load packages
Summary statistics
Constructing plots
Saving ggplot graphics

Setup

Install package

install.packages("tidyverse")

If the install fails, then run

install.packages(c("dplyr","tidyr","ggplot2"))

Load packages

The installation only needs to be done once. But we will need to load the packages in every R session where we want to use them. To load the packages, use

library("dplyr")
library("tidyr")
library("ggplot2")

alternatively, you can load the entire (not very big) tidyverse.

library("tidyverse")

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4.9000     ✔ readr     2.1.5     
## ✔ forcats   1.0.0          ✔ stringr   1.5.1     
## ✔ ggplot2   3.5.1          ✔ tibble    3.2.1     
## ✔ lubridate 1.9.3          ✔ tidyr     1.3.1     
## ✔ purrr     1.0.2          
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Change theme to a better one

theme_set(theme_bw()) # comment this line to see what default plots look like

Summary statistics

# Create data
y <- c(3, 4.5, 7, 8, 1, -3, 4, 10, 8)

# Sample size
length(y)

## [1] 9

Measures of location

Many summary statistics provide information about the location of the data. The most common measures of location provide information about the center of the data, e.g. mean, median, and mode.

# Measures of center
mean(y)

## [1] 4.722222

median(y)

## [1] 4.5

The mode() function does not provide the typical mode, i.e. the most common number. We can use R to get us this information

# Mode
sort(table(y), decreasing = TRUE)

## y
##   8  -3   1   3   4 4.5   7  10 
##   2   1   1   1   1   1   1   1

An extremely versatile measure of location is the quantile() function. The \(p\) sample quantile (with \(0 < p < 1\)) of \(N\) data points is the number such that \(p*N\) data points are below the number while \((1-p)*N\) data points are above the number. There are possibly many numbers that could satisfy this requirement and sometimes no numbers can satisfy this (loose) definition. For this class, these nuances are not important. Instead, you should think about the \(p\) sample quantile as the number such that \(p\) proportion of the data are below the number and \(1-p\) proportion of the data are above the number.

Percentiles are simply the \(p\) quantile multiplied by 100 and represented as a percentage. For example, the 0.05 quantile is the same as the 5%-tile.

Quartiles are structured so that the data are broken up into 4 groups. Thus the 1st quartile is the 0.25 quantile, the 2nd quartile is the 0.50 quantile, and the 3rd quartile is the 0.75 quantile.

# Quantile
quantile(y, probs = 0.05) # 0.05 sample quantile and 5%-tile

##   5% 
## -1.4

When exploring data, the extremes can also be extremely useful to determine whether our data are within the range we are expecting.

# Min and max
min(y)

## [1] -3

max(y)

## [1] 10

Measures of spread

In addition to measures of location, we can calculate measures of spread. Common measures of spread are sample variance, sample standard deviation, range, and interquartile range.

# Measures of spread
var(y)

## [1] 16.44444

sd(y)

## [1] 4.055175

range(y)                                 # gives c(min(y), max(y))

## [1] -3 10

diff(range(y))                           # range

## [1] 13

diff(quantile(y, probs = c(0.25, 0.75))) # interquartile range

## 75% 
##   5

Summary

You can also get a quick 6-number summary in R using the summary() function.

# Summary
summary(y)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -3.000   3.000   4.500   4.722   8.000  10.000

Constructing plots

The main purpose of the lab today is to construct plots using the ggplot2 R package. In order to construct these plots, we need to construct an appropriate data.frame and we will use dplyr to help us construct that data.frame.

Let’s use the built-in R data set airquality. Before we start plotting let’s take a quick look at the data.

# Summary statistics
dim(airquality)

## [1] 153   6

head(airquality)

##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6

tail(airquality)

##     Ozone Solar.R Wind Temp Month Day
## 148    14      20 16.6   63     9  25
## 149    30     193  6.9   70     9  26
## 150    NA     145 13.2   77     9  27
## 151    14     191 14.3   75     9  28
## 152    18     131  8.0   76     9  29
## 153    20     223 11.5   68     9  30

summary(airquality)

##      Ozone           Solar.R           Wind             Temp      
##  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
##  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
##  Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
##  Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
##  3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
##  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
##  NA's   :37       NA's   :7                                       
##      Month            Day      
##  Min.   :5.000   Min.   : 1.0  
##  1st Qu.:6.000   1st Qu.: 8.0  
##  Median :7.000   Median :16.0  
##  Mean   :6.993   Mean   :15.8  
##  3rd Qu.:8.000   3rd Qu.:23.0  
##  Max.   :9.000   Max.   :31.0  
##

For built in datasets, we can get more information by going to the help file.

?airquality

One issue with this dataset is that the Month/Day columns don’t really provide us with a Date. Let’s create a new column that creates a real Date.

airquality <- airquality %>%
  dplyr::mutate(Date = as.Date(paste("1973",Month,Day,sep="/")))

If you deal with dates a lot, you should check out the lubridate package.

Histogram

All ggplot2 graphics require a data.frame containing the data and this data.frame is always the first argument to a ggplot call. After this, we specify some aesthetics using the aes() function. Then, we tell ggplot2 what kind of graphics to construct.

ggplot(airquality,     # data.frame containing the data
       aes(x=Ozone)) + # a column name from the data.frame
  geom_histogram()     # create a histogram

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_bin()`).

If you want to avoid the message, you can specify the number of bins to use.

ggplot(airquality, aes(x=Ozone)) + 
  geom_histogram(bins = 40)

## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_bin()`).

If you want plot on the density scale (so that you can compare to a pdf), use the following:

ggplot(airquality, aes(x=Ozone)) + 
  geom_histogram(aes(y=..density..), bins = 40)

## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_bin()`).

Histogram activity

Create a histogram of solar radiation on the density scale with 50 bins.

Click for solution

ggplot(airquality, aes(x=Solar.R)) + 
  geom_histogram(aes(y=..density..), bins = 50)

## Warning: Removed 7 rows containing non-finite outside the scale range
## (`stat_bin()`).

Boxplots

The syntax for boxplots is similar except that the variable you are interest in is the y aesthetic.

ggplot(airquality,     
       aes(x=1,y=Ozone)) + 
  geom_boxplot()

## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Comparing boxplots

ggplot(airquality,     
       aes(x=Month, y=Ozone, group=Month)) + 
  geom_boxplot()

## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Boxplot activity

Create boxplots of wind speed by month. Bonus: See if you can google to find out how to swap the axes, i.e. have Month on the y-axis and Wind on the x-axis.

Click for solution

ggplot(airquality,     
       aes(x=Month, y=Wind, group=Month)) + 
  geom_boxplot(outlier.shape = NA, color='grey') +  
  coord_flip()

Flipping the axes makes the comparisons vertical and therefore, I think, easier to interpret.

Scatterplot

At this point we can construct individual graphs for our 4 different response variables: Ozone, Solar.R, Wind, and Temp. Perhaps we want to understand the temporal variability for Ozone. We can use a scatterplot of Ozone vs Date.

ggplot(airquality, aes(x = Date, y = Ozone)) +
  geom_point()

## Warning: Removed 37 rows containing missing values or values outside the scale range
## (`geom_point()`).

or if we wanted a line plot

ggplot(airquality, aes(x = Date, y = Ozone)) +
  geom_line()

Notice that the line is disconnected wherever we have missing data.

Perhaps we want to understand the relationship between solar radiation and ozone.

ggplot(airquality, aes(x = Solar.R, y = Ozone)) +
  geom_point()

## Warning: Removed 42 rows containing missing values or values outside the scale range
## (`geom_point()`).

Scatterplot activity

Create a scatterplot of wind speed versus temperature.

Click for solution

ggplot(airquality, aes(x = Temp, y = Wind)) +
  geom_point()

Boxplots with scatterplots

Scatterplots don’t look so good when there are data points that overlap. For example, when plotting Ozone vs Month the points may overlap due to Month only having 5 values in the data set.

ggplot(airquality,     
       aes(x=Month, y=Ozone, group=Month)) + 
  geom_point()

## Warning: Removed 37 rows containing missing values or values outside the scale range
## (`geom_point()`).

So, instead we will typically jitter the points a bit to remove the overlap, e.g.

ggplot(airquality,     
       aes(x=Month, y=Ozone, group=Month)) + 
  geom_jitter()

## Warning: Removed 37 rows containing missing values or values outside the scale range
## (`geom_point()`).

Now, we can combine the boxplots we discussed earlier with scatterplots or jittered scatterplots, e.g.

ggplot(airquality,     
       aes(x=Month, y=Ozone, group=Month)) + 
  geom_boxplot(color='grey',                 # make the boxes not so obvious
               outlier.shape = NA) +         # remove outliers, 
  geom_point()                               # because they get plotted here

## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

## Warning: Removed 37 rows containing missing values or values outside the scale range
## (`geom_point()`).

ggplot(airquality,     
       aes(x=Month, y=Ozone, group=Month)) + 
  geom_boxplot(color='grey',                 # make the boxes not so obvious
               outlier.shape = NA) +         # remove outliers, 
  geom_jitter()                              # because they get plotted here

## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

## Warning: Removed 37 rows containing missing values or values outside the scale range
## (`geom_point()`).

Boxplot with scatterplot activity

Create a scatterplot of wind speed by month and add a boxplot for each month in the background.

Click for solution

ggplot(airquality,     
       aes(x=Month, y=Wind, group=Month)) + 
  geom_boxplot(outlier.shape = NA, color='grey') +         
  geom_jitter() + 
  coord_flip()

Flipping the axes makes the comparisons vertical and therefore, I think, easier to interpret.

Converting data.frame from wide to long

If we want to put all the response variables on the same plot, we can color them. In order to do this, we will need to organize our data.frame into long format.

airquality_long <- airquality %>%
  dplyr::select(-Month, -Day) %>%              # Remove these columns
  tidyr::pivot_longer(-Date,
                      names_to  = "response",
                      values_to = "value")

Take a look at the resulting data.frame.

dim(airquality)

## [1] 153   7

dim(airquality_long)

## [1] 612   3

head(airquality_long)

## # A tibble: 6 × 3
##   Date       response value
##   <date>     <chr>    <dbl>
## 1 1973-05-01 Ozone     41  
## 2 1973-05-01 Solar.R  190  
## 3 1973-05-01 Wind       7.4
## 4 1973-05-01 Temp      67  
## 5 1973-05-02 Ozone     36  
## 6 1973-05-02 Solar.R  118

summary(airquality_long)

##       Date              response             value       
##  Min.   :1973-05-01   Length:612         Min.   :  1.00  
##  1st Qu.:1973-06-08   Class :character   1st Qu.: 13.00  
##  Median :1973-07-16   Mode  :character   Median : 66.00  
##  Mean   :1973-07-16                      Mean   : 80.06  
##  3rd Qu.:1973-08-23                      3rd Qu.: 91.00  
##  Max.   :1973-09-30                      Max.   :334.00  
##                                          NA's   :44

table(airquality_long$response)

## 
##   Ozone Solar.R    Temp    Wind 
##     153     153     153     153

ggplot(airquality_long, 
       aes(x = Date, y = value, 
           linetype = response,
           color = response, 
           group = response)) +
  geom_line()

Notice that the legend is automatically created. This is not something that is done in base R graphics.

Honestly, this doesn’t look very good, so it is better to facet the plot.

Faceted scatterplots

Facets are often a better way of representing multiple variables.

ggplot(airquality_long, aes(Date, value)) +
  geom_point() + 
  facet_wrap(~response)

## Warning: Removed 44 rows containing missing values or values outside the scale range
## (`geom_point()`).

Since the axes are quite different for the different responses, we can allow them to vary in the different facets.

ggplot(airquality_long, aes(Date, value)) +
  geom_line() + 
  facet_wrap(~response,scales="free_y")

Alternatively, we can use facet_grid which is more useful when you have more variables you want to facet by.

ggplot(airquality_long, aes(Date, value)) +
  geom_line() + 
  facet_wrap(.~response,scales="free_y")

ggplot(airquality_long, aes(Date, value)) +
  geom_line() + 
  facet_wrap(response ~ .,scales="free_y")

Converting data.frame from long to wide

If we only had the long version of the data.frame, we can reconstruct the wide version by using the following

airquality2 <- airquality_long %>%
  tidyr::pivot_wider(
    names_from  = response, 
    values_from = value)

Customizing ggplot2 graphics

Sometimes it is helpful to save the plot as an R object so that it can be updated in the future. To save the plot, just use the assignment operator, e.g.

g <- ggplot(airquality2,
       aes(x = Temp, y = Wind)) +
  geom_point()

g # Then you can see the plot by just typing the object name

We would like this plot to be a bit more informative, so we will add some informative labels.

g <- g +
  labs(x = "Temperature (F)",
       y = "Wind speed (mph)",
       title = "New York (May-September 1973)")

g

As you have seen before, we can also change the theme. I prefer the simple “bw” theme, but here is the default theme.

g <- g + theme_classic()
g

We can add a regression line.

g <- g + geom_smooth(method="lm")
g

## `geom_smooth()` using formula = 'y ~ x'

Alternatively, you can combine all the steps

ggplot(airquality2,
       aes(x = Temp, y = Wind)) +
  geom_point() +
  geom_smooth(method = "lm") + 
  labs(x = "Temperature (F)",
       y = "Wind speed (mph)",
       title = "New York (May-September 1973)")

## `geom_smooth()` using formula = 'y ~ x'

Plot creation activity

Use the cars dataset to construct and customize a figure displaying the relationship between the stopping distance and speed of a car.

Click for solution

ggplot(cars,
       aes(x=speed, y=dist)) +
  geom_point() + 
  geom_smooth(method = "lm") +
  labs(x = "Speed (mph)",
       y = "Stopping distance (ft)",
       title = "Stopping distance as a function of speed (1920s)")

## `geom_smooth()` using formula = 'y ~ x'

Saving ggplot graphics

If you want to save the plot, use the ggsave function, e.g.

ggsave(filename = "plot.png", 
       plot     = g, 
       width    = 5, 
       height   = 4)

Lab04 - Exploratory Statistics

Jarad Niemi

2024-09-16

Setup

Install package

Load packages

Summary statistics

Measures of location

Measures of spread

Summary

Constructing plots

Histogram

Histogram activity

Boxplots

Boxplot activity

Scatterplot

Scatterplot activity

Boxplots with scatterplots

Boxplot with scatterplot activity

Converting data.frame from wide to long

Faceted scatterplots

Converting data.frame from long to wide

Customizing ggplot2 graphics

Plot creation activity

Saving ggplot graphics