How To: Visualizing Nebraska's record-shattering volleyball match
An R tutorial putting the Cornhuskers women's volleyball team's recent match into perspective
Selling out an arena is nothing new for the Nebraska women’s volleyball team — they’ve done it more than 300 times in a row.
Now they can say selling out one of the biggest college football stadiums in the country isn’t either.
The Cornhuskers volleyball team made history this past Thursday playing in front of a record-breaking 92,003 at Nebraska’s Memorial Stadium.
The match surpassed the previous women’s sporting attendance mark of 91,648 set during the FC Barcelona-Wolfsburg Champions League match last season. For reference, last year’s Super Bowl packed only 70,000 into SoFi Stadium in Los Angeles.
Nebraska’s volleyball team is perennially ranked as one of the best in the nation and has led the NCAA in attendance in each of the past nine seasons. They regularly sell out their home gym, the Bob Devaney Sports Center in Lincoln, Neb. Nebraska has 306 consecutive sell outs, and matches involving the Huskers account for 13 of the 14 largest NCAA regular season crowds.
But even those raucous home crowds couldn’t have prepared the players for one of the most epic spectacles in any sport, men’s or women’s.
The team even got to take part in one of the best traditions in sports, the tunnel walk.
Today’s newsletter will provide a walk through of how to create the pretty simple bar chart above that shows the scale of Thursday’s match.
If you’re not interested in how the sausage is made feel free to skip the rest of this edition of Between the Pipes. For the R-heads out there, let’s dive in.
Note: This tutorial assumes you have a baseline knowledge of the R programming language and have downloaded R onto your machine. If you’re just getting started on your journey I recommend the P8105 website, among many other great resources on the internet, to get you up to speed.
Setup
First, we need to load the necessary packages into our R session. I also like to do a bit of setup work up front to create a customized ggplot2 theme, and set the aspect ratio for any figures we’d be generating later on. You can alter any of the font and background color settings in the theme_custom
function below.
library(tidyverse)
library(rvest)
library(teamcolors)
library(ggchicklet)
library(janitor)
library(ggtext)
library(glue)
library(ggimage)
library(scales)
library(prismatic)
# don't forget to set your working directory and install these packages on your machine
# Custom ggplot theme (inspired by Owen Phillips at the F5 substack blog)
theme_custom <- function () {
theme_minimal(base_size=11, base_family="Outfit") %+replace%
theme(
panel.grid.minor = element_blank(),
plot.background = element_rect(fill = 'floralwhite', color = "floralwhite")
)
}
# create aspect ration to use throughout
asp_ratio <- 1.618
Data Preparation
Before getting to the fun stuff we need to first retrieve the data. I found the following article on the NCAA’s website had a table listing the top 30 most attended women's college volleyball games. Fortunately, this is a pretty standard html table so we can use the rvest
package to scrape its contents and put it into a tidy dataframe.
From there we can do a bit of data munging to ensure we exclude NCAA tournament games from our analysis since those are typically played in much larger arenas than the on campus venues in which regular season matches take place.
## -------------------- Get data -----------------------
# Attendance record article on NCAA website
url <- "https://www.ncaa.com/news/volleyball-women/article/2023-08-30/womens-college-volleyball-all-time-attendance-records"
# use Rvest to scrape the data table on this website
ncaa_records <- url %>%
read_html %>%
html_elements('table') %>%
html_table() %>%
.[1] %>%
as.data.frame() %>%
clean_names()
# fill tournament_round column with 'Regular Season'
ncaa_records$tournament_round <- sub("^$", "Regular season", ncaa_records$tournament_round)
# exclude non-tournament games
ncaa_records <- ncaa_records %>%
filter(tournament_round == 'Regular season')
Create dataframe for the plot
We now have a table of data to work with, but need to add a few more columns that will help us make our plot shine in the end.
We want to define who the home team was, their colors and logos to display on the y-axis. We do this by leveraging the mutate and case_when function from the tidyverse to specify what colors and logos go with what team. We can also add a column outlining the match details (eg. date, teams involved).
A couple other steps will help ensure the attendance column is read in as a numeric value rather than a character vector, and to create html tags for the geom_richtext function to read when we plot the logos on the y-axis.
## ---------------------- Data preparation --------------------------
# define home team, colors, and logos
ncaa_records2 <- ncaa_records %>%
mutate(home = case_when(city == 'Lincoln' ~ "Nebraska",
city == 'Madison' ~ 'Wisconsin',
city == 'Omaha' ~ 'Creighton'
),
logo = case_when(home == 'Nebraska' ~ "https://a.espncdn.com/combiner/i?img=/i/teamlogos/ncaa/500/158.png",
home == 'Wisconsin' ~ "https://a.espncdn.com/combiner/i?img=/i/teamlogos/ncaa/500/275.png",
home == 'Creighton' ~ "https://a.espncdn.com/combiner/i?img=/i/teamlogos/ncaa/500/156.png"
),
team_color = case_when(home == 'Nebraska' ~ '#E41C38',
home == 'Wisconsin' ~ '#C5050C',
home == 'Creighton' ~ '#005CA9',
TRUE ~ "gray70"
)
)
# add a new column providing match details
ncaa_records_plot <- ncaa_records2 %>%
mutate(match = str_c(winner,opponent,date, sep = "-"))
# turn attendance values into numerics
ncaa_records_plot$att <- as.numeric(gsub(",","",ncaa_records_plot$att))
# create html tags for geom_richtext to read
ncaa_records_plot <- ncaa_records_plot %>%
mutate(logo_label = glue::glue("<img src='{logo}' width='30'/>"))
The Plot
Finally, we can create the visualization using the fantastic ggplot2
package. Instead of making traditional bar charts I opted to use the ggchicklet
package from Bob Rudis, which creates more stylized bars.
# Make Chicklets plot
ncaa_records_plot %>%
mutate(match = fct_reorder(match, att)) |>
ggplot(aes(x = match, y = att, fill = team_color)) +
geom_chicklet() + #same as geom_col or geom_bar
geom_richtext(
aes(y = -6, label = logo_label, hjust = 1),
label.size = 0, fill = NA
) +
geom_text(aes(label = scales::comma(att)), fontface = "bold",
family = "Outfit", color = "white", hjust = 1.2) +
geom_text(aes(label = match),
family = "Outfit", y=50000) +
coord_cartesian(clip = "off") +
scale_fill_identity(guide = "none") +
scale_y_continuous(labels = comma_format()) +
theme_custom() +
coord_flip() +
theme(panel.grid.major.x = element_blank(),
panel.grid.major.y = element_blank(),
plot.title = element_text(face = 'bold',
size = 20,
hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
plot.title.position = 'plot',
axis.text.y=element_blank(),
plot.margin = margin(15, 30, 15, 15)
) +
theme(legend.position = "none") +
labs(x = "",
y = "Attendance",
title = "Nebraska Breaks Womens Attendance Record",
subtitle = paste0("Top regular season home attendance records for NCAA women's volleyball since 2007. NCAA tournament games not included."),
caption = "* Denotes home team\nData:NCAA.com | Plot: Your Name Here")
# save image in working directory
ggsave("Nebraska Attendance.png", dpi = 300, width = 10.5, height = 6.5)
The full code
The below can also be found on my Github.
##### Nebraska Volleyball #####
##### By: Stephan Teodosescu #####
##### September 2023 #####
library(tidyverse)
library(rvest)
library(teamcolors)
library(ggchicklet)
library(janitor)
library(ggtext)
library(glue)
library(ggimage)
library(scales)
library(prismatic)
# don't forget to set your working directory and install these packages on your machine
# Custom ggplot theme (inspired by Owen Phillips at the F5 substack blog)
theme_custom <- function () {
theme_minimal(base_size=11, base_family="Outfit") %+replace%
theme(
panel.grid.minor = element_blank(),
plot.background = element_rect(fill = 'floralwhite', color = "floralwhite")
)
}
# create aspect ration to use throughout
asp_ratio <- 1.618
# Function for plot with logo generation
add_logo <- function(plot_path, logo_path, logo_position, logo_scale = 10){
# Requires magick R Package https://github.com/ropensci/magick
# Useful error message for logo position
if (!logo_position %in% c("top right", "top left", "bottom right", "bottom left")) {
stop("Error Message: Uh oh! Logo Position not recognized\n Try: logo_positon = 'top left', 'top right', 'bottom left', or 'bottom right'")
}
# read in raw images
plot <- magick::image_read(plot_path)
logo_raw <- magick::image_read(logo_path)
# get dimensions of plot for scaling
plot_height <- magick::image_info(plot)$height
plot_width <- magick::image_info(plot)$width
# default scale to 1/10th width of plot
# Can change with logo_scale
logo <- magick::image_scale(logo_raw, as.character(plot_width/logo_scale))
# Get width of logo
logo_width <- magick::image_info(logo)$width
logo_height <- magick::image_info(logo)$height
# Set position of logo
# Position starts at 0,0 at top left
# Using 0.01 for 1% - aesthetic padding
if (logo_position == "top right") {
x_pos = plot_width - logo_width - 0.01 * plot_width
y_pos = 0.01 * plot_height
} else if (logo_position == "top left") {
x_pos = 0.01 * plot_width
y_pos = 0.01 * plot_height
} else if (logo_position == "bottom right") {
x_pos = plot_width - logo_width - 0.01 * plot_width
y_pos = plot_height - logo_height - 0.01 * plot_height
} else if (logo_position == "bottom left") {
x_pos = 0.01 * plot_width
y_pos = plot_height - logo_height - 0.01 * plot_height
}
# Compose the actual overlay
magick::image_composite(plot, logo, offset = paste0("+", x_pos, "+", y_pos))
}
## -------------------- Get data -------------------
# Attendance record article on NCAA website
url <- "https://www.ncaa.com/news/volleyball-women/article/2023-08-30/womens-college-volleyball-all-time-attendance-records"
# use Rvest to scrape the data table on this website
ncaa_records <- url %>%
read_html %>%
html_elements('table') %>%
html_table() %>%
.[1] %>%
as.data.frame() %>%
clean_names()
# fill tournament_round column with 'Regular Season'
ncaa_records$tournament_round <- sub("^$", "Regular season", ncaa_records$tournament_round)
# exclude non-tournament games
ncaa_records <- ncaa_records %>%
filter(tournament_round == 'Regular season')
## ---------------------- Data preparation --------------------------
# define home team, colors, and logos
ncaa_records2 <- ncaa_records %>%
mutate(home = case_when(city == 'Lincoln' ~ "Nebraska",
city == 'Madison' ~ 'Wisconsin',
city == 'Omaha' ~ 'Creighton'
),
logo = case_when(home == 'Nebraska' ~ "https://a.espncdn.com/combiner/i?img=/i/teamlogos/ncaa/500/158.png",
home == 'Wisconsin' ~ "https://a.espncdn.com/combiner/i?img=/i/teamlogos/ncaa/500/275.png",
home == 'Creighton' ~ "https://a.espncdn.com/combiner/i?img=/i/teamlogos/ncaa/500/156.png"
),
team_color = case_when(home == 'Nebraska' ~ '#E41C38',
home == 'Wisconsin' ~ '#C5050C',
home == 'Creighton' ~ '#005CA9',
TRUE ~ "gray70"
)
)
# add a new column providing match details
ncaa_records_plot <- ncaa_records2 %>%
mutate(match = str_c(winner,opponent,date, sep = "-"))
# turn attendance values into numerics
ncaa_records_plot$att <- as.numeric(gsub(",","",ncaa_records_plot$att))
# create html tags for geom_richtext to read
ncaa_records_plot <- ncaa_records_plot %>%
mutate(logo_label = glue::glue("<img src='{logo}' width='30'/>"))
## -------------------- Make Plot ----------------------------
ncaa_records_plot %>%
mutate(match = fct_reorder(match, att)) |>
ggplot(aes(x = match, y = att, fill = team_color)) +
geom_chicklet() + #same as geom_col or geom_bar
geom_richtext(
aes(y = -6, label = logo_label, hjust = 1),
label.size = 0, fill = NA
) +
geom_text(aes(label = scales::comma(att)), fontface = "bold",
family = "Outfit", color = "white", hjust = 1.2) +
geom_text(aes(label = match),
family = "Outfit", y=50000) +
coord_cartesian(clip = "off") +
scale_fill_identity(guide = "none") +
scale_y_continuous(labels = comma_format()) +
theme_custom() +
coord_flip() +
theme(panel.grid.major.x = element_blank(),
panel.grid.major.y = element_blank(),
plot.title = element_text(face = 'bold',
size = 20,
hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
plot.title.position = 'plot',
axis.text.y=element_blank(),
plot.margin = margin(15, 30, 15, 15)
) +
theme(legend.position = "none") +
labs(x = "",
y = "Attendance",
title = "Nebraska Breaks Womens Attendance Record",
subtitle = paste0("Top regular season home attendance records for NCAA women's volleyball since 2007. NCAA tournament games not included."),
caption = "* Denotes home team\nData:NCAA.com | Plot: Your Name Here")
# save image in working directory
ggsave("Nebraska Attendance.png", dpi = 300, width = 10.5, height = 6.5)