A few days ago, a fellow Denver Nuggets fan posted the following tweets:
Being the nerd that I am, I saw this as an excellent excuse to practice my web scraping skills with
rvest and the R
tidyverse analysis packages. I get a lot of practice with data analysis with R in my day job as an actuary, but I don’t have an excuse to do web scraping very often. I’ve toyed around with web scraping basketball data in the past, but the resulting analysis usually devolves into a series of unanswerable rabbit holes pretty quickly…so this very straightforward analysis was a welcome change of pace.
In addition to showing the analysis and code here on my WordPress site (p.s., sorry for any ugly code formatting…I’ve tried to limit the issues, but WordPress seems hell-bent on mangling my
magrittr pipe operators…), I’ve also posted the code over on my GitHub in case you want to download the code and play with it yourself.
Given the straightforward nature of this analysis, there’s not a whole lot of setup needed. Note that I’m using the brand new (seriously…it was released yesterday)
tidyverse package to easily load all the Hadley goodies like
tidyr, etc. Praise be to Hadley!
require(tidyverse) require(magrittr) # why isn't magritter laoded within `tidyverse`? i can't use %<>% 😦 require(rvest) require(ggthemes) require(pander)
location_of_output = "/Users/kylewurtz/Dropbox/R/NBA Average Age/Plots" # update this if necessary!
Since the analysis here is relatively straightforward, I’ll give some background on the structure of this workflow. I guess I’m a bit self-conscious about the structure since it seems like overkill for a simple analysis like this, but I’m a firm believer in starting with a checklist/template to ensure consistency and help prevent me from making silly mistakes. So even in a simple case like this, I’m using my go-to analysis format that consists of the following four sections:
- Overview: This is where I briefly explain the purpose of the analysis.
- Setup: This is where I perform overhead tasks like loading packages, reading in data files, creating helper functions, etc.
- Work: This is where the magic happens, and it often contains several sub-sections (and occasionally a somewhat out-of-place rambling about the structure of my analysis template…).
- Conclusion: This is where I summarize the results of the analysis and explain any required next steps. Someone should be able to just read my Overview and Conclusion sections and understand the purpose and outcome of the analysis.
With that said, this analysis contains three short subsections within the Work section. First, we’ll scrape the data from Basketball Reference. Then, we’ll present the results graphically. And finally, we’ll present the results in a tabular form for the people who like numbers more than pretty shapes and colors.
Scraping the Data
Admittedly, using web scraping for this example is somewhat overkill. I mean, Basketball Reference has a page that contains all the data we need and you can download it as a
.csv with a couple clicks of the mouse…there’s really no need for web scraping in this case. But part of the allure of this mini project was to practice some web scraping, so I don’t really care if it’s overkill…I’m using
Besides, I started this project thinking I was going to need to loop through each of the teams’ pages and pull from their per game stats tables. That would have been tedious to do manually, so I figured web scraping would be the best solution. But after some troubles with the web scraping (turns out Basketball Reference is a pain to scrape from in some cases…I think maybe they’re populating some tables on the front end, which confuses
rvest or maybe happens after
rvest pulls the
.html? I don’t know…it’s over my head, but simple CSS selectors weren’t working…) I took a step back and realized the players totals page had all the info I needed. At that point, it was easier to just do the web scraping than download the
player_totals = read_html("http://www.basketball-reference.com/leagues/NBA_2016_totals.html") player_totals %<>% html_node("#totals_stats") %>% html_table()
Before doing any analysis, we’ll also do a little cleanup on what we get from the
html_table command. The table is already in a very nice format for us to work with, but there are some annoyances with the structure of the Basketball Reference data that we’ll have to deal with. See the comments/code below for more info.
# glimpse(player_totals) # convert to tibble player_totals %<>% as_tibble() # remove row breaks player_totals %<>% filter(Rk != "Rk") # remove "TOT" values for players who spent time on multiple teams player_totals %<>% filter(Tm != "TOT")
Plotting Average Ages
Now that we have the data in a solid format, let’s create a nice plot that visualizes the average age of each NBA team! Things get a tad complicated because we’re not just interested in the weighted average age (using minutes played as the weights) – we’re also interested in the simple average age and the difference between the two metrics. There are a lot of ways to visualize this, but I elected to use a bar plot with the x-axis representing the teams in ascending order of weighted average age, the y-axis representing the weighted average age, and the color of the bars representing the simple average age. This provides a fairly elegant way for us to identify teams that break the color gradient (meaning there’s a large difference between the weighted and simple average ages).
avg_age_plot = player_totals %>% group_by(Tm) %>% mutate( Age = as.numeric(Age), MP = as.numeric(MP) ) %>% summarize( wtd_mean_age = weighted.mean(Age, MP), mean_age = mean(Age) ) %>% arrange(wtd_mean_age) %>% mutate(Tm = factor(Tm, levels = .[])) %>% ggplot(., aes(x = Tm, y = wtd_mean_age, fill = mean_age)) + geom_bar(stat = "identity") + coord_cartesian(ylim = c(20, 32)) + theme_fivethirtyeight() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) + ggtitle("Average Age of NBA Teams") + theme(axis.title = element_text(), axis.title.x = element_blank(), axis.title.y = element_text(margin = margin(0, 15, 0, 0))) + ylab('Avg. Age (Wtd. by MP)') + scale_fill_continuous("Avg. Age (Not Weighted)", low = "#56B1F7", high = "#132B43") ggsave(avg_age_plot, filename = file.path(location_of_output, "avg_age_plot.png"), width = 12, height = 7) avg_age_plot
It’s pretty clear that the Nuggets are one of the youngest teams in the league, regardless of whether you weight age with minutes played. Of course, this is only based on last year’s data, and next year’s experience may be materially different. If Gallo and Chandler miraculously manage to stay healthy the entire season and the Nuggets are in playoff contention, I’d expect them to move to the right a bit on this graph. With a couple injuries to the feeble veterans or an impressive rookie campaign from one or more of the first round picks, however, the Nuggets might even move a bit to the left…
A few other things stuck out to me on this graph:
- Minnesota appears to be the one young team whose metrics don’t match well. They’re younger than the Nuggets when weighting age with minutes played, but they appear to be significantly older when not accounting for minutes played. This is probably because they had some old dogs on the roster (Kevin Garnett, Andre Miller, and Nikola Pekovic come to mind) that didn’t get much playing time for one reason or another.
- San Antonio really jumps out. They’re one of the oldest teams regardless of what metric you use (no surprise there), but their bar is super dark…which means they’re by far the oldest team when using a simple average. This is probably due to a similar effect as the one described for Minnesota (in this case, Matt Bonner, Rasual Butler, and Andre Miller – again…#outlieralert – are the culprits).
- For some reason, I’ve been thinking of Golden State as a pretty young team, but both metrics point out that they’re really not that young…in fact, they’re older than most teams in the league! Fortunately for them, most of their stars are still pretty young (even adding Durant), but it’ll be interesting to see how they fare as their older role players (looking at you, Iguodala and Bogut) start to deteriorate.
Listing Average Ages and Youngest Ranks
We’ll also present the data in tabular form in case there’s anything deceptive about the graphics/color shading in the plot above.
player_totals %>% group_by(Tm) %>% mutate( Age = as.numeric(Age), MP = as.numeric(MP) ) %>% summarize( wtd_mean_age = weighted.mean(Age, MP), mean_age = mean(Age) ) %>% mutate( rnk_wtd = min_rank(wtd_mean_age), rnk_not_wtd = min_rank(mean_age), wtd_mean_age = round(wtd_mean_age, 1), mean_age = round(mean_age, 1) ) %>% arrange(rnk_wtd) %>% pander()
Nothing strikes me as unusual or materially different after looking at the numbers underlying the plot above. I’m sure I could dig deeper, but this was supposed to be a quick analysis so I’ll wrap up here.
In short, the answer to Joel’s question appears to be: yes, regardless of whether you weight age with minutes played, the Nuggets are one of the younger teams in the league. Of course, that may change in the upcoming season, but there’s a good chance in my opinion that there will be some offsetting effects (adding the older Chandler will be offset by the introduction of some rookies to the rotation and inevitable injuries to the older guys) and the Nuggets will again be one of the younger teams in the league.