Don't Blame the Data

You can disagree with me, but don't blame the data

The Agony and the Ecstasy of my first open source contribution

My first open source contribution wasn't all smooth sailing, but it sure was thoroughly rewarding

9 minute read

For the last year or so, I’ve had this desire to contribute to an open source R package, but like a lot of people, I found the thought of tackling the task frightening.

While I work in a really dynamic and close team every day, and in the world of remote repositories (Git), I’ve had really limited exposure to collaborative working in these remote repositories… We tend more to work on projects largely on our own, so the concepts of pull requests (PRs), merging, forking… well it was all a bit daunting.

The following is a glimpse at the journey, and will be explained in this post:

What this post won’t be is an exhaustive step-by-step guide of every touch point, rather a medium-high level summary.

Life is not meant to be easy, my child; but take courage: it can be delightful. - George Bernard Shaw

With George’s words in mind, I thought time to push myself to jump in.

Scrolling through Twitter (as one does when nursing a newborn), I came across a tweet about a package I’ve used in a few analyses on Don’t Blame the Data that said that the package was now live on CRAN (a great achievement).

This naturally led me to the repository on github, at which point I noticed there were open “Issues”, and one of these being for a function to create a ladder for any round.

The fitzRoy Package

The fitzRoy package, created by James Day, is a package designed to help R users extract and analyse Australian Football League (AFL) data for both the men’s and women’s competitions:

The goal of fitzRoy is to provide a set of functions that allows for users to easily get access to AFL data from sources such as afltables.com and footywire.com. There are also tools for processing and cleaning that data.

While I certainly haven’t done any extensive analysis on this point, I would guess that a large proportion of all AFL data analytics projects are completed with the help of this package.

Jumping right in

So rather than think about how good it would be to contribute, why not just get in touch with James and offer to address the open issue…

James was super easy to deal with, and boy was he helpful (and patient with this bumbling fool).

Then came the time to write the function. Well sort of write the function. Fortunately, I had already written this function for a linear regression model I built for predicting the attendance of AFL home and away games here. The function was aptly named return_ladder()… I’m a Data Scientist, not a poet.

The function was modified somewhat though to take advantage of the get_match_results() function in the package to return the starting data frame for return_ladder When writing the function, I wanted to address the requirement that the ladder be returned for any round, and for it to be returned for even earlier than the 2011 season, which another API already offered.

With that in mind, the function written takes in three arguments, all of which have the option of being blank, as well as specified:

  • match_results_df - A data frame extracted using get_match_results(),
  • season_round - The round of the season the user wants the ladder for,
  • season - The season the ladder is required for.

If these are all left blank, the function will return the ladder for every round of every season since the 1897 season.

Having the function written was one thing, it also required roxygen notes, that are returned to the user in the help docs of the function. Hadley’s R Packages book does a good job explaining these.


I’m ready to be a contributor

I’ve written the function, the help docs, and have checked the package using devtools::check() to make sure I haven’t made any mistakes that would cause the package to fail it’s build… Nothing looks alarming (well there are some warnings about No visible binding for global variable or something but I’m sure there’s nothing to worry about), but all looked good to me.

My local changes were committed and a PR was made, I’m ready to be a contributor, and then bam! Failed codecov!! What is that?! An email to James and I’m told it’s because there were no tests written. Ok cool, I’ll write some tests… WHAT ARE TESTS?! HOW DO I WRITE THESE TESTS?! I found this post to be really helpful, as well as Hadley’s tests in the R Packages book.

Once these tests were written, I commit my changes, I’m ready to be a contributor, and then bam! Changes have been made to the master that I haven’t got in my PR… ok so I need to merge the master in my PR - easy (for some maybe, I have no idea). A bit of googling, seems pretty easy, but after typing git merge origin/master, I get this editor pop up in terminal:

My initial thoughts? What the is this?!

Bit of googling, ok, it’s a VIM editor. Easy. Write a commit message and then all should be good… WAIT?! How do I get out of this screen?! Bit more googling and after typing :WQ, we’re ready to rock.

What. Am. I. Doing?!!

Ok so things were looking good. I’d committed my changes, all checks passed, happy days.

You know that line I had earlier about well there are some warnings about ‘No visible binding for global variable’ or something but I’m sure there’s nothing to worry about?? Well that was nagging away at me, because as James had advised, these would cause issues when trying to include the update on CRAN. So I fixed those, and also updated the Men’s vignette. It’s at this point that I’m a bit hazy on what I did, but all I know is is that I must have spun myself into a Git web…

The Master of my forked repo was two commits behind my branch Ladder, which was five commit’s ahead of Origin/Master. What. Am. I. Doing?! Trial and error, error and trial. After much heartache (I can’t stress enough how much heartache), eventually, I got myself all sorted, created another PR and… SUCCESS!!!

Finally I can say I have successfully made my first contribution to an open source project. I hope that users of this package find the function useful and as with everything, can find improvements to make it even better.


A quick look at the function

The below code gives a glimpse into how the function can be used.

#----- Install and Load Package -----#
# devtools::install_github("jimmyday12/fitzRoy")

library(fitzRoy)
library(tidyverse)
library(kableExtra)

# get a data frame of AFL data using get_match_results
afl_data <- get_match_results()

Return the ladder for all teams, for all rounds since 1897

# apply the return_ladder function
ladder <- return_ladder(match_results_df = afl_data)
head(ladder, 16) %>% 
  kable(format = "html", escape = F) %>%
  kable_styling("striped") %>% 
  scroll_box(width = "750px", height = "600px")
Season Team Round.Number Season.Points Score.For Score.Against Percentage Ladder.Position
1897 Fitzroy 1 4 49 16 3.0625000 1
1897 Collingwood 1 4 41 16 2.5625000 2
1897 Essendon 1 4 47 24 1.9583333 3
1897 Melbourne 1 4 44 27 1.6296296 4
1897 Sydney 1 0 27 44 0.6136364 5
1897 Geelong 1 0 24 47 0.5106383 6
1897 St Kilda 1 0 16 41 0.3902439 7
1897 Carlton 1 0 16 49 0.3265306 8
1897 Fitzroy 2 8 115 42 2.7380952 1
1897 Melbourne 2 8 108 46 2.3478261 2
1897 Collingwood 2 8 91 46 1.9782609 3
1897 Essendon 2 4 77 74 1.0405405 4
1897 Sydney 2 4 67 80 0.8375000 5
1897 Carlton 2 0 52 89 0.5842697 6
1897 St Kilda 2 0 42 107 0.3925234 7
1897 Geelong 2 0 43 111 0.3873874 8

Return the ladder for round 1 for all teams since 1897

# what if we want the ladder for a specific round?
ladder_round_1 <- return_ladder(match_results_df = afl_data, season_round = 1)
tail(ladder_round_1, 18) %>% 
  kable(format = "html", escape = F) %>%
  kable_styling("striped") %>% 
  scroll_box(width = "750px", height = "600px")
Season Team Round.Number Season.Points Score.For Score.Against Percentage Ladder.Position
2019 GWS 1 4 112 40 2.8000000 1
2019 Fremantle 1 4 141 59 2.3898305 2
2019 Brisbane Lions 1 4 102 58 1.7586207 3
2019 Hawthorn 1 4 87 55 1.5818182 4
2019 Richmond 1 4 97 64 1.5156250 5
2019 Port Adelaide 1 4 87 61 1.4262295 6
2019 Footscray 1 4 82 65 1.2615385 7
2019 Geelong 1 4 72 65 1.1076923 8
2019 St Kilda 1 4 85 84 1.0119048 9
2019 Gold Coast 1 0 84 85 0.9882353 10
2019 Collingwood 1 0 65 72 0.9027778 11
2019 Sydney 1 0 65 82 0.7926829 12
2019 Melbourne 1 0 61 87 0.7011494 13
2019 Carlton 1 0 64 97 0.6597938 14
2019 Adelaide 1 0 55 87 0.6321839 15
2019 West Coast 1 0 58 102 0.5686275 16
2019 North Melbourne 1 0 59 141 0.4184397 17
2019 Essendon 1 0 40 112 0.3571429 18

Return the ladder for every round of the 2018 season

# finally, for every round in just one season
ladder_2018 <- return_ladder(match_results_df = afl_data, season = 2018)
head(ladder_2018, 18) %>% 
  kable(format = "html", escape = F) %>%
  kable_styling("striped") %>% 
  scroll_box(width = "750px", height = "600px")
Season Team Round.Number Season.Points Score.For Score.Against Percentage Ladder.Position
2018 GWS 1 4 133 51 2.6078431 1
2018 Port Adelaide 1 4 110 60 1.8333333 2
2018 Hawthorn 1 4 101 67 1.5074627 3
2018 Gold Coast 1 4 55 39 1.4102564 4
2018 Sydney 1 4 115 86 1.3372093 5
2018 St Kilda 1 4 107 82 1.3048780 6
2018 Richmond 1 4 121 95 1.2736842 7
2018 Essendon 1 4 99 87 1.1379310 8
2018 Geelong 1 4 97 94 1.0319149 9
2018 Melbourne 1 0 94 97 0.9690722 10
2018 Adelaide 1 0 87 99 0.8787879 11
2018 Carlton 1 0 95 121 0.7851240 12
2018 Brisbane Lions 1 0 82 107 0.7663551 13
2018 West Coast 1 0 86 115 0.7478261 14
2018 North Melbourne 1 0 39 55 0.7090909 15
2018 Collingwood 1 0 67 101 0.6633663 16
2018 Fremantle 1 0 60 110 0.5454545 17
2018 Footscray 1 0 51 133 0.3834586 18

I will be writing a follow up post analysing the AFL ladder through history to really test the function out!

Stay tuned.

comments powered by Disqus

Recent posts

See more

Categories

About

A data visualisation blog. Sports. Analytics. Sports analytics