The Agony and the Ecstasy of my first open source contribution

For the last year or so, I’ve had this desire to contribute to an open source R package, but like a lot of people, I found the thought of tackling the task frightening.

While I work in a really dynamic and close team every day, and in the world of remote repositories (Git), I’ve had really limited exposure to collaborative working in these remote repositories… We tend more to work on projects largely on our own, so the concepts of pull requests (PRs), merging, forking… well it was all a bit daunting.

The following is a glimpse at the journey, and will be explained in this post:

What this post won’t be is an exhaustive step-by-step guide of every touch point, rather a medium-high level summary.

Life is not meant to be easy, my child; but take courage: it can be delightful. - George Bernard Shaw

With George’s words in mind, I thought time to push myself to jump in.

Scrolling through Twitter (as one does when nursing a newborn), I came across a tweet about a package I’ve used in a few analyses on Don’t Blame the Data that said that the package was now live on CRAN (a great achievement).

This naturally led me to the repository on github, at which point I noticed there were open “Issues”, and one of these being for a function to create a ladder for any round.

The fitzRoy Package

The fitzRoy package, created by James Day, is a package designed to help R users extract and analyse Australian Football League (AFL) data for both the men’s and women’s competitions:

The goal of fitzRoy is to provide a set of functions that allows for users to easily get access to AFL data from sources such as afltables.com and footywire.com. There are also tools for processing and cleaning that data.

While I certainly haven’t done any extensive analysis on this point, I would guess that a large proportion of all AFL data analytics projects are completed with the help of this package.

Jumping right in

So rather than think about how good it would be to contribute, why not just get in touch with James and offer to address the open issue…

James was super easy to deal with, and boy was he helpful (and patient with this bumbling fool).

Then came the time to write the function. Well sort of write the function. Fortunately, I had already written this function for a linear regression model I built for predicting the attendance of AFL home and away games here. The function was aptly named return_ladder()… I’m a Data Scientist, not a poet.

The function was modified somewhat though to take advantage of the get_match_results() function in the package to return the starting data frame for return_ladder When writing the function, I wanted to address the requirement that the ladder be returned for any round, and for it to be returned for even earlier than the 2011 season, which another API already offered.

With that in mind, the function written takes in three arguments, all of which have the option of being blank, as well as specified:

  • match_results_df - A data frame extracted using get_match_results(),
  • season_round - The round of the season the user wants the ladder for,
  • season - The season the ladder is required for.

If these are all left blank, the function will return the ladder for every round of every season since the 1897 season.

Having the function written was one thing, it also required roxygen notes, that are returned to the user in the help docs of the function. Hadley’s R Packages book does a good job explaining these.


I’m ready to be a contributor

I’ve written the function, the help docs, and have checked the package using devtools::check() to make sure I haven’t made any mistakes that would cause the package to fail it’s build… Nothing looks alarming (well there are some warnings about No visible binding for global variable or something but I’m sure there’s nothing to worry about), but all looked good to me.

My local changes were committed and a PR was made, I’m ready to be a contributor, and then bam! Failed codecov!! What is that?! An email to James and I’m told it’s because there were no tests written. Ok cool, I’ll write some tests… WHAT ARE TESTS?! HOW DO I WRITE THESE TESTS?! I found this post to be really helpful, as well as Hadley’s tests in the R Packages book.

Once these tests were written, I commit my changes, I’m ready to be a contributor, and then bam! Changes have been made to the master that I haven’t got in my PR… ok so I need to merge the master in my PR - easy (for some maybe, I have no idea). A bit of googling, seems pretty easy, but after typing git merge origin/master, I get this editor pop up in terminal:

My initial thoughts? What the is this?!

Bit of googling, ok, it’s a VIM editor. Easy. Write a commit message and then all should be good… WAIT?! How do I get out of this screen?! Bit more googling and after typing :WQ, we’re ready to rock.

What. Am. I. Doing?!!

Ok so things were looking good. I’d committed my changes, all checks passed, happy days.

You know that line I had earlier about well there are some warnings about ‘No visible binding for global variable’ or something but I’m sure there’s nothing to worry about?? Well that was nagging away at me, because as James had advised, these would cause issues when trying to include the update on CRAN. So I fixed those, and also updated the Men’s vignette. It’s at this point that I’m a bit hazy on what I did, but all I know is is that I must have spun myself into a Git web…

The Master of my forked repo was two commits behind my branch Ladder, which was five commit’s ahead of Origin/Master. What. Am. I. Doing?! Trial and error, error and trial. After much heartache (I can’t stress enough how much heartache), eventually, I got myself all sorted, created another PR and… SUCCESS!!!

Finally I can say I have successfully made my first contribution to an open source project. I hope that users of this package find the function useful and as with everything, can find improvements to make it even better.


A quick look at the function

The below code gives a glimpse into how the function can be used.

#----- Install and Load Package -----#
# devtools::install_github("jimmyday12/fitzRoy")

library(fitzRoy)
library(tidyverse)
library(kableExtra)

# get a data frame of AFL data using get_match_results
afl_data <- get_match_results()

Return the ladder for all teams, for all rounds since 1897

# apply the return_ladder function
ladder <- return_ladder(match_results_df = afl_data)
head(ladder, 16) %>% 
  kable(format = "html", escape = F) %>%
  kable_styling("striped") %>% 
  scroll_box(width = "750px", height = "600px")
SeasonTeamRound.NumberSeason.PointsScore.ForScore.AgainstPercentageLadder.Position
1897Fitzroy1449163.06250001
1897Collingwood1441162.56250002
1897Essendon1447241.95833333
1897Melbourne1444271.62962964
1897Sydney1027440.61363645
1897Geelong1024470.51063836
1897St Kilda1016410.39024397
1897Carlton1016490.32653068
1897Fitzroy28115422.73809521
1897Melbourne28108462.34782612
1897Collingwood2891461.97826093
1897Essendon2477741.04054054
1897Sydney2467800.83750005
1897Carlton2052890.58426976
1897St Kilda20421070.39252347
1897Geelong20431110.38738748

Return the ladder for round 1 for all teams since 1897

# what if we want the ladder for a specific round?
ladder_round_1 <- return_ladder(match_results_df = afl_data, season_round = 1)
tail(ladder_round_1, 18) %>% 
  kable(format = "html", escape = F) %>%
  kable_styling("striped") %>% 
  scroll_box(width = "750px", height = "600px")
SeasonTeamRound.NumberSeason.PointsScore.ForScore.AgainstPercentageLadder.Position
2019GWS14112402.80000001
2019Fremantle14141592.38983052
2019Brisbane Lions14102581.75862073
2019Hawthorn1487551.58181824
2019Richmond1497641.51562505
2019Port Adelaide1487611.42622956
2019Footscray1482651.26153857
2019Geelong1472651.10769238
2019St Kilda1485841.01190489
2019Gold Coast1084850.988235310
2019Collingwood1065720.902777811
2019Sydney1065820.792682912
2019Melbourne1061870.701149413
2019Carlton1064970.659793814
2019Adelaide1055870.632183915
2019West Coast10581020.568627516
2019North Melbourne10591410.418439717
2019Essendon10401120.357142918

Return the ladder for every round of the 2018 season

# finally, for every round in just one season
ladder_2018 <- return_ladder(match_results_df = afl_data, season = 2018)
head(ladder_2018, 18) %>% 
  kable(format = "html", escape = F) %>%
  kable_styling("striped") %>% 
  scroll_box(width = "750px", height = "600px")
SeasonTeamRound.NumberSeason.PointsScore.ForScore.AgainstPercentageLadder.Position
2018GWS14133512.60784311
2018Port Adelaide14110601.83333332
2018Hawthorn14101671.50746273
2018Gold Coast1455391.41025644
2018Sydney14115861.33720935
2018St Kilda14107821.30487806
2018Richmond14121951.27368427
2018Essendon1499871.13793108
2018Geelong1497941.03191499
2018Melbourne1094970.969072210
2018Adelaide1087990.878787911
2018Carlton10951210.785124012
2018Brisbane Lions10821070.766355113
2018West Coast10861150.747826114
2018North Melbourne1039550.709090915
2018Collingwood10671010.663366316
2018Fremantle10601100.545454517
2018Footscray10511330.383458618

I will be writing a follow up post analysing the AFL ladder through history to really test the function out!

Stay tuned.

Jason Zivkovic
Jason Zivkovic
Data Scientist

A sports mad Data Scientist just having some fun.

Related