An In Depth Analysis of FIFA 19

This post was originally released on the Kaggle FIFA 19 Complete Player dataset that was kindly collected by Karan Gadiya. Many thanks Karan. The link to my original kernel is here

The full code base can be found here

Introduction

From Wikipedia:

FIFA 19 is a football simulation video game developed by EA Vancouver as part of Electronic Arts’ FIFA series. Announced on 6 June 2018 for its E3 2018 press conference, it was released on 28 September 2018 for PlayStation 3, PlayStation 4, Xbox 360, Xbox One, Nintendo Switch, and Microsoft Windows. It is the 26th installment in the FIFA series. As with FIFA 18, Cristiano Ronaldo appears as the cover athlete of the regular edition. https://en.wikipedia.org/wiki/FIFA_19

The game features a number of different playing modes, however Career mode as a manager holds the most appeal for me.

The following analysis will be tailored toward having the best chance at success in that mode for anyone interested.

Some things I want to analyse in this paper:

  • High level Exploratory Data Analysis
  • Which features are highly correlated with a player’s overall rating by player position
  • Analyse the differences between a player’s current rating and their potential rating
  • Find out which teams have the highest potential
  • Find out the youngest teams / oldest teams
  • Use k-means clustering to try to find “bargains”; ie if there is someone with the same skills/potential, can they be found for a bargain?

Feature engineering

# Load libraries
library(tidyverse)
library(scales)
library(ggthemes)
library(kableExtra)
library(plotly)

options(scipen = 999)

fifa_data <- read_csv("https://raw.githubusercontent.com/JaseZiv/FIFA19-Analysis/master/data/data.csv") %>% select(-X1)

fifa_data <- fifa_data %>%
  mutate(ValueMultiplier = ifelse(str_detect(Value, "K"), 1000, ifelse(str_detect(Value, "M"), 1000000, 1))) %>%
  mutate(ValueNumeric_pounds = as.numeric(str_extract(Value, "[[:digit:]]+\\.*[[:digit:]]*")) * ValueMultiplier) %>%
  mutate(Position = ifelse(is.na(Position), "Unknown", Position))


fifa_data <- fifa_data %>%
  mutate(WageMultiplier = ifelse(str_detect(Wage, "K"), 1000, ifelse(str_detect(Wage, "M"), 1000000, 1))) %>%
  mutate(WageNumeric_pounds = as.numeric(str_extract(Wage, "[[:digit:]]+\\.*[[:digit:]]*")) * WageMultiplier)


positions <- unique(fifa_data$Position)

gk <- "GK"
defs <- positions[str_detect(positions, "B$")]
mids <- positions[str_detect(positions, "M$")]
f1 <- positions[str_detect(positions, "F$")]
f2 <- positions[str_detect(positions, "S$")]
f3 <- positions[str_detect(positions, "T$")]
f4 <- positions[str_detect(positions, "W$")]
fwds <- c(f1, f2, f3, f4)

fifa_data <- fifa_data %>% 
  mutate(PositionGroup = ifelse(Position %in% gk, "GK", ifelse(Position %in% defs, "DEF", ifelse(Position %in% mids, "MID", ifelse(Position %in% fwds, "FWD", "Unknown")))))

fifa_data <- fifa_data %>%
  mutate(AgeGroup = ifelse(Age <= 20, "20 and under", ifelse(Age > 20 & Age <=25, "21 to 25", ifelse(Age > 25 & Age <= 30, "25 to 30", ifelse(Age > 30 & Age <= 35, "31 to 35", "Over 35")))))

1. Player Valuation The raw data has player valuations as a character string, with a designation at the end specifying whether the value is thousands or millions. Regex is used to create a numeric variable called ValueNumeric_pounds

2. Player Wage See Point 1 above. Same transformation has occurred for player Wage

3. Player Positions There are 28 different positions in FIFA2019. To make analysis less granular, I have decided to create four groupings; GK, DEF, MID and FWD.

4. Player Age I have decided to group players into age buckets, in 5 year increments other than for players 20 years and younger, and players 35 years and over.


Overall Ratings

Player ratings are normally distributed in FIFA19, with a mean of 66.2387 and standard deviation of 6.9089

Age vs Overall Rating

Athletes in all domains have no doubt been looking for the elixir of life since the dawn of time, but to no avail unfortunately. When it comes to a player’s overall rating, it appears as though player ratings are growing on average until around 30 years of age, whereby they level off for a couple of years, and then start the inevitable decline at around 34 years.

When this relationship is explored by the major position groups, we can see that defender ratings tend to beging their decline earliest at around 33 years of age, while the decline starts somewher closer to 35 for both attackers and midfielders.

Player Valuations

Player valuations show a heavily positive skew, being skewed by the superstars of the game - Messi, Neymar, De Bryune, etc.

Age vs Valuations

It is intuitive to think that players get better with age and experience and that their valuations would refelect this relationship.

Plotting the relationship below, it can be seen that as players age, their valuations tend to increase up to their early 30s, and then begin declining in the years of age between 31-35, and then rapidly decline for players older than 35. This is in line with the findings when the overall rating was plotted as a function of the player’s age.

Position vs Valuations

As expected, Forwards and Midfielders are going to cost you more than Defenders and Goalkeepers.

Specifically, it’s left and right forwards and left and right attacking-midfielders that are the most expensive positions.

Player Rating and Valuations

There a strong positive relationship between the players rating and valuation with a Spearman correlation of 0.9081. The Spearman method was used to calculate the correlation because of presence of large outliers.

Correlations with a players Overall rating

The below correlation matrix data displays the correlations between the Overall rating and other key attribute variables for all players except goalkeepers. Both the Pearson and Spearman correlation methods were used to display the differences.

The Spearman correlation method is more robust with dealing with extreme outliers, hence the player Value having the highest spearman correlation (0.92). Reactions, Composure, Special and (surprisingly) Wage rounding out the top 5 correlated variables with Overall rating using the Spearman method. I say surprisingly because Soccernomics (Kuper.S, Szymanski.S, 2014) stated the best predictor of team success was the teams wage bill.

Using the Pearson method, we get different results. the top 5 correlated variables with the Overall rating is Reactions, Composure, Special, ShortPassing and BallControl. Both value and wage don’t appear ni this list because they are prone to large outliers.

FeatureSpearmanPearson
ValueNumeric_pounds0.91602530.6346473
Reactions0.84303470.8477221
Composure0.79283220.8017716
Special0.78080080.7959002
WageNumeric_pounds0.77868880.5755838
BallControl0.73275610.7178017
ShortPassing0.72095970.7226152
Potential0.61484460.6506847
ShotPower0.59282830.5629406
LongPassing0.58785140.5853744

Which positions are skilled in which attributes

The heat map below displays the median attributes for each position available. The analysis was done on players other than goalkeepers, and also only on players with an overall rating of 75 or higher.

It shows that centre backs are high on strength and are typically strong in agression. Right and Left Wingers and Left Forwards are very agile, while Left and Right Midfielders have great acceleration.


Helping me manage a team

The next section will be devoted to analysis that will assist those wanting to magae a team. I will analyse which players to target if you want to manage a team in rebuilding mode, players to target with a low budget if managing lower division teams or lower-budget teams, which teams to manage if tyou want an attacking team, or a defensive team.

Young Player Analysis

If you are wanting to be a manager that gives youngsters a chance and watch them blossom, then the list of below players might be ones to target.

The violet bars indicate the difference between the player’s potential and their current overally rating. J. von Moos, a striker (17 years old and costs €280K) and D. Campbell, a central midfielder (also 17 years old and only costs €60K) are the two players with the highest room to grow, with both players having a differential of 26. R. Griffiths, also a striker, but 18 years old, has a potential of 84, however will cost you €575K. I know who I’d be choosing between Griffiths and the previously mentioned J. von Moos who also has an overall rating of 84.

If it’s a real bargain you’re looking for, the 16 year old central midfielder B. Mumba will only cost you €190K, but has a potential rating of 80… Juicy!

When does potential end?

This now raises an interesting topic… is there an age that players finally realise all of their potential, or is there room to grow throughout their careers?

It appears as though a player’s potential and their overall rating converge at approximately 30 years of age.

Free Valuation Players

Are there some real bargains to be had? It would appear that there is! Below plots all players that have a free valuation, and displays a player’s age against their current overall rating.

L. Paredes!!!!! Depending on what you’re looking for, either of the top-left or top-right quadrants are where the players to target are. The best player to target is L. Paredes, a fairly young CM at 24 with an overall rating of 80 (and even better still a potential of 85).

P. Mahlambi, a South African CAM 20 year old with a potential of 84 is another big target, as is the 18 year old RCB B. Méndez and 19 year old CAM I. Hagi.


Which Team do I want to manage

This section is designed to help future managers decide what team to manage, based on various factors.

I will include an analysis on the youngest and oldest teams, the most talented teams, the great attacking teams, most expensive teams and teams with the highest player valuations.

Team Age

If you’re the type of manager who likes to take charge of younger teams, then teams in some of the European teams might be for you. FC Nordsjælland (Danish) is the youngest team on average in FIFA19, with an average age of 20.3 years old, followed by FC Groningen (Dutch) at 21.4 years old and Bohemian FC (Ireland) at 21.5 years old. Interestingly, Dutch powerhouses PSV and Ajax both make the list of the 20 youngest teams. They’re certainly the type of teams I’d be looking towards.

At the other end of the spectrum, if managing older teams is your thing, then South America is the place for you.

Team Overall Talent

If you’re trying to pick the most talented team, how you choose to interpret “talent” can have an impact on which team you finally settle for.

When looking at the average overall ratings for each team, four of the five highest average rated teams come from Serie A (Juventus, Napoli, Inter and Milan), with Real Madrid the only non-Italian side in the top 5. FC Barcelona, PSG, Roma, Man Utd and surprisingly SL Benfica rounding out the top 10.

Now if you wish to define “talent” as the most number of superstars, then your selection will be slightly different.

I have chosen to define superstars as those players whose rating is 85 or above. There are 110 110 players in the game with this designation.

Looking at the top three from the previous plot, we can see that it’s a much different story. Juventus - the highest average overall rated team only have the third most amount of superstars, replaced at the top by Real Madrid and Manchester City. The story is even more pronounced for Napoli, who went from second in the previous measurement to 11th here, with Inter also dropping form 3rd to 8th.

The Galacticos of Real Madrid proudly sit atop this perch with 12 players on the list having an overall rating of 85 or over.

Interestingly, only the the top leagues are represented in the list of clubs with more than one superstar.

Deadly Teams Up Front

How does one determine the most attacking teams? The method used in this analysis takes an average rating of the four main attributes related to goals - Finishing, LongShots, Penalties, ShotPower and Positioning. The attributes are added together and an average rating is calculated. To determine the teams with the highest attacking rating, only midfielders and attackers are considered.

The average rating of the most dangerous teams in front of goals are displayed below. Inter (12 midfielders and attackers) and Juventus (13 midfielders and attackers) are the most dangerous teams in front of goal, clearly ahead of the third most dangerous club, Napoli (13 midfielders and attackers). Interestingly, there are a few leagues are represented in the 20 most dangerous teams up front - Portugese, Turkish, Greek, in addition to the usual suspects in England, Italy, Germany, Spain, France.

ClubNumberOfAttackersTeamAttackingRating
Inter1274.43333
Juventus1374.03077
Napoli1371.72308
Manchester United1971.50526
Sporting CP1771.15294
FC Porto1671.05000
FC Barcelona1871.03333
Real Madrid1771.01176
SL Benfica1670.87500
FC Bayern München1770.82353
SC Braga1570.78667
Roma1370.60000
Manchester City1670.37500
Milan1470.27143
Olympique de Marseille1270.13333
FC Schalke 041669.12500
Valencia CF1769.01176
Beşiktaş JK1468.75714
Paris Saint-Germain1768.51765
SV Werder Bremen1768.45882

Team Wage Bills and Value for Money

Real Madrid and Barcelona blow the other teams away when it comes to wage bills, with weekly wage bills sitting at just over €5M and €4.83M respectively. Man City is a distant third with a weekly wage bill of €3.7M.

Surprisingly, Inter Milan, one of the most highly rated teams in terms of overall talent (see above) comes in at number 16 on the list - talk about value for money! Even more strangely, Premier League sides Everton and West Ham have higher wage bills than Inter! What’s worse for those two EPL clubs, Milan, Napoli and Inter were three of the top five overall rated clubs in the game. As a result, these three clubs have had great value for money, with each overall rating point only costing Inter €20K, as opposed to Everton’s €27K per overall rating. Of course this is dwarfed by the two Spaniards, who both spend over €60K per point.

At the other end of the spectrum, there are some clubs that are exceptionally efficient with their wage bills. Highlighted clubs are those that have an average overall rating over 70. Spartak Moscow are the leaders, with each rating point only costing €369 - what a difference from Real Madrid!


K-Means Clustering

Trying to find players who are similar can be a very difficult task. There are close to 100 variables that you might need to analyse to determine “similarity” between players. Couple that with comparing multiple players and it’s almost imposible.

Enter K-Means!

K-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells https://en.wikipedia.org/wiki/K-means_clustering.

The steps used are outlined below:

  1. Filter out Goalkeepers and any players without a position listed
  2. Select only numeric variables, mainly the attribute variables. Player value, wages and overall rating have been omitted from the data so that these variables don’t sway our groupings, allowing for the clusters to contain like-for-like players based off their skillsets.
  3. Scale the data - without scaling, some variables, especially Special will skew the clusters to than variable.
  4. Run K-Means using, looping through a number of cluster centres to find the optimal number (k). This can be done visually, where the elbow of the plot “bends”.
  5. Set the number of centres (8 in this example)
  6. Re-run the model with K = 8
  7. Assign the cluster group back to the main data
# Get data ready
data_cluster <- fifa_data %>%
  filter(PositionGroup != "Unknown") %>%
  filter(PositionGroup != "GK") %>%
  mutate(RoomToGrow = Potential - Overall) %>%
  select_if(is.numeric) %>%
  select(-ID, -`Jersey Number`, -AttackingRating, -starts_with("Value"), - starts_with("Wage"), -starts_with("GK"), -Overall)


scaled_data <- scale(data_cluster)

set.seed(109)
# Initialize total within sum of squares error: wss
wss <- 0
# For 1 to 30 cluster centers
for (j in 1:30) {
  km.out <- kmeans(scaled_data, centers = j, nstart = 20)
  # Save total within sum of squares to wss variable
  wss[j] <- km.out$tot.withinss
}


# create a DF to use in a ggplot visualisation
wss_df <- data.frame(num_cluster = 1:30, wgss = wss)

# plot to determine optimal k
ggplot(data = wss_df, aes(x=num_cluster, y= wgss)) + 
  geom_line(color = "lightgrey", size = 2) + 
  geom_point(color = "green", size = 4) +
  theme_fivethirtyeight() +
  geom_curve(x=14, xend=8, y=300000, yend= 290500, arrow = arrow(length = unit(0.2,"cm")), size =1, colour = "purple") +
  geom_text(label = "k = 8\noptimal level", x=14, y= 290000, colour = "purple") +
  labs(title = "Using Eight Clusters To Group Players", subtitle = "Selecting the point where the elbow 'bends', or where the slope of \nthe Within groups sum of squares levels out")

# Set k equal to the number of clusters corresponding to the elbow location
k <- 8

# Create a k-means model on wisc.data: wisc.km
wisc.km <- kmeans(scale(data_cluster), centers = k, nstart = 20)

# add the cluster group back to the original DF for all players other than GK and Unknown
cluster_data <- fifa_data %>%
  filter(PositionGroup != "Unknown") %>%
  filter(PositionGroup != "GK") %>%
  mutate(Cluster = wisc.km$cluster)

K-means has split the data into the following 8 clusters as outlined below.

It appears that clusters 4 and 8 are representing the more defensive-midfielder types (made up mainly of DEFs and MIDs). Cluster 7 looks to be grouping very attacking minded midfielders midfielders (mande up mainly of MIDs and FWDs).

Cluster 2 is grouping Midfielders, Cluster 5 is for Forwards, while clusters 1 and 3 are for the Defenders.

##    
##      DEF  FWD  MID
##   1 1137   43 1802
##   2 1498   39 1282
##   3 1443    7  136
##   4   16  742 1668
##   5   20  989 1001
##   6  143  269  614
##   7 1608    3  185
##   8    1 1326  150

Plotting the 20 highest rated players in each cluster shows us which players are similar in terms of their skill sets.

I am going to build a shiny app based off this clustering analysis to enable users to input a player they wish to find a similar replacement for, and the resulting output displaying a list of players and their attributes who are most similar.

For example, selecting I. Perišić from Inter as someone to replicate, the below results will appear.

The function will take three inputs, the player name how many players you want returned, and what how close to the player’s value do you want to see results for (ie if you want all players within 10% of the player’s value, use 0.1).

return_similar_players <- function(player, num_results, return_within_fraction) {
  
  cluster_filter <- cluster_analysis$Cluster[cluster_analysis$Name == player]
  player_value <- cluster_analysis$ValueNumeric_pounds[cluster_analysis$Name == player]
  
  cluster_analysis %>%
    filter(Cluster == cluster_filter,
           ValueNumeric_pounds >= (player_value * (1- return_within_fraction)) & ValueNumeric_pounds <= (player_value * (1 + return_within_fraction))) %>%
    head(num_results)

}


return_similar_players("I. Perišić", 100, .05) %>% 
  kable(format = "html", escape = F) %>%
  kable_styling("striped", full_width = F)
IDNameClubAgePositionGroupOverallClusterValueNumeric_pounds
189332Jordi AlbaFC Barcelona29DEF87638000000
191043Alex SandroJuventus27DEF86636500000
184087T. AlderweireldTottenham Hotspur29DEF86639000000
197445D. AlabaFC Bayern München26DEF85638000000
189513ParejoValencia CF29MID85637000000
187961PaulinhoGuangzhou Evergrande Taobao FC29MID85637000000
184941A. SánchezManchester United29FWD85637500000
184267Y. BrahimiFC Porto28MID85639000000
181458I. PerišićInter29MID85637500000
179844Diego CostaAtlético Madrid29FWD85638500000
165153K. BenzemaReal Madrid30FWD85637000000
205498JorginhoChelsea26MID84638000000
204970F. ThauvinOlympique de Marseille25MID84639000000
200104H. SonTottenham Hotspur25MID84637000000
212523Anderson TaliscaGuangzhou Evergrande Taobao FC24MID83636500000

This code will be able to be re-used on FIFA 20 data when it comes out - feel free to use this as a basis for any further analysis!

Would love your thoughts on the analysis.

Jason Zivkovic
Jason Zivkovic
Data Scientist

A sports mad Data Scientist just having some fun.

Related