Fantasy football is a popular game played by millions of people around the world. Contestants draft a team of NFL football players and then are awarded points based on their players' performance each week. While there are opportunities for contestants to improve their teams throughout the season, the draft is by far the biggest factor that contributes to a fantasy team's success. There are a wide array of draft strategies, with no consensus about which one is the best. We are attempting to find the best way to draft a fantasy football team using data science techniques.
We have identified two data sets that will be used to inform our analysis. The first dataset is complete with player statistics, and more importantly, average draft position for the years 2010-2019. This dataset was compilied by FantasyFootballCalculator.com. The second data set contains fantasy statistics for players going back to 1970. This dataset is from FantasyFootballDataPros.com. We will use these datasets to compare players' statistical output with their average fantasy draft position. These insights may be used to inform more successful draft strategies.
For example, is it important to draft a running back in the first two rounds? At what round should a contestant draft a quarterback? We plan to go through each position and each round, seeking to maximize the draft value of both. There are six positions in fantasy football; quarterback, two running backs, two wide receivers, and tight end. Most leagues also include defense and kicker although our datasets do not include these positions. We will find out which rounds to target each position. However, it is slightly more complicated than this. For example, if I am picking, but the top five running backs are gone, should I wait until round two to draft a running back? We will filter the data to exclude top players when seeking to answer these questions. At the end, all of these insights will be compiled into a single draft guide.
Completing this project will require collaboration and hard work from both team members. We will meet weekly over Zoom to discuss progress and plan work for the next week. We set up a private GitHub repository to enable version control and easy sharing of each other's work. Where necessary, we will work on the project simulataneously using Teletype for Atom. This plan will allow us to work together while still maintaining compliance with social distancing guidelines.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
# This data set contains detailed statistics on each player's fantasy statistsics going
# back to 1970, though only 2010 and beyond are used. Each observation in the final dataframe
# is a player's performance in a given year.
# Load first data frame
df = pd.read_csv("data/yearly/2010.csv", index_col=0)
df["Year"] = "2010"
df["Rank"] = df["FantasyPoints"].rank(ascending=False)
df["PosRank"] = df.groupby("Pos")["Rank"].rank()
path = "data/yearly/2019.csv"
# get data from csv files from 2011 to 2019
for year in range(2011,2020):
newpath = path.replace("2019", str(year))
new_df = pd.read_csv(newpath, index_col=0)
new_df["Year"] = str(year)
new_df["Rank"] = new_df["FantasyPoints"].rank(ascending=False)
new_df["PosRank"] = new_df.groupby("Pos")["Rank"].rank()
df = df.merge(new_df,how="outer")
#create ID
df["ID"] = df["Player"] + " " + df["Year"]
df = df.set_index("ID")
#changing column names and changing team names
df.rename(columns = {'Player':'Name', 'Tm':'Team', 'Pos':'Position'}, inplace = True)
df.loc[df.Team == "KAN", ["Team"]] = "KC"
df.loc[df.Team == "OAK", ["Team"]] = "LV"
df.loc[df.Team == "GNB", ["Team"]] = "GB"
df.loc[df.Team == "NWE", ["Team"]] = "NE"
df.loc[df.Team == "STL", ["Team"]] = "LAR"
df.loc[df.Team == "SDG", ["Team"]] = "LAC"
df.loc[df.Team == "TAM", ["Team"]] = "TB"
df.loc[df.Team == "NOR", ["Team"]] = "NO"
df.loc[df.Team == "SFO", ["Team"]] = "SF"
df
path2 = "data/adp/2010.csv"
adp = pd.read_csv(path2, index_col=0)
adp["Year"] = "2010"
adp["ID"] = adp["Name"] + " " + adp["Year"]
adp = adp.set_index("ID")
for year in range(2011,2020):
newpath = path2.replace("2010", str(year))
new_adp = pd.read_csv(newpath, index_col=0)
new_adp["Year"] = str(year)
new_adp["ID"] = new_adp["Name"] + " " + new_adp["Year"]
new_adp = new_adp.set_index("ID")
adp = pd.concat([adp, new_adp])
adp
#merge the two dataframes
data = df.merge(adp, on = ['ID', 'Name', 'Position', 'Team', 'Year'], how = 'outer')
data
This cell loads in the weekly data which is required for the season simulations
#SETS UP DATA FRAME WITH PROPER COLUMNS
path = "data/weekly/2010/week1.csv"
df = pd.read_csv(path)
df["Year"] = "2010"
df["Week"] = '1'
df = df.loc[df.Week != 1] # drops rows that will be duplicated
# GETS DATA FROM THE CSV FILES AND ADDS THEM TO THE DATAFRAME
for year in range(2010,2020):
for week in range(1,18):
newpath = path.replace("2010", str(year))
newpath = newpath.replace("week1", "week"+ str(week))
new_df = pd.read_csv(newpath)
new_df["Year"] = str(year)
new_df["Week"] = str(week)
df = pd.concat([df,new_df])
#creates ID
df["ID"] = df["Player"] + " " + df["Year"] + ' ' + df["Week"]
weekly = df.set_index("ID")
display(weekly)
In order to justify our research, it is important to first check how well the average fantasy player drafts their teams. If fantasy contestants can draft effectively without the insights generated by data science techniques, then there is no need to perform the analysis. We will examine how closely Average Draft Position (ADP) and fantasy production are correlated.
ax = sns.regplot(data['Overall'], data['FantasyPoints'], line_kws={'color':'red'})
ax.set_title("Average Draft Position vs Fantasy Production")
correlation = data['Overall'].corr(data['FantasyPoints'])
print("Correlation coefficient: ", correlation)
As you can see from the above graph, there is a tremendous amount of variance. Players drafted in early rounds may miss time with injury or underperform, while players drafted in later rounds may have breakout years. The correlation coefficient is a measly -0.409. Note that we are expecting a negative correlation coefficient because we expect low ADP players (players who were drafted earlier) to outperform high ADP players (players who were drafted later). While this number does show a correlation, it can absolutely be improved upon. Let's go a little deeper with this analysis.
for i in range(0,16):
low = 10 * i
high = low+10
temp = data[(data.Overall > 10 * i) & (data.Overall > 10 * i +10)]
correlation = temp['Overall'].corr(temp['FantasyPoints'])
print("Round ", i+1, "correlation coefficient: ", correlation)
Notice that as a draft progresses to later rounds, the correlation gets weaker. This makes logical sense. In the early rounds of a draft, star players and consistent performers are taken. Drafting a team becomes much more difficult in the later rounds, when these players are no longer available. Contestants may have to take a chance on an unproven rookie, or a previously inconsistent player. This means that the most important factor in drafting a good team is finding productive players in later rounds. This is where we will focus our analysis.
If you are plenty familiar with why we need to break down fantasy player value by position, feel free to skip this part and head straight to the analysis!
Suppose we play a simplified version of fantasy football in which each team consists of 9 players regardless of what position they play. As in normal fantasy football, the season is split into 17 weeks, and the goal of each week is to outscore the opponent you are playing that week. A typical matchup would look like this:
Team 1 | Team 2 | ||||
---|---|---|---|---|---|
Slot | Player | Points | Points | Player | Slot |
Slot 1 | Player 1 | 24 | 33 | Player 10 | Slot 1 |
Slot 2 | Player 2 | 11 | 9 | Player 11 | Slot 2 |
Slot 3 | Player 3 | 9 | 2 | Player 12 | Slot 3 |
Slot 4 | Player 4 | 18 | 5 | Player 13 | Slot 4 |
Slot 5 | Player 5 | 0 | 15 | Player 14 | Slot 5 |
Slot 6 | Player 6 | 14 | 0 | Player 15 | Slot 6 |
Slot 7 | Player 7 | 15 | 28 | Player 16 | Slot 7 |
Slot 8 | Player 8 | 11 | 16 | Player 17 | Slot 8 |
Slot 9 | Player 9 | 20 | 11 | Player 18 | Slot 9 |
Total | 122 | 119 | Total |
In this example determining any player’s value is easy because it is directly determined by how many points he scored that week. That means that if you could select any 9 players to be on your fantasy team you would select the 9 players expected to score the most on average thereby giving you the highest chance of winning any given week. Nearly every league has some sort of structure, usually a snake draft, before the season to allow teams to take turns selecting players and to avoid duplicates. Under these set of circumstances, the best draft pick one could make would always be the player who is projected to score the most fantasy points of any remaining player. Lets have a quick look at what kinds of players should be targeted in this simplified league by taking the tops scorers over the last ten years:
top_10 = data.sort_values(by="FantasyPoints", ascending = False).head(10)
display(top_10[["Position","FantasyPoints"]])
sns.countplot(data=top_10, x= "Position")
plt.title("Top 10 scoring players 2010-2019")
plt.show()
Surprise! Almost all of the top fantasy producers are quarterbacks or runningbacks! This means that all we have to do is draft quarterbacks at first and then move to runningbacks and down the line right? Well, that is an excellent strategy for a positionless league, but since that structure is pretty mundane and lacking in difficulty, most leagues spice things up but placing positional restrictions on your line up. A common matchup might look something like this:
Team 1 | Team 2 | |||||
---|---|---|---|---|---|---|
Position | Player | Points | Points | Player | Position | |
QB | Player 1 | 24 | 33 | Player 10 | QB | |
RB1 | Player 2 | 18 | 14 | Player 11 | RB1 | |
RB2 | Player 3 | 13 | 11 | Player 12 | RB2 | |
WR1 | Player 4 | 18 | 5 | Player 13 | WR1 | |
WR2 | Player 5 | 0 | 21 | Player 14 | WR2 | |
TE | Player 6 | 7 | 10 | Player 15 | TE | |
FLEX(RB,WR,TE) | Player 7 | 15 | 12 | Player 16 | FLEX(RB,WR,TE) | |
D/ST | Player 8 | 8 | 3 | Player 17 | D/ST | |
K | Player 9 | 10 | 10 | Player 18 | K | |
Bench | Bench | |||||
Any | Bench 1 | Not Counted | Not Counted | Bench 7 | Any | |
Any | Bench 2 | Not Counted | Not Counted | Bench 8 | Any | |
Any | Bench 3 | Not Counted | Not Counted | Bench 9 | Any | |
Any | Bench 4 | Not Counted | Not Counted | Bench 10 | Any | |
Any | Bench 5 | Not Counted | Not Counted | Bench 11 | Any | |
Any | Bench 6 | Not Counted | Not Counted | Bench 12 | Any |
As you can see, our strategy of drafting all quarterbacks and runningbacks will no longer work here because we would not even be able to field a complete team! This means that instead of only focusing on getting the highest scoring players, we need to make sure our team has good performers at each position in the lineup if we want to be successful. The tricky part is that as we have hinted at in our previous graph is that what a good performer looks like can be different for different positions. Here is the average points per week for all players who play QB, RB, WR, and TE. We do not have data for defenses and kickers so we omit them.
averages = data.groupby("Position").FantasyPoints.mean() / 17
print(averages[["QB",'RB','WR','TE']])
There is still a problem here! These point values are really low. Surely the average QB scores more than 6.5 points per week, and this data makes it seem like wide recievers score more points than runningbacks! To help resolve this problem, let's look more carefully at the distribution of players at each position.
d = data.loc[data["Position"].isin(["QB",'RB','WR','TE'])]
sns.set(rc={'figure.figsize':(11.7,8.27)})
'''with sns.axes_style("whitegrid"):
sns.violinplot(data=d, x="Position", y="FantasyPoints", inner="box",
saturation=0.7, bw=.1)'''
#sns.set(rc={'figure.figsize':(10,10)})
with sns.axes_style("whitegrid"):
sns.violinplot(data=d, y="Position", x="FantasyPoints", inner="box",
saturation=1, bw=.1)
plt.title("Scoring Distribution of Each Position From 2010-2019")
plt.figure(figsize=(10,10))
As you can clearly see, the averages are being heavily affected by the majority of players who hardly score at all! As those familiar with fantasy football know, virtually every fantasy player is capable of drafting better than blindly picking players at random, and those who lack knowledge will use auto-drafting mechanisms provided by the hosting website. As a result, we are going to need to adjust our value to better compare players against those whom we are likely to face in actual competition.
To prove just how good owners are at extracting the most valuable players, we will examine how many of the top scoring players each week and throughout a season are owned. In a perfectly managed fantasy league, all of the top performers over weekly and yearly time spans would be owned and started, meaning that if there are 12 teams in a league, a perfect league would have each of the top 12 QBs would be owned and started. Since there are at least 2 WRs starting on each team, we should expect that the top 12 * 2 = 24 WRs are owned. While we do not have data on waverwire transactions or weekly roster decisions, we can still use draft data to determine if owers are sufficently good at finding valuable players during the draft. Here is a breakdown of the success of fantsy owners at aquiring starting level talent.
N = 12 # 12 team league
# creates a value for if players have been drafted according to ADP data
data["Drafted"] = data.Overall >= 0
#Gets players who are top 12 at their position in a year or are top 24 if RB or WR.
starter = data.loc[(data.PosRank <= N) & (data.Position.isin(["QB","TE"]))|
((data.PosRank <= 2*N) & (data.Position.isin(["RB","WR"])))]
'''creates a new data frame with information about how many starting caliber players
were drafted in a given year at each position'''
starter = pd.DataFrame(starter.groupby(["Year","Position"]).Drafted.sum())
starter = starter.reset_index()
display(starter.head())
#plot of how many starters have been drafted
sns.lineplot(data=starter, x='Year', y='Drafted', hue='Position')
plt.title("Starting Caliber Players Drafted by Year and Position")
plt.ylim(0,25)
plt.show()
#summary of plot
print("Here is the average number of starters drafted by postion:")
print(starter.groupby("Position").Drafted.mean())
#calculates the rate at which starters are drafted by dividing by the total number of starters
starter["draft_rate"] = starter.apply(lambda row: row.Drafted / N if row.Position in ["QB","TE"]
else row.Drafted / (2 *N), axis=1)
#plot of draft rate by postion
sns.lineplot(data=starter, x='Year', y='draft_rate', hue='Position')
plt.title("Draft Rate of Starters by Year and Position")
plt.ylim(0,1)
plt.show()
#summary of plot
print("Here is the average proportion of starters drafted by postion:")
print(starter.groupby("Position").draft_rate.mean())
print("average starter draft rate", starter.groupby("Position").draft_rate.mean().mean())
#plot of average draft rate by year
sns.lineplot(data=starter, x='Year', y='draft_rate')
plt.title('Plot of Average Draft Rate by Year')
plt.ylim(0,1)
plt.show()
print("Here is a breakdown of draft rate by year")
print(starter.groupby("Year").draft_rate.mean())
As you can see here, the positions with one starter, QB and TE are pretty close to the upper limit of 12 and the WR and RB are pretty close to 24. This means that even though football seasons are full of unforseeable circumstances such as injuries, almost all of the best players are taken during the draft. This is true for all the postions above as well as each year in the sample of 2010-2019.
Becuase we have shown that players are able to evaluate and aquire the top talent at each position, we can safely adjust our definition of player value to compare each player to the best amoung his peers. Specifically we can use the distance between a player and the expected value, of each of the starters at his position. This expected value is a measure of central tendency and can be calculated in two ways. The two methods are average and median. We plan on using both to see which is better, although we suspect that median will be more effective due to its resistance to outliers. As an example, a QB's value in a given week might be +5 if they outperform the average of the top 12 QB performances from that week by 5 points in a 12 team league. Becuase we are building a drafting model that is acting on predicted value for all of its players and becuase we are drafting players for a whole year, our value function is the expected points of a player - the expected value of points for the starters at that position. We divide this number by 16, the number of games each team plays in the NFL season to make it more interpretable.
print(starter.groupby("Position").Drafted.mean())
starter["draft_rate"] = starter.apply(lambda row: row.Drafted / N if row.Position in ["QB","TE"]
else row.Drafted / (2 *N), axis=1)
sns.lineplot(data=starter, x='Year', y='draft_rate', hue='Position')
plt.ylim(0,1)
plt.show()
print(starter.groupby("Position").draft_rate.mean())
print("average starter draft rate", starter.groupby("Position").draft_rate.mean().mean())
sns.lineplot(data=starter, x='Year', y='draft_rate')
plt.ylim(0,1)
plt.show()
print(starter.groupby("Year").draft_rate.mean())
Now that we have an understanding of how fantasy football works and why it is important to analyze each position separately, we can finally get to the analysis! In this section, we will use machine learning to predict players' fantasy football production.
nextlist = []
for row in data.index:
newIdx = row[:-4] + str(int(row[-4:]) + 1)
if newIdx in data.index:
nextpts = data.loc[newIdx, 'FantasyPoints']
if isinstance(nextpts, float):
nextlist.append(nextpts)
else:
#There is a small issue with the data, where some IDs occur multiple times, such as when a
#player is traded in the middle of the season. This is the solution. If nextpts is a Series,
#we take the sum of the values in that series while ignoring NaN.
nextlist.append(np.nansum(nextpts))
else:
nextlist.append(0)
data['nextYrPts'] = nextlist
data[data['Position'] == 'QB'].head()
The first step is to create a new column called 'nextYrPts' which you can see in the dataframe above. This column contains the number of fantasy football points scored in that player's next season. For example, for the ID 'Michael Vick 2010', the column 'nextYrPts' holds the number of fantasy points scored by Michael Vick in 2011. This column is the Y variable in our machine learning models. This makes sense because we are trying to predict the upcoming season's fantasy output based on stats from last season.
import sklearn
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn import svm
from sklearn import linear_model
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import r2_score
from sklearn.linear_model import SGDRegressor
We need to separate the data by position and also into training and testing data. This was accomplished by using boolean masks to create new dataframes for each position, with each position being separated into training and testing dataframes. The training data is the years from 2010-2015 and the testing data is the years 2016-2018.
The X variables are the features that we chose for each position. The Y variable is the number of points scored next year.
From there, a dictionary was created from the training data with the features chosen for a given position. This dictionary is then transformed into a vector with dummy variables. Then, all of the variables are standardized. This is important because each feature has a different range. We do not want generally large features such as passing yards to outweigh smaller features such as interceptions. The features are standardized into Z-Scores, or the number of standard deviations away from the mean. Then, a model was chosen and fit to the training data.
The next step is to use this model to make predictions of the testing data. We follow the same procedures that we did with the training data. We create a dictionary, turn it into a vector, and standardize it. Then we get the predicted fantasy points value. We repeat this process for every row in the testing dataframe using a for loop.
Finally, we calculate the r-squared score. This score represents the proportion of the variance of the testing data that can be attributed to the training data. A score of 1 represents a perfect correlation.
A more detailed discussion of the features and model chosen for each position will follow each section.
#for quarterbacks
onlyQbs = data[(data["Position"] == 'QB') & (data["Year"] != '2019')
& (data["Year"] != '2018') & (data["Year"] != '2017') & (data["Year"] != '2016')]
onlyQbs = onlyQbs.fillna(0)
qbFeatures = ['Age', 'GS', 'Cmp', 'PassingAtt', 'Int', 'RushingAtt', 'RushingYds', 'RushingTD',
'PassingYds', 'PassingTD']
X_train_dict = onlyQbs[qbFeatures].to_dict(orient="records")
y_train = onlyQbs["nextYrPts"]
# Dummy encoding
vec = DictVectorizer(sparse=False)
vec.fit(X_train_dict)
X_train = vec.transform(X_train_dict)
# Standardization
scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)
# Nearest Neighbors Regression
model = KNeighborsRegressor(n_neighbors=35, weights = 'distance')
model.fit(X_train_sc, y_train)
predictions = []
vec = DictVectorizer(sparse=False)
scaler.fit(X_train)
vec.fit(X_train_dict)
onlyQbsTest = data[(data["Position"] == 'QB') & ((data["Year"] == '2016') | (data["Year"] == '2017')
| (data["Year"] == '2018'))]
onlyQbsTest = onlyQbsTest.fillna(0)
for row in onlyQbsTest.index:
X_new_list = []
if isinstance(onlyQbsTest.loc[row].Age, float):
X_new_dict = onlyQbsTest.loc[row, qbFeatures].to_dict()
else:
#There is a small issue with the data, where some IDs occur multiple times, such as when a
#player is traded in the middle of the season. This is the solution. We create the dictionary manually,
#taking the sum of the Series as the value for each key-value pairing in the dictionary.
#This workaround occurs in all subsequent models too.
temp = onlyQbsTest.loc[row, qbFeatures]
X_new_dict = {}
for feature in qbFeatures:
X_new_dict[feature] = np.nansum(temp[feature])
X_new = vec.transform(X_new_dict)
X_new_sc = scaler.transform(X_new)
predictions.append(model.predict(X_new_sc)[0])
print(r2_score(onlyQbsTest['nextYrPts'], predictions))
onlyQbsTest['projScore'] = predictions
The features we chose for quarterbacks are as follows: age, games started, completions, passing attempts, interceptions, rushing attempts, rushing yards, rushing touchdowns, passing yards, and passing touchdowns. We felt that other potential features such as fumbles have too much variance from year to year and would make the model unnecessarily complicated. The highest r-squared score came when we used K-Nearest-Neighbors (using regression because of course this is a regression problem rather than a classification problem). The number of neighbors was set to 35 and the influence of each neighbor on the final prediction was weighted by the distance from the test data point. The r-squared score, as displayed above, was about 0.4723.
#for running backs
onlyRbs = data[(data["Position"] == 'RB') & (data["Year"] != "2019")
& (data["Year"] != '2018') & (data["Year"] != '2017') & (data["Year"] != '2016')]
onlyRbs = onlyRbs.fillna(0)
rbFeatures = ['Age', 'GS','RushingAtt', 'RushingYds', 'RushingTD',
'Tgt', 'Rec', 'ReceivingYds', 'ReceivingTD']
X_train_dict = onlyRbs[rbFeatures].to_dict(orient="records")
y_train = onlyRbs["nextYrPts"]
# Dummy encoding
vec = DictVectorizer(sparse=False)
vec.fit(X_train_dict)
X_train = vec.transform(X_train_dict)
# Standardization
scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)
# Ridge Regression
model = linear_model.Ridge(alpha=0.001)
model.fit(X_train_sc, y_train)
predictions = []
vec = DictVectorizer(sparse=False)
scaler.fit(X_train)
vec.fit(X_train_dict)
onlyRbsTest = data[(data["Position"] == 'RB') & ((data["Year"] == '2016') | (data["Year"] == '2017')
| (data["Year"] == '2018'))]
onlyRbsTest = onlyRbsTest.fillna(0)
for row in onlyRbsTest.index:
X_new_list = []
if isinstance(onlyRbsTest.loc[row].Age, float):
X_new_dict = onlyRbsTest.loc[row, rbFeatures].to_dict()
else:
temp = onlyRbsTest.loc[row, rbFeatures]
X_new_dict = {}
for feature in rbFeatures:
X_new_dict[feature] = np.nansum(temp[feature])
X_new = vec.transform(X_new_dict)
X_new_sc = scaler.transform(X_new)
predictions.append(model.predict(X_new_sc)[0])
print(r2_score(onlyRbsTest['nextYrPts'], predictions))
onlyRbsTest['projScore'] = predictions
The features we chose for running backs are as follows: age, games started, rushing attempts, rushing yards, rushing touchdowns, targets, receptions, receiving yards, and receiving touchdowns. We felt that other potential features such as fumbles have too much variance from year to year and would make the model unnecessarily complicated. The highest r-squared score came when we used Ridge Regression. The alpha term regularizes the model, making it smoother and therefore less prone to overfitting. The r-squared score, as displayed above, was about 0.4658.
Note that we chose to group wide receivers and tight ends together into a category called pass catchers. This decision was because both positions only obtain fantasy points through their receiving statistics. In the NFL, tight ends perform other functions such as blocking, but these contributions are not quantifiable in fantasy football.
#for pass catchers
onlyWRsTEs = data[((data["Position"] == 'WR') | (data["Position"] == 'TE')) & (data["Year"] != "2019")
& (data["Year"] != '2018') & (data["Year"] != '2017') & (data["Year"] != '2016')]
onlyWRsTEs = onlyWRsTEs.fillna(0)
wrteFeatures = ['Age', 'GS','Tgt', 'Rec', 'ReceivingYds', 'ReceivingTD']
X_train_dict = onlyWRsTEs[wrteFeatures].to_dict(orient="records")
y_train = onlyWRsTEs["nextYrPts"]
# Dummy encoding
vec = DictVectorizer(sparse=False)
vec.fit(X_train_dict)
X_train = vec.transform(X_train_dict)
# Standardization
scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)
# Ridge Regression
model = linear_model.Ridge(alpha=0.001)
model.fit(X_train_sc, y_train)
predictions = []
vec = DictVectorizer(sparse=False)
scaler.fit(X_train)
vec.fit(X_train_dict)
onlyWRsTEsTest = data[((data["Position"] == 'WR') | (data["Position"] == 'TE')) &
((data["Year"] == '2016') | (data["Year"] == '2017') | (data["Year"] == '2018'))]
onlyWRsTEsTest = onlyWRsTEsTest.fillna(0)
for row in onlyWRsTEsTest.index:
X_new_list = []
if isinstance(onlyWRsTEsTest.loc[row].Age, float):
X_new_dict = onlyWRsTEsTest.loc[row, wrteFeatures].to_dict()
else:
temp = onlyWRsTEsTest.loc[row, wrteFeatures]
X_new_dict = {}
for feature in wrteFeatures:
X_new_dict[feature] = np.nansum(temp[feature])
X_new = vec.transform(X_new_dict)
X_new_sc = scaler.transform(X_new)
predictions.append(model.predict(X_new_sc)[0])
print(r2_score(onlyWRsTEsTest['nextYrPts'], predictions))
onlyWRsTEsTest['projScore'] = predictions
The features we chose for running backs are as follows: age, games started, targets, receptions, receiving yards, and receiving touchdowns. We felt that other potential features such as fumbles have too much variance from year to year and would make the model unnecessarily complicated. The highest r-squared score came when we used Ridge Regression. The alpha term regularizes the model, making it smoother and therefore less prone to overfitting. The r-squared score, as displayed above, was about 0.5137.
projData = pd.concat([onlyQbsTest, onlyRbsTest, onlyWRsTEsTest])
projData[139:144]
projData[71:76]
The r-squared scores above we got may seem low, however the r-squared scores alone are misleading. It is not necessary that we project scores with a high degree of accuracy. Rather, we are just looking to be in the ballpark. Take the above slices of the dataframe for some examples. Our model projected that Jake Rudock would score 33.3 points in the 2018 season. Instead, he scored 0. This 33.3 point discrepancy will lower the r-squared, but again, we do not need a high degree of accuracy. Our drafting algorithm will not draft Rudock whether he is projected at 33.3 points or 0 points; there are too many far better options. It is only important that the model projects players like Russell Wilson and Cam Newton to perform better than Rudock. In other words, the projected score matters far less than the order of the projected scores.
Using the scoring predictions made on the test data, we aim to maximise the amount of value we can extract from the draft. Given that our definition of value relied on comparing each player relative to his peers, our model for drafting players should too. Our model uses two central ideas to select the best players
Based on these principals, we designed a draft algorithm that we believe makes good draft choices. It works by project which players are likely to be drafted before its next pick and then comparing what it thinks are the best players currently available at each position to what it thinks will be the best available the next time it picks. It chooses the predicted best player from the position that has the biggest gap. If it has consecutive picks, it will look ahead two picks on the first of the consecutive selections. We have also given the models smart roster constraints so that the draft produces balanced rosters.
In order to test our algorithm and value predictions, we designed a couple of other drafting algorithms:
Perfect: the same as our algorithm but it will use actual fantasy data from that year instead of projections
Smart_ADP: Will draft the best available player according to ADP data. It makes an exception when it need to fill out its starting lineup before moving on to its bench.
These algorithms compete against ours in simulated fantasy drafts. The resulting teams are then run through a season simulation in which the optimal lineup of each team is extracted each week and fractional wins are determined based on how many rivals any given team outscored that week out of the entire league. At the end of the regular season, the four winningest teams are matched up for a head to head postseason in which an ultimate champion is determined.
Our first test is placing our predictive algorithm against 9 Smart_ADP drafters and running a simulation from years 2016-2018. The code for the drafting algorithms and the league simulation is found in our repository in DraftSimulator.py. Here we will simply import it and use the simulation function to run league simulations over a given time interval with each team getting a chance to draft in each slot. (This may take quite a while)
from DraftSimulator import *
smart ={
"Predictive": 1,
"Smart_ADP": 9
}
results = full_sim(smart,2016,2018,1,"standard",positions, projData, weekly)
With our simulation complete we can analyse the data.
#CREATES AND CLEAN A NEW DATAFRAME ORGANISED BY TEAM NAME AND POINTS SCORED
frame = pd.DataFrame(results.groupby("Name").Points.mean())
frame= frame.reset_index()
display(frame)
#GRAPHS THE AVERAGE POINTS SCORED OF EACH TEAM
frame["Predictive"] = frame.Name == "predictive1"
sns.barplot(data=frame,x="Name",y="Points", hue="Predictive")
plt.xlabel("Team")
plt.ylabel("Average Points Scored")
plt.title("Average Points Scored by Team from 2016-2018")
plt.show()
#CREATES AND CLEAN A NEW DATAFRAME ORGANISED BY TEAM NAME AND TOTAL WINS
results["Wins"] = results.Rank == 0
frame = results.loc[results["Wins"] == True]
frame = results.groupby("Name").Wins.sum()
frame= frame.reset_index()
#GRAPHS THE TOTAL WINS OF EACH TEAM
frame["Predictive"] = frame.Name == "predictive1"
sns.barplot(data=frame,x="Name", y="Wins", hue="Predictive")
plt.xlabel("Team")
plt.ylabel("Total Wins")
plt.title("Total wins in 30 simulated seasons")
plt.show()
Wow! as you can see our algorithm outperforms our compititors significantly. Not only does it regualrly outscore the others, it is also able to translate the scoring into championships in 17 of the 30 seasons. Next we compare our results to those from running the perfect algorithm against the same drafters.
perf ={
"Perfect": 1,
"Smart_ADP": 9
}
results = full_sim(perf,2016,2018,1,"standard",positions, projData, weekly)
#CREATES AND CLEAN A NEW DATAFRAME ORGANISED BY TEAM NAME AND POINTS SCORED
frame = pd.DataFrame(results.groupby("Name").Points.mean())
frame= frame.reset_index()
display(frame)
#GRAPHS THE AVERAGE POINTS SCORED OF EACH TEAM
frame["Perfect"] = frame.Name == "perfect1"
sns.barplot(data=frame,x="Name",y="Points", hue="Perfect")
plt.xlabel("Team")
plt.ylabel("Average Points Scored")
plt.title("Average Points Scored by Team from 2016-2018")
plt.show()
#CREATES AND CLEAN A NEW DATAFRAME ORGANISED BY TEAM NAME AND TOTAL WINS
results["Wins"] = results.Rank == 0
frame = results.loc[results["Wins"] == True]
frame = results.groupby("Name").Wins.sum()
frame= frame.reset_index()
#GRAPHS THE TOTAL WINS OF EACH TEAM
frame["Perfect"] = frame.Name == "perfect1"
sns.barplot(data=frame,x="Name", y="Wins", hue="Perfect")
plt.xlabel("Team")
plt.ylabel("Total Wins")
plt.title("Total wins in 30 simulated seasons")
plt.show()
The perfect algorithm performs better than us, but that is what happens when you have the answers to the test ahead of time! What we can learn here is that the results from our machine learning predictions are strong enough when combined with solid drafting strategy to handily outperform algorithms that mimic human behavior, and that there is still room for improvenment since the perfect model was able to win significantly more and score more points overall as well. To conclude, while no human can ever reliably draft perfectly, using the principals of value dicussed above as well as accurate predictions of the value of players, we can get closer that you might expect.