Examining Fantasy Football Draft Trends¶

By Gabe Harris and Jacob Geisberg¶

Our website ¶

Why Fantasy Football?¶

Fantasy football is a popular game played by millions of people around the world. Contestants draft a team of NFL football players and then are awarded points based on their players' performance each week. While there are opportunities for contestants to improve their teams throughout the season, the draft is by far the biggest factor that contributes to a fantasy team's success. There are a wide array of draft strategies, with no consensus about which one is the best. We are attempting to find the best way to draft a fantasy football team using data science techniques.

We have identified two data sets that will be used to inform our analysis. The first dataset is complete with player statistics, and more importantly, average draft position for the years 2010-2019. This dataset was compilied by FantasyFootballCalculator.com. The second data set contains fantasy statistics for players going back to 1970. This dataset is from FantasyFootballDataPros.com. We will use these datasets to compare players' statistical output with their average fantasy draft position. These insights may be used to inform more successful draft strategies.

For example, is it important to draft a running back in the first two rounds? At what round should a contestant draft a quarterback? We plan to go through each position and each round, seeking to maximize the draft value of both. There are six positions in fantasy football; quarterback, two running backs, two wide receivers, and tight end. Most leagues also include defense and kicker although our datasets do not include these positions. We will find out which rounds to target each position. However, it is slightly more complicated than this. For example, if I am picking, but the top five running backs are gone, should I wait until round two to draft a running back? We will filter the data to exclude top players when seeking to answer these questions. At the end, all of these insights will be compiled into a single draft guide.

Collaboration Plan¶

Completing this project will require collaboration and hard work from both team members. We will meet weekly over Zoom to discuss progress and plan work for the next week. We set up a private GitHub repository to enable version control and easy sharing of each other's work. Where necessary, we will work on the project simulataneously using Teletype for Atom. This plan will allow us to work together while still maintaining compliance with social distancing guidelines.

Data ETL¶

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np


# This data set contains detailed statistics on each player's fantasy statistsics going
# back to 1970, though only 2010 and beyond are used. Each observation in the final dataframe
# is a player's performance in a given year.

# Load first data frame
df = pd.read_csv("data/yearly/2010.csv", index_col=0)
df["Year"] = "2010"
df["Rank"] = df["FantasyPoints"].rank(ascending=False)
df["PosRank"] = df.groupby("Pos")["Rank"].rank()


path = "data/yearly/2019.csv"

# get data from csv files from 2011 to 2019
for year in range(2011,2020):
    newpath = path.replace("2019", str(year))
    new_df = pd.read_csv(newpath, index_col=0)
    new_df["Year"] = str(year)
    new_df["Rank"] = new_df["FantasyPoints"].rank(ascending=False)
    new_df["PosRank"] = new_df.groupby("Pos")["Rank"].rank()
    df = df.merge(new_df,how="outer")

#create ID
df["ID"] = df["Player"] + " " + df["Year"]
df = df.set_index("ID")

#changing column names and changing team names
df.rename(columns = {'Player':'Name', 'Tm':'Team', 'Pos':'Position'}, inplace = True)
df.loc[df.Team == "KAN", ["Team"]] = "KC"
df.loc[df.Team == "OAK", ["Team"]] = "LV"
df.loc[df.Team == "GNB", ["Team"]] = "GB"
df.loc[df.Team == "NWE", ["Team"]] = "NE"
df.loc[df.Team == "STL", ["Team"]] = "LAR"
df.loc[df.Team == "SDG", ["Team"]] = "LAC"
df.loc[df.Team == "TAM", ["Team"]] = "TB"
df.loc[df.Team == "NOR", ["Team"]] = "NO"
df.loc[df.Team == "SFO", ["Team"]] = "SF"
df

path2 = "data/adp/2010.csv"
adp = pd.read_csv(path2, index_col=0)
adp["Year"] = "2010"
adp["ID"] = adp["Name"] + " " + adp["Year"]
adp = adp.set_index("ID")

for year in range(2011,2020):
    newpath = path2.replace("2010", str(year))
    new_adp = pd.read_csv(newpath, index_col=0)   
    new_adp["Year"] = str(year)
    new_adp["ID"] = new_adp["Name"] + " " + new_adp["Year"]
    new_adp = new_adp.set_index("ID")
    adp = pd.concat([adp, new_adp])

adp

#merge the two dataframes
data = df.merge(adp, on = ['ID', 'Name', 'Position', 'Team', 'Year'], how = 'outer')
data

This cell loads in the weekly data which is required for the season simulations

#SETS UP DATA FRAME WITH PROPER COLUMNS
path = "data/weekly/2010/week1.csv"
df = pd.read_csv(path)
df["Year"] = "2010"
df["Week"] = '1'
df = df.loc[df.Week != 1] # drops rows that will be duplicated


# GETS DATA FROM THE CSV FILES AND ADDS THEM TO THE DATAFRAME
for year in range(2010,2020):
    for week in range(1,18):
        newpath = path.replace("2010", str(year))
        newpath = newpath.replace("week1", "week"+ str(week))
        new_df = pd.read_csv(newpath)
        new_df["Year"] = str(year)
        new_df["Week"] = str(week)
        df = pd.concat([df,new_df])

#creates ID
df["ID"] = df["Player"] + " " + df["Year"] + ' ' + df["Week"]
weekly = df.set_index("ID")
display(weekly)

How good are fantasy football contestants at drafting their teams?¶

In order to justify our research, it is important to first check how well the average fantasy player drafts their teams. If fantasy contestants can draft effectively without the insights generated by data science techniques, then there is no need to perform the analysis. We will examine how closely Average Draft Position (ADP) and fantasy production are correlated.

ax = sns.regplot(data['Overall'], data['FantasyPoints'], line_kws={'color':'red'})
ax.set_title("Average Draft Position vs Fantasy Production")

correlation = data['Overall'].corr(data['FantasyPoints'])
print("Correlation coefficient: ", correlation)

Correlation coefficient:  -0.4091022498935128

As you can see from the above graph, there is a tremendous amount of variance. Players drafted in early rounds may miss time with injury or underperform, while players drafted in later rounds may have breakout years. The correlation coefficient is a measly -0.409. Note that we are expecting a negative correlation coefficient because we expect low ADP players (players who were drafted earlier) to outperform high ADP players (players who were drafted later). While this number does show a correlation, it can absolutely be improved upon. Let's go a little deeper with this analysis.

for i in range(0,16):
    low = 10 * i
    high = low+10
    temp = data[(data.Overall > 10 * i) & (data.Overall > 10 * i +10)]
    
    correlation = temp['Overall'].corr(temp['FantasyPoints'])
    print("Round ", i+1, "correlation coefficient: ", correlation)

Round  1 correlation coefficient:  -0.37345873440939875
Round  2 correlation coefficient:  -0.32522488766476665
Round  3 correlation coefficient:  -0.2923599009834545
Round  4 correlation coefficient:  -0.26111465946652523
Round  5 correlation coefficient:  -0.2378034817365984
Round  6 correlation coefficient:  -0.21037826215452796
Round  7 correlation coefficient:  -0.18120549092331045
Round  8 correlation coefficient:  -0.16837739590725606
Round  9 correlation coefficient:  -0.17124651858532655
Round  10 correlation coefficient:  -0.18083275888974845
Round  11 correlation coefficient:  -0.14054298936802287
Round  12 correlation coefficient:  -0.08534819613451754
Round  13 correlation coefficient:  0.03552463654305286
Round  14 correlation coefficient:  0.018297062713047263
Round  15 correlation coefficient:  0.0377254625326922
Round  16 correlation coefficient:  -0.20013534528794819

Notice that as a draft progresses to later rounds, the correlation gets weaker. This makes logical sense. In the early rounds of a draft, star players and consistent performers are taken. Drafting a team becomes much more difficult in the later rounds, when these players are no longer available. Contestants may have to take a chance on an unproven rookie, or a previously inconsistent player. This means that the most important factor in drafting a good team is finding productive players in later rounds. This is where we will focus our analysis.

Why analyze by position?¶

If you are plenty familiar with why we need to break down fantasy player value by position, feel free to skip this part and head straight to the analysis!

Suppose we play a simplified version of fantasy football in which each team consists of 9 players regardless of what position they play. As in normal fantasy football, the season is split into 17 weeks, and the goal of each week is to outscore the opponent you are playing that week. A typical matchup would look like this:

	Team 1			Team 2
Slot	Player	Points	Points	Player	Slot
Slot 1	Player 1	24	33	Player 10	Slot 1
Slot 2	Player 2	11	9	Player 11	Slot 2
Slot 3	Player 3	9	2	Player 12	Slot 3
Slot 4	Player 4	18	5	Player 13	Slot 4
Slot 5	Player 5	0	15	Player 14	Slot 5
Slot 6	Player 6	14	0	Player 15	Slot 6
Slot 7	Player 7	15	28	Player 16	Slot 7
Slot 8	Player 8	11	16	Player 17	Slot 8
Slot 9	Player 9	20	11	Player 18	Slot 9
	Total	122	119	Total

In this example determining any player’s value is easy because it is directly determined by how many points he scored that week. That means that if you could select any 9 players to be on your fantasy team you would select the 9 players expected to score the most on average thereby giving you the highest chance of winning any given week. Nearly every league has some sort of structure, usually a snake draft, before the season to allow teams to take turns selecting players and to avoid duplicates. Under these set of circumstances, the best draft pick one could make would always be the player who is projected to score the most fantasy points of any remaining player. Lets have a quick look at what kinds of players should be targeted in this simplified league by taking the tops scorers over the last ten years:

top_10 = data.sort_values(by="FantasyPoints", ascending = False).head(10)
display(top_10[["Position","FantasyPoints"]])
sns.countplot(data=top_10, x= "Position")
plt.title("Top 10 scoring players 2010-2019")
plt.show()

Surprise! Almost all of the top fantasy producers are quarterbacks or runningbacks! This means that all we have to do is draft quarterbacks at first and then move to runningbacks and down the line right? Well, that is an excellent strategy for a positionless league, but since that structure is pretty mundane and lacking in difficulty, most leagues spice things up but placing positional restrictions on your line up. A common matchup might look something like this:

	Team 1			Team 2
Position	Player	Points	Points	Player	Position
QB	Player 1	24	33	Player 10	QB
RB1	Player 2	18	14	Player 11	RB1
RB2	Player 3	13	11	Player 12	RB2
WR1	Player 4	18	5	Player 13	WR1
WR2	Player 5	0	21	Player 14	WR2
TE	Player 6	7	10	Player 15	TE
FLEX(RB,WR,TE)	Player 7	15	12	Player 16	FLEX(RB,WR,TE)
D/ST	Player 8	8	3	Player 17	D/ST
K	Player 9	10	10	Player 18	K
	Bench			Bench

Any	Bench 1	Not Counted	Not Counted	Bench 7	Any
Any	Bench 2	Not Counted	Not Counted	Bench 8	Any
Any	Bench 3	Not Counted	Not Counted	Bench 9	Any
Any	Bench 4	Not Counted	Not Counted	Bench 10	Any
Any	Bench 5	Not Counted	Not Counted	Bench 11	Any
Any	Bench 6	Not Counted	Not Counted	Bench 12	Any

As you can see, our strategy of drafting all quarterbacks and runningbacks will no longer work here because we would not even be able to field a complete team! This means that instead of only focusing on getting the highest scoring players, we need to make sure our team has good performers at each position in the lineup if we want to be successful. The tricky part is that as we have hinted at in our previous graph is that what a good performer looks like can be different for different positions. Here is the average points per week for all players who play QB, RB, WR, and TE. We do not have data for defenses and kickers so we omit them.

averages = data.groupby("Position").FantasyPoints.mean() / 17
print(averages[["QB",'RB','WR','TE']])

Position
QB    6.593466
RB    4.324490
WR    5.106244
TE    3.374470
Name: FantasyPoints, dtype: float64

There is still a problem here! These point values are really low. Surely the average QB scores more than 6.5 points per week, and this data makes it seem like wide recievers score more points than runningbacks! To help resolve this problem, let's look more carefully at the distribution of players at each position.

d = data.loc[data["Position"].isin(["QB",'RB','WR','TE'])]

sns.set(rc={'figure.figsize':(11.7,8.27)})
'''with sns.axes_style("whitegrid"):
    sns.violinplot(data=d, x="Position", y="FantasyPoints", inner="box",
                  saturation=0.7, bw=.1)'''
    


#sns.set(rc={'figure.figsize':(10,10)})
with sns.axes_style("whitegrid"):
    sns.violinplot(data=d, y="Position", x="FantasyPoints", inner="box",
                  saturation=1, bw=.1)
    plt.title("Scoring Distribution of Each Position From 2010-2019")
    plt.figure(figsize=(10,10))

<Figure size 720x720 with 0 Axes>

As you can clearly see, the averages are being heavily affected by the majority of players who hardly score at all! As those familiar with fantasy football know, virtually every fantasy player is capable of drafting better than blindly picking players at random, and those who lack knowledge will use auto-drafting mechanisms provided by the hosting website. As a result, we are going to need to adjust our value to better compare players against those whom we are likely to face in actual competition.

To prove just how good owners are at extracting the most valuable players, we will examine how many of the top scoring players each week and throughout a season are owned. In a perfectly managed fantasy league, all of the top performers over weekly and yearly time spans would be owned and started, meaning that if there are 12 teams in a league, a perfect league would have each of the top 12 QBs would be owned and started. Since there are at least 2 WRs starting on each team, we should expect that the top 12 * 2 = 24 WRs are owned. While we do not have data on waverwire transactions or weekly roster decisions, we can still use draft data to determine if owers are sufficently good at finding valuable players during the draft. Here is a breakdown of the success of fantsy owners at aquiring starting level talent.

N = 12 # 12 team league

# creates a value for if players have been drafted according to ADP data
data["Drafted"] = data.Overall >= 0

#Gets players who are top 12 at their position in a year or are top 24 if RB or WR.
starter = data.loc[(data.PosRank <= N) & (data.Position.isin(["QB","TE"]))| 
                   ((data.PosRank <= 2*N) & (data.Position.isin(["RB","WR"])))]


'''creates a new data frame with information about how many starting caliber players 
were drafted in a given year at each position'''
starter = pd.DataFrame(starter.groupby(["Year","Position"]).Drafted.sum())
starter = starter.reset_index()
display(starter.head())

#plot of how many starters have been drafted
sns.lineplot(data=starter, x='Year', y='Drafted', hue='Position')
plt.title("Starting Caliber Players Drafted by Year and Position")
plt.ylim(0,25)
plt.show()

#summary of plot
print("Here is the average number of starters drafted by postion:")
print(starter.groupby("Position").Drafted.mean())

#calculates the rate at which starters are drafted by dividing by the total number of starters
starter["draft_rate"] = starter.apply(lambda row: row.Drafted / N if row.Position in ["QB","TE"] 
                                       else row.Drafted / (2 *N), axis=1)

#plot of draft rate by postion
sns.lineplot(data=starter, x='Year', y='draft_rate', hue='Position')
plt.title("Draft Rate of Starters by Year and Position")
plt.ylim(0,1)
plt.show()

#summary of plot
print("Here is the average proportion of starters drafted by postion:")
print(starter.groupby("Position").draft_rate.mean())
print("average starter draft rate", starter.groupby("Position").draft_rate.mean().mean())

#plot of average draft rate by year
sns.lineplot(data=starter, x='Year', y='draft_rate')
plt.title('Plot of Average Draft Rate by Year')
plt.ylim(0,1)
plt.show()

print("Here is a breakdown of draft rate by year")
print(starter.groupby("Year").draft_rate.mean())

Here is the average number of starters drafted by postion:
Position
QB    10.5
RB    20.1
TE     9.0
WR    20.2
Name: Drafted, dtype: float64

Here is the average proportion of starters drafted by postion:
Position
QB    0.875000
RB    0.837500
TE    0.750000
WR    0.841667
Name: draft_rate, dtype: float64
average starter draft rate 0.8260416666666666

Here is a breakdown of draft rate by year
Year
2010    0.833333
2011    0.916667
2012    0.708333
2013    0.770833
2014    0.812500
2015    0.833333
2016    0.833333
2017    0.854167
2018    0.875000
2019    0.822917
Name: draft_rate, dtype: float64

As you can see here, the positions with one starter, QB and TE are pretty close to the upper limit of 12 and the WR and RB are pretty close to 24. This means that even though football seasons are full of unforseeable circumstances such as injuries, almost all of the best players are taken during the draft. This is true for all the postions above as well as each year in the sample of 2010-2019.

Becuase we have shown that players are able to evaluate and aquire the top talent at each position, we can safely adjust our definition of player value to compare each player to the best amoung his peers. Specifically we can use the distance between a player and the expected value, of each of the starters at his position. This expected value is a measure of central tendency and can be calculated in two ways. The two methods are average and median. We plan on using both to see which is better, although we suspect that median will be more effective due to its resistance to outliers. As an example, a QB's value in a given week might be +5 if they outperform the average of the top 12 QB performances from that week by 5 points in a 12 team league. Becuase we are building a drafting model that is acting on predicted value for all of its players and becuase we are drafting players for a whole year, our value function is the expected points of a player - the expected value of points for the starters at that position. We divide this number by 16, the number of games each team plays in the NFL season to make it more interpretable.

print(starter.groupby("Position").Drafted.mean())
starter["draft_rate"] = starter.apply(lambda row: row.Drafted / N if row.Position in ["QB","TE"] 
                                       else row.Drafted / (2 *N), axis=1)

sns.lineplot(data=starter, x='Year', y='draft_rate', hue='Position')
plt.ylim(0,1)
plt.show()
print(starter.groupby("Position").draft_rate.mean())
print("average starter draft rate", starter.groupby("Position").draft_rate.mean().mean())

sns.lineplot(data=starter, x='Year', y='draft_rate')
plt.ylim(0,1)
plt.show()
print(starter.groupby("Year").draft_rate.mean())

Position
QB    10.5
RB    20.1
TE     9.0
WR    20.2
Name: Drafted, dtype: float64

Position
QB    0.875000
RB    0.837500
TE    0.750000
WR    0.841667
Name: draft_rate, dtype: float64
average starter draft rate 0.8260416666666666

Year
2010    0.833333
2011    0.916667
2012    0.708333
2013    0.770833
2014    0.812500
2015    0.833333
2016    0.833333
2017    0.854167
2018    0.875000
2019    0.822917
Name: draft_rate, dtype: float64

Predicting Fantasy Value¶

Now that we have an understanding of how fantasy football works and why it is important to analyze each position separately, we can finally get to the analysis! In this section, we will use machine learning to predict players' fantasy football production.

nextlist = []
for row in data.index:
    newIdx = row[:-4] + str(int(row[-4:]) + 1)
    if newIdx in data.index:
        nextpts = data.loc[newIdx, 'FantasyPoints']
        if isinstance(nextpts, float):
            nextlist.append(nextpts)
        else:
            #There is a small issue with the data, where some IDs occur multiple times, such as when a
            #player is traded in the middle of the season. This is the solution. If nextpts is a Series,
            #we take the sum of the values in that series while ignoring NaN.
            nextlist.append(np.nansum(nextpts))
    else:
        nextlist.append(0)
data['nextYrPts'] = nextlist
data[data['Position'] == 'QB'].head()

The first step is to create a new column called 'nextYrPts' which you can see in the dataframe above. This column contains the number of fantasy football points scored in that player's next season. For example, for the ID 'Michael Vick 2010', the column 'nextYrPts' holds the number of fantasy points scored by Michael Vick in 2011. This column is the Y variable in our machine learning models. This makes sense because we are trying to predict the upcoming season's fantasy output based on stats from last season.

import sklearn
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn import svm
from sklearn import linear_model
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import r2_score
from sklearn.linear_model import SGDRegressor

We need to separate the data by position and also into training and testing data. This was accomplished by using boolean masks to create new dataframes for each position, with each position being separated into training and testing dataframes. The training data is the years from 2010-2015 and the testing data is the years 2016-2018.

The X variables are the features that we chose for each position. The Y variable is the number of points scored next year.

From there, a dictionary was created from the training data with the features chosen for a given position. This dictionary is then transformed into a vector with dummy variables. Then, all of the variables are standardized. This is important because each feature has a different range. We do not want generally large features such as passing yards to outweigh smaller features such as interceptions. The features are standardized into Z-Scores, or the number of standard deviations away from the mean. Then, a model was chosen and fit to the training data.

The next step is to use this model to make predictions of the testing data. We follow the same procedures that we did with the training data. We create a dictionary, turn it into a vector, and standardize it. Then we get the predicted fantasy points value. We repeat this process for every row in the testing dataframe using a for loop.

Finally, we calculate the r-squared score. This score represents the proportion of the variance of the testing data that can be attributed to the training data. A score of 1 represents a perfect correlation.

A more detailed discussion of the features and model chosen for each position will follow each section.

Quarterbacks¶

#for quarterbacks
onlyQbs = data[(data["Position"] == 'QB') & (data["Year"] != '2019') 
               & (data["Year"] != '2018') & (data["Year"] != '2017') & (data["Year"] != '2016')]
onlyQbs = onlyQbs.fillna(0)

qbFeatures = ['Age', 'GS', 'Cmp', 'PassingAtt', 'Int', 'RushingAtt', 'RushingYds', 'RushingTD',
             'PassingYds', 'PassingTD']


X_train_dict = onlyQbs[qbFeatures].to_dict(orient="records")
y_train = onlyQbs["nextYrPts"]

# Dummy encoding
vec = DictVectorizer(sparse=False)
vec.fit(X_train_dict)
X_train = vec.transform(X_train_dict)

# Standardization
scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)

# Nearest Neighbors Regression
model = KNeighborsRegressor(n_neighbors=35, weights = 'distance')
model.fit(X_train_sc, y_train)

KNeighborsRegressor(n_neighbors=35, weights='distance')

predictions = []
vec = DictVectorizer(sparse=False)
scaler.fit(X_train)
vec.fit(X_train_dict)

onlyQbsTest = data[(data["Position"] == 'QB') & ((data["Year"] == '2016') | (data["Year"] == '2017') 
                                                | (data["Year"] == '2018'))]
onlyQbsTest = onlyQbsTest.fillna(0)

for row in onlyQbsTest.index:
    X_new_list = []
    
    if isinstance(onlyQbsTest.loc[row].Age, float):
        
        X_new_dict = onlyQbsTest.loc[row, qbFeatures].to_dict()
        
    else:
        #There is a small issue with the data, where some IDs occur multiple times, such as when a
        #player is traded in the middle of the season. This is the solution. We create the dictionary manually,
        #taking the sum of the Series as the value for each key-value pairing in the dictionary.
        #This workaround occurs in all subsequent models too.
        
        temp = onlyQbsTest.loc[row, qbFeatures]
        X_new_dict = {}
        for feature in qbFeatures:
            X_new_dict[feature] = np.nansum(temp[feature])
    
    X_new = vec.transform(X_new_dict)
    X_new_sc = scaler.transform(X_new)
    predictions.append(model.predict(X_new_sc)[0])

print(r2_score(onlyQbsTest['nextYrPts'], predictions))
onlyQbsTest['projScore'] = predictions

0.47232421516871204

The features we chose for quarterbacks are as follows: age, games started, completions, passing attempts, interceptions, rushing attempts, rushing yards, rushing touchdowns, passing yards, and passing touchdowns. We felt that other potential features such as fumbles have too much variance from year to year and would make the model unnecessarily complicated. The highest r-squared score came when we used K-Nearest-Neighbors (using regression because of course this is a regression problem rather than a classification problem). The number of neighbors was set to 35 and the influence of each neighbor on the final prediction was weighted by the distance from the test data point. The r-squared score, as displayed above, was about 0.4723.

Running Backs¶

#for running backs
onlyRbs = data[(data["Position"] == 'RB') & (data["Year"] != "2019")
              & (data["Year"] != '2018') & (data["Year"] != '2017') & (data["Year"] != '2016')]
onlyRbs = onlyRbs.fillna(0)

rbFeatures = ['Age', 'GS','RushingAtt', 'RushingYds', 'RushingTD',
             'Tgt', 'Rec', 'ReceivingYds', 'ReceivingTD']

X_train_dict = onlyRbs[rbFeatures].to_dict(orient="records")
y_train = onlyRbs["nextYrPts"]

# Dummy encoding
vec = DictVectorizer(sparse=False)
vec.fit(X_train_dict)
X_train = vec.transform(X_train_dict)

# Standardization
scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)

# Ridge Regression
model = linear_model.Ridge(alpha=0.001)
model.fit(X_train_sc, y_train)

Ridge(alpha=0.001)

predictions = []
vec = DictVectorizer(sparse=False)
scaler.fit(X_train)
vec.fit(X_train_dict)

onlyRbsTest = data[(data["Position"] == 'RB') & ((data["Year"] == '2016') | (data["Year"] == '2017') 
                                                | (data["Year"] == '2018'))]
onlyRbsTest = onlyRbsTest.fillna(0)

for row in onlyRbsTest.index:
    X_new_list = []
    
    if isinstance(onlyRbsTest.loc[row].Age, float):
        
        X_new_dict = onlyRbsTest.loc[row, rbFeatures].to_dict()
        
    else:
        temp = onlyRbsTest.loc[row, rbFeatures]
        X_new_dict = {}
        for feature in rbFeatures:
            X_new_dict[feature] = np.nansum(temp[feature])
    
    X_new = vec.transform(X_new_dict)
    X_new_sc = scaler.transform(X_new)
    predictions.append(model.predict(X_new_sc)[0])

print(r2_score(onlyRbsTest['nextYrPts'], predictions))
onlyRbsTest['projScore'] = predictions

0.4658281635739452

The features we chose for running backs are as follows: age, games started, rushing attempts, rushing yards, rushing touchdowns, targets, receptions, receiving yards, and receiving touchdowns. We felt that other potential features such as fumbles have too much variance from year to year and would make the model unnecessarily complicated. The highest r-squared score came when we used Ridge Regression. The alpha term regularizes the model, making it smoother and therefore less prone to overfitting. The r-squared score, as displayed above, was about 0.4658.

Pass Catchers (Wide Receivers and Tight Ends)¶

Note that we chose to group wide receivers and tight ends together into a category called pass catchers. This decision was because both positions only obtain fantasy points through their receiving statistics. In the NFL, tight ends perform other functions such as blocking, but these contributions are not quantifiable in fantasy football.

#for pass catchers
onlyWRsTEs = data[((data["Position"] == 'WR') | (data["Position"] == 'TE')) & (data["Year"] != "2019")
                  & (data["Year"] != '2018') & (data["Year"] != '2017') & (data["Year"] != '2016')]
onlyWRsTEs = onlyWRsTEs.fillna(0)

wrteFeatures = ['Age', 'GS','Tgt', 'Rec', 'ReceivingYds', 'ReceivingTD']

X_train_dict = onlyWRsTEs[wrteFeatures].to_dict(orient="records")
y_train = onlyWRsTEs["nextYrPts"]

# Dummy encoding
vec = DictVectorizer(sparse=False)
vec.fit(X_train_dict)
X_train = vec.transform(X_train_dict)

# Standardization
scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)

# Ridge Regression
model = linear_model.Ridge(alpha=0.001)
model.fit(X_train_sc, y_train)

Ridge(alpha=0.001)

predictions = []
vec = DictVectorizer(sparse=False)
scaler.fit(X_train)
vec.fit(X_train_dict)

onlyWRsTEsTest = data[((data["Position"] == 'WR') | (data["Position"] == 'TE')) & 
                      ((data["Year"] == '2016') | (data["Year"] == '2017') | (data["Year"] == '2018'))]
onlyWRsTEsTest = onlyWRsTEsTest.fillna(0)

for row in onlyWRsTEsTest.index:
    X_new_list = []
    
    if isinstance(onlyWRsTEsTest.loc[row].Age, float):
        
        X_new_dict = onlyWRsTEsTest.loc[row, wrteFeatures].to_dict()
        
    else:
        temp = onlyWRsTEsTest.loc[row, wrteFeatures]
        X_new_dict = {}
        for feature in wrteFeatures:
            X_new_dict[feature] = np.nansum(temp[feature])
    
    X_new = vec.transform(X_new_dict)
    X_new_sc = scaler.transform(X_new)
    predictions.append(model.predict(X_new_sc)[0])

print(r2_score(onlyWRsTEsTest['nextYrPts'], predictions))
onlyWRsTEsTest['projScore'] = predictions

0.5136918284816725

The features we chose for running backs are as follows: age, games started, targets, receptions, receiving yards, and receiving touchdowns. We felt that other potential features such as fumbles have too much variance from year to year and would make the model unnecessarily complicated. The highest r-squared score came when we used Ridge Regression. The alpha term regularizes the model, making it smoother and therefore less prone to overfitting. The r-squared score, as displayed above, was about 0.5137.

Results¶

projData = pd.concat([onlyQbsTest, onlyRbsTest, onlyWRsTEsTest])
projData[139:144]

projData[71:76]

The r-squared scores above we got may seem low, however the r-squared scores alone are misleading. It is not necessary that we project scores with a high degree of accuracy. Rather, we are just looking to be in the ballpark. Take the above slices of the dataframe for some examples. Our model projected that Jake Rudock would score 33.3 points in the 2018 season. Instead, he scored 0. This 33.3 point discrepancy will lower the r-squared, but again, we do not need a high degree of accuracy. Our drafting algorithm will not draft Rudock whether he is projected at 33.3 points or 0 points; there are too many far better options. It is only important that the model projects players like Russell Wilson and Cam Newton to perform better than Rudock. In other words, the projected score matters far less than the order of the projected scores.

Drafting Model¶

Using the scoring predictions made on the test data, we aim to maximise the amount of value we can extract from the draft. Given that our definition of value relied on comparing each player relative to his peers, our model for drafting players should too. Our model uses two central ideas to select the best players

The best possible player to take at any pick must be the best remaining player at his position
Drafting a player from a position in which there will be comparable players later will net you less value than selecting a player that is unreplacable with a later selection

Based on these principals, we designed a draft algorithm that we believe makes good draft choices. It works by project which players are likely to be drafted before its next pick and then comparing what it thinks are the best players currently available at each position to what it thinks will be the best available the next time it picks. It chooses the predicted best player from the position that has the biggest gap. If it has consecutive picks, it will look ahead two picks on the first of the consecutive selections. We have also given the models smart roster constraints so that the draft produces balanced rosters.

In order to test our algorithm and value predictions, we designed a couple of other drafting algorithms:

Perfect: the same as our algorithm but it will use actual fantasy data from that year instead of projections

Smart_ADP: Will draft the best available player according to ADP data. It makes an exception when it need to fill out its starting lineup before moving on to its bench.

These algorithms compete against ours in simulated fantasy drafts. The resulting teams are then run through a season simulation in which the optimal lineup of each team is extracted each week and fractional wins are determined based on how many rivals any given team outscored that week out of the entire league. At the end of the regular season, the four winningest teams are matched up for a head to head postseason in which an ultimate champion is determined.

Our first test is placing our predictive algorithm against 9 Smart_ADP drafters and running a simulation from years 2016-2018. The code for the drafting algorithms and the league simulation is found in our repository in DraftSimulator.py. Here we will simply import it and use the simulation function to run league simulations over a given time interval with each team getting a chance to draft in each slot. (This may take quite a while)

from DraftSimulator import *
smart ={
    "Predictive": 1,
    "Smart_ADP": 9
}
results = full_sim(smart,2016,2018,1,"standard",positions, projData, weekly)

With our simulation complete we can analyse the data.

#CREATES AND CLEAN A NEW DATAFRAME ORGANISED BY TEAM NAME AND POINTS SCORED
frame = pd.DataFrame(results.groupby("Name").Points.mean())
frame= frame.reset_index()
display(frame)

#GRAPHS THE AVERAGE POINTS SCORED OF EACH TEAM
frame["Predictive"] = frame.Name == "predictive1"
sns.barplot(data=frame,x="Name",y="Points", hue="Predictive")
plt.xlabel("Team")
plt.ylabel("Average Points Scored")
plt.title("Average Points Scored by Team from 2016-2018")
plt.show()

#CREATES AND CLEAN A NEW DATAFRAME ORGANISED BY TEAM NAME AND TOTAL WINS
results["Wins"] = results.Rank == 0
frame = results.loc[results["Wins"] == True]
frame = results.groupby("Name").Wins.sum()
frame= frame.reset_index()

#GRAPHS THE TOTAL WINS OF EACH TEAM
frame["Predictive"] = frame.Name == "predictive1"
sns.barplot(data=frame,x="Name", y="Wins", hue="Predictive")
plt.xlabel("Team")
plt.ylabel("Total Wins")
plt.title("Total wins in 30 simulated seasons")
plt.show()

Wow! as you can see our algorithm outperforms our compititors significantly. Not only does it regualrly outscore the others, it is also able to translate the scoring into championships in 17 of the 30 seasons. Next we compare our results to those from running the perfect algorithm against the same drafters.

perf ={
    "Perfect": 1,
    "Smart_ADP": 9
}

results = full_sim(perf,2016,2018,1,"standard",positions, projData, weekly)

#CREATES AND CLEAN A NEW DATAFRAME ORGANISED BY TEAM NAME AND POINTS SCORED
frame = pd.DataFrame(results.groupby("Name").Points.mean())
frame= frame.reset_index()
display(frame)

#GRAPHS THE AVERAGE POINTS SCORED OF EACH TEAM
frame["Perfect"] = frame.Name == "perfect1"
sns.barplot(data=frame,x="Name",y="Points", hue="Perfect")
plt.xlabel("Team")
plt.ylabel("Average Points Scored")
plt.title("Average Points Scored by Team from 2016-2018")
plt.show()

#CREATES AND CLEAN A NEW DATAFRAME ORGANISED BY TEAM NAME AND TOTAL WINS
results["Wins"] = results.Rank == 0
frame = results.loc[results["Wins"] == True]
frame = results.groupby("Name").Wins.sum()
frame= frame.reset_index()

#GRAPHS THE TOTAL WINS OF EACH TEAM
frame["Perfect"] = frame.Name == "perfect1"
sns.barplot(data=frame,x="Name", y="Wins", hue="Perfect")
plt.xlabel("Team")
plt.ylabel("Total Wins")
plt.title("Total wins in 30 simulated seasons")
plt.show()

The perfect algorithm performs better than us, but that is what happens when you have the answers to the test ahead of time! What we can learn here is that the results from our machine learning predictions are strong enough when combined with solid drafting strategy to handily outperform algorithms that mimic human behavior, and that there is still room for improvenment since the perfect model was able to win significantly more and score more points overall as well. To conclude, while no human can ever reliably draft perfectly, using the principals of value dicussed above as well as accurate predictions of the value of players, we can get closer that you might expect.

	Name	Team	Position	Age	G	GS	Cmp	Att	Yds	Int	...	PassingAtt	RushingYds	RushingTD	RushingAtt	ReceivingYds	ReceivingTD	FantasyPoints	Year	Rank	PosRank
ID
Arian Foster 2010	Arian Foster	HOU	RB	24.0	16.0	13.0	0.0	0.0	0.0	0.0	...	0.0	1616.0	16.0	327.0	604.0	2.0	392.00	2010	1.0	1.0
Peyton Hillis 2010	Peyton Hillis	CLE	RB	24.0	16.0	14.0	1.0	2.0	13.0	0.0	...	2.0	1177.0	11.0	270.0	477.0	2.0	294.92	2010	7.0	3.0
Adrian Peterson 2010	Adrian Peterson	MIN	RB	25.0	15.0	15.0	0.0	0.0	0.0	0.0	...	0.0	1298.0	12.0	283.0	341.0	1.0	275.90	2010	15.0	6.0
Jamaal Charles 2010	Jamaal Charles	KC	RB	24.0	16.0	6.0	0.0	0.0	0.0	0.0	...	0.0	1467.0	5.0	230.0	468.0	3.0	282.50	2010	10.0	4.0
Chris Johnson 2010	Chris Johnson	TEN	RB	25.0	16.0	16.0	0.0	0.0	0.0	0.0	...	0.0	1364.0	11.0	316.0	245.0	1.0	272.90	2010	17.0	7.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
Ray-Ray McCloud 2019	Ray-Ray McCloud	CAR	0	23.0	6.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	-2.00	2019	618.5	61.5
Darrius Shepherd 2019	Darrius Shepherd	GB	WR	24.0	6.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	1.0	0.0	-0.90	2019	614.0	216.0
Jarrett Stidham 2019	Jarrett Stidham	NE	QB	23.0	3.0	0.0	2.0	4.0	14.0	1.0	...	4.0	-2.0	0.0	2.0	0.0	0.0	-1.64	2019	617.0	71.0
Michael Walker 2019	Michael Walker	JAX	WR	23.0	7.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	15.0	0.0	-0.50	2019	612.0	215.0
Corey Clement 2019	Corey Clement	PHI	0	25.0	4.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	-4.00	2019	620.0	63.0

	Overall	Name	Position	Team	Times Drafted	Std. Dev	High	Low	Bye	Year
ID
Chris Johnson 2010	1.4	Chris Johnson	RB	TEN	461	0.7	1	6	4	2010
Adrian Peterson 2010	2.1	Adrian Peterson	RB	MIN	286	0.7	1	4	7	2010
Ray Rice 2010	3.0	Ray Rice	RB	BAL	157	0.7	1	5	7	2010
Maurice Jones-Drew 2010	4.4	Maurice Jones-Drew	RB	JAX	355	1.2	1	8	8	2010
Andre Johnson 2010	5.6	Andre Johnson	WR	HOU	315	1.4	2	9	8	2010
...	...	...	...	...	...	...	...	...	...	...
Mohamed Sanu 2019	170.1	Mohamed Sanu	WR	NE	17	10.6	142	180	5	2019
Matt Prater 2019	171.2	Matt Prater	PK	DET	13	5.2	163	178	5	2019
Buffalo Defense 2019	171.8	Buffalo Defense	DEF	BUF	38	9.8	146	191	11	2019
Brett Maher 2019	172.0	Brett Maher	PK	DAL	32	10.4	145	186	10	2019
Dan Bailey 2019	172.3	Dan Bailey	PK	MIN	18	4.8	166	180	7	2019

	Name	Team	Position	Age	G	GS	Cmp	Att	Yds	Int	...	FantasyPoints	Year	Rank	PosRank	Overall	Times Drafted	Std. Dev	High	Low	Bye
ID
Arian Foster 2010	Arian Foster	HOU	RB	24.0	16.0	13.0	0.0	0.0	0.0	0.0	...	392.00	2010	1.0	1.0	36.3	566.0	7.4	17.0	57.0	8
Peyton Hillis 2010	Peyton Hillis	CLE	RB	24.0	16.0	14.0	1.0	2.0	13.0	0.0	...	294.92	2010	7.0	3.0	160.9	40.0	13.5	128.0	179.0	9
Adrian Peterson 2010	Adrian Peterson	MIN	RB	25.0	15.0	15.0	0.0	0.0	0.0	0.0	...	275.90	2010	15.0	6.0	2.1	286.0	0.7	1.0	4.0	7
Jamaal Charles 2010	Jamaal Charles	KC	RB	24.0	16.0	6.0	0.0	0.0	0.0	0.0	...	282.50	2010	10.0	4.0	27.3	298.0	5.7	12.0	41.0	10
Chris Johnson 2010	Chris Johnson	TEN	RB	25.0	16.0	16.0	0.0	0.0	0.0	0.0	...	272.90	2010	17.0	7.0	1.4	461.0	0.7	1.0	6.0	4
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
Mohamed Sanu 2019	Mohamed Sanu	NE	WR	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	2019	NaN	NaN	170.1	17.0	10.6	142.0	180.0	5
Matt Prater 2019	Matt Prater	DET	PK	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	2019	NaN	NaN	171.2	13.0	5.2	163.0	178.0	5
Buffalo Defense 2019	Buffalo Defense	BUF	DEF	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	2019	NaN	NaN	171.8	38.0	9.8	146.0	191.0	11
Brett Maher 2019	Brett Maher	DAL	PK	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	2019	NaN	NaN	172.0	32.0	10.4	145.0	186.0	10
Dan Bailey 2019	Dan Bailey	MIN	PK	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	2019	NaN	NaN	172.3	18.0	4.8	166.0	180.0	7

	Position	FantasyPoints
ID
Christian McCaffrey 2019	RB	469.20
Lamar Jackson 2019	QB	415.68
Patrick Mahomes 2018	QB	415.08
Peyton Manning 2013	QB	409.98
David Johnson 2016	RB	405.80
Aaron Rodgers 2011	QB	397.42
Arian Foster 2010	RB	392.00
Cam Newton 2015	QB	389.08
Drew Brees 2011	QB	387.64
Christian McCaffrey 2018	RB	385.50

	Name	Team	Position	Age	G	GS	Cmp	Att	Yds	Int	...	Rank	PosRank	Overall	Times Drafted	Std. Dev	High	Low	Bye	Drafted	nextYrPts
ID
Michael Vick 2010	Michael Vick	PHI	QB	30.0	12.0	12.0	233.0	372.0	3018.0	6.0	...	3.0	1.0	NaN	NaN	NaN	NaN	NaN	NaN	False	233.02
Aaron Rodgers 2010	Aaron Rodgers	GB	QB	27.0	15.0	15.0	312.0	475.0	3922.0	11.0	...	4.0	2.0	6.8	327.0	2.3	1.0	15.0	5	True	397.42
Tom Brady 2010	Tom Brady	NE	QB	33.0	16.0	16.0	324.0	492.0	3900.0	4.0	...	5.0	3.0	25.2	476.0	5.5	10.0	41.0	5	True	366.30
Philip Rivers 2010	Philip Rivers	LAC	QB	29.0	16.0	16.0	357.0	541.0	4710.0	13.0	...	12.0	5.0	42.0	326.0	7.0	23.0	60.0	6	True	252.56
Peyton Manning 2010	Peyton Manning	IND	QB	34.0	16.0	16.0	450.0	679.0	4700.0	17.0	...	9.0	4.0	17.7	463.0	4.3	6.0	27.0	7	True	NaN

	Player	Pos	Tm	PassingYds	PassingTD	Int	PassingAtt	Cmp	RushingAtt	RushingYds	...	Rec	Tgt	ReceivingYds	ReceivingTD	FL	PPRFantasyPoints	StandardFantasyPoints	HalfPPRFantasyPoints	Year	Week
ID
Vince Young 2010 1	Vince Young	QB	TEN	154.0	2.0	0.0	17.0	13.0	7.0	30.0	...	0.0	0.0	0.0	0.0	1.0	15.16	15.16	15.16	2010	1
David Garrard 2010 1	David Garrard	QB	JAX	170.0	3.0	0.0	21.0	16.0	7.0	10.0	...	0.0	0.0	0.0	0.0	0.0	19.80	19.80	19.80	2010	1
Tom Brady 2010 1	Tom Brady	QB	NWE	258.0	3.0	0.0	35.0	25.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	22.32	22.32	22.32	2010	1
Peyton Manning 2010 1	Peyton Manning	QB	IND	433.0	3.0	0.0	57.0	40.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	29.32	29.32	29.32	2010	1
Jay Cutler 2010 1	Jay Cutler	QB	CHI	372.0	2.0	1.0	35.0	23.0	5.0	22.0	...	0.0	0.0	0.0	0.0	1.0	21.08	21.08	21.08	2010	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
Tavon Austin 2019 17	Tavon Austin	WR	DAL	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	1.0	2.0	1.0	0.0	0.0	1.10	0.10	0.60	2019	17
Blake Bell 2019 17	Blake Bell	TE	KAN	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	1.0	2.0	1.0	0.0	0.0	1.10	0.10	0.60	2019	17
Jamal Agnew 2019 17	Jamal Agnew	CB	DET	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	1.0	2.0	-2.0	0.0	0.0	0.80	-0.20	0.30	2019	17
Trevor Davis 2019 17	Trevor Davis	WR	MIA	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.00	0.00	0.00	2019	17
Joe Thuney 2019 17	Joe Thuney	OL	NWE	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.00	0.00	0.00	2019	17

	Name	Team	Position	Age	G	GS	Cmp	Att	Yds	Int	...	PosRank	Overall	Times Drafted	Std. Dev	High	Low	Bye	Drafted	nextYrPts	projScore
ID
Kellen Clemens 2017	Kellen Clemens	LAC	QB	34.0	8.0	0.0	6.0	8.0	36.0	1.0	...	71.0	0.0	0.0	0.0	0.0	0.0	0	False	0.00	12.546090
Chad Henne 2017	Chad Henne	JAX	QB	32.0	2.0	0.0	0.0	2.0	0.0	0.0	...	69.0	0.0	0.0	0.0	0.0	0.0	0	False	1.46	12.275391
Jake Rudock 2017	Jake Rudock	DET	QB	24.0	3.0	0.0	3.0	5.0	24.0	1.0	...	70.0	0.0	0.0	0.0	0.0	0.0	0	False	0.00	33.303249
Tyler Bray 2017	Tyler Bray	KC	QB	26.0	1.0	0.0	0.0	1.0	0.0	0.0	...	72.0	0.0	0.0	0.0	0.0	0.0	0	False	0.00	17.546173
Teddy Bridgewater 2017	Teddy Bridgewater	MIN	QB	25.0	1.0	0.0	0.0	2.0	0.0	1.0	...	73.0	0.0	0.0	0.0	0.0	0.0	0	False	7.22	34.325876

	Name	Team	Position	Age	G	GS	Cmp	Att	Yds	Int	...	PosRank	Overall	Times Drafted	Std. Dev	High	Low	Bye	Drafted	nextYrPts	projScore
ID
Russell Wilson 2017	Russell Wilson	SEA	QB	29.0	16.0	16.0	339.0	553.0	3983.0	11.0	...	1.0	63.5	258.0	6.9	45.0	79.0	6	True	297.42	232.414366
Cam Newton 2017	Cam Newton	CAR	QB	28.0	16.0	16.0	291.0	492.0	3302.0	16.0	...	2.0	81.4	235.0	9.7	56.0	104.0	13	True	278.60	231.523128
Tom Brady 2017	Tom Brady	NE	QB	40.0	16.0	16.0	385.0	581.0	4577.0	8.0	...	4.0	27.4	291.0	4.8	14.0	39.0	5	True	281.30	249.139050
Alex Smith 2017	Alex Smith	KC	QB	33.0	15.0	15.0	341.0	505.0	4042.0	5.0	...	3.0	160.7	15.0	13.1	143.0	179.0	10	True	138.00	219.115336
Carson Wentz 2017	Carson Wentz	PHI	QB	25.0	13.0	13.0	265.0	440.0	3296.0	7.0	...	6.0	132.7	156.0	8.9	110.0	151.0	9	True	191.66	190.413394

	Name	Points
0	predictive1	1834.354000
1	smart1	1264.085333
2	smart2	1246.591333
3	smart3	1258.564000
4	smart4	1256.138000
5	smart5	1280.684667
6	smart6	1275.415333
7	smart7	1252.184667
8	smart8	1242.476000
9	smart9	1258.302667

	Name	Points
0	perfect1	1941.798000
1	smart1	1253.644667
2	smart2	1229.267333
3	smart3	1246.337333
4	smart4	1247.502667
5	smart5	1247.406000
6	smart6	1256.812000
7	smart7	1263.151333
8	smart8	1235.497333
9	smart9	1268.640000