Model Building part 1.docx

This is the first in a two-part article about building a simple football handicapping model and testing it with the R statistical package. I will not go over installing R, nor is this a tutorial on R. If you have no experience with R, I suggest taking a short tutorial first, but some people work better running through and example like this and looking up functions as we go.

We’ll start with the hypothesis that a team’s total yards will translate to a specific amount of points scored. This will create a complete model, just not a good one. It’s good for an article like this as it’s about as simple as they come. Since this article is essentially a “how to model,” it will serve our purposes well. But it will not be accurate enough to be profitable. In part 2, we’ll go over testing our model and I’ll better explain some ideas. Similarly, you use the predicted points to come up with a Total. That, too, will not be accurate enough to bet on with this model.

One thing to note is that the NFL is probably the most difficult betting market to beat. It’s not unbeatable, but the opportunities and expected edge in this market will be smaller than nearly any other market. MLB would be another very efficient betting market.

We’ll be using the data from repole.com (http://www.repole.com/sun4cast/data.html). You’ll want to download the CSV versions of the “Cumulative statistics” files. Specifically, we’ll just be using one year of data, 2011. This will be using a simple linear regression and average total yards to predict what the spread should be. Let’s get our hands dirty.

First, let’s read in the 2011 csv file into a data frame we’ll call df:

df <- read.csv("/path/to/data/nfl2011stats.csv")

Now, we’ll do a bit of slicing, dicing, and reformatting of the data to better prepare it for what we’re going to do.

The stats in this file are labeled poorly. Anything “Off” really means the cumulative for the team listed in the “TeamName” column for the row. Anything “Def” means the team in the “Opponent” column. ScoreOff, for example, actually shows all the points for that team (offensive, defensive, and special teams scores are totaled in this column).

Since it doesn’t have the total yards computed already, we’ll change the dataframe by adding a column that is the sum of RushYdsOff and PassYdsOff:

df <- within(df, TotalYdsOff <- RushYdsOff + PassYdsOff)

Let’s do the same for the Defense:

df <- within(df, TotalYdsDef <- RushYdsDef + PassYdsDef)

And since the CSV files show the lines “backwards” (-3 for the loser instead of +3, for instance), we’ll need to fix that quickly:

df <- within(df, Line <- Line * (-1))

Also, the date is currently seen as a character string. Let’s transform it to an actual “Date” structure. This will allow us to search by saying “Date >= 2011-10-04” and it will chose all the dates that are later than or equal to Oct 04, 2011. It’ll be useful if you muck around with this stuff later, but not particularly useful in this article, so you can skip this next line if you really want:

df$Date <- as.Date(df$Date, format = “%m/%d/%Y”)

Now let’s compute what we’ll call the ActualLine, which is the difference in the final scores:

df <- within(df, ActualLine <- ScoreDef – ScoreOff)

Notice we calculate this as the Opponent’s score minus the TeamName’s score. This will always display it as we expect to see it. For example, “-7” means TeamName outscored Opponent by 7 points. Likewise, “3” means Opponent outscored TeamName by 3 points.

Okay. Now it’s time to run a linear regression to see the relationship between total yards and points scored. This is really easy in R. Nearly all the functions that perform a regression (linear or otherwise) are in the “y ~ x” style of formatting. A third variable would look like “y ~ x1 + x2” if you’re adding them, for example. Since we’re only using one, we just use the first form.

In our hypothesis, We’re trying to figure out how many yards equals a point. This makes yards the “x” variable, and points are the “y” in the equation. A simple linear regression is run using the lm() function. We’re going to store the output in a variable, “yds_lm_2011.” Here’s what the code looks like:

yds_lm_2011 <- lm(ScoreOff ~ TotalYdsOff, data = df)

If you just call the variable by typing “yds_lm_2011” into the command line and pressing enter, it’ll show you the intercept and coefficient of our equation. If you remember junior high math class, the equation would look like the following:

y = 0.07141x – 2.58814

With this equation, we just need to figure out how many yards a team is going to score, plug that number in, and it’ll pop out how many points we expect the team to get. For example, if we expect a team to get 356 yards, this equation would predict about 28 points being scored.

So how are we going to predict how many yards a team will total up? Well, let’s just use the average total yards up to that point in the season. Before I show you some of the code, let’s just walk through exactly what we’ll be doing.

For a given team, we will first grab a subset of our data frame consisting of the team. Then, we’ll add up all the yards for all weeks PRIOR to the week we’re working on. We cannot use data from that week as we would not have known it at the time (you place a bet before the game happens, not after). Then, we divide that running total by the number of games played at that point to get the average. I’ve put that code into a function, so we can call it again and it’ll look more readable:

calc_avg_yds <- function(teamName, data) {

tmp_df <- subset(data, TeamName==teamName)

RunningTotalYdsOff <- 0

v <- c(0) #the average before the first game is always 0

for(i in 2:nrow(tmp_df)) {

RunningTotalYdsOff <- RunningTotalYdsOff + tmp_df[ i-1,]$TotalYdsOff

AvgTotalYdsOff <- RunningTotalYdsOff / ( i-1 )

row = tmp_df$row.names

v <- c(v, AvgTotalYdsOff)

}

v #return the vector

}

Now that we have the function, we’ll want to call it for every team we have. Luckily, R keeps track of that and it’s a simple “for” loop. But we’ll want to store the output in a vector, so we can recombine it with our dataframe. Here’s how the loop looks:

avg_yds_vector <- vector()

for(each in levels(df$TeamName)) {

avg_yds_vector <- c(avg_yds_vector, calc_avg_yds(each, df))

}

So now we have a vector with the average total yards for each team and each week. Time to combine that with the dataframe, right? Not quite yet. You see, our new vector is ordered primarily by TeamName, and secondarily by Date. Our data frame ordered primarily by Date, and then by TeamName. So we just need to make sure we reorder the data frame before we bind the two together. We can still do it in one line:

df <- cbind(df[order(df$TeamName), ], AvgTotalYdsOff = avg_yds_vector)

So now we’ll want to get the AvgTotalYds for the Opponent. Since each game has two rows of data, one for the home team and one for the away team, we’ve already calculated the other team’s AvgTotalYds. We’ll just grab it from that line. I’ll just bind the new vector together again:

foo <- vector()

for(i in 1:nrow(df)) {

foo <- c(foo, subset(df, TeamName==df[i,]$Opponent & Date==df[i,]$Date)$AvgTotalYdsOff)

}

df <- cbind(df, AvgTotalYdsDef = foo)

Now there’s very little left to do before we have our predictions! We’ll just calculate our predicted points for both teams using the linear regression we did earlier, and the difference between the two will be our predicted “fair” line. Now, there is a predict() function which takes a linear model and some data and will do the predictions for us. For the longest time, I couldn’t figure out how to get it to use a different column name for the data (AvgTotalYdsOff instead of TotalYdsOff, for example). At some point I hardcoded it into an equation and moved on, but eventually, I came across how to do it. Instead of hardcoding, you can use the predict function as follows:

predict(yds_lm_2011, list(TotalYdsOff = df$AvgTotalYdsOff))

Now, that only outputs the results to the console, but doesn’t pop it in our data frame. We can do that in one line like so:

df <- cbind(df, PredPtsOff = predict(yds_lm_2011, list(TotalYdsOff = df$AvgTotalYdsOff)))

To do it for the Defense is slightly different. Since the model has its yardage variable defined as TotalYdsOff, that stays the same, even though we’re using AvgTotalYdsDef. It looks like this:

df <- cbind(df, PredPtsDef = predict(yds_lm_2011, list(TotalYdsOff = df$AvgTotalYdsDef)))

Notice that PredPtsOff and AvgTotalYdsOff have changed to the “Def” versions, but TotalYdsOff stays “Off.”

Finally, you’ll want to come up with your predicted line. Any team’s line is always its opponent’s score minus its own score. Thus, if a team is projected to win by 7 points, the line is represented as “-7” as you’d expect to see from any sportsbook. Here’s the code to add that column to the data frame:

df <- within(df, PredLine <- PredPtsDef - PredPtsOff)

If you want to save your data back into the original csv file, you’d do that like this:

write.csv(df, file="nfl2011stats.csv, na="")

That’s the end of part 1. As I mentioned at the beginning of the article, part 2 will go over testing this data and some ways to improve the model going forward.