Our data collection started with the CSV file below, it contains all historical data from the beginning of the NBA until the 2015 regular season. We decided to drop all data from more than 20 years ago. We are focusing on the data from when current 30 NBA franchises were established and have been a consistent part of the NBA from that time forward. We also dropped some of the more inconsqeuential columns (season or playoffs games) and deleted the is_copy column, which was just duplicated games. Essentially we cleaned out the data we would not be incorporating into our model.

In [9]:
import pandas as pd
elo = pd.read_csv('elo_1946-2015.csv')
elo.head()
Out[9]:
gameorder game_id lg_id _iscopy year_id date_game seasongame is_playoffs team_id fran_id ... win_equiv opp_id opp_fran opp_pts opp_elo_i opp_elo_n game_location game_result forecast notes
0 1 194611010TRH NBA 0 1947 11/1/1946 1 0 TRH Huskies ... 40.294830 NYK Knicks 68 1300.0000 1306.7233 H L 0.640065 NaN
1 1 194611010TRH NBA 1 1947 11/1/1946 1 0 NYK Knicks ... 41.705170 TRH Huskies 66 1300.0000 1293.2767 A W 0.359935 NaN
2 2 194611020CHS NBA 0 1947 11/2/1946 1 0 CHS Stags ... 42.012257 NYK Knicks 47 1306.7233 1297.0712 H W 0.631101 NaN
3 2 194611020CHS NBA 1 1947 11/2/1946 2 0 NYK Knicks ... 40.692783 CHS Stags 63 1300.0000 1309.6521 A L 0.368899 NaN
4 3 194611020DTF NBA 0 1947 11/2/1946 1 0 DTF Falcons ... 38.864048 WSC Capitols 50 1300.0000 1320.3811 H L 0.640065 NaN

5 rows × 23 columns

Our model was missing about a year and a half of data points that we needed, so we filled in the games that were missing between 2015 and the present day with this data set. It was obtained by scraping NBA reference. This data set needed to be cleaned so that the names for the franchises across our data sets were the same. For example, some sources list the Portland Trailblazers as the Portland Trail Blazers or just the Portland Blazers.

In [12]:
recent_data = pd.read_csv('recent_data.csv')
recent_data.head()
Out[12]:
Date Start (ET) Visitor/Neutral PTS Home/Neutral PTS.1
0 Tue, Oct 27, 2015 8:00 pm Detroit Pistons 106 Atlanta Hawks 94
1 Tue, Oct 27, 2015 8:00 pm Cleveland Cavaliers 95 Chicago Bulls 97
2 Tue, Oct 27, 2015 10:30 pm New Orleans Pelicans 95 Golden State Warriors 111
3 Wed, Oct 28, 2015 7:30 pm Philadelphia 76ers 95 Boston Celtics 112
4 Wed, Oct 28, 2015 7:30 pm Chicago Bulls 115 Brooklyn Nets 100

With the combination of the two previous data sets, we built the historical data set. This contains all the games in the years that we are considering up until the present date. This is the data that we use to calculate our ELO ratings.

In [19]:
historical_data = pd.read_csv('historical_data.csv')
historical_data.head()
Out[19]:
fran_id pts opp_fran opp_pts game_location date
0 Pistons 106 Hawks 94 A 2015-10-27
1 Cavaliers 95 Bulls 97 A 2015-10-27
2 Pelicans 95 Warriors 111 A 2015-10-27
3 Sixers 95 Celtics 112 A 2015-10-28
4 Bulls 115 Nets 100 A 2015-10-28

Another large portion of our data set that was missing was the altitudes and latitude and longitude locations of each team. We use this information as one of the features in our model and it had to be inserted by hand.

In [25]:
teamInfo = pd.read_csv('teamInfo.csv', delimiter='\t')
teamInfo.head()
Out[25]:
fran_id altitude lat lon
0 Bucks 617 43.038902 -87.906471
1 Bulls 594 41.881832 -87.623177
2 Cavaliers 653 41.505493 -81.681290
3 Celtics 141 42.361145 -71.057083
4 Clippers 233 34.052235 -118.243683

The projected win loss data set is where we store our projected wins and losses for each team and the current ELO ranking of that team. This will be updated daily and displayed on our website.

In [13]:
ProjectedWL = pd.read_csv('ProjectedWL.csv')
ProjectedWL.head()
Out[13]:
Projected L Projected W elo fran_id
0 15.0 67.0 1770.075218 Warriors
1 19.0 63.0 1719.168956 Spurs
2 26.0 56.0 1645.724021 Cavaliers
3 26.0 56.0 1618.636284 Rockets
4 31.0 51.0 1598.583197 Wizards

The tomorrow data set shows the games that will be played tomorrow, and the win/loss probabilty for each teams in the matchup. This will be updated daily and displayed on our website.

In [16]:
tomorrow = pd.read_csv('tomorrow.csv')
tomorrow.head()
Out[16]:
fran_id opp_fran prob opp_prob
0 Warriors Hawks 0.647281 0.352719
1 Pacers Hornets 0.523157 0.476843
2 Heat Cavaliers 0.148315 0.851685
3 Kings Nuggets 0.331493 0.668507
4 Bulls Pistons 0.563439 0.436561

This data set holds all the information for the upcoming games left in the 2017 season.

In [23]:
upcoming_games = pd.read_csv('upcoming_games.csv')
upcoming_games.head()
Out[23]:
fran_id pts opp_fran opp_pts game_location date
0 Raptors NaN Hawks NaN A 2017-03-10
1 Rockets NaN Bulls NaN A 2017-03-10
2 Magic NaN Hornets NaN A 2017-03-10
3 Nets NaN Mavericks NaN A 2017-03-10
4 Celtics NaN Nuggets NaN A 2017-03-10