Our data collection started with the CSV file below, it contains all historical data from the beginning of the NBA until the 2015 regular season. We decided to drop all data from more than 20 years ago. We are focusing on the data from when current 30 NBA franchises were established and have been a consistent part of the NBA from that time forward. We also dropped some of the more inconsqeuential columns (season or playoffs games) and deleted the is_copy column, which was just duplicated games. Essentially we cleaned out the data we would not be incorporating into our model.
import pandas as pd
elo = pd.read_csv('elo_1946-2015.csv')
elo.head()
Our model was missing about a year and a half of data points that we needed, so we filled in the games that were missing between 2015 and the present day with this data set. It was obtained by scraping NBA reference. This data set needed to be cleaned so that the names for the franchises across our data sets were the same. For example, some sources list the Portland Trailblazers as the Portland Trail Blazers or just the Portland Blazers.
recent_data = pd.read_csv('recent_data.csv')
recent_data.head()
With the combination of the two previous data sets, we built the historical data set. This contains all the games in the years that we are considering up until the present date. This is the data that we use to calculate our ELO ratings.
historical_data = pd.read_csv('historical_data.csv')
historical_data.head()
Another large portion of our data set that was missing was the altitudes and latitude and longitude locations of each team. We use this information as one of the features in our model and it had to be inserted by hand.
teamInfo = pd.read_csv('teamInfo.csv', delimiter='\t')
teamInfo.head()
The projected win loss data set is where we store our projected wins and losses for each team and the current ELO ranking of that team. This will be updated daily and displayed on our website.
ProjectedWL = pd.read_csv('ProjectedWL.csv')
ProjectedWL.head()
The tomorrow data set shows the games that will be played tomorrow, and the win/loss probabilty for each teams in the matchup. This will be updated daily and displayed on our website.
tomorrow = pd.read_csv('tomorrow.csv')
tomorrow.head()
This data set holds all the information for the upcoming games left in the 2017 season.
upcoming_games = pd.read_csv('upcoming_games.csv')
upcoming_games.head()