« Introduction | Main | The Week Score (by position) »

July 25, 2008

Scraping from NFL.com

It appears that obtaining game-by-game data from the NFL is harder than originally thought. First, there seems to be some kind of monopoly with stats.com - so the NFL doesn't provide any kind of open format for stats. Second, there are very few third party providers of NFL stats (unlike MLB stats). Third, the providers that do exist all want money or only provide season-by-season aggregate data. I want game-by-game summaries.

Therefore, we have to "screen scrape" the NFL.com box scores. To do this, we first need to get the unique game_id for all games, which they provide on their "gamecenter" page. Below is the script I used to scrape up all the game_ids. Basically, it's the PHP curl module + some regular expression to pull out the game_id, which we use to create "game" class and entry in our MySQL database.

// Loads the season file to pull the game ids
// author: tom hayden 
include_once "config.lib.php";

$seasons = array("2001","2002","2003","2004","2005","2006","2007");
$games = array();
foreach($seasons AS $s)
{

        // loop through each week
        // assume 17 weeks
        for( $i=1; $i<=17; $i++ )
        {
                echo "loading season: $s \t week: $i \n";
                // generate the url
                $url = "http://www.nfl.com/scores?season=$s&week=Week+$i";

                // use curl to pull down the page
                $slurp = new slurp( $url );

                // extract game ids
                preg_match_all("/.*boxscore\?game_id=(.*)\&disp./",$slurp
                ->output,$matches);

                // feed game id matches into an array
                foreach($matches[1] AS $m)
                {
                        // load into mysql
                        $game = new game();
                        $game->create($m,$i,$s);

                }

                print_r($games);
        }

}

So, we basically end up with a table that looks like the below sample. In total, I ended up with 1,785 rows for all regular season games from 2001-2007.

mysql> select * from game order by rand() desc limit 10;
+-------+------+--------+
| id    | week | season |
+-------+------+--------+
| 17694 |   17 |   2001 | 
| 27150 |   17 |   2004 | 
| 29343 |   11 |   2007 | 
| 26935 |    2 |   2004 | 
| 29315 |    9 |   2007 | 
| 26500 |    1 |   2003 | 
| 27052 |   10 |   2004 | 
| 26464 |   13 |   2003 | 
| 29325 |   10 |   2007 | 
| 18343 |   17 |   2002 | 
+-------+------+--------+
10 rows in set (0.07 sec)

Next, we need to loop through all of the individual game box scores and scrape out all of the individual player game stats (touchdowns, yards, receptions, field goals, etc). This is much more complicated and the code is way to long to post on a blog. In a nutshell, I used a script similar to that above but had a ton of regular expressions and some help from Troy Wolf's class_http table-to-array script to build a massive stats table (see sample of a random John Kitna score below).

mysql> select * from stats where pid='00-0009311' order by rand() limit 1;
+--------+-------+------+------------+-----------+-----------+----------+---------+----------+----------+----------+---------+---------+---------+---------+--------+--------+---------+----------+---------+---------+---------+---------+---------+----------+---------+----------+----------+---------+-------+--------+-------+-------+-------+--------+-------+-------+--------+---------+---------+--------+--------+
| id     | gid   | tid  | pid        | stat_type | pass_cpat | pass_yds | pass_td | pass_int | rush_att | rush_yds | rush_td | rush_lg | rec_rec | rec_yds | rec_td | rec_lg | fum_fum | fum_lost | fum_rec | fum_yds | kick_fg | kick_lg | kick_xp | kick_pts | punt_no | punt_avg | punt_i20 | punt_lg | kr_no | kr_avg | kr_td | kr_lg | pr_no | pr_avg | pr_td | pr_lg | def_ta | def_sck | def_int | def_ff | points |
+--------+-------+------+------------+-----------+-----------+----------+---------+----------+----------+----------+---------+---------+---------+---------+--------+--------+---------+----------+---------+---------+---------+---------+---------+----------+---------+----------+----------+---------+-------+--------+-------+-------+-------+--------+-------+-------+--------+---------+---------+--------+--------+
| 215469 | 28895 | DET  | 00-0009311 | Passing   | 25/40     |      342 |       2 |        1 |        0 |        0 |       0 |       0 |       0 |       0 |      0 |      0 |       0 |        0 |       0 |       0 | 0       |       0 | 0       |        0 |       0 |        0 |        0 |       0 |     0 |      0 |     0 |     0 |     0 |      0 |     0 |     0 | 0      |    0.00 |       0 |      0 |  21.40 | 
+--------+-------+------+------------+-----------+-----------+----------+---------+----------+----------+----------+---------+---------+---------+---------+--------+--------+---------+----------+---------+---------+---------+---------+---------+----------+---------+----------+----------+---------+-------+--------+-------+-------+-------+--------+-------+-------+--------+---------+---------+--------+--------+

Posted by haydenth at July 25, 2008 09:59 PM

Comments

Login to leave a comment. Create a new account.