Tag Archives: parsing

Project: Corsi Database – Parsing NHL’s Play-by-Play HTML Game Summary Data

While we were parsing the larger details of the a Play-by-Play (PbP) HTML Game Summary, details on who is on the ice and who is taking shots requires a more detailed parsing of the file. To simplify my effort, I’m using the excellent tool called HtmlAgilityPack, which provides XPATH-like functionality to navigate within HTML files. With recent versions of Visual Studio, the HtmlAgilityPack is available as a NuGet package – so it’s a very simple add-in. Documentation, on the other hand, is a bit more precious to come by – I found trial and error and XPATH tutorials at W3Schools was best.

HtmlAgilityPack

HtmlAgilityPack

While I don’t know how fragile the NHL’s PbP HTML standard is, I set my goal to harden the parsing algorithm over 5-10 games. Tonight, I wrote the basic parsing algorithm and nearly made it through the first game of the season – when I crashed out in the second period with a null exception during the player parsing. But it’s a good place to stop because I made significant headway – the base code is pretty solid, the edges need work.

Parsing PbB using HtmlAgilityPack

Parsing PbB using HtmlAgilityPack

I think one more night and I’ll be able to go through one complete game, another night and we should have a polished NHL PbP scraper. Then we’ll refine our database so that we can hold the season data. These first few weeks will be hard because we have to build out our database before we can even do any analysis – but we’re making progress.

Advertisements