XML Import in Ruby

I wrote a Ruby program that parsed a huge XML file shipped to us from a partner. Over several weeks, working a day a week, I would coax our partner to produce a correct file. Throughout this effort Graphviz was my friend.

I used Ruby's normal parser, Nokogiri, in its incremental mode. I collected statistics on tags and their nesting structure. I wrote this to a dot file every ten thousand nodes or so. This let me watch the import process run since Graphviz rerendered with each write.

I reported discrepancies by citing facts and line numbers. These were hard to refute. Each could be traced to flaws in our partner's data or bugs in their hand crafted export program.

With each interaction my complaints would move from simple relationships to more complex semantics. Had we imported the data without cleaning we would be hunting their bugs in our code forever.

See Exploratory Parsing where I generalized this approach beyond XML.