Review of “Data Crunching”

June 21st, 2005 by Mathias Meyer

If you’re reading this, you probably spend some quality time developing software. If you’re developing software, chances are that you have to move data around on a daily basis (lucky you, if you don’t). Be it getting data from one text format to another, moving data from a legacy system to a newer project’s database, transforming XML into some more readable format for your boss or trying to get some useful data out of a former colleague’s own binary format. Whatever you do in that manner, you’re crunching data. Greg Wilson seems to have spent a lot of time crunching data and wants to share his wisdom with the world of pragmatic programmers. The book’s coding focus is on working with Python and Java. I for my part haven’t worked with Python yet, but being familiar with Ruby and Groovy it wasn’t actually that hard to get an idea about what the Python code does (and I’m starting to like Python). So you’ve been warned about that.
Being a big fan of The Pragmatic Programmers’ bookshelf I didn’t hesitate to buy a copy of “Data Crunching” as well. Since I spend a lot of time doing stuff with some more or less usable data I thought it might be a good read to get some fresh ideas. And as it turns out that was a good choice.
Let’s dive into the world of crunching data. Greg takes it easy on the reader in the introduction. He starts off with short examples of his professional career. This helps a lot to get an idea about what data crunching actually is. If you didn’t already know it, reading the first chapter will give you some hints. The book is split up in a simple way. The next chapters will take you most of the data source/formats/crunching you’ll most likely get in touch with. Mainly, that’s text, regular expressions, XML, binary data, and relational databases. The book ends with a short chapter about the so called horseshoe nails, that being things that didn’t fit anywhere else. But we’ll get to that later. Not surprisingly, every chapter ends with a short summary.
The (more or less) simplest data you can work with is text. While some genius programmer in some company whose products you use/once used can always come up with a great new text format that nobody will ever understand, there’s a good chance that you’ll at least get an idea of its meaning by looking at a text file. Greg takes an example from the introduction some steps further to show the basics of working with text files, and also how to work with and around the common pitfalls. Being a pragmatic book, you also get the idea of how to keep your data crunching code nice and clean, and how to deal with normalising, collision detection and, of course, the basics of working with the UNIX shell (the tool of my choice for dealing with most “normal” text). After reading this chapter you have a very good idea about dealing with text. Compressing more information about dealing with text should be almost impossible.
Ah, regular expressions. The sheer joy of getting to know all the differences between grep, sed, awk, vim, Perl RE and the like just keeps me alive. Giving probably the best and shortest (but still understandable) introduction into working with regular expressions, Greg also gives good examples about what you can and what you can (or should) not do with regular expressions. Skimming through the pages you’ll find that regular expressions can be applied to a lot of problems when it comes to handling input.
Working with XML is something I never really got comfortable with, but I gotta say Greg could convince to change my mind here. He introduces the basic techniques to work with XML, being SAX and DOM, showing their strengths and their weaknesses. Pretty much nothing else to say here. Good thing is that he prefers showing how to work with JDOM (Java) and xml.dom.minidom (Python) rather than the clunky C-style DOM-API. The real beauty of working with XML is XPath, at least for me. On the other hand there is XSLT which is more verbose than useful. You might get a similar impression reading this chapter. But I’m not here to judge (well, not about XPath and XSLT, anyway), the day might come when I’ll have to get back to XSLT. It’s always good to know the choices you have.
If you didn’t get a chance to work with binary data yet, then next chapter is for you. One could discuss, if there’s still a need to fiddle with binary data in the modern world. Or you could just give it shot. The examples are pretty straight-forward and understandable. Greg does an impressive job at working through the ups and downs of working with binary data. My fears certainly turned into curiosity after this chapter. After a short introduction into the world of 0 and its buddy 1 you’ll learn how to pack and unpack different data types in fixed and variable formats with metadata.
The chapter on relational databases starts off with the best summary to SQL I’ve read so far, including joins, nested queries and normalisation. Besides text, databases definitely are one of my favourite tools for data crunching, in whatever tongue of SQL they speak. The SQL you’ll learn in this chapter might be almost everything you’ll ever need for working with data from MySQL, Oracle, and the like. Since working with SQL in your code is not the hardest part here, Greg keeps the focus on showing what you can do with SQL itself.
The grand finale is a small collection of so called horseshoe nails, miscellaneous techniques that will help you while crunching data that didn’t really fit somewhere else. I definitely agree with Greg here in that those nails didn’t fit anywhere else, but they’re very much worth reading anyway. He introduces some basic tools like JUnit, diff and Make. He finishes with some short information about encoding/decoding, floating point arithmetic and working with dates and times.
This book is a gold mine for the software developer, be it a beginner or one that crunches data on a daily basis for years. The examples are very applicable to the every day life of a developer. They are simple enough to be immediately understood, but powerful enough to be good snippets to reuse, work on or build own data crunching code on them. Greg does an amazing job at keeping the examples and the text at a level that is both understandable and helpful for every developer. The book should be on your (as well as on mine) shelf whenever you have to write a small script or program to work with yet another data format the world didn’t know existed. Greg will keep you sane and on track with his book. It’s, after all, a pragmatic book! Just like with the other ones (which I can recommend without hesitation), you’ll find tons of information packed into an entertaining, but nonetheless helpful book.

One Response to “Review of “Data Crunching””

  1. CFD Trading Says:

    Great article. I’ve sent your blog link to one of our programmers.

Leave a Reply


ok