February 26th, 2008

Lousy reporting at TechCrunch

Consider this gem from [[http://www.techcrunch.com/2008/02/20/yahoo-search-wants-to-be-more-like-google-embraces-hadoop/|a recent post on TechCrunch]]:

//Hadoop is an open-source implementation of Google’s MapReduce software and file system. It takes all the links on the Web found by a search engine’s crawlers and “reduces” them to a map of the Web so that ranking algorithms can be run against them.//

As one of the commenter’s rightly said, that is probably the most inaccurate description of Hadoop (or the Map-Reduce paradigm in general). The author, Erick Schonfeld, updated the post with another explanation based on feedback in the comments:

//What MapReduce and Hadoop do is break up a computation problem into mangeable chunks and distribute them to different processors—that is the “map” part, it is mapping the data. Once all of the individual results are in, they are combined into one big result—that is the reduce part. Search engines, in turn, use this technique to literally map the Web.//

While this is better, it is still far from an accurate and concise description. For instance, I’m not sure I understand (for that matter, even Erick understands) the relevance/meaning of the phrase “it is mapping the data” in this context. Or that results are “combined into one big result” in the reduce phase.

Given the visibility and readership of a prominent blog like TC, I find such reporting to be below par. If the author had taken 5 minutes to just try to understand the basic map-reduce paradigm, all this confusion could have been easily avoided.

January 11th, 2008

New MapReduce article in CACM

Several people have already [[http://googlesystem.blogspot.com/2008/01/google-reveals-more-mapreduce-stats.html|noted]] that Google has published updated statistics on [[http://labs.google.com/papers/mapreduce.html|MapReduce]] in a [[http://doi.acm.org/10.1145/1327452.1327492|recent article]] published in the Communications of the ACM.

While numbers from Google are certainly always interesting, what struck me was the **absolutely pathetic** quality of the graphs in the article. To see what I mean, check out the graphs on Page 5 (you need an ACM account to get the PDF I think). They are hardly readable, both in print and on screen (zooming in doesn’t help). Here is a screenshot (I have included some of the surrounding text to give you an idea of the resolution):

{{ http://floatingsun.net/wordpress/wp-content/uploads/2008/01/screenshot15.png|MapReduce graph}}

As a member of the academic community, I’m quite disappointed and surprised that neither the authors, nor the editors took note of such an obvious shortcoming. MapReduce is great work, and a publication like CACM reaches out to a much broader audience than the conference proceedings of OSDI (where MapReduce was originally published) so I would expect the presentation to be top-notch (and remember, this is Google we’re talking about). Besides, what irks me most is that these are the //exact// same graphs (or at least some of them are) from the original MapReduce paper ({{http://labs.google.com/papers/mapreduce-osdi04.pdf|pdf here}}). Was is so hard to just copy paste or import the figures without messing up the resolution so bad?