Tagged: hadoop

Lousy reporting at TechCrunch

Consider this gem from [[http://www.techcrunch.com/2008/02/20/yahoo-search-wants-to-be-more-like-google-embraces-hadoop/|a recent post on TechCrunch]]:

//Hadoop is an open-source implementation of Google’s MapReduce software and file system. It takes all the links on the Web found by a search engine’s crawlers and “reduces” them to a map of the Web so that ranking algorithms can be run against them.//

As one of the commenter’s rightly said, that is probably the most inaccurate description of Hadoop (or the Map-Reduce paradigm in general). The author, Erick Schonfeld, updated the post with another explanation based on feedback in the comments:

//What MapReduce and Hadoop do is break up a computation problem into mangeable chunks and distribute them to different processors—that is the “map” part, it is mapping the data. Once all of the individual results are in, they are combined into one big result—that is the reduce part. Search engines, in turn, use this technique to literally map the Web.//

While this is better, it is still far from an accurate and concise description. For instance, I’m not sure I understand (for that matter, even Erick understands) the relevance/meaning of the phrase “it is mapping the data” in this context. Or that results are “combined into one big result” in the reduce phase.

Given the visibility and readership of a prominent blog like TC, I find such reporting to be below par. If the author had taken 5 minutes to just try to understand the basic map-reduce paradigm, all this confusion could have been easily avoided.