Category: Technology

August 4th, 2010

Toying with node.js

A commenter rightly complained that despite my claims of “playing around” with node.js, all I could come up was with the example in the man page. I replied saying that I did intend to post something that I wrote from scratch, and as promised, here is my first toy node.js program:

var sys = require('sys');
var http = require('http');
var url = require('url');
var path = require('path');

function search() {
  stdin = process.openStdin();
  stdin.setEncoding('utf8');
  stdin.on('data', function(term) {
    term = term.substring(0, term.length - 1);
    var google = http.createClient(80, 'ajax.googleapis.com');
    var search_url = "/ajax/services/search/web?v=1.0&q=" + term;
    var request = google.request('GET', search_url, {
      'host': 'ajax.googleapis.com',
      'Referer': 'http://floatingsun.net',
      'User-Agent': 'NodeJS HTTP client',
      'Accept': '*/*'});
    request.on('response', function(response) {
      response.setEncoding('utf8');
      var body = ""
      response.on('data', function(chunk) {
        body += chunk;
      });
      response.on('end', function() {
        var searchResults = JSON.parse(body);
        var results = searchResults["responseData"]["results"];
        for (var i = 0; i < results.length; i++) {
          console.log(results[i]["url"]);
        }
      });
    });
    request.end();
  });
}

search();

This program (also available as a gist) reads in search terms on standard input, and does a Google search on those terms, printing the URLs of the search results.

I was quite surprised (and a bit embarrassed) at how long it took me to get this simple program working. For instance, it took me the better part of an hour to realize that when I read something from stdin, it includes the trailing newline (as the user hits ‘Enter’). Earlier, I was using the input as-is for the search term, and that was leading to a 404 error, because the resulting URL was malformed.

Debugging was also harder, as expected. Syntax errors are easily caught by V8, but everything else is still obscure. I’m sure some of the difficulty is because of my lack of expertise with Javascript. But at one point, I got this error:

events:12
        throw arguments[1];
                       ^
Error: Parse Error
    at Client.ondata (http:881:22)
    at IOWatcher.callback (net:517:29)
    at node.js:270:9

I still haven’t figured out exactly where that error was coming from. Nonetheless, it was an interesting exercise. I’m looking forward to writing some non-trivial code with node.js now.

August 3rd, 2010

Some thoughts on dbShards

I heard about dbShards via two recent blog posts — one by Curt Monash and the other by Todd Hoff. It seemed like an interesting product, so I spent some time digging around on their website.

As the name suggests, dbShards is all about sharding. Sharding, also known as partitioning, is the process of distributing a given dataset into smaller chunks based some policy. AFAIK, the term “shard” was popularized recently by Google even though the concept of partitioning is at least a few decades old. Most distributed data management systems implement some form of sharding by necessity, since the entire data set will not fit in memory on a single node (if it would, you should not be using a distributed system). And therein lies the USP of dbShards — it brings sharding (and with it, performance and scalability) to commodity, single-node databases such as MySQL and Postgres.

So how does it work? Well, dbShards acts as a transparent layer sitting in front of multiple nodes running MySQL, lets say. Transparent, because they want to work with legacy code, meaning no or minimal client side modifications. Inserting new data is pretty simple: dbShards using a “sharding key” to route an incoming tuple to the appropriate destination. Queries are a bit more complex, and here the website is skimpy on details. Monash’s post mentions that join performance is good when sharding keys are the same — this is not a surprise. I’m not interested in what other kinds of query optimizations are in place. When data is partitioned, you really need a sophisticated query planner and optimizer that can minimize data movement and aggregation, and push down as much computation as possible to individual nodes.

I found the page on replication intriguing. I’m guessing when they say “reliable replication”, they mean “consistent replication” in more common parlance (alternative, that dbShards supports strong consistency, as opposed to eventual or lazy consistency). This particular bit in the first paragraph caught my eye: “deliver high performance yet reliable multi-threaded replication for High Availability (HA)”. I’m not sure how to read this. Are they implying that multi-threaded replication is typically not performant? And usually you do NOT want threading for high availability, because a single thread can still take the entire process down. The actual mechanism for replication seems like a straightforward application of the replicated state machine approach.

But making a replicated state machine based system scale requires very careful engineering, otherwise it is easy to hit performance bottlenecks. I’d be very interested in knowing a bit more about the transaction model in dbShards and how it performs on larger systems (tens to hundreds of nodes).

The pricing model is also quite interesting. I think it is the first vendor I know of that is pricing on CPU and not storage (their pricing is $5,000 per year per server). I think this is indicative of the target customer segment as well — I would imagine dbShards works well with a few TBs of data on machines with a lot of CPU and memory.

July 21st, 2010

What is node.js?

: Image via Wikipedia

If you follow the world of Javascript and/or high-performance networking, you have probably heard of node.js. If you already grok Node, then this post is not for you; move along. If, however, you are a bit confused as to exactly what Node.js is and how it works, then you should read on.

The node.js website doesn’t mince words in describing the software: “Evented I/O for V8 JavaScript.” While that statement is precise and captures the essence of node.js succinctly, at first glance it did not tell me much about node.js. I did what anyone interested in node.js should do: downloaded the source and started playing around with it.

So what exactly is node.js? Well, first and foremost it is a Javascript runtime. Think of your web browser; how does it run Javascript? It implements a Javascript runtime and supports APIs that make sense in the browser such as DOM manipulation etc. Javascript as a language itself is fairly browser agnostic. So node.js is yet another runtime for Javascript, implemented primarily in C++.

Because node.js focuses on networking, it does not support the standard APIs available in a browser. Instead, it provides a different set of APIs (with fantastic documentation). Thus, for instance, HTTP support is built into node.js — it is not an external library.

The other salient feature of node.js is that it is event driven. If you are familiar with event driven programming (ala Python Twisted, Ruby’s Event Machine, the event loop in Qt etc), you know what I’m talking about. The key difference though is that unlike all these systems, you never explicitly invoke a blocking call to start the event loop — node.js automatically enters the event loop as soon as it has finished loaded the program. A corollary is that you can only write event driven programs in node.js, no other programming models are supported. Another consequence of this design choice is that node.js is single-threaded. To exploit CPU parallelism, you need to run multiple node.js instances. Of course, there are several node.js modules and projects already available to address this very issue.

To implement a runtime for Javascript, node.js first needs to parse the input Javascript. node.js leverages Google’s V8 Javascript engine to do this. V8 takes care of interpreting the Javascript so node.js need not worry about syntactical issues; it only need to implement the appropriate hooks and callbacks for V8.

node.js claims to be extremely memory efficient and scalable. This is possible because node.js does not expose any blocking APIs. As a result, the program is completely callback driven. Of course, any kind of I/O (disk or network) will eventually block. node.js does all blocking I/O in an internal thread pool — thus even though the application executes in a single thread, internally there are multiple threads that node.js manages.

Overall, node.js is very refreshing. The community seems great and there is a lot of buzz around the project right now, with some big companies like Yahoo starting to use experiment with node.js. node.js is also driving the “server side Javascript” movement. For instance, Joyent’s Smart platform allows you to write your server code in Javascript, which they can then execute on their hosted platforms.

Finally, no blog post about node.js is complete without an example of node.js code. Here is a simple web server:

[gist id=485001]

June 30th, 2010

Rupee symbol finalists

The ToI reported a few days ago that there are five finalists for the Indian Rupee symbol:

Rupee symbol finalists

I had previously written about the Rupee symbol here. The ToI seems to suggest that #4 is the likely candidate. According to the article, the symbol should have been decided by now, but I haven’t found any information to that effect. Does anyone know the final outcome? I can’t say I like any of the finalists that much, but I guess out of the five #4 is probably one of the better ones.

May 12th, 2010

How the mouse moves

Random interesting find of the day: IOGraphica. Here’s mine for about 7 hours at work:

Such a simple app, but such a fascinating output. An easy way to create computer generated art! Couple of observations:

I have a dual-monitor setup at work. I use the left monitor for email for browsing and the right monitor for code. The mouse patterns clearly reflect this usage pattern. I tend to rest the mouse roughly equally on the both the monitors.
I was very intrigued by the fact that most of the mouse motions are very smooth. Most curves almost look parabolic. There are very few jerks and jittery lines. Once again, nature seems poetic even in the most chaotic and random actions.