Category: Featured

August 3rd, 2010

Some thoughts on dbShards

I heard about dbShards via two recent blog posts — one by Curt Monash and the other by Todd Hoff. It seemed like an interesting product, so I spent some time digging around on their website.

As the name suggests, dbShards is all about sharding. Sharding, also known as partitioning, is the process of distributing a given dataset into smaller chunks based some policy. AFAIK, the term “shard” was popularized recently by Google even though the concept of partitioning is at least a few decades old. Most distributed data management systems implement some form of sharding by necessity, since the entire data set will not fit in memory on a single node (if it would, you should not be using a distributed system). And therein lies the USP of dbShards — it brings sharding (and with it, performance and scalability) to commodity, single-node databases such as MySQL and Postgres.

So how does it work? Well, dbShards acts as a transparent layer sitting in front of multiple nodes running MySQL, lets say. Transparent, because they want to work with legacy code, meaning no or minimal client side modifications. Inserting new data is pretty simple: dbShards using a “sharding key” to route an incoming tuple to the appropriate destination. Queries are a bit more complex, and here the website is skimpy on details. Monash’s post mentions that join performance is good when sharding keys are the same — this is not a surprise. I’m not interested in what other kinds of query optimizations are in place. When data is partitioned, you really need a sophisticated query planner and optimizer that can minimize data movement and aggregation, and push down as much computation as possible to individual nodes.

I found the page on replication intriguing. I’m guessing when they say “reliable replication”, they mean “consistent replication” in more common parlance (alternative, that dbShards supports strong consistency, as opposed to eventual or lazy consistency). This particular bit in the first paragraph caught my eye: “deliver high performance yet reliable multi-threaded replication for High Availability (HA)”. I’m not sure how to read this. Are they implying that multi-threaded replication is typically not performant? And usually you do NOT want threading for high availability, because a single thread can still take the entire process down. The actual mechanism for replication seems like a straightforward application of the replicated state machine approach.

But making a replicated state machine based system scale requires very careful engineering, otherwise it is easy to hit performance bottlenecks. I’d be very interested in knowing a bit more about the transaction model in dbShards and how it performs on larger systems (tens to hundreds of nodes).

The pricing model is also quite interesting. I think it is the first vendor I know of that is pricing on CPU and not storage (their pricing is $5,000 per year per server). I think this is indicative of the target customer segment as well — I would imagine dbShards works well with a few TBs of data on machines with a lot of CPU and memory.

July 27th, 2010

How to buy a new car

A few weeks ago we were in the market for a new car. Now, I like to think of myself as a cautious buyer: I like to do my research, I’m not much of an impulse shopper and I’m generally suspicious of sales people. A new car is a significant investment; naturally I felt extra prudent. Of course, all my friends kept wondering why I was making such a big deal: you go into a dealership, pick up the car, do the paperwork and walk out, as simple as that. I say good for them! But I sleep more peacefully knowing that I had my bases covered.

Toy Car — Courtesy http://www.flickr.com/photos/ericrobinson/

A quick Google search on “how to buy a new car” led me straight to the very comprehensive CarBuyingTips.com. It is probably a great resource for many people. But after spending a few hours clicking through the numerous links on there, I almost felt exhausted. There was way too much (redundant) information, perhaps badly organized and overall just not very easy to consume. As they say, hindsight is 20/20. So, with the experience of having just purchased a new vehicle, here is my attempt at a concise, five step guide to buying a new car.

Figure out what you want: You should know exactly what make and model you want, down to the last detail — this includes the interior color, upholstery and exterior color, as well as any other options and accessories. The more precise you are in what you want, the better off you will be. My first impression while researching new cars was that I could get whatever configuration I wanted — if the dealership doesn’t have it in stock, they’ll simply order it in. Unfortunately, most dealers will only work with what stock they have. So checking for availability is critical. Go ahead and schedule those test drives, but let the dealers know upfront that you are not looking to buy just yet. The dealers will ask for your contact information though, so be prepare for a barrage of emails and phone calls from them, until you’ve made your purchase.
Get the numbers: Once you have identified the configuration you want, find your car on Edmunds.com. Edmunds will give you the invoice price of your car. Go ahead and add all the options and select the colors to get a final estimated invoice price. The more informed you are, the better your chances are when negotiating with dealers and making an informed decision.
Get a quote from CarsDirect: Buying a car online these days is not only possible, but highly recommended. You save the hassle of driving to dealerships, wasting time over the phone etc. Start your hunt for the best price by getting a quote from carsdirect.com. They partner with local dealerships and have very competitive pricing. My experience with CarsDirect was fantastic and I’d have definitely bought a car from them had a local dealer not given me a much better deal.
Get quotes from local dealers: Open a spreadsheet, fire up your browser and start calling your local dealerships. Ask for the new car sales department and let them know exactly what configuration you are looking for. Ask them for price and availability. Always ask for out-of-the-door price, including taxes and rebates. This way there will be fewer surprises on the final bill. Make a point to let them know that you are talking to other dealers. Jot down the dealers quote in the spreadsheet (add the CarsDirect quote here as well). This process can take some time because you may not be able to reach them in the first attempt and there might be some back and forth while they get back to you with details. I recommend setting aside 2 slots of 2 hrs each for these phone calls.
Decide and Buy: Once you have all competing quotes, you can make your decision. The final decision will probably depend not just on the price, but other factors such as availability, location of the dealership, your experience with the dealership etc. If you finance your car, most car companies typically have their own financing arm which usually provides great APRs. If not, talk to your bank. For the final paperwork, you should make a visit to the dealership. Be sure to read the fine print and know exactly what service (if any) the dealer will provide above and beyond the warranty and services provided by the manufacturer.

That’s pretty much it! I read a lot of horror stories online about swindling and cheating in dealerships. My personal experience with at least the Toyota dealers in the Bay Area was pretty good. Most of them were very straightforward and to the point. They did not want to waste their time or mine, and did not try to pressurize or hoodwink me into a bad deal. You might also find this guide useful.

July 21st, 2010

What is node.js?

: Image via Wikipedia

If you follow the world of Javascript and/or high-performance networking, you have probably heard of node.js. If you already grok Node, then this post is not for you; move along. If, however, you are a bit confused as to exactly what Node.js is and how it works, then you should read on.

The node.js website doesn’t mince words in describing the software: “Evented I/O for V8 JavaScript.” While that statement is precise and captures the essence of node.js succinctly, at first glance it did not tell me much about node.js. I did what anyone interested in node.js should do: downloaded the source and started playing around with it.

So what exactly is node.js? Well, first and foremost it is a Javascript runtime. Think of your web browser; how does it run Javascript? It implements a Javascript runtime and supports APIs that make sense in the browser such as DOM manipulation etc. Javascript as a language itself is fairly browser agnostic. So node.js is yet another runtime for Javascript, implemented primarily in C++.

Because node.js focuses on networking, it does not support the standard APIs available in a browser. Instead, it provides a different set of APIs (with fantastic documentation). Thus, for instance, HTTP support is built into node.js — it is not an external library.

The other salient feature of node.js is that it is event driven. If you are familiar with event driven programming (ala Python Twisted, Ruby’s Event Machine, the event loop in Qt etc), you know what I’m talking about. The key difference though is that unlike all these systems, you never explicitly invoke a blocking call to start the event loop — node.js automatically enters the event loop as soon as it has finished loaded the program. A corollary is that you can only write event driven programs in node.js, no other programming models are supported. Another consequence of this design choice is that node.js is single-threaded. To exploit CPU parallelism, you need to run multiple node.js instances. Of course, there are several node.js modules and projects already available to address this very issue.

To implement a runtime for Javascript, node.js first needs to parse the input Javascript. node.js leverages Google’s V8 Javascript engine to do this. V8 takes care of interpreting the Javascript so node.js need not worry about syntactical issues; it only need to implement the appropriate hooks and callbacks for V8.

node.js claims to be extremely memory efficient and scalable. This is possible because node.js does not expose any blocking APIs. As a result, the program is completely callback driven. Of course, any kind of I/O (disk or network) will eventually block. node.js does all blocking I/O in an internal thread pool — thus even though the application executes in a single thread, internally there are multiple threads that node.js manages.

Overall, node.js is very refreshing. The community seems great and there is a lot of buzz around the project right now, with some big companies like Yahoo starting to use experiment with node.js. node.js is also driving the “server side Javascript” movement. For instance, Joyent’s Smart platform allows you to write your server code in Javascript, which they can then execute on their hosted platforms.

Finally, no blog post about node.js is complete without an example of node.js code. Here is a simple web server:

[gist id=485001]

May 19th, 2010

Thoughts on the Rupee symbol

This post has been sitting around in my “drafts” for more than a year now. I just figured I would get it out of the way — better late than never.

In March 2009, the Indian government (specifically, the Finance Ministry) announced a contest to design a symbol for the Indian Rupee. Sometime in April, the Ministry put out a press release listing all the eligible applicants; there were around 2300 eligible candidates it seems.

At some point after that, images of a few of the designs started surfacing:

As is the case with most Indian Government websites, the Financial Ministry website is a disaster. There is very little useful information there, there is no way to search for information. Case in point — I was not able to find any information about the design contest on their website. The image above is the result of a Google search.

Couple of thoughts on the designs above (note that I do not know if these are even actual candidates. I’m assuming they are):

It isn’t entirely clear to me why we need a symbol in the first place. Sure, writing ‘$’ is probably nicer than writing ‘USD’, but ‘Rs’ isn’t all that bad.
Any symbol for a currency should be really simple to draw. Simple. You should be able to draw it by hand in a few strokes. How many of the above designs do you find that simple?
Even if we pick a symbol, for it to actually start getting used, it has to be readily available on all computing platforms. Does Unicode have a provision for adding new symbols?

In December 2009, the Economic Times reported that the Ministry had shortlisted five finalists. Really? Wow. Again, no information to be found from the Ministry itself. It would have been pretty amazing (and easy to set up) if the Ministry had set up a public poll and asked Indian citizens which symbol they liked best.

Does anyone know what happened to the Indian rupee symbol design contest? I couldn’t find anything on Google after December 2009.

April 4th, 2010

India Census 2011

Amidst all the chatter, billboards, news coverage and ads about the 2010 census here in the US, I heard a stories on the radio about the upcoming census in India next year. Since I didn’t know much about the census process in India and since this particular census seemed particularly newsworthy, I decided to learn more about it. Here’s what I found.

First, I was a bit surprised and disappointed that the top hit for a Google search for “india census 2011″ does not point to an Indian government website for the census. Instead, most of the top hits are from various news stories. An interesting aside here — check out the same search carried on google.co.in; the results are quite different. In fact, I found it quite interesting that the censusindia.gov website is the top hit for the US search, but not in the India search.

Second, the census website is pretty shabby. Forget about SEO, typography, accessibility — the folks who are building these websites are still living in 1999. Someone needs to tell them about good web design. Design and usability issues apart, the content itself is lacking in many places; even incorrect in some places. Consider this page:

the breadcrumb style navigation is only for appearances — the links don’t really work
It says “Census 2001″!!!!
the <title> anchor says “Evolution of Data Processing”. There’s no indication of the title on the page itself. Instead, it says “Preparation for the 2011 census.”
Best viewed in 1024×768? Wow.
None of the foreign language links at the bottom of the page work. It would be trivial to link to a Google-translated page.
The header image or text don’t link to the Census home page. This is really basic stuff. I wonder if any of the developers building the website have actually used it.
Most of the links on the Census home page are “under construction.”

Anyways, I don’t want to waste this post on nitpicking. After all, the issues at stake here are considerably more serious than web design. Without further ado, here are the key facts:

This is the 15th census. The first one took place in 1872.
I couldn’t get any information about how the previous censuses were conducted, but this time around the Indian Government is aiming to record each and every one of the 1.2 billion individuals. Thats a really lofty goal.
This Census is also being used to piggy back another significant project: the construction of a National Public Registry (NPR). The NPR will contain some key stats about everyone in India. The NPR will also be used to generate biometric IDs for all Indians at some point.
For the first time in Census history, some NGOs will be involved in the training phase.
A lot of geo-spatialization technologies are in play. The only article I could find was this.
Another first: this Census will be collecting some data on basic sanitation, drinking water etc.
The 2001 census employed roughly 2 million “enumerators” (data collection personnel). This census will undoubtedly require more.

The FAQ has several interesting nuggets:

The information collected about individuals is kept absolutely confidential. In fact this information is not accessible even to Courts of law.

Once this database has been created, biometrics such as photograph, 10 fingerprints and probably Iris information will be added for all persons aged 15 years and above.

The issue of Cards will be done in Coastal Villages to start with. After this the coastal Towns will be covered and so on till the entire country is covered.

35 questions relating to Building material, Use of Houses, Drinking water, Availability and type of latrines, Electricity, possession of assets etc. will be canvassed.

Like I said earlier, these are lofty goals and this is a massive effort. But there’s so little information out there. I have a lot of questions and concerns. For one, the construction of the NPR poses a huge privacy risk. How is the Government going to ensure that corrupt officials don’t sell out the data to whoever pays the right price? What kind of audit policies and procedures are in place such that a neutral third-party can attest to the integrity of the entire process? Who all within the government will be able to look up my data? If the Government is going to store the biometric data as well, aren’t they constructing a gold-mine of identity-theft ammunition?

Digging around for more information, I found this fascination presentation (linked off of the United Nations Statistics Division). The PPT claims that it was prepared by someone from within the Census organization, but I can’t vouch for its authenticity.

India Census Data Processing

View more presentations from diwakergupta.

I really hope that a lot more information on the technology behind the Census comes out in the coming months. The news channels in India should take some time out from reporting on tabloid/gossip issues and focus some energy on something that will impact every single Indian.