Category: Technology

April 29th, 2010

Throw away mailing lists

Have you ever found yourself in a situation where you are on a mailing list and you want to send an email to all but 2 members on that list? A common case here is planning a surprise for someone on that list.

In general, I find myself on (long) email threads containing a different subset of people for different occasions (birthdays, anniversaries etc) several times a year. The email threads quickly become long and unwieldy. People keep adding other people as the thread progresses, and the only way the new adds can figure out whats going on is looking at the content of future emails. There is no way for anyone to go back and read all the discussion so far.

That got me thinking, wouldn’t it be great to have a service that provide throw away mailing lists? Hear me out. Here’s how the service would work:

To start a new mailing list, I simply send an email to newlist@mycoolservice.com. In the email, I also include a list of email addresses I want to seed the list with.
The service sends me back the address of a newly created throw away list. This could be of the form some-random-number@googlegroups.com.
For all practical purposes, this is exactly like any other mailing list (or Google Group). We can add more members, search the messages etc.
Start your discussion and let the thoughts flow.
…
At some point, the purpose behind the list will cease to exist (successful surprise, for instance). Needless to add, further discussions on the topic will also cease.
You forget you even created this mailing list. After the mailing list has been idle for some time (say two weeks), the service automatically deletes the mailing list. Any future messages to that address will bounce back saying that the list has been deleted, please contact the administrator.

Does anyone else think this could be useful?

April 4th, 2010

India Census 2011

Amidst all the chatter, billboards, news coverage and ads about the 2010 census here in the US, I heard a stories on the radio about the upcoming census in India next year. Since I didn’t know much about the census process in India and since this particular census seemed particularly newsworthy, I decided to learn more about it. Here’s what I found.

First, I was a bit surprised and disappointed that the top hit for a Google search for “india census 2011″ does not point to an Indian government website for the census. Instead, most of the top hits are from various news stories. An interesting aside here — check out the same search carried on google.co.in; the results are quite different. In fact, I found it quite interesting that the censusindia.gov website is the top hit for the US search, but not in the India search.

Second, the census website is pretty shabby. Forget about SEO, typography, accessibility — the folks who are building these websites are still living in 1999. Someone needs to tell them about good web design. Design and usability issues apart, the content itself is lacking in many places; even incorrect in some places. Consider this page:

the breadcrumb style navigation is only for appearances — the links don’t really work
It says “Census 2001″!!!!
the <title> anchor says “Evolution of Data Processing”. There’s no indication of the title on the page itself. Instead, it says “Preparation for the 2011 census.”
Best viewed in 1024×768? Wow.
None of the foreign language links at the bottom of the page work. It would be trivial to link to a Google-translated page.
The header image or text don’t link to the Census home page. This is really basic stuff. I wonder if any of the developers building the website have actually used it.
Most of the links on the Census home page are “under construction.”

Anyways, I don’t want to waste this post on nitpicking. After all, the issues at stake here are considerably more serious than web design. Without further ado, here are the key facts:

This is the 15th census. The first one took place in 1872.
I couldn’t get any information about how the previous censuses were conducted, but this time around the Indian Government is aiming to record each and every one of the 1.2 billion individuals. Thats a really lofty goal.
This Census is also being used to piggy back another significant project: the construction of a National Public Registry (NPR). The NPR will contain some key stats about everyone in India. The NPR will also be used to generate biometric IDs for all Indians at some point.
For the first time in Census history, some NGOs will be involved in the training phase.
A lot of geo-spatialization technologies are in play. The only article I could find was this.
Another first: this Census will be collecting some data on basic sanitation, drinking water etc.
The 2001 census employed roughly 2 million “enumerators” (data collection personnel). This census will undoubtedly require more.

The FAQ has several interesting nuggets:

The information collected about individuals is kept absolutely confidential. In fact this information is not accessible even to Courts of law.

Once this database has been created, biometrics such as photograph, 10 fingerprints and probably Iris information will be added for all persons aged 15 years and above.

The issue of Cards will be done in Coastal Villages to start with. After this the coastal Towns will be covered and so on till the entire country is covered.

35 questions relating to Building material, Use of Houses, Drinking water, Availability and type of latrines, Electricity, possession of assets etc. will be canvassed.

Like I said earlier, these are lofty goals and this is a massive effort. But there’s so little information out there. I have a lot of questions and concerns. For one, the construction of the NPR poses a huge privacy risk. How is the Government going to ensure that corrupt officials don’t sell out the data to whoever pays the right price? What kind of audit policies and procedures are in place such that a neutral third-party can attest to the integrity of the entire process? Who all within the government will be able to look up my data? If the Government is going to store the biometric data as well, aren’t they constructing a gold-mine of identity-theft ammunition?

Digging around for more information, I found this fascination presentation (linked off of the United Nations Statistics Division). The PPT claims that it was prepared by someone from within the Census organization, but I can’t vouch for its authenticity.

India Census Data Processing

View more presentations from diwakergupta.

I really hope that a lot more information on the technology behind the Census comes out in the coming months. The news channels in India should take some time out from reporting on tabloid/gossip issues and focus some energy on something that will impact every single Indian.

March 31st, 2010

Big Data Analytics

DISCLAIMER: As with all other material on this blog, these are my thoughts and do NOT reflect the opinions of my employer.

I really like the tagline on our logo: big data. fast insights.

But leaving the marketing aside, what does it mean really? What is all the hoopla about big data analytics?

The way I look at things, a few key observations here are:

Data is increasing. This is almost self-evident, so I won’t bother with presenting any evidence.
Data is driving businesses more than ever. Whether it is search, advertising, insurance, finance, health care, governance — data is becoming an integral part of more and more business processes.
Finally, data movement is slow. And I mean really really slow, compared to our processing and memory speeds. Once you go into the range of hundreds of terabytes of petabytes of data, you really don’t want to keep moving around that data into isolated silos for doing analytics.

Clearly, none of these observations is particularly new or insightful. However, I do think some of the implications of these observations are quite powerful and were new at least for me. For instance, (3) implies that once you have accumulated a lot of data in one place (imagine hundreds of TB or more), it is extremely difficult and time consuming to move that data around. This, in turn, means that more often than not, data is likely to reside in a single place.

Traditionally, it was not uncommon to have a large data warehouse that would be the repository of all data. Then smaller data sets could be carved out from this master data set (also known as data marts) as required. This approach is becoming increasingly unfeasible. Carving out 100TB data marts from a 1PB data warehouse is simply not going to scale.

At the same time, it is clear that a one-size-fits-all approach to data storage and analysis is not practical either. Some data sets naturally lend themselves to a relational data model, while others might be more suited to unstructured processing (Hadoop) or document oriented processing (CouchDB or MarkLogic) or graph analysis (Neo4J) and so on. Forcing a single model or access mechanism down all customers’ throat is not tenable.

So what would the ideal platform for big data analytics look like? One that allows you to store and access data in various ways, seamlessly.

March 17th, 2010

My experiences with Apple: A poem

Image via Wikipedia

I’m a Linux guy; Windows was never my thing honey
Apple seemed interesting, but required too much money

I have ideological problems with Apple too,
What with all the DRM and hardware lock-in they do.

But people are crazy about Apple, and I used to wonder why,
I had a dream: to own Apple products that I didn’t have to buy.

A few months back my wife gifted me an iPhone, bro!
And then at work I got the new Macbook Pro!!

Thus suddenly I was an Apple user,
Sure, some people called me a sore loser.

Allow me to share my early experiences,
Some accolades and some grievances.

I’ll try to keep a neutral tone,
Shall focus on the Mac and not the iPhone.

Integration, integration, integration!
The attention to detail gives a wonderful sensation.

User experience is the key,
Excellent design is for all to see.

They’ve taken care of the enterprises,
Exchange support, Google integration — no surprises.

It’s by far the best laptop I’ve ever used,
The hardware is slick, the software is smooth.

Image representing iTunes as depicted in Crunc... — Image via CrunchBase

But boy do I hate iTunes,
It’s so broken it should be called Looney Tunes.

Try connecting multiple iPhones to the same device,
Or plug your iPhone in another laptop (poor advice).

Sync is threatening, sounds like a bully.
“I shall sync or destroy”, that just sounds silly.

The Terminal app should aspire higher,
No 256-color support leaves much to desire.

Keyboard shortcuts are hard to find,
Change them? you must be out of your mind!

“Features” like “Spaces” are overrated,
More like awaited, belated and deflated.

I prefer iTerm over Terminal and Adium for chat,
Chrome over Safari, and this over that.

I’m certainly not blown away,
But a Mac is convenient, I have to say.

March 1st, 2010

Waf: a pleasant build system

A good software project must have a good build system. Unless you have a small code base consisting entirely of dynamic, scripted languages, you probably need to “build” your code before you can use it. Until around an year ago, the only build tool that I used and was familiar with was GNU Make. Make and the autotools family of tools have served the developer community well the past few decades.

But the Make model is rife with problems. Here are a few of them:

Make requires the use of its own domain specific language — this is, in general, not a good idea. Have you looked at any sizable project’s Makefile lately? Its hard to understand, and harder to modify.
In the same vein, autoconf/automake are notoriously hard to use. Bear in mind that these tools are supposed to make your life easier.
Makefile are so hard to write and extend that several popular build systems today are essentially Makefile generators. A good example is CMake.
Make relies heavily on file timestamps to detect changes.
Make is slow.
Makefile are not modular. Recursive Make is especially evil.
…

I recently began work on a new pet project. As is usually the case, I spent a lot more time figuring out what tools and libraries I would use for my project, than in actually writing any code for the project :) Part of the investigation was to survey the state of the art in build systems. At work, we started using SCons for most of our build, which was already a huge improvement over Make. But SCons has its own set of issues.

One of the nicest features in SCons is that build files are regular Python files. This provides enormous flexibility and immediate familiarity. Unfortunately, the SCons documentation leaves much to be desired. I still don’t quite understand the execution model of SCons very well. For instance, I know how to extend SCons to support cross-compilation for multiple platforms. However, I don’t really understand why those modifications work — there’s quite a bit of black magic that goes on behind the scenes. As a concrete example, there are several magic variables such _LIBDIRFLAGS that have strange powers.

Waf

After some more looking around, I discovered Waf. And now that I’ve played around with it a little bit, I’m happy to say that it is the most pleasant build system I’ve ever used. Things I really like about Waf:

The execution model just makes sense to me. You typically build a project in phases: there’s a configure phase, to sort out dependencies, tools etc; there’s the actual build phase; and then there’s the install phase. It is not uncommon to have a ‘dist’ step as well, to prepare the source for distribution. Waf understands these operations as first class entities. There is a very strong notion of workflow built into Waf.
Comprehensive documentation. Check out the Waf book and the wiki.
Waf has a very strong task model. There is a much stronger notion of dependencies (powered by content hashes, not timestamps). Waf also enforces that all generated code lands up in a separate “build” directory, so your source tree always remains clean.
Using waf is a breeze — there are no big dependencies, no packages to install, no bloated software to include with your code. Just a single 80kb script.
Progress indication and colored output is built in, not an after thought. Like SCons, Waf build files are regular Python files.
Waf is fast. Faster than SCons.

Of course, Waf is not perfect. Coming from a Make/SCons world, I sorely miss the ability to build specific targets. Yes there are ways to achieve this in Waf, but they are all clumsy. The API documentation (and the source itself) are a bit hard to parse.