Floating Sun » Software

GaesteBin: a secure pastebin for Google App Engine

Diwaker Gupta — Tue, 23 Oct 2012 04:21:55 +0000

TL;DR: gaestebin is a private, secure, open source pastebin for Google App Engine.

Pastebins are incredibly useful. But most of the public pastebins are not suitable for sharing within a company (think code fragments, log messages etc.) and most private pastebins are either ugly (except hastebin!), hard to setup/maintain and usually forced to be behind the firewall (for security).

So I decided to scratch my itch and whipped up a pastebin that fit my needs. Features:

Usability: it’s a pastebin — you create a paste, you view a paste. That’s it. No need to specify the language — we guess it, thanks to highlight.js. It also looks good (I think), thanks to Solarized.
Easy of deployment: no installation, no maintenance. Just deploy to Google App Engine.
Security: you need a Google account to create a paste. No login required to view a paste though. Bonus: if you use Google Apps at your company, you can have a private pastebin for you company by restricting the app to your domain.
Open source: Fork it, baby!

Bitcasa: First Impressions

Diwaker Gupta — Wed, 25 Jan 2012 07:47:49 +0000

I got my invite for the Bitcasa beta last week but only got around to installing it yesterday. I’ve only used it sparingly thus far. If you are in a hurry, here’s the TL;DR version:

Users might find the “cloudify” model confusing
Built using osxfuse (not to be confused with MacFUSE) and Qt
Infinite storage sounds too good to be true. What’s the catch?
Building trust with users will take time

Cloudification and Confusion

Here’s Bitcasa on what cloudify does:

When a folder is Cloudified, a corresponding virtual folder is created on the Bitcasa server and the contents of your local folder are copied up to the server. When Connected to the Bitcasa server, any changes or additions to the folder will live on the server. When not Connected to the Bitcasa server, any changes or addition to the folder will live locally.

Just think about that for a second. The “cloudify” model sounds great in principle, but it does add a lot of complexity in terms of how users interact with the system. For instance, when I’m offline and make changes to one of my cloudified folders, that change happens presumably locally. I would assume that when I come back online, these changes are synced back to Bitcasa ala Dropbox. But what if I accidentally disconnect a folder, make some changes and then reconnect — per the FAQ, the changes made locally won’t be synced.

The consumer cloud storage is fairly mature right now and one can learn a lot by looking at how people respond to other systems. This thread on Quora is particularly insightful: again and again, simplicity comes up as one of the key reasons behind Dropbox’s success.

My prediction is that Bitcasa’s cloudify feature will be leveraged primarily by power users and the rest would end up using the default Bitcasa folder, Dropbox style.

Nuts and Bolts

Bitcasa seems to be built primarily using Qt. This isn’t a surprise: Qt is a mature, open source and cross-platform library.

$ otool -L Bitcasa
Bitcasa:
 /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 159.1.0)
 /usr/lib/libz.1.dylib (compatibility version 1.0.0, current version 1.2.5)
 /usr/lib/libcrypto.0.9.8.dylib (compatibility version 0.9.8, current version 44.0.0)
 @executable_path/../Frameworks/libmacfuse_i64.2.dylib (compatibility version 10.0.0, current version 2.0.0)
 /usr/lib/libssl.0.9.8.dylib (compatibility version 0.9.8, current version 44.0.0)
 /System/Library/Frameworks/CoreServices.framework/Versions/A/CoreServices (compatibility version 1.0.0, current version 53.0.0)
 @executable_path/../Frameworks/QtWebKit.framework/Versions/4/QtWebKit (compatibility version 4.7.0, current version 4.7.4)
 @executable_path/../Frameworks/QtXml.framework/Versions/4/QtXml (compatibility version 4.7.0, current version 4.7.4)
 @executable_path/../Frameworks/QtGui.framework/Versions/4/QtGui (compatibility version 4.7.0, current version 4.7.4)
 @executable_path/../Frameworks/QtNetwork.framework/Versions/4/QtNetwork (compatibility version 4.7.0, current version 4.7.4)
 @executable_path/../Frameworks/QtCore.framework/Versions/4/QtCore (compatibility version 4.7.0, current version 4.7.4)
 /usr/lib/libstdc++.6.dylib (compatibility version 7.0.0, current version 52.0.0)
 /usr/lib/libgcc_s.1.dylib (compatibility version 1.0.0, current version 1105.0.0)

$ mount
Sample Videos on /Users/diwaker/Bitcasa/Sample Videos (osxfusefs, nodev, nosuid, synchronous, mounted by diwaker)
TryBitcasa on /Users/diwaker/TryBitcasa (osxfusefs, nodev, nosuid, synchronous, mounted by diwaker)
TryBitcasaDedup on /Users/diwaker/TryBitcasaDedup (osxfusefs, nodev, nosuid, synchronous, mounted by diwaker)

Note further that Bitcasa represents “connected” folders as mount points over the existing folders. This is why when you disconnect a folder and make changes, they won’t propagate to Bitcasa’s copy of that folder. They are using osxfuse which implies that Bitcasa is intercepting file system calls; this is in contrast to Dropbox-like systems that detect changes to the local filesystem asynchronously. I haven’t compared fine-grained read/write performance just yet.

Here’s a snapshot of the Bitcasa Folders UI:

Bitcasa also does some deduplication. Uploading 100MB of mostly random data took around 4 minutes on a pretty fat pipe which isnt’ bad at all. Copying that data back out took just as long, if not longer. A copy of the same folder took less than 10 seconds to cloudify!

Security

Much has been said about Bitcasa’s security. However, most of the articles are concerned with a specific dimension of security: encryption.

A detailed discussion of Bitcasa’s security in general and encryption, in particular, deserves a post of its own. For now, suffice to say that even after several years of user experience, Dropbox still hit some pretty nasty security snafus in 2011. Like a lot of you, I’m very concerned about security, especially with a service that is offering me infinite storage for free! It takes time to build trust with your users — there’s no short cut.

Overall, Bitcasa is definitely interesting. Dropbox was almost beginning to monopolize the consumer cloud storage market, so some good competition will hopefully benefit the end users in the long run.

Mac Tip: Get wifi password from another (connected) Mac

Diwaker Gupta — Sun, 11 Sep 2011 01:36:39 +0000

Here’s the situation: say you are at a friend’s place and as all responsible hosts, they have a password protected wifi network. Your friend is busy (or unavailable) so you can’t ask her for the password. Of course, you are known to not give up easily. You look around and realize: aha! someone else over there on the couch is busy with their laptop, so they must know the password. Unfortunately, they don’t. But the password must be somewhere on their laptop, since they are connected after all. So how do you find it?

OK, that probably sounds contrived. But the truth is that I did have the need to extract the wifi password from my wife’s laptop earlier today and thought I’d share the (pretty simple) process.

Step one: open keychain access

Step two: search for the network name (SSID)

Step three: check ‘Show password’ (you may need to enter your password first since this required Administrator privileges).

Voila!

The silent victories of open source

Diwaker Gupta — Mon, 28 Mar 2011 06:38:57 +0000

Image via Wikipedia

For years, free/libre/open source software (henceforth referred to as FLOSS) have proclaimed, year after year, how that year is the year of Linux, or the year that open source will become mainstream, or the year that open source will finally take off etc. But it never has, at least traditionally speaking. Linux based desktops haven’t penetrated either the enterprise or consumer markets; with a few notable exceptions (Apache httpd, for instance), most FLOSS products — be it office software like OpenOffice, multimedia software such as Gimp or Inkscape — remain popular with economically insignificant niches. And yet, this year, more than ever before, open source forges ahead with its silent victories.

Consider the following shifts:

all the top brands of the day — Apple, Google, Facebook, Twitter, Amazon — they ALL stand tall on the shoulders of FLOSS giants.
Contributing software back to the open source community is becoming increasingly common, even expected. Take a look at the GitHub repositories of Twitter and Facebook, or the various Google projects. In fact, when screening engineering candidates, I often look for and encourage people to talk about their open source contributions.
Most of the activity around “big data” and “cloud computing” is being driven in large part by FLOSS, whether it is the Hadoop-powered ecosystem or the Xen/Linux powered Amazon Web Services.
Given the current smartphone landscape, it is highly likely that Android will become ubiquitous on tablet devices and a variety of consumer smart phones. Already, Android has more search mindshare than Linux, despite the fact that Linux is part of the Android stack.
If you start a software company today, I would bet that you will find yourself bootstrapping almost entirely using open source software. The entire development process — from the GCC compiler toolchain, to the build systems, to the scripting languages, to the version control systems, to the code review systems, to the continuous integration systems — everything is dominated by FLOSS products. Good bug trackers and enterprise Wikis are the last bastions but it is just a matter of time.

I’ve had a chance to see the enterprise software market up close and increasingly find more and more open source everywhere I look. FLOSS has not arrived, it has taken over.

Startup Infrastructure: Where Linux Fails

Diwaker Gupta — Sun, 09 Jan 2011 20:31:46 +0000

Image via Wikipedia

It is no secret that I’m an open source evangelist and so when it was time to set up internal infrastructure at work, naturally the first order of business was to evaluate the various OSS projects out there — everything from wikis, bug trackers, source control, code review and project management. Running Ubuntu LTS (10.04) on all of our servers was a no-brainer and there were plenty of excellent options for most everything else as well (a follow-up post on our final choices later). The Linux ecosystem is fabulous for most of the infrastructure needs of a startup, but I learnt the hard way that there are still some areas where Linux needs a lot of work before it can become competitive with proprietary, non-Linux solutions.

Authentication

Centralized account management (users and groups) and authentication is critical component in any IT deployment, no matter the size. Even for a small startup, creating users/groups repeatedly for each new server, separate authentication mechanisms for each new service is simply not scalable. That is precisely why Active Directory is so ubiquitous at enterprises.

LDAP was the obvious solution in Linux-land and I figured it would be trivial to setup an OpenLDAP server that can manage user/group information for us. It would also be the single authentication source for all servers and services. I was so wrong.

After struggling with OpenLDAP for several painful hours, I gave up — the documentation is fragmented, Google doesn’t help much and personally I think the LDAP creators had never heard of “usability” when designing it. The seemingly simple task of creating some new users and groups involved several black-magic incantations of the LDAP command line tools. Getting servers to authenticate against the resulting directory was even harder.

Just as I was about to throw in the towel and setup an AD instance in-house, I stumbled upon the 389 Directory Server (now known as the Fedora Directory Server). With a new found hope, I set about installing it on Ubuntu and hit another roadblock — there are no up-to-date packages of FDS for Ubuntu. Reluctantly, I setup a Fedora instance (the only one so far) and installed FDS. Thankfully, Red Hat has put together really comprehensive documentation and guides for the Directory Server, which was invaluable.

From there on, it was mostly downhill (only a few minor hiccups). Finally we have a nice GUI to manage users and groups, and all servers/services authenticate against a single Directory Server. But the journey was unnecessarily painful. Here’s what I’d like to see:

Up-to-date packages of FDS for Ubuntu. Sane defaults and functionality out-of-the-box
Ready to consume documentation on how to integrate LDAP with various web applications, Linux distros etc (I’ll put together some of this soon)
More awareness — I should have found FDS a lot sooner than I did, but it is certainly not very well marketed
Single sign on: This is a whole different beast

Remote Access

At my previous company, we had a Cisco VPN solution. There were plenty of Cisco compatible VPN clients on Windows and Mac. In fairness, it was relatively easy to get vpnc working on Ubuntu as well. In fact, with Network Manager, you can manage your VPN connections using a simple and intuitive UI. But the setup was not very reliable and my connections would get dropped relatively frequently. It was impossible to have a long-running VPN session without disruption. I’m not sure if the problem was with the Cisco hardware or the Ubuntu vpnc client; I did see similar issues with the built-in VPN client on Mac OS X.

But at least VPN on Linux works. I can’t say the same about other remote access mechanisms, in particular IPSec and L2TP over IPSec. It took me some time to figure out which package to use (Strongswan, Openswan, iked etc etc); another couple of hours to get the Openswan configuration just right; several hours of struggling to automatically setup DNS lookups when using the IPSec connection (gave up and ended up using entries in /etc/hosts!). There is no UI in Network Manager to manage IPSec connections either. Strongswan does have a NM plugin, but that only works for IKEv2 (certificate based authentication), while I had to use IKEv1 (shared key based authentication).

At the end of the day, I do have a working IPSec tunnel and it is definitely more reliable than the Cisco VPN (been up for more than 2 days without disruption). But all this can and should become a lot more seamless.

These are a few areas where Linux failed me in setting up the infrastructure for a startup; it shines most everywhere else. Hopefully these last few kinks will get ironed out soon.

Reclaim the TV

Diwaker Gupta — Sat, 18 Sep 2010 00:07:18 +0000

Do you:

wonder who watches all of the 700+ channels that you get from your cable provider?
wonder why you are paying more than $50/month for all those channels, when you only watch a handful?
find yourself channel surfing, just because you can, not because you know what you want to watch?
feel like you are not getting the most out of your TV?
feel like you can’t wait for Google TV or the Boxee box to improve your television experience?

If you answered yes to any of the above questions, I have a few tips for you. There have been times when I spent an hour flipping through channels, not really watching anything in particular, and later feeling like an idiot for having wasted my time. All this while I just accepted that cable TV as a given, an inevitable companion of our internet service provider… until I met a few friends who were perfectly happy without any cable service.

With that inspiration, I started looking around for a better TV experience. There is so much content available online (and good content!) that I wasn’t really worried about lack of material to watch. Both Google TV and the Boxee box seemed very promising (not to mention, the new Apple TV). There were two little problems:

Neither Boxee box nor Google TV are here yet, and
All of these devices have very very limited functionality. They address the problem of Internet TV, but I don’t want to use my TV to just watch TV. I want it to become a hub for all my media. I want to access my local music, photos, videos and more.

So I did what any self-respecting geek would do — I built my own media center PC. It runs Ubuntu; it hosts all my media in a single location; it serves as media server and storage server; and it serves as a compute server when I have to transcode media. You can find all the gory details in this guide: HOWTO Build an Ubuntu based media center.

And just like that, I’m a GNOME user

Diwaker Gupta — Fri, 10 Sep 2010 06:21:59 +0000

When I first started using Linux (more than a decade ago), I did my share of playing around with various desktop environments: the classic FVWM, GNOME, KDE, Enlightenment etc. I settled down with KDE. Over the years, I kept coming back to GNOME to check it out but somehow KDE always felt home to me.

Well guess what, not any more. As of a few days ago, I’m (mostly) a GNOME user.

I still love KDE (the desktop) and KDE based applications (KMail, Amarok etc). It is still infinitely more configurable than anything comparable in GNOME (Evolution and Thunderbird are still fairly limited in comparison) and over the years I’ve tweaked it to just the way I like it. But GNOME has something the KDE project does not: Canonical.

Thats right, I switched to GNOME because of Canonical, the company that drives Ubuntu development. Sure, there is a lot of effort behind the various Ubuntu variants such as Kubuntu, Xubuntu etc. But make no mistake, none of these variants are first-class citizens in the Ubuntu ecosystem.

The switch was a result of my recent experience setting up Ubuntu on my home theater PC. The effort Canonical has put into making the Ubuntu experience more seamless and pleasant is clearly visible. Pretty much everything works out of the box: folders that I share show up on other computers in my home network, bluetooth/webcam etc all work just fine, setting up remote desktop is a breeze and so on, Avahi/bonjour works like a charm; I can setup a DAAP server to share my music and it shows up on iTunes just like that.

Note that all of these things are obviously not limited to Ubuntu in any way. But the user experience in Ubuntu is unparalleled in comparison with Kubuntu etc. Subtle niceties like the notifications (the Ayatana project), the Me menu, the messaging menu, the “light” themes etc. come together in a very cohesive way to deliver an experience that rivals that of Mac OS. But beyond the subtleties, Canonical is shaping the future of Linux on the desktop, laptop and mobile devices: the Unity interface, multi-touch support for mobile devices and more. Bottomline: having a company put its weight behind a desktop has ramifications.

So as much as I love thy, KDE, for now we shall part ways. I’m still using some KDE apps (like digiKam), but until Canonical decides to officially adopt Kubuntu, GNOME it is.

Toying with node.js

Diwaker Gupta — Wed, 04 Aug 2010 17:58:34 +0000

A commenter rightly complained that despite my claims of “playing around” with node.js, all I could come up was with the example in the man page. I replied saying that I did intend to post something that I wrote from scratch, and as promised, here is my first toy node.js program:

var sys = require('sys');
var http = require('http');
var url = require('url');
var path = require('path');

function search() {
  stdin = process.openStdin();
  stdin.setEncoding('utf8');
  stdin.on('data', function(term) {
    term = term.substring(0, term.length - 1);
    var google = http.createClient(80, 'ajax.googleapis.com');
    var search_url = "/ajax/services/search/web?v=1.0&q=" + term;
    var request = google.request('GET', search_url, {
      'host': 'ajax.googleapis.com',
      'Referer': 'http://floatingsun.net',
      'User-Agent': 'NodeJS HTTP client',
      'Accept': '*/*'});
    request.on('response', function(response) {
      response.setEncoding('utf8');
      var body = ""
      response.on('data', function(chunk) {
        body += chunk;
      });
      response.on('end', function() {
        var searchResults = JSON.parse(body);
        var results = searchResults["responseData"]["results"];
        for (var i = 0; i < results.length; i++) {
          console.log(results[i]["url"]);
        }
      });
    });
    request.end();
  });
}

search();

This program (also available as a gist) reads in search terms on standard input, and does a Google search on those terms, printing the URLs of the search results.

I was quite surprised (and a bit embarrassed) at how long it took me to get this simple program working. For instance, it took me the better part of an hour to realize that when I read something from stdin, it includes the trailing newline (as the user hits ‘Enter’). Earlier, I was using the input as-is for the search term, and that was leading to a 404 error, because the resulting URL was malformed.

Debugging was also harder, as expected. Syntax errors are easily caught by V8, but everything else is still obscure. I’m sure some of the difficulty is because of my lack of expertise with Javascript. But at one point, I got this error:

events:12
        throw arguments[1];
                       ^
Error: Parse Error
    at Client.ondata (http:881:22)
    at IOWatcher.callback (net:517:29)
    at node.js:270:9

I still haven’t figured out exactly where that error was coming from. Nonetheless, it was an interesting exercise. I’m looking forward to writing some non-trivial code with node.js now.

Some thoughts on dbShards

Diwaker Gupta — Wed, 04 Aug 2010 06:20:52 +0000

I heard about dbShards via two recent blog posts — one by Curt Monash and the other by Todd Hoff. It seemed like an interesting product, so I spent some time digging around on their website.

dbShards

As the name suggests, dbShards is all about sharding. Sharding, also known as partitioning, is the process of distributing a given dataset into smaller chunks based some policy. AFAIK, the term “shard” was popularized recently by Google even though the concept of partitioning is at least a few decades old. Most distributed data management systems implement some form of sharding by necessity, since the entire data set will not fit in memory on a single node (if it would, you should not be using a distributed system). And therein lies the USP of dbShards — it brings sharding (and with it, performance and scalability) to commodity, single-node databases such as MySQL and Postgres.

So how does it work? Well, dbShards acts as a transparent layer sitting in front of multiple nodes running MySQL, lets say. Transparent, because they want to work with legacy code, meaning no or minimal client side modifications. Inserting new data is pretty simple: dbShards using a “sharding key” to route an incoming tuple to the appropriate destination. Queries are a bit more complex, and here the website is skimpy on details. Monash’s post mentions that join performance is good when sharding keys are the same — this is not a surprise. I’m not interested in what other kinds of query optimizations are in place. When data is partitioned, you really need a sophisticated query planner and optimizer that can minimize data movement and aggregation, and push down as much computation as possible to individual nodes.

I found the page on replication intriguing. I’m guessing when they say “reliable replication”, they mean “consistent replication” in more common parlance (alternative, that dbShards supports strong consistency, as opposed to eventual or lazy consistency). This particular bit in the first paragraph caught my eye: “deliver high performance yet reliable multi-threaded replication for High Availability (HA)”. I’m not sure how to read this. Are they implying that multi-threaded replication is typically not performant? And usually you do NOT want threading for high availability, because a single thread can still take the entire process down. The actual mechanism for replication seems like a straightforward application of the replicated state machine approach.

But making a replicated state machine based system scale requires very careful engineering, otherwise it is easy to hit performance bottlenecks. I’d be very interested in knowing a bit more about the transaction model in dbShards and how it performs on larger systems (tens to hundreds of nodes).

The pricing model is also quite interesting. I think it is the first vendor I know of that is pricing on CPU and not storage (their pricing is $5,000 per year per server). I think this is indicative of the target customer segment as well — I would imagine dbShards works well with a few TBs of data on machines with a lot of CPU and memory.

What is node.js?

Diwaker Gupta — Wed, 21 Jul 2010 19:39:07 +0000

: Image via Wikipedia

If you follow the world of Javascript and/or high-performance networking, you have probably heard of node.js. If you already grok Node, then this post is not for you; move along. If, however, you are a bit confused as to exactly what Node.js is and how it works, then you should read on.

The node.js website doesn’t mince words in describing the software: “Evented I/O for V8 JavaScript.” While that statement is precise and captures the essence of node.js succinctly, at first glance it did not tell me much about node.js. I did what anyone interested in node.js should do: downloaded the source and started playing around with it.

So what exactly is node.js? Well, first and foremost it is a Javascript runtime. Think of your web browser; how does it run Javascript? It implements a Javascript runtime and supports APIs that make sense in the browser such as DOM manipulation etc. Javascript as a language itself is fairly browser agnostic. So node.js is yet another runtime for Javascript, implemented primarily in C++.

Because node.js focuses on networking, it does not support the standard APIs available in a browser. Instead, it provides a different set of APIs (with fantastic documentation). Thus, for instance, HTTP support is built into node.js — it is not an external library.

The other salient feature of node.js is that it is event driven. If you are familiar with event driven programming (ala Python Twisted, Ruby’s Event Machine, the event loop in Qt etc), you know what I’m talking about. The key difference though is that unlike all these systems, you never explicitly invoke a blocking call to start the event loop — node.js automatically enters the event loop as soon as it has finished loaded the program. A corollary is that you can only write event driven programs in node.js, no other programming models are supported. Another consequence of this design choice is that node.js is single-threaded. To exploit CPU parallelism, you need to run multiple node.js instances. Of course, there are several node.js modules and projects already available to address this very issue.

To implement a runtime for Javascript, node.js first needs to parse the input Javascript. node.js leverages Google’s V8 Javascript engine to do this. V8 takes care of interpreting the Javascript so node.js need not worry about syntactical issues; it only need to implement the appropriate hooks and callbacks for V8.

node.js claims to be extremely memory efficient and scalable. This is possible because node.js does not expose any blocking APIs. As a result, the program is completely callback driven. Of course, any kind of I/O (disk or network) will eventually block. node.js does all blocking I/O in an internal thread pool — thus even though the application executes in a single thread, internally there are multiple threads that node.js manages.

Overall, node.js is very refreshing. The community seems great and there is a lot of buzz around the project right now, with some big companies like Yahoo starting to use experiment with node.js. node.js is also driving the “server side Javascript” movement. For instance, Joyent’s Smart platform allows you to write your server code in Javascript, which they can then execute on their hosted platforms.

Finally, no blog post about node.js is complete without an example of node.js code. Here is a simple web server:

[gist id=485001]