Random interesting find of the day: IOGraphica. Here’s mine for about 7 hours at work:
Such a simple app, but such a fascinating output. An easy way to create computer generated art! Couple of observations:
I have a dual-monitor setup at work. I use the left monitor for email for browsing and the right monitor for code. The mouse patterns clearly reflect this usage pattern. I tend to rest the mouse roughly equally on the both the monitors.
I was very intrigued by the fact that most of the mouse motions are very smooth. Most curves almost look parabolic. There are very few jerks and jittery lines. Once again, nature seems poetic even in the most chaotic and random actions.
DISCLAIMER: As with all other material on this blog, these are my thoughts and do NOT reflect the opinions of my employer.
I really like the tagline on our logo: big data. fast insights.
But leaving the marketing aside, what does it mean really? What is all the hoopla about big data analytics?
The way I look at things, a few key observations here are:
Data is increasing. This is almost self-evident, so I won’t bother with presenting any evidence.
Data is driving businesses more than ever. Whether it is search, advertising, insurance, finance, health care, governance — data is becoming an integral part of more and more business processes.
Finally, data movement is slow. And I mean really really slow, compared to our processing and memory speeds. Once you go into the range of hundreds of terabytes of petabytes of data, you really don’t want to keep moving around that data into isolated silos for doing analytics.
Clearly, none of these observations is particularly new or insightful. However, I do think some of the implications of these observations are quite powerful and were new at least for me. For instance, (3) implies that once you have accumulated a lot of data in one place (imagine hundreds of TB or more), it is extremely difficult and time consuming to move that data around. This, in turn, means that more often than not, data is likely to reside in a single place.
Traditionally, it was not uncommon to have a large data warehouse that would be the repository of all data. Then smaller data sets could be carved out from this master data set (also known as data marts) as required. This approach is becoming increasingly unfeasible. Carving out 100TB data marts from a 1PB data warehouse is simply not going to scale.
At the same time, it is clear that a one-size-fits-all approach to data storage and analysis is not practical either. Some data sets naturally lend themselves to a relational data model, while others might be more suited to unstructured processing (Hadoop) or document oriented processing (CouchDB or MarkLogic) or graph analysis (Neo4J) and so on. Forcing a single model or access mechanism down all customers’ throat is not tenable.
So what would the ideal platform for big data analytics look like? One that allows you to store and access data in various ways, seamlessly.
A good software project must have a good build system. Unless you have a small code base consisting entirely of dynamic, scripted languages, you probably need to “build” your code before you can use it. Until around an year ago, the only build tool that I used and was familiar with was GNU Make. Make and the autotools family of tools have served the developer community well the past few decades.
But the Make model is rife with problems. Here are a few of them:
Make requires the use of its own domain specific language — this is, in general, not a good idea. Have you looked at any sizable project’s Makefile lately? Its hard to understand, and harder to modify.
In the same vein, autoconf/automake are notoriously hard to use. Bear in mind that these tools are supposed to make your life easier.
Makefile are so hard to write and extend that several popular build systems today are essentially Makefile generators. A good example is CMake.
Make relies heavily on file timestamps to detect changes.
Make is slow.
Makefile are not modular. Recursive Make is especially evil.
I recently began work on a new pet project. As is usually the case, I spent a lot more time figuring out what tools and libraries I would use for my project, than in actually writing any code for the project :) Part of the investigation was to survey the state of the art in build systems. At work, we started using SCons for most of our build, which was already a huge improvement over Make. But SCons has its own set of issues.
One of the nicest features in SCons is that build files are regular Python files. This provides enormous flexibility and immediate familiarity. Unfortunately, the SCons documentation leaves much to be desired. I still don’t quite understand the execution model of SCons very well. For instance, I know how to extend SCons to support cross-compilation for multiple platforms. However, I don’t really understand why those modifications work — there’s quite a bit of black magic that goes on behind the scenes. As a concrete example, there are several magic variables such _LIBDIRFLAGS that have strange powers.
After some more looking around, I discovered Waf. And now that I’ve played around with it a little bit, I’m happy to say that it is the most pleasant build system I’ve ever used. Things I really like about Waf:
The execution model just makes sense to me. You typically build a project in phases: there’s a configure phase, to sort out dependencies, tools etc; there’s the actual build phase; and then there’s the install phase. It is not uncommon to have a ‘dist’ step as well, to prepare the source for distribution. Waf understands these operations as first class entities. There is a very strong notion of workflow built into Waf.
Comprehensive documentation. Check out the Waf book and the wiki.
Waf has a very strong task model. There is a much stronger notion of dependencies (powered by content hashes, not timestamps). Waf also enforces that all generated code lands up in a separate “build” directory, so your source tree always remains clean.
Using waf is a breeze — there are no big dependencies, no packages to install, no bloated software to include with your code. Just a single 80kb script.
Progress indication and colored output is built in, not an after thought. Like SCons, Waf build files are regular Python files.
Waf is fast. Faster than SCons.
Of course, Waf is not perfect. Coming from a Make/SCons world, I sorely miss the ability to build specific targets. Yes there are ways to achieve this in Waf, but they are all clumsy. The API documentation (and the source itself) are a bit hard to parse.
Readers of this blog will know that I’m a big fan of GNU screen. While screen is a great tool, it hasn’t seen any major development or feature addition in quite some time. The code base is pretty old, there are some ancient bugs that still linger, and support for modern terminals (such as 256 colors by default) is not quite up to speed. I recently discovered byobu and was extremely happy with it — it completely overhauled my screen user experience. You can read all about byobu here.
I thought I had attained screen nirvana… until I found tmux (hat tip xed). So what exactly is tmux?
tmux is intended to be a modern, BSD-licensed alternative to programs such as GNU screen. Major features include:
A powerful, consistent, well-documented and easily scriptable command interface.
A window may be split horizontally and vertically into panes.
Panes can be freely moved and resized, or arranged into one of four preset layouts.
Support for UTF-8 and 256-colour terminals.
Copy and paste with multiple buffers.
Interactive menus to select windows, sessions or clients.
Change the current window by searching for text in the target.
Terminal locking, manually or after a timeout.
A clean, easily extended, BSD-licensed codebase, under active development.
And how is tmux better than screen? Thats question #1 in the FAQ:
tmux offers several advantages over screen:
- a clearly-defined client-server model: windows are independent entities which may be attached simultaneously to multiple sessions and viewed from multiple clients (terminals), as well as moved freely between sessions within the same tmux server;
- a consistent, well-documented command interface, with the same syntax whether used interactively, as a key binding, or from the shell;
- easily scriptable from the shell;
- multiple paste buffers;
- choice of vi or emacs key layouts;
- an option to limit the window size;
- a more usable status line syntax, with the ability to display the first line of output of a specific command;
- a cleaner, modern, easily extended, BSD-licensed codebase.
There are still a few features screen includes that tmux omits:
- builtin serial and telnet support; this is bloat and is unlikely to be added to tmux;
- wider platform support, for example IRIX and HP-UX, and for odd terminals.
I’ve been using tmux exclusively for the last couple of weeks and I really like it so far. For once, I can actually understand the configuration file :) But there are a few things that I miss from screen:
I found the screen way of scrolling in a buffer and copying text much easier to use than tmux’s. Unless I’m missing something, the only way to scroll a buffer in tmux and copy some text is by using vi-like keyboard commands. While this is doable, it is not always quick or convenient.
byobu made it really easy to add various status indicators. Wish I had something similar for tmux.