Tagged: mysql

Some thoughts on dbShards

I heard about dbShards via two recent blog posts — one by Curt Monash and the other by Todd Hoff. It seemed like an interesting product, so I spent some time digging around on their website.

dbShards
dbShards

As the name suggests, dbShards is all about sharding. Sharding, also known as partitioning, is the process of distributing a given dataset into smaller chunks based some policy. AFAIK, the term “shard” was popularized recently by Google even though the concept of partitioning is at least a few decades old. Most distributed data management systems implement some form of sharding by necessity, since the entire data set will not fit in memory on a single node (if it would, you should not be using a distributed system). And therein lies the USP of dbShards — it brings sharding (and with it, performance and scalability) to commodity, single-node databases such as MySQL and Postgres.

So how does it work? Well, dbShards acts as a transparent layer sitting in front of multiple nodes running MySQL, lets say. Transparent, because they want to work with legacy code, meaning no or minimal client side modifications. Inserting new data is pretty simple: dbShards using a “sharding key” to route an incoming tuple to the appropriate destination. Queries are a bit more complex, and here the website is skimpy on details. Monash’s post mentions that join performance is good when sharding keys are the same — this is not a surprise. I’m not interested in what other kinds of query optimizations are in place. When data is partitioned, you really need a sophisticated query planner and optimizer that can minimize data movement and aggregation, and push down as much computation as possible to individual nodes.

I found the page on replication intriguing. I’m guessing when they say “reliable replication”, they mean “consistent replication” in more common parlance (alternative, that dbShards supports strong consistency, as opposed to eventual or lazy consistency). This particular bit in the first paragraph caught my eye: “deliver high performance yet reliable multi-threaded replication for High Availability (HA)”. I’m not sure how to read this. Are they implying that multi-threaded replication is typically not performant? And usually you do NOT want threading for high availability, because a single thread can still take the entire process down. The actual mechanism for replication seems like a straightforward application of the replicated state machine approach.

But making a replicated state machine based system scale requires very careful engineering, otherwise it is easy to hit performance bottlenecks. I’d be very interested in knowing a bit more about the transaction model in dbShards and how it performs on larger systems (tens to hundreds of nodes).

The pricing model is also quite interesting. I think it is the first vendor I know of that is pricing on CPU and not storage (their pricing is $5,000 per year per server). I think this is indicative of the target customer segment as well — I would imagine dbShards works well with a few TBs of data on machines with a lot of CPU and memory.

Amarok ate your stats?

I have a reasonably large music collection. In a desire to trim this collection down to stuff that I really like, over the years I have painstakingly rated my songs in Amarok. But twice, now, I have had Amarok eat up all my ratings and scores and play counts and other stats. This post will show you how to recover from such a loss.

Amarok

The first solution is for a situation where an Amarok upgrade caused the stats to be lost. I was using the default Amarok setup — that is, all the stats are stored in a file on disk using sqlite. During my Amarok upgrade, Amarok decided to create a new database. Fortunately, it left the old sqlite database file lying around. If you have an old version of your database, you can restore the stats using the following script:

The next scenario arose when I switched to using Amrok2 recently. Now, granted that Amarok2 is still under heavy development and hence the requisite migration paths might not be in place. But I still hate to lose all my stats. Amarok2 uses embedded MySQL instead of sqlite to store all its data. The following script uses your old database to restore all the stats in the new database.

Note that for this to work, you need to first start a MySQL server using the embedded MySQL directory as the data backend, as pointed out here.

Hope that helps!