Tagged: email

Throw away mailing lists

Have you ever found yourself in a situation where you are on a mailing list and you want to send an email to all but 2 members on that list? A common case here is planning a surprise for someone on that list.

Photo by http://www.anna-om-line.com/

In general, I find myself on (long) email threads containing a different subset of people for different occasions (birthdays, anniversaries etc) several times a year. The email threads quickly become long and unwieldy. People keep adding other people as the thread progresses, and the only way the new adds can figure out whats going on is looking at the content of future emails. There is no way for anyone to go back and read all the discussion so far.

That got me thinking, wouldn’t it be great to have a service that provide throw away mailing lists? Hear me out. Here’s how the service would work:

  • To start a new mailing list, I simply send an email to newlist@mycoolservice.com. In the email, I also include a list of email addresses I want to seed the list with.
  • The service sends me back the address of a newly created throw away list. This could be of the form some-random-number@googlegroups.com.
  • For all practical purposes, this is exactly like any other mailing list (or Google Group). We can add more members, search the messages etc.
  • Start your discussion and let the thoughts flow.
  • At some point, the purpose behind the list will cease to exist (successful surprise, for instance). Needless to add, further discussions on the topic will also cease.
  • You forget you even created this mailing list. After the mailing list has been idle for some time (say two weeks), the service automatically deletes the mailing list. Any future messages to that address will bounce back saying that the list has been deleted, please contact the administrator.

Does anyone else think this could be useful?

On email obfuscation

Executive summary: I think email obfuscation is futile.

If you want to find out more, read on.

Email obfuscation or address munging has become so common place these days that we have come to take it for granted. If our favorite website or content management system or wiki or mailing list web interface doesn’t protect our emails, we frown and scowl and shun the blasphemous beast out of our sight forever. How can one even think of NOT hiding the email address?!! Why would any one be so crazy as to just hand over our precious email address to the evil spammers on a platter?

I argue that email obfuscation really doesn’t buy us much. Sure, it might make the job of the spammer a tad bit harder, but as I will show, most of these problems and their solutions have become so commoditized that the marginal utility of obfuscation is fast approaching zero. Over the years we have seen some really clever ways of hiding our email addresses, but none of them are bullet proof and most of them are trivially easy to deal with. Others have also pointed out the dangers of email obfuscation.

Images don’t work

One common way of hiding email from bots and spammers is to create an image containing the email address. Facebook does this, for instance. Now, using images for email is inherently bad for accessibility, because text readers will not be able to read it. But even otherwise, there are several free character recognition programs that do a pretty good of decoding such email-images. I have experimented with jocr and ocrad with a reasonable success rate.

The sophisicated of OCR software points to a more general problem though: any obfuscation approach that preserves the visual rendering of the email is vulnerable to OCR based attack. That is, if the email address visually looks like an email address on screen to the reader, then one can always simply take a screenshot of the web page in question and shove it through an OCR to recover the email address. Thus, HTML-encoding the “mailto” link (as done by Dokuwiki and many others) is just as vulnerable as images containing emails. Similarly, Javascript based techniques are not effective because the visual representation of email addresses is preserved.

Clever ways of communicating the email don’t work

So we have established that if an email “looks like” an email to a human, it can be identified as such by a computer without too much trouble. Which leaves us with the other option: communicate the email address such that visually it does NOT look like an email address. Of course, this defeats the whole point of coming up with a canonical way of representing email addresses in the first place, but let us set that aside for now.

One could argue that there are an infinite number of ways to communicate the email address to an astute reader without revealing it to a machine. Here are a few of them:

  • ”username AT domainname”
  • ”mailto:username(at)domain(dot)com”
  • ”usernameREMOVETHIS@domain.com”
  • can without great difficulty be pieced together from edu, username, domain, @ and some dots

The first three are fairly easy to deal with. Here is a simple Python script that takes as input a set of regular expressions and tries to match each line of the input with them:

[gist id=90969]

The patterns file might look like:

(\S+)\.at\.(\S+) %s@%s
(\S+)\s+at\s+(\S+)\s+dot\s+(\S+) %s@%s.%s
(\S+)\[at\](\S+)\[dot\](\S+) %s@%s.%s

I will go out on a limb and claim that with a corpus of about 40-50 rules, one can easily parse 80% of the email addresses that are obfuscated like this. Not to mention the fact that such addresses are just waiting to be harvested by spammers, as Google search reveals.

Email addresses are easy to simply guess!

Lets say you were successful in hiding your email address from spammers and conveying it to a genuine reader, what is to stop the spammer from simply guessing your email address? Most professional email addresses (whether in academic institutes, research labs, companies etc) are usually a predictable function of our names. Common examples are:

  • firstname.lastname@company.com
  • firstinitialLastName@company.com
  • firstinitialMiddleInitialLastName@company.com and so on

Since we are hardly as careful about revealing our names as we are about revealing our email addresses, it is much easier for spammers to harvest names and then just construct an exhaustive list of email addresses from those names.

Similarly, email addresses are trivially easy to reconstruct from screen names, IM handles and network usernames. The proliferation of social networking websites with bad privacy features and our carelessness in putting out personal information all over the place without even realizing it makes this problem even more severe.

If all else fails, bulk email address lists can be purchased without much trouble. There is a flourishing underground economy on the Internet for this kind of stuff.

The fear factor

Technical arguments apart, I think obfuscation is like running away from the problem. Spamming is like terrorism in many ways: it feeds off of the fears of the public. I want to give out my email address on my website, but I can’t do so for fear of my email address being harvested. The amount of time and energy we spend obfuscating our addresses and de-obfuscating an address everytime we come across one translates into dollars for someone. After all, time is money. There is no doubt that there is a LOT of money in spam. Not just for the spammers, but also for the “good guys”; the entire industry of anti-virus, anti-spam, firewall and security products that has spawned as a result.

So where does that leave us?

For several years now I have tried hard to make sure my email address is available, but not easily harvestable. Yet, from all I can tell, it hasn’t had a huge impact on the amount of spam I get. Sure enough, I would probably have gotten even more spam had it been available as is, but as far as an end user goes, I don’t see the difference between 100 spams daily to 1,000 spams daily. And since the amount of spam I get has been monotonically increasing over time, it suggests that despite my efforts, at least some spammers out there already do have my email. In which case, it doesn’t really matter how many more get it, its the first outbreak that counts.

If you absolutely do feel the need to obfuscate your email, please try to do graceful email obfuscation. Another solution is to use disposable email addresses, but this is really an impractical solution given that our email addresses are increasingly becoming our identities.

Any long term solution will likely involve non-trivial changes to the entire email infrastructure. Recent research has revealed that spam conversion rates are abysmally low: one in about ONE MILLION spams get converted to some money. Despite this low conversion rates, spam continues to increase, so clearly spammers are still profitable. The only way to permanently get rid of spam is to make it prohibitively expensive for spammers to send out such huge amounts of spam. But this is not going to happen any time soon.

In the meantime, use aggressive spam filtering. Don’t worry about being too conservative. If someone really wants to get their email to you, they WILL somehow get it to you. There are lots of good spam filtering solutions out there, use them. Large email providers like Gmail and Yahoo should take the lead in developing state-of-the-art spam filtering solutions, since they have access to massive amounts of training data.

What did you do last year?


Happy new year y’all!

The past couple of days my feed reader has been chock full of posts about one of the following: the year in review, predictions for 2008, reflections and introspections. So much so that I got tired of reading about the “new year” and never got around to writing MY end of the year post, but I’m sure the world didn’t miss much. But I did run into an interesting problem as I was thinking about what could have been my end of the year post: exactly what all did I do last year?

So I started by writing down all the months, the idea being that I would put down all the significant events that happened in any given month next to it. The hope is that there aren’t that many of them so the list should be fairly manageable. Now, I have always known that my memory is not that great, and that is why I tend to rely on tools to do the dirty book keeping for me: calendars, todo lists, reminders etc. But it was still a little shocking when I couldn’t immediately recall what I did in lets say May of last year. Of course I did remember things once I thought about it a little bit, often relying on context (what happened before May, after May etc).

The bottom line is that it wasn’t as easy as I thought it would be. For some months, I actually had to go back to my email inbox and other digital archives to figure out the salient happenings. This got me thinking about **personal information analysis and visualization**. And the more I thought about it, the more excited I became.

I was actually surprised to find such little information on the web about this. With our increasing information overload, cheap storage, and tons of archived data (online and offline), I think this space has tremendous potential for both academic and commercial ventures. For instance, here’s a really simple thing I want to be able to do: for a given time period (say 2007), I want to analyze and visualize all of my emails so that I can quickly figure out:
* who did I communicate with the most?
* what were the main topics I wrote about?

I couldn’t find any open source tool to do even this. And my initial Googling hasn’t turned up much in commercial offerings either. The closest thing I could find was a project called [[http://alumni.media.mit.edu/~fviegas/projects/themail/study/index.htm|themail]] from MIT Media Labs, but there’s no code that I can download. Then there is [[http://carohorn.de/anymails/|Anymails]], but it seems just a cool visualization, and not a lot of information (specially the kind I want).

If you know about any free or paid tools that can do this kind of analysis, please drop a line in the comments. And while you are at it, try to think about what YOU did all of last year :-)