On email obfuscation

Executive summary: I think email obfuscation is futile.

If you want to find out more, read on.

Email obfuscation or address munging has become so common place these days that we have come to take it for granted. If our favorite website or content management system or wiki or mailing list web interface doesn’t protect our emails, we frown and scowl and shun the blasphemous beast out of our sight forever. How can one even think of NOT hiding the email address?!! Why would any one be so crazy as to just hand over our precious email address to the evil spammers on a platter?

I argue that email obfuscation really doesn’t buy us much. Sure, it might make the job of the spammer a tad bit harder, but as I will show, most of these problems and their solutions have become so commoditized that the marginal utility of obfuscation is fast approaching zero. Over the years we have seen some really clever ways of hiding our email addresses, but none of them are bullet proof and most of them are trivially easy to deal with. Others have also pointed out the dangers of email obfuscation.

Images don’t work

One common way of hiding email from bots and spammers is to create an image containing the email address. Facebook does this, for instance. Now, using images for email is inherently bad for accessibility, because text readers will not be able to read it. But even otherwise, there are several free character recognition programs that do a pretty good of decoding such email-images. I have experimented with jocr and ocrad with a reasonable success rate.

The sophisicated of OCR software points to a more general problem though: any obfuscation approach that preserves the visual rendering of the email is vulnerable to OCR based attack. That is, if the email address visually looks like an email address on screen to the reader, then one can always simply take a screenshot of the web page in question and shove it through an OCR to recover the email address. Thus, HTML-encoding the “mailto” link (as done by Dokuwiki and many others) is just as vulnerable as images containing emails. Similarly, Javascript based techniques are not effective because the visual representation of email addresses is preserved.

Clever ways of communicating the email don’t work

So we have established that if an email “looks like” an email to a human, it can be identified as such by a computer without too much trouble. Which leaves us with the other option: communicate the email address such that visually it does NOT look like an email address. Of course, this defeats the whole point of coming up with a canonical way of representing email addresses in the first place, but let us set that aside for now.

One could argue that there are an infinite number of ways to communicate the email address to an astute reader without revealing it to a machine. Here are a few of them:

”username AT domainname”
”mailto:username(at)domain(dot)com”
”[email protected]”
can without great difficulty be pieced together from edu, username, domain, @ and some dots

The first three are fairly easy to deal with. Here is a simple Python script that takes as input a set of regular expressions and tries to match each line of the input with them:

[gist id=90969]

The patterns file might look like:

(\S+)\.at\.(\S+) %s@%s
(\S+)\s+at\s+(\S+)\s+dot\s+(\S+) %s@%s.%s
(\S+)\[at\](\S+)\[dot\](\S+) %s@%s.%s

I will go out on a limb and claim that with a corpus of about 40-50 rules, one can easily parse 80% of the email addresses that are obfuscated like this. Not to mention the fact that such addresses are just waiting to be harvested by spammers, as Google search reveals.

Email addresses are easy to simply guess!

Lets say you were successful in hiding your email address from spammers and conveying it to a genuine reader, what is to stop the spammer from simply guessing your email address? Most professional email addresses (whether in academic institutes, research labs, companies etc) are usually a predictable function of our names. Common examples are:

Since we are hardly as careful about revealing our names as we are about revealing our email addresses, it is much easier for spammers to harvest names and then just construct an exhaustive list of email addresses from those names.

Similarly, email addresses are trivially easy to reconstruct from screen names, IM handles and network usernames. The proliferation of social networking websites with bad privacy features and our carelessness in putting out personal information all over the place without even realizing it makes this problem even more severe.

If all else fails, bulk email address lists can be purchased without much trouble. There is a flourishing underground economy on the Internet for this kind of stuff.

The fear factor

Technical arguments apart, I think obfuscation is like running away from the problem. Spamming is like terrorism in many ways: it feeds off of the fears of the public. I want to give out my email address on my website, but I can’t do so for fear of my email address being harvested. The amount of time and energy we spend obfuscating our addresses and de-obfuscating an address everytime we come across one translates into dollars for someone. After all, time is money. There is no doubt that there is a LOT of money in spam. Not just for the spammers, but also for the “good guys”; the entire industry of anti-virus, anti-spam, firewall and security products that has spawned as a result.

So where does that leave us?

For several years now I have tried hard to make sure my email address is available, but not easily harvestable. Yet, from all I can tell, it hasn’t had a huge impact on the amount of spam I get. Sure enough, I would probably have gotten even more spam had it been available as is, but as far as an end user goes, I don’t see the difference between 100 spams daily to 1,000 spams daily. And since the amount of spam I get has been monotonically increasing over time, it suggests that despite my efforts, at least some spammers out there already do have my email. In which case, it doesn’t really matter how many more get it, its the first outbreak that counts.

If you absolutely do feel the need to obfuscate your email, please try to do graceful email obfuscation. Another solution is to use disposable email addresses, but this is really an impractical solution given that our email addresses are increasingly becoming our identities.

Any long term solution will likely involve non-trivial changes to the entire email infrastructure. Recent research has revealed that spam conversion rates are abysmally low: one in about ONE MILLION spams get converted to some money. Despite this low conversion rates, spam continues to increase, so clearly spammers are still profitable. The only way to permanently get rid of spam is to make it prohibitively expensive for spammers to send out such huge amounts of spam. But this is not going to happen any time soon.

In the meantime, use aggressive spam filtering. Don’t worry about being too conservative. If someone really wants to get their email to you, they WILL somehow get it to you. There are lots of good spam filtering solutions out there, use them. Large email providers like Gmail and Yahoo should take the lead in developing state-of-the-art spam filtering solutions, since they have access to massive amounts of training data.

2 comments

September 14th, 2008 - 1:15 pm tekkie

I believe most bots are still running through the code as OCR is a time consuming process.

As for regular obfuscation Mac OS X users can use a Dashboard widget called obfuscatr. It provides JS or just plain encoding of your email. See the details at flash tekkie.

obfuscatr was also featured in MacWorld Italy of March 2008.

Reply
May 12th, 2009 - 1:11 pm Pingback: Obfuscate no more: why your email address should go au naturale - Jason Priem