On Writing Human-Only Readable E-mail Addresses in Plain Text
Since I am not a big fan of spam, I normally do something to obfuscate my e-mail address whenever I put it up on a publicly accessible site in plain text. I usually don’t do anything special; I just do something like name {at} example {dot} net. As far as I am aware, spammers have not started to collect addresses formatted using such methods, but it has always bothered me, because it would be so simple to collect such addresses. Of course, that fear is far greater with any sort of compute-generated obfuscation. For example, mailman has a particularly dumb formatting. Every address is like name at example.net. Since spammers don’t have to worry about false positives, they could collect every address off of a mailman archive just by joining together every word immediately before an ‘at’ with the word that comes immediately after the ‘at’ with an ‘@’. Without reducing the accuracy, the false positive count could be reduced just by checking for a dot in the latter word. Really, the difficultly in extracting such addresses is incredibly minimal. The only reason that it is not an issue yet is that there are still enough people out there who put addresses up without any sort of obfuscation, so the task is still easy. I am think that spammers will have to start collecting such addresses soon enough, if they haven’t started already. So my goal is to determine exactly how I would go about an e-mail extraction system were I a spammer; this way, I can determine what sort of addresses could not be extracted easily. To start, let’s go through the assumptions we are making about spammers:
- False positives aren’t important. There will be plenty of bad addresses already, so a few more won’t hurt.
- Want to keep everything simple. The spammer is not looking for the theoretically best system, just something that works and is simple to write.
- Want to write things for the most global cases. If someone does something unique, then we should expect the system to fail.
- Keep the system to the level of joining together strings. Don’t look for cyphers or anything like that.
So of all things, the obfuscation method is least likely to do anything to the actual text in the address. In the name@example.net example, ‘name’, ‘example’, and ‘net’ are never changed. The only constants here are the top level domains such as ‘com’, ‘net’, ‘org’, ‘edu’, etc…. On the web, the most frequent occurrences of these TLDs will be in URIs and e-mail addresses. The first step would be to filter out the URLs from this mix. Any address without a protocol specified would result in a false positive. First thing to do is find all the obvious addresses. With the rest of the TLDs found, if the match is connected by a dot to anything to the left, join the word to the word occurring two positions to the left with an ‘@’. Otherwise, join the word two positions to the left with a ‘.’, and then join that new string with the word two positions to the left of that with an ‘@’. The spammer surely could think of other methods like these I have outlined. This exercise makes it clear that the only way to avoid most trouble is to come up with some sort of encoding method that is human readable, but is obfuscated to these sorts of general extraction methods. With something like an e-mail address on a uchicago system, if it is listed on a uchicago site, it is possible to make abbreviations like uchic.... edu. These sort of obfuscations could still be detected by a sophisticated extraction system, but it would be too much of a hassle for too limited results. There are other tricks one could employ along these same lines; for example, the addresses name..../\7.....example#...#net or USER:(....name....) |AT| ADDY:(....example{dot}net....) . The problem with these is that they are just human-readable, and a means to extraction is not that far off.
What is the best solution? For all intents and purposes, I consider javascript obfuscation the equivalent to putting addresses in plain text without obfuscation. As I have previously discussed, it is pretty easy to extract the contents of the DOM from firefox. The first method that comes to mind is essentially a series of variations on the barely readable obfuscations. Basically, using php or something else server-side, addresses can be written as normal, and then encoded. The problem with this solution is that these methods are barely human-readable, and the text is still left unaltered. What sort of method could leave everything as human-readable but also modify the text itself? I haven’t been able to think of anything. Perhaps I will think of something soon…..
Posted in coding, design, internet, web design | No Comments


