• Home
  • About Me
  • Contact

kanorben.net - blog

My personal blog on technology, programming, life, and the random


 

December 2007
M T W T F S S
« Nov   Jan »
 12
3456789
10111213141516
17181920212223
24252627282930
31  

Blogroll

  • Boing Boing
  • BorjaNet
  • Brian Mayer
  • Dean Armstrong’s Blog
  • Ellen Smith’s blog
  • Faraocious
  • Gross or Awesome?
  • Marcus Westin’s Blog
  • Paul Mantz’s Blog
  • Slashdot
  • TechCrunch
  • Tomorrow with Alex Beinstein
  • Valleywag

Personal Sites

  • DOIT Fortune Database
  • My bookmark’s on del.icio.us
  • My CS account page
  • My Facebook Profile
  • My LinkedIn Page
  • My Picasa Albums
  • My Twitter
  • pyXSD
  • The SUCCESS Blog
  • UofC ACM Site

webcomics

  • Questionable Content
  • The Perry Bible Fellowship
  • Welcome To The Future
  • xkcd.com

Meta

  • Register
  • Log in
  • Entries RSS
  • Comments RSS
  • WordPress.org
Add to Google Add to My Yahoo! Subscribe with Bloglines
Bloggers' Rights at EFF

Twitter Updates

    RSS My Del.icio.us

    • VX32 Virtual Extension Environment
    • Starting Forth's home-page
    • Quick start - The Open Source Backup Wiki (Amanda, MySQL Backup)
    • Telemarketers - Kill the Calls
    • Ian Bicking: a blog :: Python HTML Parser Performance
    • Mac OS X 10.4 Tiger: Page 5 - launchd
    • MidpSSH | SSH and Telnet client for Mobile devices (MIDP/J2ME)

    RSS My Facebook Posted Items

    • Neave Television ...telly without context
    • Cock Puncher: The Game!
    • Clever Squirrel
    • Thurston Moore Interviews Beck and Mike D
    • Ubersite - How I Ruined My Neighbor's Christmas, New Years, and Birthdays for Years to Come...

    On Writing Human-Only Readable E-mail Addresses in Plain Text

    December 12th, 2007 by knorby

    Since I am not a big fan of spam, I normally do something to obfuscate my e-mail address whenever I put it up on a publicly accessible site in plain text. I usually don’t do anything special; I just do something like name {at} example {dot} net. As far as I am aware, spammers have not started to collect addresses formatted using such methods, but it has always bothered me, because it would be so simple to collect such addresses. Of course, that fear is far greater with any sort of compute-generated obfuscation. For example, mailman has a particularly dumb formatting. Every address is like name at example.net. Since spammers don’t have to worry about false positives, they could collect every address off of a mailman archive just by joining together every word immediately before an ‘at’ with the word that comes immediately after the ‘at’ with an ‘@’. Without reducing the accuracy, the false positive count could be reduced just by checking for a dot in the latter word. Really, the difficultly in extracting such addresses is incredibly minimal. The only reason that it is not an issue yet is that there are still enough people out there who put addresses up without any sort of obfuscation, so the task is still easy. I am think that spammers will have to start collecting such addresses soon enough, if they haven’t started already. So my goal is to determine exactly how I would go about an e-mail extraction system were I a spammer; this way, I can determine what sort of addresses could not be extracted easily. To start, let’s go through the assumptions we are making about spammers:

    • False positives aren’t important. There will be plenty of bad addresses already, so a few more won’t hurt.
    • Want to keep everything simple. The spammer is not looking for the theoretically best system, just something that works and is simple to write.
      • Want to write things for the most global cases. If someone does something unique, then we should expect the system to fail.
    • Keep the system to the level of joining together strings. Don’t look for cyphers or anything like that.

    So of all things, the obfuscation method is least likely to do anything to the actual text in the address. In the name@example.net example, ‘name’, ‘example’, and ‘net’ are never changed. The only constants here are the top level domains such as ‘com’, ‘net’, ‘org’, ‘edu’, etc…. On the web, the most frequent occurrences of these TLDs will be in URIs and e-mail addresses. The first step would be to filter out the URLs from this mix. Any address without a protocol specified would result in a false positive. First thing to do is find all the obvious addresses. With the rest of the TLDs found, if the match is connected by a dot to anything to the left, join the word to the word occurring two positions to the left with an ‘@’. Otherwise, join the word two positions to the left with a ‘.’, and then join that new string with the word two positions to the left of that with an ‘@’. The spammer surely could think of other methods like these I have outlined. This exercise makes it clear that the only way to avoid most trouble is to come up with some sort of encoding method that is human readable, but is obfuscated to these sorts of general extraction methods. With something like an e-mail address on a uchicago system, if it is listed on a uchicago site, it is possible to make abbreviations like uchic.... edu. These sort of obfuscations could still be detected by a sophisticated extraction system, but it would be too much of a hassle for too limited results. There are other tricks one could employ along these same lines; for example, the addresses name..../\7.....example#...#net or USER:(....name....) |AT| ADDY:(....example{dot}net....) . The problem with these is that they are just human-readable, and a means to extraction is not that far off.

    What is the best solution? For all intents and purposes, I consider javascript obfuscation the equivalent to putting addresses in plain text without obfuscation. As I have previously discussed, it is pretty easy to extract the contents of the DOM from firefox. The first method that comes to mind is essentially a series of variations on the barely readable obfuscations. Basically, using php or something else server-side, addresses can be written as normal, and then encoded. The problem with this solution is that these methods are barely human-readable, and the text is still left unaltered. What sort of method could leave everything as human-readable but also modify the text itself? I haven’t been able to think of anything. Perhaps I will think of something soon…..

    Posted in coding, design, internet, web design | No Comments

     
    Add to Technorati Favorites - Creative Commons License - © 2007 Karl Norby