• Home
  • About Me
  • Contact

kanorben.net - blog

My personal blog on technology, programming, life, and the random

 

November 2007
M T W T F S S
« Feb   Dec »
 1234
567891011
12131415161718
19202122232425
2627282930  

Blogroll

  • Boing Boing
  • BorjaNet
  • Brian Mayer
  • Dean Armstrong’s Blog
  • Ellen Smith’s blog
  • Faraocious
  • Gross or Awesome?
  • Marcus Westin’s Blog
  • Nightmares of David Bowie’s Package
  • Paul Mantz’s Blog
  • Slashdot
  • Tomorrow with Alex Beinstein
  • Valleywag

Personal Sites

  • DOIT Fortune Database
  • My bookmark’s on del.icio.us
  • My CS account page
  • My Facebook Profile
  • My LinkedIn Page
  • My Picasa Albums
  • My Twitter
  • pyXSD
  • The SUCCESS Blog
  • UofC ACM Site

webcomics

  • Questionable Content
  • Saturday Morning Breakfast Cereal
  • The Perry Bible Fellowship
  • Welcome To The Future
  • xkcd

Meta

  • Register
  • Log in
  • Entries RSS
  • Comments RSS
  • WordPress.org
Add to Google Add to My Yahoo! Subscribe with Bloglines
Bloggers' Rights at EFF

Twitter Updates

    RSS My Del.icio.us

    • ledger
    • Git User's Manual (for version 1.5.3 or newer)
    • Don’t overuse classes in Python | The GITS Blog
    • BBC NEWS | UK | Magazine | The rival to the Bible
    • Elite Officer Recalls Bin Laden Hunt, Delta Force Commander Says The Best Plan To Kill The Al Qaeda Leader In 2001 Was Nixed - CBS News
    • How Laser TVs work at BrainStuff
    • uMac | University of Utah | Xhooks

    RSS My Facebook Posted Items

    • Elite Officer Recalls Bin Laden Hunt, Delta Force Commander Says The Best Plan To Kill The Al Qaeda
    • Language Fail
    • Safety Fail
    • Gnarls Barkley Crazy Theremin Jam
    • Domino's Scientists Test Limits Of What Humans Will Eat | The Onion - America's Finest News Source

    Using Firefox to Screen Scrape from the Command-line

    November 6th, 2007 by knorby

    So, here is the problem. I want to be able to get the source for a page after it has been rendered by Firefox (that is, loading javascript manipulations have been made, etc…). In other words, I want to be able to serialize the DOM in Firefox, from the command-line. Essentially, I am trying to write a massive hack. There are few problems that need to be overcome first. For one, Firefox requires some display. Since I only really care about Linux/BSD/Sun systems, I have to go through X11 (speaking of massive hacks…). Basically, I need a dummy X11 session. I don’t care what is displayed, I just want to send it somewhere. VNC, fortunately, provides this interface. It is worth noting at some point that I have not fully written this yet (laziness + hard-ass school = project stagnation), but I have a very good idea of what it will do. Anyways, the display is one small part of the problem; the trick here is getting the DOM out. I had some fun here. Unfortunately, DOM serialization must be done through javascript. Gecko provides a really nice little tool: XMLSerializer. I am not aware of anything like it in another browser, which just further supports my belief that Firefox/anything Gecko-based is simply the lesser of evils (bad design being evil of course). Why Mozilla decided the mix of XUL (an xml format Mozilla came up to design interfaces) and javascript would be sensible things to build a browser around, I don’t know, but it is useful here. The normal browser interface can be found at chrome://browser/content/browser.xul. You can have a lot of fun loading lots of these inside each other (see image). If you load browser.xul with firebug, you can play around with all of Firefox’s standard functions, which is always fun.

    browser.xul window multiload

    If you are creating a tradition extension, I suppose you would want to look at this stuff as well, but it is especially helpful here. Once this set of deep Firefox functions has been revealed, the actual loading of page is rather trivial. The real problem is I/O. I need to be able to pass firefox the link I want to open from the command-line, and write it to a specified location. Fortunately, there is JSLib, which provides things like I/O in javascript. From here, the solution is simple. I just want to make a copy of browser.xul, and add a few scripts into it. I then want to parse GET arguments on this file when loaded, since I can pass these to Firefox from the command-line. I would want one for the url, and one for the output path. Of course, these would have to be escaped before they could actually be passed to Firefox. That’s it! I was planning on calling it FireScraper. Hopefully I can finish it soon.

    Posted in VNC, XUL, coding, design, firebug, firefox, internet, javascript, mozilla, screen scraping |

    3 Responses

    1. Writing human-readable encodings of e-mails in plain text to avoid spam | kanorben.net Says:
      December 12th, 2007 at 2:56 am

      [...] javascript obfuscation the equivalent to putting addresses in plain text without obfuscation. As I have previously discussed, it is pretty easy to extract the contents of the DOM from firefox. The first method that comes to [...]

    2. My Current Projects | kanorben.net Says:
      January 4th, 2008 at 12:15 am

      [...] - The project resulting from the method I outlined to screen scrape using Firefox from the command-line. Still has a lot of work to be done, but I have already done much of the needed [...]

    3. Bjorn Ellis-Gowland Says:
      July 15th, 2008 at 3:44 am

      Hi knorby,

      I am in the process of #finding# something similar

      My skills are pretty limited to php / just starting out learning how to do the most basic of linuxes!!!

      Anyway I am in the process of producing a simple proxy ip address spider - easy to grab html responses using cUrl but a lot of these sites then produce output using js to thawt the likes of me!!

      Well I have been digging very deep into the www abys and found this :

      http:// http://www. wenwenba. com/ seaflower.jsp

      Havent yet been in contact with the author, but to sum it up it is based on FF3. I am still snooping around to understand the inner workings - it will take a while!!

      I presume it starts a FF3 service that listens for pages - FF engine then processes and spits out completed DOM. It is simple and means I can use an easy shell_exec(’crawl http://www.google.com‘) in my php scripts to get the output.

      Anyway, check it out - I would love to learn more about it.

      Bjorn/

    Leave a Comment

    Please note: Comment moderation is enabled and may delay your comment. There is no need to resubmit your comment.

     
    Add to Technorati Favorites - Creative Commons License - © 2007 Karl Norby