• Home
  • About Me
  • Contact

kanorben.net - blog

My personal blog on technology, programming, life, and the random

 

November 2008
M T W T F S S
« Oct    
 12
3456789
10111213141516
17181920212223
24252627282930

Blogroll

  • Boing Boing
  • BorjaNet
  • Brian Mayer
  • Dean Armstrong’s Blog
  • Ellen Smith’s blog
  • Faraocious
  • Gross or Awesome?
  • Marcus Westin’s Blog
  • Nightmares of David Bowie’s Package
  • Paul Mantz’s Blog
  • Slashdot
  • Tomorrow with Alex Beinstein
  • Valleywag

Personal Sites

  • DOIT Fortune Database
  • My bookmark’s on del.icio.us
  • My CS account page
  • My Facebook Profile
  • My LinkedIn Page
  • My Picasa Albums
  • My Twitter
  • pyXSD
  • The SUCCESS Blog
  • UofC ACM Site

webcomics

  • Questionable Content
  • Saturday Morning Breakfast Cereal
  • The Perry Bible Fellowship
  • Welcome To The Future
  • xkcd

Meta

  • Register
  • Log in
  • Entries RSS
  • Comments RSS
  • WordPress.org
Add to Google Add to My Yahoo! Subscribe with Bloglines
Bloggers' Rights at EFF

Twitter Updates

    RSS My Del.icio.us

    • Developers Aren't Gonna Read It | Agile Software Development
    • Code Like a Pythonista: Idiomatic Python
    • Samuel Adams Utopias | Larsblog
    • BASTILLE-LINUX
    • from Vim to Emacs - part 1
    • OpenMacGrid | MacResearch
    • Tutorial: Backups with Launchd | MacResearch

    RSS My Facebook Posted Items

    • nomin' on ribs
    • Chomsky says pick the lesser of two evils
    • Elite Officer Recalls Bin Laden Hunt, Delta Force Commander Says The Best Plan To Kill The Al Qaeda
    • Language Fail
    • Safety Fail

    Using Firefox to Screen Scrape from the Command-line

    November 6th, 2007 by knorby

    So, here is the problem. I want to be able to get the source for a page after it has been rendered by Firefox (that is, loading javascript manipulations have been made, etc…). In other words, I want to be able to serialize the DOM in Firefox, from the command-line. Essentially, I am trying to write a massive hack. There are few problems that need to be overcome first. For one, Firefox requires some display. Since I only really care about Linux/BSD/Sun systems, I have to go through X11 (speaking of massive hacks…). Basically, I need a dummy X11 session. I don’t care what is displayed, I just want to send it somewhere. VNC, fortunately, provides this interface. It is worth noting at some point that I have not fully written this yet (laziness + hard-ass school = project stagnation), but I have a very good idea of what it will do. Anyways, the display is one small part of the problem; the trick here is getting the DOM out. I had some fun here. Unfortunately, DOM serialization must be done through javascript. Gecko provides a really nice little tool: XMLSerializer. I am not aware of anything like it in another browser, which just further supports my belief that Firefox/anything Gecko-based is simply the lesser of evils (bad design being evil of course). Why Mozilla decided the mix of XUL (an xml format Mozilla came up to design interfaces) and javascript would be sensible things to build a browser around, I don’t know, but it is useful here. The normal browser interface can be found at chrome://browser/content/browser.xul. You can have a lot of fun loading lots of these inside each other (see image). If you load browser.xul with firebug, you can play around with all of Firefox’s standard functions, which is always fun.

    browser.xul window multiload

    If you are creating a tradition extension, I suppose you would want to look at this stuff as well, but it is especially helpful here. Once this set of deep Firefox functions has been revealed, the actual loading of page is rather trivial. The real problem is I/O. I need to be able to pass firefox the link I want to open from the command-line, and write it to a specified location. Fortunately, there is JSLib, which provides things like I/O in javascript. From here, the solution is simple. I just want to make a copy of browser.xul, and add a few scripts into it. I then want to parse GET arguments on this file when loaded, since I can pass these to Firefox from the command-line. I would want one for the url, and one for the output path. Of course, these would have to be escaped before they could actually be passed to Firefox. That’s it! I was planning on calling it FireScraper. Hopefully I can finish it soon.

    Posted in VNC, XUL, coding, design, firebug, firefox, internet, javascript, mozilla, screen scraping | 3 Comments

     
    Add to Technorati Favorites - Creative Commons License - © 2007 Karl Norby