Using Firefox to Screen Scrape from the Command-line
So, here is the problem. I want to be able to get the source for a page after it has been rendered by Firefox (that is, loading javascript manipulations have been made, etc…). In other words, I want to be able to serialize the DOM in Firefox, from the command-line. Essentially, I am trying to write a massive hack. There are few problems that need to be overcome first. For one, Firefox requires some display. Since I only really care about Linux/BSD/Sun systems, I have to go through X11 (speaking of massive hacks…). Basically, I need a dummy X11 session. I don’t care what is displayed, I just want to send it somewhere. VNC, fortunately, provides this interface. It is worth noting at some point that I have not fully written this yet (laziness + hard-ass school = project stagnation), but I have a very good idea of what it will do. Anyways, the display is one small part of the problem; the trick here is getting the DOM out. I had some fun here. Unfortunately, DOM serialization must be done through javascript. Gecko provides a really nice little tool: XMLSerializer. I am not aware of anything like it in another browser, which just further supports my belief that Firefox/anything Gecko-based is simply the lesser of evils (bad design being evil of course). Why Mozilla decided the mix of XUL (an xml format Mozilla came up to design interfaces) and javascript would be sensible things to build a browser around, I don’t know, but it is useful here. The normal browser interface can be found at chrome://browser/content/browser.xul. You can have a lot of fun loading lots of these inside each other (see image). If you load browser.xul with firebug, you can play around with all of Firefox’s standard functions, which is always fun.
If you are creating a tradition extension, I suppose you would want to look at this stuff as well, but it is especially helpful here. Once this set of deep Firefox functions has been revealed, the actual loading of page is rather trivial. The real problem is I/O. I need to be able to pass firefox the link I want to open from the command-line, and write it to a specified location. Fortunately, there is JSLib, which provides things like I/O in javascript. From here, the solution is simple. I just want to make a copy of browser.xul, and add a few scripts into it. I then want to parse GET arguments on this file when loaded, since I can pass these to Firefox from the command-line. I would want one for the url, and one for the output path. Of course, these would have to be escaped before they could actually be passed to Firefox. That’s it! I was planning on calling it FireScraper. Hopefully I can finish it soon.




Writing human-readable encodings of e-mails in plain text to avoid spam | kanorben.net said,
[...] javascript obfuscation the equivalent to putting addresses in plain text without obfuscation. As I have previously discussed, it is pretty easy to extract the contents of the DOM from firefox. The first method that comes to [...]
My Current Projects | kanorben.net said,
[...] – The project resulting from the method I outlined to screen scrape using Firefox from the command-line. Still has a lot of work to be done, but I have already done much of the needed [...]
Bjorn Ellis-Gowland said,
Hi knorby,
I am in the process of #finding# something similar
My skills are pretty limited to php / just starting out learning how to do the most basic of linuxes!!!
Anyway I am in the process of producing a simple proxy ip address spider – easy to grab html responses using cUrl but a lot of these sites then produce output using js to thawt the likes of me!!
Well I have been digging very deep into the www abys and found this :
http:// www. wenwenba. com/ seaflower.jsp
Havent yet been in contact with the author, but to sum it up it is based on FF3. I am still snooping around to understand the inner workings – it will take a while!!
I presume it starts a FF3 service that listens for pages – FF engine then processes and spits out completed DOM. It is simple and means I can use an easy shell_exec(‘crawl http://www.google.com‘) in my php scripts to get the output.
Anyway, check it out – I would love to learn more about it.
Bjorn/
virteman said,
Hello, that is what I want . How does FireScraper go on?
Paulo said,
Very good review
In the past I used a VNC virtual session to automatically launch firefox from the command line.
At the launch, I gived the URL to open and the virtual session which firefox could use.
This was the solution I found at the time (2 yrs ago) to be able to “automatically” use firefox from “command line”
To pass parameters into the firefox , I appended my parameters on the URL passed firefox, and made an extension in firefox which:
– parsed the URL and extracted my parameters from it
– open the real URL without the parameters
I think I will look into the JSLib as it could be the solution to pass parameters in a more gracious way
Greetings,
Add A Comment