November 6th, 2007 by knorby

I have always thought that the social dynamics of digg were a little odd to say the least, but I have never been able to put my finger on it. I don’t know quite why I even browse digg, but I do; at least some of the posted stories are interesting or useful. Digg is one of the few high traffic sites I have seen where headings like “BREAKING,” “AMAZING,” or some other word in all caps is somehow considered acceptable. I guess the most noticeable dynamic though is the pure sensationalism. It is hard to believe half of the stories posted. Sure, the Internet is famous for bullshit, but “web 2.0″ + pure bullshit seems to be in a category of its own. Perhaps digg is simply the combined expression of Internet culture, but I believe it is a force far darker.
Posted in culture, digg, internet | No Comments
November 6th, 2007 by knorby
So, here is the problem. I want to be able to get the source for a page after it has been rendered by Firefox (that is, loading javascript manipulations have been made, etc…). In other words, I want to be able to serialize the DOM in Firefox, from the command-line. Essentially, I am trying to write a massive hack. There are few problems that need to be overcome first. For one, Firefox requires some display. Since I only really care about Linux/BSD/Sun systems, I have to go through X11 (speaking of massive hacks…). Basically, I need a dummy X11 session. I don’t care what is displayed, I just want to send it somewhere. VNC, fortunately, provides this interface. It is worth noting at some point that I have not fully written this yet (laziness + hard-ass school = project stagnation), but I have a very good idea of what it will do. Anyways, the display is one small part of the problem; the trick here is getting the DOM out. I had some fun here. Unfortunately, DOM serialization must be done through javascript. Gecko provides a really nice little tool: XMLSerializer. I am not aware of anything like it in another browser, which just further supports my belief that Firefox/anything Gecko-based is simply the lesser of evils (bad design being evil of course). Why Mozilla decided the mix of XUL (an xml format Mozilla came up to design interfaces) and javascript would be sensible things to build a browser around, I don’t know, but it is useful here. The normal browser interface can be found at chrome://browser/content/browser.xul. You can have a lot of fun loading lots of these inside each other (see image). If you load browser.xul with firebug, you can play around with all of Firefox’s standard functions, which is always fun.

If you are creating a tradition extension, I suppose you would want to look at this stuff as well, but it is especially helpful here. Once this set of deep Firefox functions has been revealed, the actual loading of page is rather trivial. The real problem is I/O. I need to be able to pass firefox the link I want to open from the command-line, and write it to a specified location. Fortunately, there is JSLib, which provides things like I/O in javascript. From here, the solution is simple. I just want to make a copy of browser.xul, and add a few scripts into it. I then want to parse GET arguments on this file when loaded, since I can pass these to Firefox from the command-line. I would want one for the url, and one for the output path. Of course, these would have to be escaped before they could actually be passed to Firefox. That’s it! I was planning on calling it FireScraper. Hopefully I can finish it soon.
Posted in VNC, XUL, coding, design, firebug, firefox, internet, javascript, mozilla, screen scraping | 3 Comments