2010-04-21

Performance lost.

One of the exercises I usually do when learning a new language is to write a web crawler.  A web crawler represents a nice cross-section of the sort of features I care about in a language:  handling lots of data, handling network IO, handling disk IO, concurrency etc.  If I can write a decent web crawler in a language without too much friction then it is likely that I will be able to use it for other types of servers as well.

(The one thing this test is not good for is testing the sort of latency issues you can run into when dealing with managed memory and lots of data / operations.  To do that implementing some sort of search engine with various forms of caching is probably a better test.  A crawler is a completely different type of problem where you just need throughput -- the individual network connections will by nature have high, and highly variable, latency, but since you have lots and lots of them, the throughput can usually be quite high).

The other day I stumbled across a web crawler I wrote in plain old C.  At the time I needed a very rapid crawler for a specific purpose and I had some novel ideas on how to solve some of the hard problems in crawling.  I can remember not being too pleased with the initial version of the crawler so I re-wrote significant portions of it a couple of times.  In the end I had a relatively simple crawler prototype that I was rather pleased with.  It ran on a dual 500Mhz P3 with about 1Gb of memory.  At full blast it saturated the network interface on the machine (100Mbit card).  If memory serves it could crawl about 1100-1200 documents per second with an average CPU load of between 12% and 15% -- leaving lots of CPU headroom for any document processing one might want to do.

All crawlers I have written since have been resource hogs in comparison.  (Curiously I have never written a reasonably complete web crawler in Java.  I should do this one day, but for now there are no projects I'm working on that needs a fast web crawler and I am sure the ones that exist are more than good enough for most purposes.  But if I were to write a Java web crawler I would probably have a look at MINA to see if they have succeeded in creating workable abstractions for NIO).

The absolute worst performing crawler I have ever written was in Ruby.  I spent about 1 month learning and hacking Ruby in 2006.  Some of that time was spent writing a web crawler.  I'm a bit fuzzy on the numbers, but what I remember was that the performance was apalling: on a 1.8Ghz machine with 1Gb of RAM the thing initially consumed all available CPU crawling tens of documents per second.  After a day of optimizing (including modifying the network libraries of Ruby) I was able to get the CPU consumption down to about 35%.  I would expect that Ruby has gotten a lot better since then.  In all fairness I might have done something stupid to get such sucky performance.  Who knows.  It just annoyed me immensely that I would actually have to hack the network libraries because they were horribly naive.

(As a comparison, the original PHP network libraries were atrociously bad, but after a rewrite some time in the early 00s or late 90s, they had pretty good performance.  Given how horrible PHP is language creators should at least have a goal of beating PHP.  I also wrote a socket implementation for a JavaScript VM that Markku Rossi did in the 90s, but the project caved (disappeared) before I was able to submit my socket library.  I did manage to write a simple web crawler in JavaScript before I lost interest in the project.  The API looked like a very simplified version of NIO (which didn't exist yet) and had good support for multiplexing.)

One thing that struck me while looking at the old C-crawler was how horribly bad a lot of our "modern" tools have become.  In particular things like Tweetdeck in Adobe Air.  When it isn't doing anything it consumes a few percent CPU.  When it is doing something I regularly see its CPU usage jump to 90% whilst gobbling up hundreds of megabytes of memory.   And it isn't like it is doing anything that is CPU intensive:  it is making some API calls and it is handling a negligible amount of data.

Either Adobe Air is a piece of rubbish or the people who made Tweetdeck have no idea what they are doing.  You tell me.  Either way it does make you think when an application like that consumes such enormous amounts of CPU and memory.

PS: The reason you would want to use something like MINA to write a web crawler is because I have yet to find an HTTP client library that isn't shit, so you would have to roll your own anyway.  It amazes me how clumsy these things tend to be.

No comments:

Post a Comment