2008-01-15

Ancient history, Part 3: more on the origins of FAST Search

Back in the day when FAST was a startup, there was a lot of talk in the press about the origins and nature of our technology.

Most of what was presented as "fact" in the press was, at best, misleading and often just wrong. You can't really blame anyone for this. Whenever people from our management talked to the press, the journalists would invariably get the details wrong or they'd take something out of context and run with it as far as possible. Or worse, journalists would try to distill facts by skimming inaccurate articles by other journalists, thus accumulating the number of bogus claims.

One myth I heard quite often was that Web Search was based on FTPSearch. That we had somehow taken the original FTPSearch code and "adapted" it for use on the web. This, of course, is nonsense.

Web search and FTPSearch were in fact two completely separate and un-related code-bases.
At the time of writing even the wikipedia entry for Alltheweb has gotten this wrong.

FTPSearch, which was an independent implementation of the Archie search system, was originally written by Hugo Gunnarsen while he was doing his masters thesis at NTH (or NTNU as it is called today). The project grew out of a discussion Hugo had with Anders Christensen (always a great source of inspiration at PVV) about what one might do with the MS160 search chip that Professor Arne Halaas had done. The original FTPSearch ran on a 50Mhz 486 running Linux 0.99 and was developed during the fall of 1993. Tegge acted as mentor for Hugo while working on the project although he wasn't very involved in the project itself at that point.

The first time I heard of the project was when I almost destroyed the ISA board with the search chip, by accidentally dropping a large pizza on top of it. Hugo was not amused, but still kind enough to explain what the chip could do. All while suspiciously eyeing my pizza as if he expected me to drop it on his thesis project. Again. (Hugo insists I spilled coke in his keyboard rather than drop a pizza on the board while it was lying around the coffee table, but I only spilled coke in a keyboard once at PVV; and that was the indestructible keyboard of the IBM RS/6000).

After Hugo got his degree he left NTH and started working for SGI in january 1994. Hugo later joined FAST in 1999 and now works for Yahoo!

The project was taken over by Tor Egge -- "Tegge". Tegge ported FTPSearch from Linux to NetBSD some time in 1994-1995 and the project grew on. An early casualty of the port to NetBSD was the support for the MS160 chip. According to Hugo, Tegge never ported the MS160 code to NetBSD and thus the search chip was dropped. The main reason being that the new machine (a Pentium, insert "ooh"-noises) would run searches significantly faster without using the MS160.
In addition to improving the service, he added more features and grew the index to cover most ftp servers worth indexing. One of the neat things he added was proper regular expressions.

During the same time Stig Bakken developed a web interface to FTPSearch in Perl. Up until then, FTPSearch had only been available to users of special search clients that used the Prospero protocol to talk to Archie servers. With the increasing adoption of the World Wide Web, Stig figured that a web interface would be useful. The first interface was pretty crude, but he kept improving it until it was a lot easier to use than an Archie client. This interface was eventually replaced by a web interface written in C by Tegge.

The IT department at NTNU graciously let Tegge host the service at the university. It consumed quite a bit of bandwidth since it was a popular service on the net. In fact, the web interface for FTPSearch was the most popular web site in Norway for quite a while -- something the corporate types in the budding Internet industry in Norway disliked intensely as it wasn't what they thought of as a "proper website". They'd routinely avoid mentioning FTPSearch when talking about the most popular websites in Norway -- even though it was seeing a lot more traffic than the rest of the sites. This was the source of much merriment.

I don't think the IT department at NTNU has ever been properly credited for lending a hand. Especially Odd Arne Haugen and Jan Ole Waagen. Not only did FTPSearch get a box to run on and much needed bandwidth to serve up searches, but they also let Tegge use "Storm" for indexing the data. "Storm" was a four-CPU SGI machine with 1Gb of memory, which was an almost obscene amount of RAM back in those days. The IT department didn't even kick the project out when Tegge managed to completely hose the "unbreakable" journaling filesystem on the machine during an index build.

Eventually Fast Search and Transfer ASA bought the rights to FTPSearch. However, the FTPSearch code had already been released as open source, so long after FAST bought the rights, you could still download the source code of an older version, compile it and run it.

In fact, the source is still available from:

http://ftpsearch.archive.sunet.se:8000/search-info/software.html


As mentioned earlier, Web Search was a completely different code base. I can understand how some people might be tempted to conclude that it was the same code, but it wasn't.

The first version of Web Search were the fruits of Knut Magne Risvik's Masters thesis work. For all practical purposes, it was a research prototype and not so much a finished product. However, Knut Magne wanted to see if it could be taken further -- if it could form the basis of a large scale search engine. Knut Magne interrogated Tegge on the subject, and Tegge was only "mildly sceptical to the whole idea" -- which, as these things go, is almost equivalent to a golden seal of approval. A ten page memo on the feasability of building Alltheweb was hammered out by Knut Magne and sent off to Espen Brodin, who then got Robert Keith to cut us a check in order to make it so.

(Although it wasn't called "Alltheweb" back then. If you can believe it, the internal name was even more awkward: "TheWorldsBiggestSearchEngine" or TWBSE for short. Full marks for ambition -- a smack in the back of the head for the ghastly name)

The next revision of the search engine grew out of the combined efforts of Knut Magne and Tegge. As the project, the company and the workforce grew, more and more people were involved in developing the search engine, but Tor Egge and Knut Magne continued to play a key role in its development.

Our first real customer was Lycos. We provided the search services for their portal. Our search results were presented, dressed up in drag, on their portal pages. Portals were all the rage at the time. All the rubbish any normal person can't possibly be bothered to look at all crammed into one site. The density of STUFF on these pages was obscene. Every square millimetre was full of blinky stuff, links, advertising and even more links.

But hey, everyone was doing it that way, so they could be no better.

After getting the regular web search up and running we started to work on Image Search. In every sense of the word an interesting project. I'm no longer sure if this was something Lycos pushed for or something our people said we could do (or more likely, practically had in the can, ready to ship), but all of a sudden we had to figure out how to do it.

A lot of interesting pieces of code got written to do this. For one, libraries for encoding and decoding image formats were not really written by people who knew a whole lot about writing robust software, so the libraries we used had a tendency to crash. I didn't envy the guys who had to deal with this.

Also I vividly remember a small HTML parser someone wrote inside Finn Arne's original HTML parser, which was able to parse HTML backwards. The idea was to locate IMG elements in the HTML code and then extract N characters of context that would have been rendered as text in front of and behind the IMG tag. I can still remember Finn Arne pursing his lips and shaking his head in disgust. I guess he felt somewhat violated for having his neat HTML parser defiled in this manner.

Another last minute hack was thumbnail serving system (when you do a search on any search engine today that has image search, it will show the hits in the form of a grid of thumbnails). Earlier on I had written an Apache module for accessing crawled pages in our crawler storage. Our first crawler stored one web page per file, but even in the early, limited size document crawls, we soon discovered that this wouldn't work. You'd end up with an awful lot of files, and thus an awful lot of random accesses on the disk during processing. So we added our own, rather simple, storage format which allowed us to to processing (eg. indexing) more efficiently -- by exploiting the fact that sequential access over lots of data is generally quite fast. In essence we stored content inside very primitive miniature filesystems on top of the UNIX filesystem. If course, this meant that you couldn't access them without using the library we made handling this file format, so to make things easier I wrote an Apache module to provide browser access to all the content we had crawled.

One day, Rolf popped his head in my office and asked how much trouble it would be to serve images through the Apache module. Not a lot really. I just had to add something to set the correct content type and it would work. It wouldn't be fast, but it would do the job. We could always replace it with a more suitable system later...

I think it took two or three years before we judged it to be enough of a pain to actually replace it with a proper thumbnail serving system for Image Search. Which, to be quite honest, was a lot longer than I had thought it would last.


For years there was much talk of how the FAST search technology was a hybrid solution -- a search engine powered by special search chips. There was even rumors that we were using polymer memory technology from Opticom -- originally FASTs parent company. Entertaining rumors for sure, and I can only imagine how they messed with the minds of the competition. As entertaining as they were, there was no truth to these rumors. The last time a search chip was used in any of the services was in FTPSearch -- and Tegge stopped using it long before FAST entered the scene.

FAST was actully developing a search chip, but it was never used to power any of our search engines. As far as I know we intended to use it when it was finished, but we never did. In 2002 the department working on the chip was spun off into a separate company (Interagon).

3 comments:

  1. This is a great series and I hope to see more! We have referred to you article over at Pandia.

    If you find errors, just let me know.

    Per Koch
    Pandia

    ReplyDelete
  2. I second that. On the wordy side at times, but in general this is good stuff. It's also nice to see you weaving some PVV folklore into the story here and there. #8D)

    Thank you.

    ReplyDelete
  3. really interesting reading, bjørn! i didn't know much of this, but are really fascinated by how things evolved

    ReplyDelete