2011-04-24

Hard problems and unreasonable expectations

After spending a decade in the search engine industry you start to take for granted that people of normal intelligence and above are able to reach their own conclusions on what it means when certain numbers become large enough.  This is an occupational hazard;  becoming blind to the fact that most people have no way of thinking about large numbers or how hard a particular problem is.

A very common theme in litigation against search engines, or other parties that deal with large corpora of content, is that some party demands or expects that the search engine or service provider curates the content.  For instance, that potentially offensive material is filtered out.

There are two main ways to approach this:
  • Manual curation
  • Algorithmic curation
Of course, there are many hybrid solutions between these two extremes, but let's explore the extremes to gain some understanding of the bounds of the solution space.

Manual curation.

Manual curation of a large corpus is not impossible per se, but it is unfeasible for very large amounts of data. This seems to be extremely hard for many people to grasp so I am going to provide a handy example in the shape of a very rough estimate.

Imagine that the web indexed by a search engine is 1 billion documents.  Now, imagine that it takes, on average, about 10 seconds for a human being to determine if a web page contains offensive content or not (some pages merely need a glance while others require you to actually read the page).
That means that to evaluate the whole corpus indexed you would need 10 billion seconds of work being performed.  That's roughly 2.78 million hours.  Assuming a 8 hour work day, that is 347,222 work-days.  Assuming you work 5 days per week that's 17,361 months.  Again assuming you work 11 months per year, that gives us 1578 years of work.

So if you have 1578 ideal workers who can keep it up 8 hours per day without breaks, are never sick and only take 4 weeks of vacation every year, it will take you one year to chew through 1 billion web pages.

So what would this cost?  Well, get out your calculator and run the numbers.  Fixed wage?  Per page?  Whichever way you slice it, it is going to cost a lot.  Now, is it reasonable to impose this cost on, say, search engines?  What does this do to the competitive landscape? (As if building a web scale search engine wasn't hard enough and expensive enough as it is).

And these are extremely conservative numbers.  You see, a large scale web index was 1 billion documents 10 years ago.  Today the size of a web index is roughly two orders of a magnitude higher and the web never stands still.   Not only are new pages added, but older pages are updated. The problem size grows exponentially.

If you do not understand, instinctively, that problems that grow exponentially cannot be solved efficiently using manual labor you are intellectually unfit to reason about these sort of problems.

Any 10 year old knows all the math needed to reason about these things.

Algorithmic curation.

An algorithm is a precise recipe for how you solve a problem.  In order to write computer programs that solve a given problem you need a precise recipe -- that can be realized as a program.

Quick, can you come up with a precise recipe for figuring out whether a page contains offensive material or not?

For the vast majority of you the answer is "no" even if you dedicated the rest of your life to this problem.  In fact, for the vast majority of the worlds smartest people working in this field the answer will still be "no" if you expect the algorithm to be right every single time.  You can get some high percentage of the cases right, but it is unlikely that we will ever discover a method that will always produce the correct answer every time. (If you have a liberal arts background or some other non-scientific degree you may think of "unlikely" as "won't happen ever").

Not least because there is no correct answer to this problem.   What is offensive is in the eye of the beholder as well as subject to social and cultural norms.  What you consider offensive may even change over time.  It can also change depending on context.  We are faced with many problems that have this nature when it comes to large scale information processing -- problems that have no definitive answers.

So to sum it up:  it is possible to achieve high success rates for these sorts of problems using an algorithmic approach -- but perfection will not be achievable.  So we will just have to live with imperfect results.

That, or just decide that not being offended is so important to us that we do not want search engines at all.  Or the web.

Postscript.

What is striking when you reason about these things is that you do not need an education in information retrieval, computer science or mathematics to understand that these problems are very hard and that some types of approaches will be completely unfeasible.  Yet on a daily basis, ignorant law-makers, judges, lawyers and opinion leaders make statements that seem uninformed by even the most basic understanding of the problem domains they talk about.

And there are a lot of problems that exhibit the same properties as curating a web index in that they have vague problem definitions, involve ultra-large scale, and expectations that can never be met fully.  It is going to be a real challenge to educate law-makers, judges, bureaucrats and the public on these issues.

But it is important that it is done.  It is important that these things are explained over and over, and in simple terms so that lack of knowledge or a lack in intellectual fortitude can not be used as an excuse for making dimwitted decisions.

3 comments:

  1. Just because YOU can't do something doesn't mean it CAN'T be done.

    ReplyDelete
  2. There is a difference between "can't be done" in absolute terms and "can't be done" in practical terms.

    In absolute terms I have not seen proof that either classifying all web pages manually or algorithmically is impossible. However in real life we have to consider feasability and practicality.

    (though it could be argued that in order to classify all pages perfectly by a manual process, all pages must necessarily be classified by all people and thus we run into trouble with our limited life-span necessitating such drastic steps as mandating the web never have more pages on it than is physically possible for all humans to inspect. I suspect that would be a very low number and it would make for a rather dull web)

    If you think you can do perfect algorithmic classification of all content on the web, then I suggest you set out to prove that it can be done -- or actually do it. The world eagerly awaits your contribution.

    ReplyDelete
  3. Neither is there a perfect algorithm for ranking pages by their relevance to a given query.

    Note that the problem isn't even well defined, as the only unit of direct relevance is how pleased the query's author would be upon being presented with the ranked set of results. Even if we were able to quantify this kind of reaction, we would have to consider its value on the domain of all possible searchers * all possible queries * all possible reorderings of the index -- and this is a very large domain. In fact its super-exponential.

    We can't do any better, since we can't assume there to be any clustering or cliquing within the index. It's completely unstructured. And human opinions of the relevancy of any particular document must be assumed to be independent.

    And clearly nothing less than a perfect global ranking will suffice.

    ReplyDelete