Hard problems and unreasonable expectations

After spending a decade in the search engine industry you start to take for granted that people of normal intelligence and above are able to reach their own conclusions on what it means when certain numbers become large enough.  This is an occupational hazard;  becoming blind to the fact that most people have no way of thinking about large numbers or how hard a particular problem is.

A very common theme in litigation against search engines, or other parties that deal with large corpora of content, is that some party demands or expects that the search engine or service provider curates the content.  For instance, that potentially offensive material is filtered out.

There are two main ways to approach this:
  • Manual curation
  • Algorithmic curation
Of course, there are many hybrid solutions between these two extremes, but let's explore the extremes to gain some understanding of the bounds of the solution space.

Manual curation.

Manual curation of a large corpus is not impossible per se, but it is unfeasible for very large amounts of data. This seems to be extremely hard for many people to grasp so I am going to provide a handy example in the shape of a very rough estimate.

Imagine that the web indexed by a search engine is 1 billion documents.  Now, imagine that it takes, on average, about 10 seconds for a human being to determine if a web page contains offensive content or not (some pages merely need a glance while others require you to actually read the page).
That means that to evaluate the whole corpus indexed you would need 10 billion seconds of work being performed.  That's roughly 2.78 million hours.  Assuming a 8 hour work day, that is 347,222 work-days.  Assuming you work 5 days per week that's 17,361 months.  Again assuming you work 11 months per year, that gives us 1578 years of work.

So if you have 1578 ideal workers who can keep it up 8 hours per day without breaks, are never sick and only take 4 weeks of vacation every year, it will take you one year to chew through 1 billion web pages.

So what would this cost?  Well, get out your calculator and run the numbers.  Fixed wage?  Per page?  Whichever way you slice it, it is going to cost a lot.  Now, is it reasonable to impose this cost on, say, search engines?  What does this do to the competitive landscape? (As if building a web scale search engine wasn't hard enough and expensive enough as it is).

And these are extremely conservative numbers.  You see, a large scale web index was 1 billion documents 10 years ago.  Today the size of a web index is roughly two orders of a magnitude higher and the web never stands still.   Not only are new pages added, but older pages are updated. The problem size grows exponentially.

If you do not understand, instinctively, that problems that grow exponentially cannot be solved efficiently using manual labor you are intellectually unfit to reason about these sort of problems.

Any 10 year old knows all the math needed to reason about these things.

Algorithmic curation.

An algorithm is a precise recipe for how you solve a problem.  In order to write computer programs that solve a given problem you need a precise recipe -- that can be realized as a program.

Quick, can you come up with a precise recipe for figuring out whether a page contains offensive material or not?

For the vast majority of you the answer is "no" even if you dedicated the rest of your life to this problem.  In fact, for the vast majority of the worlds smartest people working in this field the answer will still be "no" if you expect the algorithm to be right every single time.  You can get some high percentage of the cases right, but it is unlikely that we will ever discover a method that will always produce the correct answer every time. (If you have a liberal arts background or some other non-scientific degree you may think of "unlikely" as "won't happen ever").

Not least because there is no correct answer to this problem.   What is offensive is in the eye of the beholder as well as subject to social and cultural norms.  What you consider offensive may even change over time.  It can also change depending on context.  We are faced with many problems that have this nature when it comes to large scale information processing -- problems that have no definitive answers.

So to sum it up:  it is possible to achieve high success rates for these sorts of problems using an algorithmic approach -- but perfection will not be achievable.  So we will just have to live with imperfect results.

That, or just decide that not being offended is so important to us that we do not want search engines at all.  Or the web.


What is striking when you reason about these things is that you do not need an education in information retrieval, computer science or mathematics to understand that these problems are very hard and that some types of approaches will be completely unfeasible.  Yet on a daily basis, ignorant law-makers, judges, lawyers and opinion leaders make statements that seem uninformed by even the most basic understanding of the problem domains they talk about.

And there are a lot of problems that exhibit the same properties as curating a web index in that they have vague problem definitions, involve ultra-large scale, and expectations that can never be met fully.  It is going to be a real challenge to educate law-makers, judges, bureaucrats and the public on these issues.

But it is important that it is done.  It is important that these things are explained over and over, and in simple terms so that lack of knowledge or a lack in intellectual fortitude can not be used as an excuse for making dimwitted decisions.


More pictures

Lars Magne thinks my blog has too few pictures.  Here's a random picture:

Free advice for Skype

Lets make this short and sweet: the user interface of Skype sucks.  

It is too dominating on the screen and it does not take into account how people actually use similar products.  If you look at the closest relatives of Skype, the aggregating chat clients, you will see what I mean.  Small, focused on being unobtrusive and usually centered around how people actually use communication software.  By comparison, Skype wastes screen real-estate and doesn't really show you that much of what you are interested in.

In fact, I was going to include an illustrative screen-capture, but I decided against it.  When thinking about why I realized it was because I didn't want a honking big graphic in the middle of my blog posting.  That is how badly designed the Skype UI is.

No, I have no use for a cover-flow-like UI to leaf through my contacts.  Copying UI design mistakes by Apple and then, to make matters worse, using them in the wrong context is just stupid.  Nor do I have any use for seeing a list of whom I talked to two days ago.  No really,  I don't.

What I have some use for is to see which one of my contacts are online.  Without having to click around and then reshuffle windows.  This should be the default view.  Not some oddball, tucked-on auxiliary UI wart like now.

One more thing...

What really puzzles me is why nobody at Skype cared enough about the product to ensure that people would have a reason to run Skype all the time when logged in.  In case you hadn't noticed: most people don't.  They fire up Skype when they "need it".  Often initiating the conversation out of band via chat, email or (gasp) phone.

The easiest way to accomplish this would be to make sure Skype is also an IM aggregation client.  A good IM client.  One that is on par with, say, Adium.  With support for all the major chat networks.  (If you work for Skype and are not familiar with Adium I suggest you install it now and start using it.  It is important you understand why Adium is so popular)

You see, a lot of people would like to have Skype support in their IM client.  This is no accident.  Because they run their IM clients all the time.  They do not run Skype all the time.  Because it is an annoying piece of single-use software.

By ensuring that Skype is also a great IM aggregation client you would give people real reason to run Skype all the time.  Which would boost the usefulness of the application more than any amount of advertising could ever do.