A very common theme in litigation against search engines, or other parties that deal with large corpora of content, is that some party demands or expects that the search engine or service provider curates the content. For instance, that potentially offensive material is filtered out.
There are two main ways to approach this:
- Manual curation
- Algorithmic curation
Manual curation.
Manual curation of a large corpus is not impossible per se, but it is unfeasible for very large amounts of data. This seems to be extremely hard for many people to grasp so I am going to provide a handy example in the shape of a very rough estimate.
Imagine that the web indexed by a search engine is 1 billion documents. Now, imagine that it takes, on average, about 10 seconds for a human being to determine if a web page contains offensive content or not (some pages merely need a glance while others require you to actually read the page).
That means that to evaluate the whole corpus indexed you would need 10 billion seconds of work being performed. That's roughly 2.78 million hours. Assuming a 8 hour work day, that is 347,222 work-days. Assuming you work 5 days per week that's 17,361 months. Again assuming you work 11 months per year, that gives us 1578 years of work.
So if you have 1578 ideal workers who can keep it up 8 hours per day without breaks, are never sick and only take 4 weeks of vacation every year, it will take you one year to chew through 1 billion web pages.
So what would this cost? Well, get out your calculator and run the numbers. Fixed wage? Per page? Whichever way you slice it, it is going to cost a lot. Now, is it reasonable to impose this cost on, say, search engines? What does this do to the competitive landscape? (As if building a web scale search engine wasn't hard enough and expensive enough as it is).
And these are extremely conservative numbers. You see, a large scale web index was 1 billion documents 10 years ago. Today the size of a web index is roughly two orders of a magnitude higher and the web never stands still. Not only are new pages added, but older pages are updated. The problem size grows exponentially.
If you do not understand, instinctively, that problems that grow exponentially cannot be solved efficiently using manual labor you are intellectually unfit to reason about these sort of problems.
Any 10 year old knows all the math needed to reason about these things.
Algorithmic curation.
An algorithm is a precise recipe for how you solve a problem. In order to write computer programs that solve a given problem you need a precise recipe -- that can be realized as a program.
Quick, can you come up with a precise recipe for figuring out whether a page contains offensive material or not?
For the vast majority of you the answer is "no" even if you dedicated the rest of your life to this problem. In fact, for the vast majority of the worlds smartest people working in this field the answer will still be "no" if you expect the algorithm to be right every single time. You can get some high percentage of the cases right, but it is unlikely that we will ever discover a method that will always produce the correct answer every time. (If you have a liberal arts background or some other non-scientific degree you may think of "unlikely" as "won't happen ever").
Not least because there is no correct answer to this problem. What is offensive is in the eye of the beholder as well as subject to social and cultural norms. What you consider offensive may even change over time. It can also change depending on context. We are faced with many problems that have this nature when it comes to large scale information processing -- problems that have no definitive answers.
So to sum it up: it is possible to achieve high success rates for these sorts of problems using an algorithmic approach -- but perfection will not be achievable. So we will just have to live with imperfect results.
That, or just decide that not being offended is so important to us that we do not want search engines at all. Or the web.
Postscript.
What is striking when you reason about these things is that you do not need an education in information retrieval, computer science or mathematics to understand that these problems are very hard and that some types of approaches will be completely unfeasible. Yet on a daily basis, ignorant law-makers, judges, lawyers and opinion leaders make statements that seem uninformed by even the most basic understanding of the problem domains they talk about.
And there are a lot of problems that exhibit the same properties as curating a web index in that they have vague problem definitions, involve ultra-large scale, and expectations that can never be met fully. It is going to be a real challenge to educate law-makers, judges, bureaucrats and the public on these issues.
But it is important that it is done. It is important that these things are explained over and over, and in simple terms so that lack of knowledge or a lack in intellectual fortitude can not be used as an excuse for making dimwitted decisions.
