August 23, 2008

Good stuff in, bad stuff out

A fun ad from IBM that makes the point... (Thanks to Mark McCabe)

Posted by jmm at 12:07 AM | Comments (0) | Permalink »

March 29, 2008

Keeping the good stuff out at Yahoo! Answers

This is, I think, an amusing and instructive tale. I'm a bit sorry to be telling it, because I have a lot of friends at Yahoo! (especially in the Research division), and I respect the organization. The point is not to criticize Yahoo! Answers, however: keeping pollution out is a hard problem for user-contributed content information services, and that their system is imperfect is a matter for sympathy, not scorn.

While preparing for my recent presentation at Yahoo! Research, I wondered whether Yahoo! Mail was still using the the Goodmail spam-reduction system (which is based on monetary incentives). I couldn't find the answer with a quick Google search, nor by searching the Goodmail and Yahoo! corporate web sites (Goodmail claims that Yahoo! is a current client, but there was no information about whether Yahoo! is actually using the service, or what impact it is having).

So, I thought, this is a great chance to give Yahoo! Answers a try. I realize the question answerers are not generally Yahoo! employees, but I figured some knowledgeable people might notice the question. Here is my question, in full:

Is Yahoo! Mail actually using Goodmail's Certified Email? In 2005 Yahoo!, AOL and Goodmail announced that the former 2 had adopted Goodmail's "Certified Email" system to allow large senders to buy "stamps" to certify their mail (see e.g., http://tinyurl.com/2atncr). The Goodmail home page currently states that this system is available at Yahoo!. Yet I can find nothing about it searching Yahoo!Mail Help, etc. My question: I the system actually being used at Yahoo!Mail? Bonus: Any articles, reports, etc. about its success or impacts on user email experience?

A day later I received the following "Violation Notice" from Yahoo! Answers:

You have posted content to Yahoo! Answers in violation of our Community Guidelines or Terms of Service. As a result, your content has been deleted. Community Guidelines help to keep Yahoo! Answers a safe and useful community, so we appreciate your consideration of its rules.

So, what is objectionable about my question? It is not profane or a rant. It is precisely stated (though compound), and I provided background context to aid answerers (and so they knew what I already knew).

I dutifully went and read the Community Guidelines (CG) and the Terms of Service (TOS), and I could not figure out what I had violated. I had heard elsewhere that some people did not like TinyURLs because it it not clear where you are being redirected, and thus it might be used to maliciously direct traffic. But I saw nothing in the CG or TOS that prohibited URLs in general, or TinyURLs specifically.

So I contacted the link they provided to appeal the deletion. A few days later I received a reply that cut-and-pasted the information from the Yahoo! Answers help page explaining why content is deleted. This merely repeated what I had been told in the first message (since none of the other categories applied): my content was in violation of the CG or TOS. But no information was provided (second time) on how the content violated these rules.

Another address was provided to appeal the decision, so I wrote a detailed message to that address, explaining my question, and my efforts to figure out what I was violating. A few days later, I got my third email from Yahoo! Answers:

We have reviewed your appeal request. Upon review we found that your content was indeed in violation of the Yahoo! Answers Community Guidelines, Yahoo! Community Guidelines or the Yahoo! Terms of Service. As a result, your content will remain removed from Yahoo! Answers.

Well... Apparently it's clear to others that my message violates the CG or the TOS, but no one wants to tell me what the violation actually is. Three answers, all three with no specific explanation. Starting to feel like I'm a character in a Kafka novel.

At this point, I laughed and gave up (it was time for me to travel to Yahoo! to give my -- apparently dangerous and community-guideline-violating -- presentation anyway).

I have to believe that there is something about the use of a URL, a TinyURL, or the content to which I pointed that is a violation. I've looked, and found many answers that post URLs (not surprisingly) to provide people with further information. Perhaps the problem is that I was linking to a Goodmail press release on their web site, and they have a copyright notice on that page? But does Yahoo! really think providing a URL is "otherwise make available any Content that infringes any patent, trademark, trade secret, copyright" (from the TOS)? Isn't that what Yahoo's search engine does all the time?

End of story.

Moral? Yahoo! Answers is a user-contributed content platform. Like most, that means it is fundamentally an open-access publishing platform. There will be people who want to publish content that is outside the host's desired content scope. How to keep out the pollution? Yahoo! uses a well-understood, expensive method to screen: labor. People read the posted questions and make determinations about acceptability. But, as with any screen, there are Type I (false negative) and Type II (false positive) errors. Screening polluting content is hard.

(My question probably does violate something, but surely the spirit of my question does not. I had a standard, factual, reference question, ironically, to learn a fact that I wanted to use in a presentation to Yahoo! Research. A bit more clarity about what I was violating and I would have contributed desirable content to Yahoo! Answers. Instead, a "good" contributor was kept out.)

Posted by jmm at 10:19 AM | Comments (5) | Permalink »

December 16, 2007

CAPTCHA farms in the courts

As soon as a screen is developed to protect a valuable activity, the incentive is on the table to work around it. Screening works by demanding a test or task that is more costly for the undesirables to perform (the technical requirements are a bit more subtle than this). If it is too costly to perform as well as a desirable, the undesirables are identified and can be blocked (or charged a different price, etc.)

Incentive designs spawn incentive designs (I also wrote about this in May 2006). If the service or product or information the undesirables want is sufficiently valuable, it is worth it to them to invest in circumventing the screen to get the cost of performing as well as a desirable low enough.

CAPTCHAs, developed by Luis von Ahn and his colleagues at Carnegie-Mellon, are one such screen for keeping undesirables -- in this case software bots -- out of certain valuable information services (like free webmail accounts). Or, in the case of Ticketmaster, from robotically buying large numbers of hot tickets.

Ticketmaster has sued RMG (preliminary injunction) for its business selling PurchaseMaster software, which allegedly enables ticket brokers to score large numbers of desirable tickets in the first few minutes the events go on sale. One of Ticketmaster's protections against bots is a standard CAPTCHA. RMG, in its defense, has publicly stated that it is using one of the now standard low-cost ways of circumventing the CAPTCHA: the bots are hiring low-wage humans (in India in this case) to break the CAPTCHAs, so the bots can get on with their business. (The Matrix is coming.)

RMG answered Ticketmaster’s Captchas — the visual puzzles of distorted letters that a customer must type before buying tickets— not with character recognition software, he said, but with humans: “We pay guys in India $2 an hour to type the answers.? (NY Times, 16 Dec 2007)

Another way bots hire humans to do their CAPTCHA work for them is with porn bribes: set up a site giving free access to porn as long as the human solves a CAPTCHA or three, and feed them CAPTCHAs thrown up by other sites to block the bots entrance.


Posted by jmm at 11:03 AM | Permalink »

January 31, 2007

Wikipedia is in trouble

I'm going out on a limb here: unless Wikipedia comes up with a coherent contribution policy that is consistent with the economic value of its content, it will start to deteriorate.

In a widely published Associated Press story, Brian Bergstein reports that Jimmy Wales, Wikipedia founder, Board Chair Emeritus, and currently President of for-profit Wikia, Inc., blocked the account of a small entrepreneur, Gregory Kohs, who was selling his services to (openly, with attribution) write Wikipedia articles about businesses. Wales reportedly told Kohs that his MyWikiBiz was "antithetical to Wikipedia's mission", and that even posting his stories on his personal page inside Wikipedia so independent editors could grab them and insert them in the encyclopedia was "absolutely unacceptable".

Before I get into my dire forecast, what is antithetical about someone who is paid as a professional writer to prepare content, especially if he is open about that fact? There are three "fundamental" Wikipedia editorial policies with which all contributions must comply:

  1. Neutral point of view (NPOV)
  2. Verifiability
  3. No original research

The first two are relevant here. NPOV means all content "must be written from a neutral point of view (NPOV), representing fairly and without bias all significant views." Verifiability means "any reader should be able to check that material added to Wikipedia has already been published by a reliable source." Kohs stated in his corporate materials that he is committed to compliance with these two policies: he would prepare the content for interested parties, but it would be neutral and verifiable. Of course, on any particular contribution other editors might disagree and choose to revise the content, but that is the core process of Wikipedia.

The problem is deep: arguably all contributors have a subjective (non-neutral) point of view, no matter how much they may wish, and believe otherwise. What is rather remarkable about Wikipedia is how well the group editing process has worked to enforce neutrality (and verifiability) through collective action. In any case, there is no clear reason to believe a paid professional writer is going to be systematically non-neutral any more or less than a volunteer writer.

In part, this is just a simple statement about incentives. A reasonable starting point is to accept that everyone who makes the effort to research and write material for Wikipedia is doing it for some motivating reason. Research and writing take time away from other desirable activities, so unless the writer is consistently irrational, she by revealed preference believes she is getting some benefit out of writing greater than the opportunity cost of the foregone time. It follows directly that point of view might be biased by whatever is motivating a given writer. To believe otherwise is naive. Dangerously naive, for the future of Wikipedia.

Even if the "everyone is motivated by someone" argument is too subtle for some true believers in massive social altruism, there is an obvious problem with Wikipedia's position on Gregory Kohs: surely there are many, many writers who are being paid for time and effort they devote to Wikipedia, but who are not being open about it. For example, employees of corporations, non-profits, educational institutions, etc., asked to maintain a Wikipedia entry on the corporation, who do so from an IP address not traceable to the corporation (e.g., from home). We already know from past experience that political operatives have made sub rosa contributions.

So, the problem of distinguishing between a priori neutral and a priori non-neutral contributors is deep and possibly not amenable to any reasonably effective solution. This is a fundamental problem of hidden information: the contributor knows things about her motivations and point of view that are not observable by others. Rather, others can only infer her motivations, by seeing what she writes, and at that point, the motivations are moot: if her content is not neutral or verifiable, other editors can fix it, and if she systematically violates these principles, she can be banned based on what she did, not who she purports to be.

Indeed, given the intractability of knowing the motivations and subjective viewpoints of contributors, it might seem that the sensible policy would be to encourage contributors to disclose any potential conflicts of interest, to alert editors to be vigilant for particular types of bias. This disclosure, of course, is exactly what Kohs did.

And now, for my prediction that Wikipedia is in trouble. Wikipedia has become mainstream: people in all walks of life rely on it as a valuable source of information for an enormous variety of activities. That is, the content has economic value: economic in the sense that it is a scarce resource, valuable precisely because for many purposes it is better than the next alternative (it is cheaper, or more readily available, or more reliable, or more complete, etc.). Having valuable content, of course, is the prime directive for Wikipedia, and it is, truly, a remarkable success.

However, precisely because the content has economic value to the millions of users, there are millions of agents who have an economic interest in what the content contains. Some are interested merely that content exist (for example, there are not many detailed articles about major businesses, which was the hole that Kohs was trying to plug). Others might want that content to reflect a particular point of view.

Because there is economic value to many who wish to influence the content available, they will be willing to spend resources to do the influencing. And where there are resources -- value to be obtained -- there is initiative and creativity. A policy that tries to ex ante filter out certain types of contributors based on who they are, or on very limited information about what their subjective motivations might be, is as sure to be increasingly imperfect and unsuccessful as is any spam filtering technology that tries to set up ex ante filtering rules. Sure, some of this pollution will be filtered, but there will also be false positives, and worse, those with an interest in influencing content will simply find new clever ways to get around the imperfect ex ante policies about who can contribute. And they will succeed, just as spammers in other contexts succeed, because of the intrinsic information asymmetry: the contributors know who they are and what their motivations are better than any policy rule formulated by another can ever know.

So, trying to pre-filter subjective content based on extremely limited, arbitrary information about the possible motivations of a contributor will just result in a spam-like arms race: content influencers will come up with new ways to get in and edit Wikipedia, and Wikipedia's project managers will spend ever increasing amounts of time trying to fix up the rules and filters to keep them out (but they won't succeed).

This vicious cycle has always been a possibility, and indeed, we've seen examples of pollution in Wikipedia before. The reason I think the problem is becoming quite dangerous to the future of Wikipedia its very success. By becoming such a valuable source of content, content influencers will be willing to spend ever increasing amounts to win the arms race.

Wikipedia is, unavoidably (and hooray! this is a sign of the success of its mission) an economic resource. Ignoring the unavoidable implications of that fact will doom the resource to deteriorating quality and marginalization (remember Usenet?).

Ironically, at first blush there seems to be a simple, obvious alternative right at hand: let Wikipedia be Wikipedia. The marvel of the project is that the collective editorial process maintains very high quality standards. Further, by allowing people to contribute, and then evaluating their contributions, persistent abusers can be identified and publicly humiliated (as Jimmy Wales himself was when he was caught making non-neutral edits to the Wikipedia entry about himself). Hasn't Wikipedia learned its own key lessons? Let the light shine, and better the devil you know.

(Wikipedia itself offers an enlightening summary of the battle of Kohs's efforts to contribute content. This summary serves to emphasize the impossibility of Wikipedia's fantasy of pre-screening contributors.)

Posted by jmm at 12:29 AM | Comments (4) | Permalink »

January 06, 2007

Spyware

Here's a paragraph Rick Wash and I wrote for the USENIX paper, somewhat revised for later use, concerning spyware:

An installer program acts on behalf of the computer owner to install desired software. However, the installer program is also acting on behalf of its author, who may have different incentives than the computer owner. The author may surreptitiously include installation of undesired software such as spyware, zombies, or keystroke loggers. Rogue installation is a hidden action problem: the actions of one party (the installer) are not easy to observe. One typical design response is to require a bond that can be seized if unwanted behavior is discovered (an escrowed warranty, in essence), or a mechanism that screens unwanted behavior by providing incentives that induce legitimate installers to take actions distinguishable from those who are illegitimate.

Posted by jmm at 11:43 PM | Comments (0) | Permalink »

May 10, 2006

CAPTCHAs (2): Technical screens vulnerable to motivated humans

A particularly interesting approach to breaking purely technical screens, like CAPTCHAs, is to provide humans with incentives to end-run the screen. The CAPTCHA is a test that is easy for humans to pass, but costly or impossible for machines to pass. The goal is to keep out polluters who rely on cheap CPU cycles to proliferate their pollution. But polluters can be smart, and in this case the smart move may be "if you can't beat 'em, join 'em".

Say a polluter wants to get many free email accounts from Yahoo! (from which to launch pollution distribution, such as spamming). Their approach was to have a computer go through the process of setting up an account at Yahoo! and to replicate this many times to get many accounts. For many similar settings, it is easy to write code to automatically navigate the signup (or other) service.

CAPTCHAs make it very costly for computers to aid polluters, because most computers fail, or take a very long time decoding a CAPTCHA.

As I discussed in my CAPTCHAs (1) entry, one approach for polluters to get around the screen is to improve the ability of computers to crack the CAPTCHA. But another is to give in: if humans can easily pass the screen, then enlist large numbers of human hours to get past the test repeatedly. There are at least two ways to motivate humans to prove repeatedly to a CAPTCHA that they are human: pay low-wage workers (usually in developing countries) to sit at screens all day and solve CAPTCHAs, or give (higher-wage) users some other currency they value to solve the CAPTCHAs: the most common in-kind payment has been access to a collection of pornography in exchange for solving a CAPTCHA.

This puts us back in the usual problem space for screening: how to come up with a screen that is low cost for desirable human participants, but high cost for undesirable humans?

The lesson is that CAPTCHAs may be able to distinguish humans from computers, but only if the computers act like computers. If they enlist humans to help them, the CAPTCHAs fail.

Ironically, enlisting large numbers of humans to solve problems that are hard for computers is an example of what Louis von Ahn (one of the inventors of CAPTCHAs) calls "social computing".

Posted by jmm at 12:37 AM | Comments (0) | Permalink »

CAPTCHAs (1): Technical screens are vulnerable to technical progress

One of the most wildly successful technical screening mechanisms for blocking pollution in recent years is the CAPTCHA (Complete Automated Public Turing Test to Tell Computers and Humans Apart). The idea is ingenious, and respects basic incentive-centered design principles necessary for a screen to be successful. However, it suffers from a common flaw: purely technical screens often are not very durable because technology advances. I think it may be important to include human-behavior incentive features in screening mechanisms.

The basic idea behind a CAPTCHA is beautifully simple: present a graphically distorted image of a word to a subject. A computer will not be able to recognize the word, but a human will, so a correct answer identifies a human.

Of course, as we know from screening theory, for a CAPTCHA to work, the cost for the computer to successfully recognize the word has to be substantially higher than for humans. And, since the test is generally dissipative (wasteful of time, at least for the human user), the system will be more efficient (user satisfaction will be higher) the lower is the screening cost for the humans. So, the CAPTCHA should be very easy for humans, but hard to impossible for computers.

With rapidly advancing technology (not just hardware, but especially machine vision algorithms), the cost of decoding any particular family of CAPTCHAs will decline rapidly. Once the decoding cost is low enough, the CAPTCHA no longer screens effectively: we get a pooling equilibrium rather than a separating equilibrium (the test can't tell computers and humans apart). The creators of CAPTCHAs (Ahn, Blum, Hopper and Langford) note, reasonably enough, that this isn't all bad: developing an algorithm that has a high success rate against a particular family of CAPTCHAs is solving an outstanding artificial intelligence problem. But, while good for science, that probably isn't much comfort to people who are relying on CAPTCHAs to secure various open access systems from automated polluting agents.

The vulnerability of CAPTCHAs to rapid technological advance is now clear. A recent paper shows that computers can now beat humans at single character CAPTCHA recognition. The CAPTCHA project documents successful efforts to break two CAPTCHA families (ez-gimpy and gimpy-r).

Posted by jmm at 12:12 AM | Comments (0) | Permalink »