June 06, 2013

Everything can -- and will -- be manipulated

Well, not "everything". But every measure on which decisions of value depend (e.g., book purchases, dating opportunities, or tenure) can and will be manipulated.

And if the measure depends on user-contributed content distributed on an open platform, the manipulation often will be easy and low cost, and thus we should expect to see it happen a lot. This is a big problem for "big data" applications.

This point has been the theme of many posts I've made here. Today, a new example: citations of scholarly work. One of the standard, often highly-valued (as in, makes a real difference to tenure decisions, salary increases and outside job offers) measures of the impact of a scholar's work is how often it is cited in the published work of other scholars. ISI Thompson has been providing citations indices for many years. ISI is not so easy to manipulate because -- though it depends on user-contributed content (articles by one scholar that cite the work of another) -- that content is distributed on closed platforms (ISI only indexes citations from a set of published journals that have editorial boards which protect their reputation and brand by screening what they publish).

But over the past several years, scholars have increasingly relied on Google Scholar (and sometimes Microsoft Academic) to count citations. Google Scholar indexes citations from pretty much anything that appears to be a scholarly article that is reachable by the Google spiders crawling the open web. So, for example, it includes citations in self-published articles, or e-prints of articles published elsewhere. Thus, Google Scholar citation counts depends on user-contributed content distributed on an open platform (the open web).

And, lo and behold, it's relatively easy to manipulate such citation counts, as demonstrated by a recent scholarly paper that did so: Delgado Lopez-Cozar, Emilio; Robinson-Garcia, Nicolas; Torres Salinas, Daniel (2012). Manipulating Google Scholar Citations and Google Scholar Metrics: simple, easy and tempting. EC3 Working Papers 6: 29 May, 2012, available as http://arxiv.org/abs/1212.0638v2.

Their method was simple: they created some fake papers that cited other papers, and published the fake papers on the Web. Google's spider dutifully found them and increased the citation counts for the real papers that these fake papers "cited".

The lesson is simple: for every measure that depends on user-contributed content on an open platform, if valuable decisions depend on it, we should assume that it is vulnerable to manipulation. This is a sad and ugly fact about a lot of new opportunities for measurement ("big data"), and one that we must start to address. The economics are unavoidable: the cost of manipulation is low, so if there is much value to doing so, it will be manipulated. We have to think about ways to increase the cost of manipulating, if we don't want to lose the value of the data.

Posted by jmm at 11:09 AM | Comments (1) | Permalink »

May 27, 2013

Mining social data -- what is revealed?

Here is a recent article about high school students manipulating their Facebook presence to fool college admissions officers. Not terribly surprising: the content is (largely) created and controlled by the target of the background searches (by admissions, prospective employers, prospective dating partners etc) so it's easy to manipulate. We've been seeing this sort of manipulation since the early days of user-contributed content.

People mining user-contributed content should be giving careful thought to this. Social scientists like it when they can observe behavior, because it often reveals something more authentic than simply asking someone a question (about what they like, or what they would have done in a hypothetical situation, etc). Economists, for example, are thrilled when they get to observe "revealed preference", which are choices people make when faced with a true resource allocation problem. It could be that I purchased A instead of B to fool an observer, but there is a cost to my doing so (I bought and paid for a product that I didn't want), and as long as the costs are sufficiently salient, it is more likely that we are observing preferences untainted by manipulation.

There are costs to manipulating user-contributed content, like Facebook profiles, of course: some amount of time, at the least, and probably some reduced value from the service (for example, students say that during college application season they hide their "regular" Facebook profile, and create a dummy in which they talk about all of the community service they are doing, and how they love bunnies and want to solve world hunger: all fine, but they are giving up the other uses of Facebook that they normally prefer). But costs of manipulating user-contributed content often may be low, and thus we shouldn't be surprised if there is substantial manipulation in the data, especially if the users have reason to think they are being observed in a way that will affect an outcome they care about (like college admissions).

Put another way, the way people portray themselves online is behavior and so reveals something, but it may not reveal what the data miner thinks it does.

Posted by jmm at 02:39 PM | Comments (0) | Permalink »

February 01, 2010

Crowd-sourcing combats information asymmetry

Jonathan Zinman and Eric Zitzewitz studied ski resorts claims about snowfall. They found that, relative to government snow reports, ski resorts claim 23% more snowfall for weekend days than for weekdays. Seems a pretty clear case of deceptive advertising to draw in the business, with the risk (of being sued for deception, or of damaging reputation) taken more when the payoff is higher (weekends).

Deceptive advertising is a standard case of asymmetric information, hidden characteristics variety. The resort has better information, and chooses what to report.

What incentives to induce honesty? As I mentioned above, there are at least two obvious incentives: avoiding a lawsuit (by a government agency or perhaps a class action on behalf of disgruntled customers), and avoiding a loss of customer goodwill if they realize the resort is routinely lying.

How to increase those incentives (since apparently they have not been enough to prevent at least some deception)? One way is to raise the fine or other penalties if prosecuted. Another way, particularly for the reputation effect, is to reduce the cost of getting better information to the consumers.

And...Zinman and Zitzewitz found that the deception has decreased since the release of an iPhone app that aggregates skier reports of local conditions in real time (and that the reduction in exaggeration is much more notable at resorts that have good iPhone reception).

Crowdsourcing: reducing asymmetric information problems.

(Zinman must be pretty happy to have found a co-author with whom he gets first billing in co-authored papers...no mean feat.)

(Via Erin Krupka and the Marginal Revolution blog.)

Posted by jmm at 11:50 AM | Comments (0) | Permalink »

September 01, 2008

The fine line between spam and foie gras

The New York Times (following others) reported today on a large number of detailed, informed, and essentially all flattering edits to Sarah Palin's Wikipedia page made --- hmmm --- in the 24 hours before her selection as the Republican vice presidential nominee was made public. The edits were made anonymously, and the editor has not yet been identified, though he acknowledges that he is a McCain campaign volunteer.

Good or bad content? The potential conflict of interest is clear. But that doesn't mean the content is bad. Most of the facts were supported with citations. But were the written in overly flattering langauge? And was the selection of facts unbiased? Much of the material has been revised and toned down or removed in the few days since, which is not surprising regardless of the quality of this anonymous editor's contributions, given the attention that Ms. Palin has been receiving.

Posted by jmm at 04:18 PM | Comments (0) | Permalink »

August 23, 2008

Good stuff in, bad stuff out

A fun ad from IBM that makes the point... (Thanks to Mark McCabe)

Posted by jmm at 12:07 AM | Comments (0) | Permalink »

July 08, 2008

ICD introductory readings from on high

Students often ask me what they can read to learn about ICD. I've not had a terribly good answer to that. On the one hand, the foundations -- especially mechanism design in economics, and game theory, and engineering design theory, and social psychology -- are ancient (well, a few decades old), and have very rich literatures. But I haven't seen (haven't really searched for) good intros. And, these are the building blocks of ICD, but the particular area in which we focus -- incentive-centered design for information systems -- and the particular multi-disciplinary approach we take -- is rather new. I don't know that folks have written any good overviews yet.

However, three quite nice articles just appeared in the American Economic Review that are a step in the right direction. They are focused on mechanism design and microeconomics (not social psychology, computation theory, nor specifically applications to information system design). But they are accessible, short, and written by giants in the field; in fact, they are revised versions of the Nobel lectures given the by three laureates recently cited for creating the foundations of mechanism design theory: Leonard Hurwicz, Eric Maskin and Roger Myerson.

Maskin's overview, "Mechanism Design: How to Implement Social Goals", doesn't require any math. He introduces implementation theory, "which, given a social goal, characterizes when we can design a mechanism whose predicted outcomes (i.e., the set of equilibrium outcomes) coincide with the desirable outcomes" (p. 567).

Myerson's article, "Perspectives on Mechanism Design in Economic Theory", begins to introduce some of the basic modeling elements from the theory, so it has a bit more math, but it's not heavy going for those who have had an intermediate microeconomics class. He introduces some of the classic applications from economics: bilateral trade with advsere selection (hidden information), and project management with moral hazard (hidden action).

Posted by jmm at 08:49 AM | Comments (1) | Permalink »

March 29, 2008

Keeping the good stuff out at Yahoo! Answers

This is, I think, an amusing and instructive tale. I'm a bit sorry to be telling it, because I have a lot of friends at Yahoo! (especially in the Research division), and I respect the organization. The point is not to criticize Yahoo! Answers, however: keeping pollution out is a hard problem for user-contributed content information services, and that their system is imperfect is a matter for sympathy, not scorn.

While preparing for my recent presentation at Yahoo! Research, I wondered whether Yahoo! Mail was still using the the Goodmail spam-reduction system (which is based on monetary incentives). I couldn't find the answer with a quick Google search, nor by searching the Goodmail and Yahoo! corporate web sites (Goodmail claims that Yahoo! is a current client, but there was no information about whether Yahoo! is actually using the service, or what impact it is having).

So, I thought, this is a great chance to give Yahoo! Answers a try. I realize the question answerers are not generally Yahoo! employees, but I figured some knowledgeable people might notice the question. Here is my question, in full:

Is Yahoo! Mail actually using Goodmail's Certified Email? In 2005 Yahoo!, AOL and Goodmail announced that the former 2 had adopted Goodmail's "Certified Email" system to allow large senders to buy "stamps" to certify their mail (see e.g., http://tinyurl.com/2atncr). The Goodmail home page currently states that this system is available at Yahoo!. Yet I can find nothing about it searching Yahoo!Mail Help, etc. My question: I the system actually being used at Yahoo!Mail? Bonus: Any articles, reports, etc. about its success or impacts on user email experience?

A day later I received the following "Violation Notice" from Yahoo! Answers:

You have posted content to Yahoo! Answers in violation of our Community Guidelines or Terms of Service. As a result, your content has been deleted. Community Guidelines help to keep Yahoo! Answers a safe and useful community, so we appreciate your consideration of its rules.

So, what is objectionable about my question? It is not profane or a rant. It is precisely stated (though compound), and I provided background context to aid answerers (and so they knew what I already knew).

I dutifully went and read the Community Guidelines (CG) and the Terms of Service (TOS), and I could not figure out what I had violated. I had heard elsewhere that some people did not like TinyURLs because it it not clear where you are being redirected, and thus it might be used to maliciously direct traffic. But I saw nothing in the CG or TOS that prohibited URLs in general, or TinyURLs specifically.

So I contacted the link they provided to appeal the deletion. A few days later I received a reply that cut-and-pasted the information from the Yahoo! Answers help page explaining why content is deleted. This merely repeated what I had been told in the first message (since none of the other categories applied): my content was in violation of the CG or TOS. But no information was provided (second time) on how the content violated these rules.

Another address was provided to appeal the decision, so I wrote a detailed message to that address, explaining my question, and my efforts to figure out what I was violating. A few days later, I got my third email from Yahoo! Answers:

We have reviewed your appeal request. Upon review we found that your content was indeed in violation of the Yahoo! Answers Community Guidelines, Yahoo! Community Guidelines or the Yahoo! Terms of Service. As a result, your content will remain removed from Yahoo! Answers.

Well... Apparently it's clear to others that my message violates the CG or the TOS, but no one wants to tell me what the violation actually is. Three answers, all three with no specific explanation. Starting to feel like I'm a character in a Kafka novel.

At this point, I laughed and gave up (it was time for me to travel to Yahoo! to give my -- apparently dangerous and community-guideline-violating -- presentation anyway).

I have to believe that there is something about the use of a URL, a TinyURL, or the content to which I pointed that is a violation. I've looked, and found many answers that post URLs (not surprisingly) to provide people with further information. Perhaps the problem is that I was linking to a Goodmail press release on their web site, and they have a copyright notice on that page? But does Yahoo! really think providing a URL is "otherwise make available any Content that infringes any patent, trademark, trade secret, copyright" (from the TOS)? Isn't that what Yahoo's search engine does all the time?

End of story.

Moral? Yahoo! Answers is a user-contributed content platform. Like most, that means it is fundamentally an open-access publishing platform. There will be people who want to publish content that is outside the host's desired content scope. How to keep out the pollution? Yahoo! uses a well-understood, expensive method to screen: labor. People read the posted questions and make determinations about acceptability. But, as with any screen, there are Type I (false negative) and Type II (false positive) errors. Screening polluting content is hard.

(My question probably does violate something, but surely the spirit of my question does not. I had a standard, factual, reference question, ironically, to learn a fact that I wanted to use in a presentation to Yahoo! Research. A bit more clarity about what I was violating and I would have contributed desirable content to Yahoo! Answers. Instead, a "good" contributor was kept out.)

Posted by jmm at 10:19 AM | Comments (5) | Permalink »

Presentation at Yahoo! Research on user-contributed content

Yahoo! Research invited me to speak in their "Big Thinkers" series at the Santa Clara campus on 12 March 2008. My talk was "Incentive-centered design for user-contributed content: Getting the good stuff in, Keeping the bad stuff out."

My hosts wrote a summary of the talk (that is a bit incorrect in places and skips some of the main points, but is reasonably good), and posted a video they took of the talk. The video, unfortunately, focuses mostly on me without my visual presentation, panning only occasionally to show a handful of the 140 or so illustrations I used. The talk is, I think, much more effective with the visual component. (In particular, it reduces the impact of the amount of time I spend glancing down to check my speaker notes!)

In the talk I present a three-part story: UCC problems are unavoidably ICD problems; ICD offers a principled approach to design; and ICD works in practical settings. I described three main incentives challenges for UCC design: getting people to contribute; motivating quality and variety of contributions; and discouraging "polluters" from using the UCC platform as an opportunity to publish off-topic content (such as commercial ads, or spam). I illustrated with a number of examples in the wild, and a number of emerging research projects on which my students and I are working.

Posted by jmm at 10:02 AM | Comments (0) | Permalink »