October 31, 2013

Everything (of value) is for sale

There's a truism that bothers many (except economists): if there is a good or service that has value to some and can be produced at a cost below that value by someone else, there will be a market. This is disturbing to many because it is as true for areas of dubious morality such as sexual transactions, clear immorality (human trafficking and slavery) as it is for lawn mowing and automobiles.

Likewise for online activities, as I've documented many times here. You can buy twitter followers, Yelp reviews, likes on Facebook, votes on Reddit. And, of course, Wikipedia, where you can buy pages or edits, or even (shades of The Sopranos), "protection".

Here is an article that reports at some length on large scale, commercialized Wikipedia editing and page management services. Surprised? Just another PR service, like social media management services provided by every advertising / marketing / image management service today.

Posted by jmm at 09:47 AM | Comments (0) | Permalink »

June 06, 2013

Everything can -- and will -- be manipulated

Well, not "everything". But every measure on which decisions of value depend (e.g., book purchases, dating opportunities, or tenure) can and will be manipulated.

And if the measure depends on user-contributed content distributed on an open platform, the manipulation often will be easy and low cost, and thus we should expect to see it happen a lot. This is a big problem for "big data" applications.

This point has been the theme of many posts I've made here. Today, a new example: citations of scholarly work. One of the standard, often highly-valued (as in, makes a real difference to tenure decisions, salary increases and outside job offers) measures of the impact of a scholar's work is how often it is cited in the published work of other scholars. ISI Thompson has been providing citations indices for many years. ISI is not so easy to manipulate because -- though it depends on user-contributed content (articles by one scholar that cite the work of another) -- that content is distributed on closed platforms (ISI only indexes citations from a set of published journals that have editorial boards which protect their reputation and brand by screening what they publish).

But over the past several years, scholars have increasingly relied on Google Scholar (and sometimes Microsoft Academic) to count citations. Google Scholar indexes citations from pretty much anything that appears to be a scholarly article that is reachable by the Google spiders crawling the open web. So, for example, it includes citations in self-published articles, or e-prints of articles published elsewhere. Thus, Google Scholar citation counts depends on user-contributed content distributed on an open platform (the open web).

And, lo and behold, it's relatively easy to manipulate such citation counts, as demonstrated by a recent scholarly paper that did so: Delgado Lopez-Cozar, Emilio; Robinson-Garcia, Nicolas; Torres Salinas, Daniel (2012). Manipulating Google Scholar Citations and Google Scholar Metrics: simple, easy and tempting. EC3 Working Papers 6: 29 May, 2012, available as http://arxiv.org/abs/1212.0638v2.

Their method was simple: they created some fake papers that cited other papers, and published the fake papers on the Web. Google's spider dutifully found them and increased the citation counts for the real papers that these fake papers "cited".

The lesson is simple: for every measure that depends on user-contributed content on an open platform, if valuable decisions depend on it, we should assume that it is vulnerable to manipulation. This is a sad and ugly fact about a lot of new opportunities for measurement ("big data"), and one that we must start to address. The economics are unavoidable: the cost of manipulation is low, so if there is much value to doing so, it will be manipulated. We have to think about ways to increase the cost of manipulating, if we don't want to lose the value of the data.

Posted by jmm at 11:09 AM | Comments (1) | Permalink »

May 27, 2013

Mining social data -- what is revealed?

Here is a recent article about high school students manipulating their Facebook presence to fool college admissions officers. Not terribly surprising: the content is (largely) created and controlled by the target of the background searches (by admissions, prospective employers, prospective dating partners etc) so it's easy to manipulate. We've been seeing this sort of manipulation since the early days of user-contributed content.

People mining user-contributed content should be giving careful thought to this. Social scientists like it when they can observe behavior, because it often reveals something more authentic than simply asking someone a question (about what they like, or what they would have done in a hypothetical situation, etc). Economists, for example, are thrilled when they get to observe "revealed preference", which are choices people make when faced with a true resource allocation problem. It could be that I purchased A instead of B to fool an observer, but there is a cost to my doing so (I bought and paid for a product that I didn't want), and as long as the costs are sufficiently salient, it is more likely that we are observing preferences untainted by manipulation.

There are costs to manipulating user-contributed content, like Facebook profiles, of course: some amount of time, at the least, and probably some reduced value from the service (for example, students say that during college application season they hide their "regular" Facebook profile, and create a dummy in which they talk about all of the community service they are doing, and how they love bunnies and want to solve world hunger: all fine, but they are giving up the other uses of Facebook that they normally prefer). But costs of manipulating user-contributed content often may be low, and thus we shouldn't be surprised if there is substantial manipulation in the data, especially if the users have reason to think they are being observed in a way that will affect an outcome they care about (like college admissions).

Put another way, the way people portray themselves online is behavior and so reveals something, but it may not reveal what the data miner thinks it does.

Posted by jmm at 02:39 PM | Comments (0) | Permalink »

October 01, 2012

I saw this one coming: reviewing myself

Actually, I didn't see this coming, but I wish I had: scholarly authors who see themselves coming by suggesting themselves (via "sybils") as their own article reviewers (referees)! Lovely case of online information manipulation in response to (fairly intense) incentives to increase one's publication count.

How could an editor be dumb enough to send an article back to the author for review? The trick is simple (though also it shouldn't be that hard for editors to see through it, and apparently checking is becoming more commonplace: so what will be the next clever idea as this particular arm's race escalates?). Submit to a journal that asks authors to suggest potential reviewers. (Many journals do this -- one hopes the editor selects some reviewers from an independent list, not just from the author's suggestions!) Then submit a name and university and a false email address, one to a mailbox you control. Then, bingo, if the editor selects that reviewer, you get to write the review.

To reduce your chances of getting caught, you can suggest a real, and appropriate reviewer, just providing an inocuous but false email address (some variant on his or her name @gmail, for example).

Via The Chronicle of Higher Education.

Posted by jmm at 10:38 AM | Comments (0) | Permalink »

April 06, 2010

Yelp's new idea

Yelp!, the local business user-contributed review site, has a well-known set of manipulative incentive problems. First, businesses might want to write overly positive reviews of themselves (under pseudonyms). Second, they might want to write negative reviews of their competitors. Third, they might want to pay Yelp to remove negative reviews of themselves. This last has received a lot of attention, including a class action suit against Yelp alleging that some of its sales people extort businesses into paying to remove unfavorable reviews.

Yelp has always filtered reviews, trying to remove those that it suspects are biased either too positive or too negative. But of course it makes both Type I and Type II errors, and some of the Type IIs (filtering out valid reviews) may be at the root of some of the extortion claims (or not).

Yelp has now made a rather simple, but I suspect quite favorable change: "http://mashable.com/2010/04/06/yelp-extortion-claims/"it is making all filtered reviews visible (on another page). This transparency, it hopes, will let users see that it is even-handed in its filtering, and that its errors are not themselves biased (or influenced).

Embracing transparency is a strategy that seems to work more often than not in this Web 2.0 age of the Internet. I think it will here. Most folks will never bother to look at the filtered-out reviews, and thus will rely on the very reviews that Yelp thinks are most reliable. Those who do look, if Yelp is indeed being even-handed, will probably find the filtering interesting, but will ignore these reviews in choosing which business to frequent. The main risk to Yelp is likely to be that imitators will better be able to reverse-engineer their filtering formulae.

Posted by jmm at 12:57 AM | Comments (0) | Permalink »

May 08, 2009

Another take on manipulating Wikipedia

Dilbert.com

(Scott Adams, http://dilbert.com/strips/, 8 May 2009)

Posted by jmm at 10:54 AM | Comments (0) | Permalink »

April 12, 2009

Lady Chatterly banned again

I am here engaging in what I usually describe as the "manipulation" subspecies of "pollution". I am doign this to participate in the Amazon Rank project to Google-bomb Amazon.

Apparently on 12 April 2009, Amazon removed books that it deemed had "adult content" from its sales rankings. Because of the way their systems works, this now means that the books are not found in standard searches. Example: use the main search box to query "Lady Chatterly's Lover". I just did this and did not get a hit on the book by D. H. Lawrence until #8, and this is an edition available through Amazon's used bookseller partners, not a new copy available from Amazon itself. Search on D. H. Lawrence's name, and his most popular book (LCL) does not come up in the first 16 entries (an audio CD edition pops up at #15).

So, an angry blogger started a Google-bombing campaign to make a search on Amazon Rank turn up a a critical definition (follow the link). As a political matter, I support this particular manipulation.

Posted by jmm at 04:05 PM | Comments (0) | Permalink »

December 16, 2008

Manipulating online voting

Is online anonymous (or at least, unauthenticated) voting ever a good idea? When there is no authentication, voters who care can go to the ballot box multiple times. And with scripted robots, won't voting become completely meaningless?

Perhaps. It appears that is more or less the case with the NHL All Star game balloting this year.

There has been a huge discussion about this in the blogosphere (see, e.g., the summary with links by NY Times blogger Stu Hackel). As reported in the "traditional" pages of the NYT, ballot stuffing has, of course, been known to occur as long as fan voting has been around, but this year it rose to new levels when fans started posting robot scripts and at least one (Pittsburgh) team executive started exhorting fans to get out the text messaging vote. At one point two players who had seen zero ice time this year were closing in on starting positions in the All Star game.

Online balloting is cheap, accessible, and participatory. But can it be made to work? What sort of incentives can be designed that lead to reliable results when secure authentication is not required?

One recent very interesting research paper on this topic is by Liad Wagman and Vince Conitzer (PhD candidate and Asst Prof. at Duke). (For more permanent reference, the paper was published in the proceedings of the AAAI 2008 conference, where it won a "Best Paper" award.) Their results are fairly limited and not yet applicable to real settings, but it is a key step forward. Their main insight is that if there is some cost to voting, then as the number of voters becomes large it may be possible to design a voting rule under which it is not in anyone's interest to vote multiple times.

Imposing a small cost on honest participants to screen out the dishonest is not a new idea in the broad area of incentive-centered design (Rick Wash and I have shown that it is a fundamental strategy realized in widely varying information system applications, such as passwords and CAPTCHAs, and of course, it's an application of screening and signaling theory, which goes back to at least Spence's seminal paper for which he won the Nobel Prize), but this is the first rigorous study of the idea in the area of online voting, to the best of my knowledge.

Unfortunately the Wagman-Conitzer results so far are somewhat negative when there are more than 3 alternatives (as, for example, is the case in sports all-star balloting) or when individuals can collude and act as a group. By casual intuition it is not surprising this problem is so hard. More work to be done.

Postscript 12/29:
This is too good to pass up. Today, one day after writing the above article, I received in my email an official mailing from the Detroit Red Wings, urging me (in traditional Chicago fashion), to help out:

Posted by jmm at 02:20 AM | Comments (1) | Permalink »

November 26, 2008

New UCC opportunity, new opportunity for manipulation and spam

Google has made available a striking set of new features for search, which it calls SearchWiki. If you are logged in to a Google account, when you search you will have the ability to add or delete results you get if you search that page again, re-order the results, and post comments (which can be viewed by others).

But the comments are user-contributed content: this is a relatively open publishing platform. If others search on the same keyword(s) and select "view comments" they will see what you entered. Which might be advertising, political speech, whatever. As Lauren Weinstein points out, this is an obvious opportunity for pollution, and (to a lesser extent in my humble opinion, because there is no straightforward way to affect the behavior of other users) manipulation. In fact, she finds that comment wars and nastiness started within hours of SearchWiki's availability:

It seem inevitable that popular search results in particular will
quickly become laden with all manner of "dueling comments" which can
quickly descend into nastiness and even potentially libel. In fact,
a quick survey of some obvious search queries shows that in the few
hours that SearchWiki has been generally available, this pattern is
*already* beginning to become established. It doesn't take a
lot of imagination to visualize the scale of what could happen with
the search results for anybody or anything who is the least bit
controversial.

Lauren even suggests that lawsuits are likely by site owners whose links in Google become polluted, presumably claiming they have some sort of property right in clean display of their beachfront URL.

Posted by jmm at 10:27 AM | Comments (0) | Permalink »

New UCC opportunity, new opportunity for manipulation and spam

Google has made available a striking set of new features for search, which it calls SearchWiki. If you are logged in to a Google account, when you search you will have the ability to add or delete results you get if you search that page again, re-order the results, and post comments (which can be viewed by others).

But the comments are user-contributed content: this is a relatively open publishing platform. If others search on the same keyword(s) and select "view comments" they will see what you entered. Which might be advertising, political speech, whatever. As Lauren Weinstein points out, this is an obvious opportunity for pollution, and (to a lesser extent in my humble opinion, because there is no straightforward way to affect the behavior of other users) manipulation. In fact, she finds that comment wars and nastiness started within hours of SearchWiki's availability:

It seem inevitable that popular search results in particular will
quickly become laden with all manner of "dueling comments" which can
quickly descend into nastiness and even potentially libel. In fact,
a quick survey of some obvious search queries shows that in the few
hours that SearchWiki has been generally available, this pattern is
*already* beginning to become established. It doesn't take a
lot of imagination to visualize the scale of what could happen with
the search results for anybody or anything who is the least bit
controversial.

Lauren even suggests that lawsuits are likely by site owners whose links in Google become polluted, presumably claiming they have some sort of property right in clean display of their beachfront URL.

Posted by jmm at 10:27 AM | Comments (0) | Permalink »

September 01, 2008

The fine line between spam and foie gras

The New York Times (following others) reported today on a large number of detailed, informed, and essentially all flattering edits to Sarah Palin's Wikipedia page made --- hmmm --- in the 24 hours before her selection as the Republican vice presidential nominee was made public. The edits were made anonymously, and the editor has not yet been identified, though he acknowledges that he is a McCain campaign volunteer.

Good or bad content? The potential conflict of interest is clear. But that doesn't mean the content is bad. Most of the facts were supported with citations. But were the written in overly flattering langauge? And was the selection of facts unbiased? Much of the material has been revised and toned down or removed in the few days since, which is not surprising regardless of the quality of this anonymous editor's contributions, given the attention that Ms. Palin has been receiving.

Posted by jmm at 04:18 PM | Comments (0) | Permalink »

April 12, 2008

Pollution as revenge

One of my students alerted me to a recent dramatic episode. Author and psychologist Cooper Lawrence appeared on a Fox News segment and made some apparently false statements about the Xbox game "Mass Effect", which she admitted she had never seen or played. Irate gamers shortly thereafter started posting (to Amazon) one-star (lowest possible score) reviews of her recent book that she was plugging on Fox News. Within a day or so, there were about 400 one-star reviews, and only a handful any better.

Some of the reviewers acknowledged they had not read or even looked at the book (arguing they shouldn't have to since she reviewed a game without looking at it). Many explicitly criticized her for what she said about the game, without actually saying anything about her book.

When alerted, Amazon apparently deleted most of the reviews. Its strategy apparently was to delete reviews that mentioned the name of the game, or video games at all (the book has nothing to do with video games). With this somewhat conservative strategy, the reviews remaining (68 at the moment) are still lopsidedly negative (57 one-star, 8 two-star, 3 five-star), more than I've ever noticed for any somewhat serious book, though there's no obvious way to rule these out as legitimate reviews. (I read several and they do seem to address the content of the book, at least superficially.)

Aside from being a striking, and different example of book review pollution (past examples I've noted have been about favorable reviews written by friends and authors themselves), I think this story highlights troubling issues. The gamers have, quite possibly, intentionally damaged Lawrence's business prospects: her sales likely will be lower (I know that I pay attention to review scores when I'm choosing books to buy). Of course, she arguably damaged the sales of "Mass Effect", too. Arguably, her harm was unintentional and careless (negligent rather than malicious). But she presumably is earning money by promoting herself and her writing by appearing on TV shows: is a reasonable social response to discipline her in her for negligence? (And the reviewers who have more or less written "she speaks about things she doesn't know; don't trust her as an author" may have a reasonable point: so-called "public intellectuals" probably should be guarding their credibility in every public venue if they want people to pay them for their ideas.)

I also find it disturbing, as a consumer of book reviews, but not video games, that reviews might be revenge-polluted. Though this may discipline authors in a way that benefits gamers, is it right for them to disadvantage book readers?

I wonder how long it will be (if it hasn't already happened) before an author or publisher sues Amazon for providing a nearly-open access platform for detractors to attack a book (or CD, etc.). I don't know the law in this area well enough to judge whether Amazon is liable (after all, arguably she could sue the individual reviewers for some sort of tortious interference with her business prospects), but given the frequency of contributory negligence or similar malfeasances in other domains (such as Napster and Grokster facilitating the downloading of copyrighted materials), it seems like some lawyer will try to make the case one of these days. After all, Amazon provides the opportunity for readers to post reviews in order to advance its own business interests.

Some significant risk of contributory liability could be hugely important for the problem of screening pollution in user-contributed content. If you read some of the reviews still on Amazon's site in this example, you'll see that it would not be easy to decide which of them were "illegitimate" and delete all of those. And what kind of credibility would the review service have if publishers made a habit of deciding (behind closed doors) which too-negative reviews to delete, particularly en masse. I think Amazon has done a great job of making it clear that they permit both positive and negative reviews and don't over-select the positive ones to display, which was certainly a concern I had when they first started posting reviews. But it authors and publishers can hold it liable if they let "revenge" reviews appear, I suspect it (and similar sites) will have to shut down reviewing altogether.

(Thanks to Sarvagya Kochak.)

Posted by jmm at 01:42 PM | Comments (0) | Permalink »

March 29, 2008

Presentation at Yahoo! Research on user-contributed content

Yahoo! Research invited me to speak in their "Big Thinkers" series at the Santa Clara campus on 12 March 2008. My talk was "Incentive-centered design for user-contributed content: Getting the good stuff in, Keeping the bad stuff out."

My hosts wrote a summary of the talk (that is a bit incorrect in places and skips some of the main points, but is reasonably good), and posted a video they took of the talk. The video, unfortunately, focuses mostly on me without my visual presentation, panning only occasionally to show a handful of the 140 or so illustrations I used. The talk is, I think, much more effective with the visual component. (In particular, it reduces the impact of the amount of time I spend glancing down to check my speaker notes!)

In the talk I present a three-part story: UCC problems are unavoidably ICD problems; ICD offers a principled approach to design; and ICD works in practical settings. I described three main incentives challenges for UCC design: getting people to contribute; motivating quality and variety of contributions; and discouraging "polluters" from using the UCC platform as an opportunity to publish off-topic content (such as commercial ads, or spam). I illustrated with a number of examples in the wild, and a number of emerging research projects on which my students and I are working.

Posted by jmm at 10:02 AM | Comments (0) | Permalink »

February 12, 2008

Followup: Second GiveWell founder admits deception

Earlier I noted that the founder of nonprofit GiveWell had been demoted and financially penalized when it was learned that he used a pseudonym online to recommend his own organization. The New York Times reports today that a second founder has admitted he also used a false name in an online posting recommending GiveWell.

These are provocative examples of the manipulation problem --- one species of managing the quality of user-contributed content --- because GiveWell's business is evaluating the reliability and quality of other nonprofits in order to provide advice on where to give one's charitable donations.

Posted by jmm at 03:50 PM | Comments (0) | Permalink »

January 08, 2008

MetaFilter manipulated by nonprofit that reports on honesty and reliability of nonprofits

The New York Times today reported that the Executive Director of a nonprofit research organization manipulated the Ask MetaFilter question service to steer users to his organization's site.

This is particularly piquant because the manipulator founded his organization (GiveWell) as a nonprofit to help people evaluate the quality (presumably, including reliability!) of nonprofit charitable organizations, and GiveWell itself is supported by charitable donations.

The manipulation was simple, and reminiscent of the well-publicized book reviews by authors and their friends on Amazon: the executive pseudonymously posted a question asking where he could go to get good information about charities, and then under his own name (but without identifying his affiliation) answered his own question by pointing to his own organization.

When discovered, the GiveWell board invoked old-fashioned incentives: they demoted the Executive Director (and founder), docked his salary, and required him to attend a professional development training program. Of course, the expected cost of being caught and punished was not, apparently, a sufficient incentive ex ante, but the organization apparently hopes by imposing the ex post punishment he will be motivated to behave in the future, and by publicizing it other employees will be similarly motivated. The publicity provides an additional incentive: the ED's reputation has been severely devalued, presumably reducing his expected future income and sense of well-being as well.

Posted by jmm at 08:23 AM | Permalink »

January 07, 2008

UCC search arrives...manipulation and pollution to follow soon

Jimmy Wales announced the release of the public "alpha" of his new, for-profit search service, Wikia Search. The service is built on a standard search engine, but its primary feature is that users can evaluate and comment on search results, building a user-contributed content database that Wikia hopes will improve search quality, making Wikia a viable but open (and hopefully profitable) alternative to Google.

Miguel Helft, writer for the New York Times was quick to note that such a search service might be quite vulnerable to manipulation:

Like other search engines and sites that rely on the so-called “wisdom of crowds,? the Wikia search engine is likely to be susceptible to people who try to game the system, by, for example, seeking to advance the ranking of their own site. Mr. Wales said Wikia would attempt to “block them, ban them, delete their stuff,? just as other wiki projects do.

The tension is interesting: Wikia promotes itself as a valuable alternative to Google largely because its search and ranking algorithms are open, so that users know more about why some sites are being selected or ranked more highly than others.

“I think it is unhealthy for the citizens of the world that so much of our information is controlled by such a small number of players, behind closed doors,? [Wales] said. “We really have no ability to understand and influence that process.?

But, although the search and ranking algorithms may be public, whether or not searches are being manipulated by user contributed content will not be so obvious. It is far from obvious which approach is more dependable and "open". Wikia's success apparently will depend on its ad hoc and technical methods for "blocking, banning and deleting" manipulation.

Posted by jmm at 09:23 AM | Permalink »

April 08, 2006

Polluting user-contributed reviews

A recent First Monday article by David and Pinch (2006) documents an interesting case of book review pollution on Amazon. A user review of one book critically compared it to another. Immediately following a "user" entered another review blatantly plagiarizing a favorable review of the first book, and further user reviews did additional plagiarizing.

When the author of the first book discovered the plagiarism, he notified Amazon which at the time had a completely hands-off policy on user reviews, so it refused to intervene even for blatant plagiarism. (The policy since has changed.) Another example of the problem of keeping bad quality contributions out.

David and Pinch remind us that when an Amazon Canada programming glitch revealed reviewer identities,

a large number of authors had "gotten glowing testimonials from friends, husbands, wives, colleagues or paid professionals." A few had even 'reviewed' their own books, and, unsurprisingly, some had unfairly slurred the competition.

David and Pinch address the issue of review pollution at some length. First, the catalogue six discrete layers of reputation in the Amazon system, including user ratings of reviews by others, and a mechanism to report abuse. Then they conducted an analysis of 50,000 reviews of 10,000 books and CDs. Categories of review pollution they identified automatically (using software algorithms):

They also make an interesting point about the arms-race limitations of technical pollution screens:

The sorts of practices we have documented in this paper could have been documented by Amazon.com themselves (and for all we know may have indeed been documented). Furthermore if we can write an algorithm to detect copying then it is possible for Amazon.com to go further and use such algorithms to alert users to copying and if necessary remove material. If Amazon.com were to write such an algorithm and, say, remove copied material, this will not be the end of the story. Users will adapt to the new feature and will no doubt try and find new ways to game the system.

Posted by jmm at 02:35 PM | Comments (0) | Permalink »