A Data Science Central Community
Originally posted on Data Science Central, by Vincent Granville.
Adversarial analytics and business hacking: Amazon case study.
Chances are that you might have purchased a book, or visited a restaurant, as a result of reading fake reviews. The problem impacts companies such as Amazon and Yelp, while on Facebook, massive disinformation campaigns are funded by political money, hitting thousands of profiles and managed by public relation companies: they create fake profiles and try to become friends with influencers. Here the focus is specifically on Amazon book reviews, the Facebook issue will be discussed later, while the Yelp issue is well known and has resulted in a class action lawsuit: Yelp's account managers create bad reviews for restaurants, and if you pay a monthly advertising fee, suddently your rating dramatically improves.
Source for picture: Examples of bogus book reviews on Amazon
Amazon is selling books, so it has a conflict of interest when it comes to book (or product) reviews. The purpose of this article is three-fold:
This is the new project for candidates interested in our data science apprenticeship. The full list of projects can be found here. The project description is as follows:
You will have to assess the proportion of fake book reviews on Amazon, test a fake review generator (possibly using EC2 to deploy the reviews), reverse engineer an Amazon algorithm, and identify how the review scoring engine can be improved. Extra mile: create and test your own review scoring engine. Scrape thousands of sampled Amazon reviews and score them, as well as users posting these reviews.
Note that we do not study here the impact of reviews and stars on purchasing behavior or pricing, this will be the subject of another article.
1. Fake review detection
Which metrics would you use to detect fake reviews?
These are features that should probably be included in any fake review detection system. HDT (hidden decision trees) is a great data science technology to design such scoring engines, to score reviews. What other metrics would you suggest?
2. Experimental design and proof of concept: test fake reviews on Amazon
Here the data science apprentice is asked to try various strategies to post fake reviews for targeted books on Amazon, and check what works (that is, undetected by Amazon). The purpose is to reverse-engineer Amazon's review scoring algorithm (used to detect bogus reviews), to identify weaknesses and report them to Amazon.
Strategies will involve
You might have to fine-tune the suggested parameters, to optimize performance of your fake review posting process. Success here is measured by the proportion of 4- or 5-stars books where you managed to reduce the number of stars, to 3 or below. Deliverable is a paper summarizing the results of your test, how scalable your strategy is (can it be automated?) and recommended fixes to make Amazon reviews more trustworthy (that is, designing a better review scoring system). A review scoring system score the reviews, and automatically "review the reviews" to decide which ones should be accepted.
3. The real business risk associated with reviews
Amazon authors are vulnerable to the following fraud, that would eventually result in significant business loss for Amazon.
A start-up company selling good reviews for $500 per book with a $100 monthly fee. It would work as follows.
How scalable is this? A college student could easily make $500 a day, targeting only a few books each day. That's $100k per year, and collect the money via Paypal. Because the money is relatively easy to make, a large number of (educated and under-employed) people could be interested in setting up such a scheme, eventually targeting thousands of authors each day when combined together. Or someone might find a way to automate this activity, maybe using a Botnet, and make millions of dollars each year. Many authors would eventually refuse to have their books listed on Amazon, and choose to self-publish with platforms such as Lulu. Publishers would also opt out of Amazon. Revenue on Amazon (from book sales) would drop. Or Amazon could simply eliminate all reviews and not accept new ones.
Interestingly, it appears that Yelp might be making money with a similar scheme: out of fake reviews and blackmailing small businesses listed on its website. And I've seen companies selling fake Twitter followers or Facebook profiles, though they quickly disappear. Even LinkedIn was recently victim of a massive scheme involving fake profiles automatically generated.
Website relying on reviews (books, products, restaurants reviews, etc.) are vulnerable to massive attacks that could destroy their reputation, and eventually their income.
How could Amazon protect itself from such a risk? Using a better review scoring engine. Relying more on their recommendation engine (user who purchased A also purchased B). Design a better fraud-resistant user reputation engine, and integrate user reputation as a metric in the review scoring engine. Display reviews with high score at the top, or more frequently. Or dropping user-generated reviews altogether.
Also, Amazon could categorize users, so that a data science book review by a user categorized as "interested in web design" does not carry the same weight as a data science book review by a user categorized as "interested in data science". Or a new company could emerge and start competing with Amazon, by offering much better user experience. Such a company could make additional revenue by offering authors the possibility to have their book featured at the top, when a user is searching for books - just like Google does with webmasters who want to promote their website.
Note: I never write reviews, despite the many requests that I receive from authors or publishers. I don't have the time, and I expect to be paid to provide quality content (reviews, bad or good, of high quality). No executive has time to spend on writing reviews anyway, thus if you write a book aimed at executives, you won't get any reviews from fellow executives. In short, all the reviews will be worthless.