How (not) to lie with statistics
I’ve been looking at the Author Earnings Excel spreadsheets (the “raw data”) for the last few days. Many people have become very, very excited by the conclusions the authors draw from this data, and critics have consistently been shouted down as being pro-publisher, anti-self-publishing, or having various other axes to grind.
I have no dog in this hunt. I’m not a fiction author and very little of my income comes from royalties or other direct remuneration from writing. I have to write and publish in order to earn promotions and salary increases, and to be respected in my field, but that writing is judged on quality, reliability, and academic contribution. No one really cares about sales.
I do, however, care about how data are collected, analyzed, and reported, and this report doesn’t pass my smell test for reliability and validity. As a political scientist trained in both political science and sociological methods, I’ve conducted qualitative and quantitative studies and I’ve hand-built and analyzed a quantitative dataset comprised of over 3000 observations. I’ve also taught research design and methods at the university undergraduate and graduate levels.
My concerns aren’t about math or statistics (there is very little of either being used here). My concerns are about (1) how the data were selected; (2) inferences from one data point to a trend (you literally cannot do this); and (3) inferences about author behavior that are drawn entirely from data about book sales. I would be less concerned if the authors of the report showed more awareness of the limitations of the data. Instead they go the other way and make claims that the data cannot possibly support.
I could write 10,000 words on this pretty easily (I started to), but I’ll spare DA’s readership that and try to be succinct.
(1) This is Amazon-only data. We call this a “sample of convenience,” i.e., the authors figured out a way to scrape Amazon so they did. They justify extrapolating from Amazon to the entire book selling market by saying that Amazon has been “chosen for being the largest book retailer in the world.” This is analogous to me studying politics in the state of Alaska and then claiming my study provides an accurate representation of politics in the entire USA because Alaska is the biggest state. Or studying politics in China and then claiming my study explains politics all over the world because China is the most populous country. In other words: No.
Amazon is the biggest, but it’s not the only or even the majority bookseller. And it’s not a representative bookseller statistically, i.e., there is no reason to believe that Amazon provides a representative snapshot of all book sales. It probably sells the most ebooks, it might sell the most titles, and it is where a lot of self-published authors sell their books. But it does not sell the most total units of (print and e-) books, which means comparisons across categories will most likely be skewed and unreliable. If the authors’ conclusions were limited to Amazon-specific points, I would be less bothered. But they are making claims about general author behavior based on partial, skewed data. No. No. No.
(2) This is one day’s worth of data, one 24-hour period of sales and rankings. We call this a cross-sectional study. Cross-sections are snapshots, which can be very useful to give you an idea of relationships between variables. But they have their own biases, and these biases can be irrelevant (good for your study) or relevant (bad for your study). In this case the 24-hour period comprised parts of the last Tuesday and Wednesday of January 2014. I can think of at least two relevant biases: (1) Books for the next month are frequently released on the last Monday-Tuesday of the previous month; and (2) people buy textbooks at the beginning of school semesters (January is the beginning of spring quarter/semester). This latter condition might create substitution effects (people spend their money on non-fiction books) which are not the same across all publishers and categories. I don’t know if these biases matter, but the point is that the authors don’t even tell us that they considered that there might be bias issues in picking this particular time period. I mistrust studies where limitations aren’t at least discussed.
(3) Cross-sections cannot give you trends. Trends need more than one data point. You cannot determine a trend from a single observation. If a book is #1 today, that doesn’t mean it will be #1 tomorrow. You cannot infer anything about the past or the future from a single data point in the cross-section.
(4) Nevertheless, the authors do try to infer from this. In fact, they do a lot of inferring that is analytically indefensible. Let’s take the inferences in turn.
- They infer sales numbers from rankings (because Amazon does not publicly report sales), based on their own books and information from other authors. In asking around I have been told that the sales numbers correspond to rankings fairly well. I’m willing to believe this, but Amazon itself points out that rankings can change without sales figures changing and vice versa. This may be a bias that is sufficiently general that it doesn’t compromise the inferences. But it’s something to keep in mind.
- They take the sales numbers for one day (an inference) and combine it with the publisher data, which gives them the gross and net sales figures for each book. They then take the author’s net revenue and multiply that number x365 to get the author’s earnings for that book for the entire year. This is
insanecompletely absurd. According to this “formula,” a book that sells zero copies on January 28/29 nets the author zero dollars for the year. A book that sells 7000 copies and is published by Amazon nets the author over $4 million. Not only is this unbelievable (99% of books move around the list over the year, unless they’re stuck at 0 sales), it casts doubt on every other data and inference decision the authors make. Why on earth would you accept information, let alone take advice, from someone who thinks this is a good way to calculate author earnings? You should be running far, far away. It is very bad analysis. It is horrifically bad inference. This criticism doesn’t even take into account the difficulty of estimating author earnings without including advances, but frankly, in my estimation it’s a sufficiently disabling criticism on its own.
- Author behavior is being inferred from data on books. There is too much author behavior that is simply missing and can’t be inferred in any legitimate way. Authors make choices about editing, packaging, writing quality, etc. that affect reader decisions to purchase books. In addition, anecdotal evidence consistently points to the importance of a backlist, a backlist that can come from self- or publisher-published books. None of these variables are captured in this data. So what we are at risk for having is “omitted variable bias,” where the correlations are inaccurate because not everything that matters is present in the dataset.
(5) Many of the top selling books are Amazon imprints. Thomas & Mercer is the top-selling imprint in this dataset. That lucky author is on course to make $4 million, according to the report. (Or not, depending on which planet and universe you inhabit. Here’s hoping she lives in the alternate one.) It makes sense that Amazon does so well at Amazon, since they have many ways of boosting visibility and they naturally use those techniques to sell their own books. And since the NYT and USA Today don’t include exclusive-vendor books on their bestseller lists we can’t see how Amazon books do when we include other rankings. It’s a classic problem of comparability: Amazon doesn’t include pre-order sales in its rankings and NYT/USA Today don’t include Amazon-only books in their rankings. so you can look at apples at one and oranges at the other, but you can’t look at apples and oranges together.
The authors attempt to bolster their existing data by looking at Bookscan numbers, but because Bookscan doesn’t break down data between ebook and print sales and instead includes overall sales, the information revealed to us in this new report doesn’t help us situate the Amazon data in a larger context. At best the Bookscan numbers might reveal the proportion of books sold at Amazon relative to the larger marketplace captured by Bookscan (although Bookscan doesn’t account for all sales). But instead, the Bookscan numbers are used to compare e-book sales to print sales, which are a completely different issue.
(6) The authors provide the data in an Excel spreadsheet format so that the rest of us can analyze it. I appreciate this, and I’m happy to work with flat files, although there are a lot of advantages to relational databases (e.g., MySQL and Access). But when I downloaded the files I realized that (a) this is not “raw data” and (b) important information is uncollected or removed. In particular:
- When the data were scraped, if they had picked up the release date it would let us know where the book was in its life cycle. Then we could apply a survival model, i.e., one that estimates a rate of sales decay over time, and the date would also help us identify whether the price in the cross-section is permanent or temporary. These data aren’t perfect, because publishers can change release dates (and there are different release dates for different editions, including self-pub updates). But being able to use even imperfect release date information would allow revenue projections to approximate something that isn’t prima facie absurd.
- The “author data” sheet in the file combines all of each author’s books into one observation (one row in the spreadsheet) and labels them with one publisher category. This potentially conflates self- and publisher-published books, books across genre categories, and top-selling and lesser-selling books (case in point: #1 Book is 7000 sales and #1 Author has two books at 7000 sales, so one of #1 Author’s books has 0 sales). I would like to decompose this data but I can’t because the author info has been “anonymized,” as has the title info. Therefore I can’t combine the info provided in the two sheets into one dataset. This is apart from the main problem, of course, which is that there is very little point to running more rigorous statistical analysis because the underlying data have essential reliability and validity problems.
There is a post at Digital Book World which provides descriptive statistics for the data (something the report’s authors did not, which is also a breach of data analysis norms). The data look to be skewed, and also to be non-normally distributed. I’m betting there are correlation issues that will wipe out at least some of the results in the pretty charts if we subjected those bivariate relationships to proper controls in a multivariate analysis. There is also an excellent criticism of the report’s discussion of star ratings here.
A sentence in the report has been making the rounds:
“Our data suggests that even stellar manuscripts are better off self-published.”
No. That conclusion is writing a check that the data can’t cash.
As an empirical researcher who respects the limits inherent in all data collection and analysis, my strongest advice is to read this report as you would read any interesting tidbit about the publishing industry. Treat it as entertainment, not information. If you’re interested in data analysis more generally, think of this as a stellar example of What Not To Do.
If you pushed me for a recommendation based on what I see in these data, I would say, after reminding you of the insurmountable shortcomings contained within it: If you plan on selling ebooks solely or primarily at Amazon and the opportunity cost of your time is greater than zero, you might want to
sign up with submit to (and hope you are offered a contract by) an Amazon imprint. Because Amazon books do extremely well and the cut they take may well be worth the time you save doing all your own production and promotion. Somehow I don’t think that’s the takeaway the authors intend, but that’s an obvious one for me. But remember, I don’t have a dog in this hunt. I’m just looking at the data.