Romance, Historical, Contemporary, Paranormal, Young Adult, Book reviews, industry news, and commentary from a reader's point of view


How (not) to lie with statistics

How (not) to lie with statistics

Fortune Teller CatI’ve been looking at the Author Earnings Excel spreadsheets (the “raw data”) for the last few days. Many people have become very, very excited by the conclusions the authors draw from this data, and critics have consistently been shouted down as being pro-publisher, anti-self-publishing, or having various other axes to grind.

I have no dog in this hunt. I’m not a fiction author and very little of my income comes from royalties or other direct remuneration from writing. I have to write and publish in order to earn promotions and salary increases, and to be respected in my field, but that writing is judged on quality, reliability, and academic contribution. No one really cares about sales.

I do, however, care about how data are collected, analyzed, and reported, and this report doesn’t pass my smell test for reliability and validity. As a political scientist trained in both political science and sociological methods, I’ve conducted qualitative and quantitative studies and I’ve hand-built and analyzed a quantitative dataset comprised of over 3000 observations. I’ve also taught research design and methods at the university undergraduate and graduate levels.

My concerns aren’t about math or statistics (there is very little of either being used here). My concerns are about (1) how the data were selected; (2) inferences from one data point to a trend (you literally cannot do this); and (3) inferences about author behavior that are drawn entirely from data about book sales. I would be less concerned if the authors of the report showed more awareness of the limitations of the data. Instead they go the other way and make claims that the data cannot possibly support.

I could write 10,000 words on this pretty easily (I started to), but I’ll spare DA’s readership that and try to be succinct.

(1) This is Amazon-only data. We call this a “sample of convenience,” i.e., the authors figured out a way to scrape Amazon so they did. They justify extrapolating from Amazon to the entire book selling market by saying that Amazon has been “chosen for being the largest book retailer in the world.” This is analogous to me studying politics in the state of Alaska and then claiming my study provides an accurate representation of politics in the entire USA because Alaska is the biggest state. Or studying politics in China and then claiming my study explains politics all over the world because China is the most populous country. In other words: No.

Amazon is the biggest, but it’s not the only or even the majority bookseller. And it’s not a representative bookseller statistically, i.e., there is no reason to believe that Amazon provides a representative snapshot of all book sales. It probably sells the most ebooks, it might sell the most titles, and it is where a lot of self-published authors sell their books. But it does not sell the most total units of (print and e-) books, which means comparisons across categories will most likely be skewed and unreliable. If the authors’ conclusions were limited to Amazon-specific points, I would be less bothered. But they are making claims about general author behavior based on partial, skewed data. No. No. No.

(2) This is one day’s worth of data, one 24-hour period of sales and rankings. We call this a cross-sectional study. Cross-sections are snapshots, which can be very useful to give you an idea of relationships between variables. But they have their own biases, and these biases can be irrelevant (good for your study) or relevant (bad for your study). In this case the 24-hour period comprised parts of the last Tuesday and Wednesday of January 2014. I can think of at least two relevant biases: (1) Books for the next month are frequently released on the last Monday-Tuesday of the previous month; and (2) people buy textbooks at the beginning of school semesters (January is the beginning of spring quarter/semester). This latter condition might create substitution effects (people spend their money on non-fiction books) which are not the same across all publishers and categories. I don’t know if these biases matter, but the point is that the authors don’t even tell us that they considered that there might be bias issues in picking this particular time period. I mistrust studies where limitations aren’t at least discussed.

(3) Cross-sections cannot give you trends. Trends need more than one data point. You cannot determine a trend from a single observation. If a book is #1 today, that doesn’t mean it will be #1 tomorrow. You cannot infer anything about the past or the future from a single data point in the cross-section.

(4) Nevertheless, the authors do try to infer from this. In fact, they do a lot of inferring that is analytically indefensible. Let’s take the inferences in turn.

  • They infer sales numbers from rankings (because Amazon does not publicly report sales), based on their own books and information from other authors. In asking around I have been told that the sales numbers correspond to rankings fairly well. I’m willing to believe this, but Amazon itself points out that rankings can change without sales figures changing and vice versa. This may be a bias that is sufficiently general that it doesn’t compromise the inferences. But it’s something to keep in mind.
  • They take the sales numbers for one day (an inference) and combine it with the publisher data, which gives them the gross and net sales figures for each book. They then take the author’s net revenue and multiply that number x365 to get the author’s earnings for that book for the entire year. This is insane completely absurd. According to this “formula,” a book that sells zero copies on January 28/29 nets the author zero dollars for the year. A book that sells 7000 copies and is published by Amazon nets the author over $4 million. Not only is this unbelievable (99% of books move around the list over the year, unless they’re stuck at 0 sales), it casts doubt on every other data and inference decision the authors make. Why on earth would you accept information, let alone take advice, from someone who thinks this is a good way to calculate author earnings? You should be running far, far away. It is very bad analysis. It is horrifically bad inference. This criticism doesn’t even take into account the difficulty of estimating author earnings without including advances, but frankly, in my estimation it’s a sufficiently disabling criticism on its own.
  • Author behavior is being inferred from data on books. There is too much author behavior that is simply missing and can’t be inferred in any legitimate way. Authors make choices about editing, packaging, writing quality, etc. that affect reader decisions to purchase books. In addition, anecdotal evidence consistently points to the importance of a backlist, a backlist that can come from self- or publisher-published books. None of these variables are captured in this data. So what we are at risk for having is “omitted variable bias,” where the correlations are inaccurate because not everything that matters is present in the dataset.

(5) Many of the top selling books are Amazon imprints. Thomas & Mercer is the top-selling imprint in this dataset. That lucky author is on course to make $4 million, according to the report. (Or not, depending on which planet and universe you inhabit. Here’s hoping she lives in the alternate one.) It makes sense that Amazon does so well at Amazon, since they have many ways of boosting visibility and they naturally use those techniques to sell their own books. And since the NYT and USA Today don’t include exclusive-vendor books on their bestseller lists we can’t see how Amazon books do when we include other rankings. It’s a classic problem of comparability: Amazon doesn’t include pre-order sales in its rankings and NYT/USA Today don’t include Amazon-only books in their rankings. so you can look at apples at one and oranges at the other, but you can’t look at apples and oranges together.

The authors attempt to bolster their existing data by looking at Bookscan numbers, but because Bookscan doesn’t break down data between ebook and print sales and instead includes overall sales, the information revealed to us in this new report doesn’t help us situate the Amazon data in a larger context. At best the Bookscan numbers might reveal the proportion of books sold at Amazon relative  to the larger marketplace captured by Bookscan (although Bookscan doesn’t account for all sales). But instead, the Bookscan numbers are used to compare e-book sales to print sales, which are a completely different issue.

(6) The authors provide the data in an Excel spreadsheet format so that the rest of us can analyze it. I appreciate this, and I’m happy to work with flat files, although there are a lot of advantages to relational databases (e.g., MySQL and Access). But when I downloaded the files I realized that (a) this is not “raw data” and (b) important information is uncollected or removed. In particular:

  • When the data were scraped, if they had picked up the release date it would let us know where the book was in its life cycle. Then we could apply a survival model, i.e., one that estimates a rate of sales decay over time, and the date would also help us identify whether the price in the cross-section is permanent or temporary. These data aren’t perfect, because publishers can change release dates (and there are different release dates for different editions, including self-pub updates). But being able to use even imperfect release date information would allow revenue projections to approximate something that isn’t prima facie absurd.
  • The “author data” sheet in the file combines all of each author’s books into one observation (one row in the spreadsheet) and labels them with one publisher category. This potentially conflates self- and publisher-published books, books across genre categories, and top-selling and lesser-selling books (case in point: #1 Book is 7000 sales and #1 Author has two books at 7000 sales, so one of #1 Author’s books has 0 sales). I would like to decompose this data but I can’t because the author info has been “anonymized,” as has the title info. Therefore I can’t combine the info provided in the two sheets into one dataset. This is apart from the main problem, of course, which is that there is very little point to running more rigorous statistical analysis because the underlying data have essential reliability and validity problems.

There is a post at Digital Book World which provides descriptive statistics for the data (something the report’s authors did not, which is also a breach of data analysis norms). The data look to be skewed, and also to be non-normally distributed. I’m betting there are correlation issues that will wipe out at least some of the results in the pretty charts if we subjected those bivariate relationships to proper controls in a multivariate analysis. There is also an excellent criticism of the report’s discussion of star ratings here.

A sentence in the report has been making the rounds:

“Our data suggests that even stellar manuscripts are better off self-published.”

No. That conclusion is writing a check that the data can’t cash.

As an empirical researcher who respects the limits inherent in all data collection and analysis, my strongest advice is to read this report as you would read any interesting tidbit about the publishing industry. Treat it as entertainment, not information. If you’re interested in data analysis more generally, think of this as a stellar example of What Not To Do.

If you pushed me for a recommendation based on what I see in these data, I would say, after reminding you of the insurmountable shortcomings contained within it: If you plan on selling ebooks solely or primarily at Amazon and the opportunity cost of your time is greater than zero, you might want to sign up with submit to (and hope you are offered a contract by) an Amazon imprint. Because Amazon books do extremely well and the cut they take may well be worth the time you save doing all your own production and promotion. Somehow I don’t think that’s the takeaway the authors intend, but that’s an obvious one for me. But remember, I don’t have a dog in this hunt. I’m just looking at the data.

How to read and preserve your ePub library

How to read and preserve your ePub library

The news for the last couple of weeks has been one of frustration and  anger for epub readers as digital rights management continues to thwart readers access to their legitimately purchased ebooks. Adobe announced that it would introduce a “hardened” DRM that most people believe will include some kind of “always on” component.  Initially Adobe planned to move forward with the implementation of its new DRM in July. This DRM was not backwards compatible and some readers reported losing access to older DRM’ed books with the new Adobe Digital Editions upgrade. Sony has announced it is vacating the US and Canadian market and transferring readers’ libraries from Sony to Kobo by March 20, 2014 at 6 p.m. (EST).

When the transfer happens Sony does caution that “highlights, bookmarks and annotations you made in your Reader Store eBooks will not be available after you transfer your library to Kobo” and “in a few rare cases, ebooks purchased at Reader Store may not be available at Kobo for re-download. In these situations, it is recommended that you download a copy of these titles from Reader Store before April 30, 2014.”

I want to be clear before I go on that the only people who are adversely affected by this issue are individuals who have paid money for these books.  This post is for them. To preserve access to your ePub library, particularly from Sony, you need to take the following steps.

1) Download and install Calibre. Calibre is a free ebook cataloguing system. I highly recommend its use to ANY digital reader.

2) Google Apprentice Alf.  Apprentice Alf keeps a collection of DRM related plugins that work with Calibre.

3) Install the plugins.

Step 1: Open Preferences

Preferences Calibre

Preferences icon is found in the upper right hand corner of the main navigation bar. You can also access preferences by using CTRL P (PC) or CMD P (Mac)

Step 2: Open Plugins.

Once the Preferences screen is open scroll to the bottom and click on Plugins:

Plugin button Calibre

Step 3:   Browse for Plugin.

Click on the little blue icon on the bottom right.

Screenshot 2014-02-08 19.04.09

This should launch a dialog box where you can navigate to find your plugin. Highlight the plugin you want to add.   Click the “open” button and then the dialog box will close.

If you are preserving your ePub library, there are no other actions you have to take at this time. In other words, there are no customizations you need to do the plugins you have just added. Simply click the green “APPLY” arrow in the upper left corner and then restart Calibre.

4) Download and install Reader software. It should open in the “My Library” tab. Click on the arrow next to the dropdown box and select “Purchased from local Reader Store.”

Screenshot 2014-02-08 19.08.21

You will be asked to sign in to your Reader Store account. Once your Reader credentials are accepted, another popup screen appears asking for your Adobe ID (use the same ID that you used for Adobe Digital Editions. If you don’t have one, then go ahead and  get a new one).

Screenshot 2014-02-08 19.09.54


(Note: If you don’t remember your Adobe ID then open Adobe Digital Editions and  go to the HELP menu and select “Authorization Information”. A screen should popup and give you the email address associated with your Adobe ID. Often the email address is your Adobe ID.)

Screenshot 2014-02-08 19.11.56

I fought with the Adobe ID screen several times knowing I entered the password correctly. I don’t know if the server was done or what but I had to abandon the process.

5) Once Sony has allowed you to access your purchases, you should have a screen like this. Double click on the covers to download. If Sony says your books have been downloaded by another user, then you’ll need to make sure you’ve entered your Adobe ID credentials correctly.

Screenshot 2014-02-08 19.19.47

Your ebooks will be saved in a folder on your harddrive called “My Books/Reader.” This is usually found in the My Documents (PC) or Documents (MAC) folder.

If you have problems authenticating your Adobe ID (which I did) you can go directly online and download the books there.

Screenshot 2014-02-08 19.26.00

What downloads is a license, not the book. It’s called an ACSM file (and will be unhelpfully named “URLLINK.acsm”) and basically calls back to the server holding the books and says, this computer is okay to deliver the book to. Then the computer gremlins push the book down the internet line into the folder of My Books/Reader. UNLESS! You have Adobe Digital Editions as your primary ebook software, then it will download into My Documents (PC) or Documents (MAC)/Digital Editions.

6) Once you have downloaded the ebooks it is now time to drag and drop. For ease of use have Calibre open on one side of your screen and your folder of ebooks open. Simply drag the ebooks onto the Calibre window. Your books are backed up, preserved and able to be read on nearly any device.

Screenshot 2014-02-08 19.35.06

That’s it. I know it seems like a lot but it’s really not. To make life easier for you, you can learn how to use automating scripts so that when you download an ebook, it automatically gets imported into Calibre. Check out our posts for Macs and PCs.

I also recommend you check out our posts on backing up your digital library.