How (not) to lie with statistics
I’ve been looking at the Author Earnings Excel spreadsheets (the “raw data”) for the last few days. Many people have become very, very excited by the conclusions the authors draw from this data, and critics have consistently been shouted down as being pro-publisher, anti-self-publishing, or having various other axes to grind.
I have no dog in this hunt. I’m not a fiction author and very little of my income comes from royalties or other direct remuneration from writing. I have to write and publish in order to earn promotions and salary increases, and to be respected in my field, but that writing is judged on quality, reliability, and academic contribution. No one really cares about sales.
I do, however, care about how data are collected, analyzed, and reported, and this report doesn’t pass my smell test for reliability and validity. As a political scientist trained in both political science and sociological methods, I’ve conducted qualitative and quantitative studies and I’ve hand-built and analyzed a quantitative dataset comprised of over 3000 observations. I’ve also taught research design and methods at the university undergraduate and graduate levels.
My concerns aren’t about math or statistics (there is very little of either being used here). My concerns are about (1) how the data were selected; (2) inferences from one data point to a trend (you literally cannot do this); and (3) inferences about author behavior that are drawn entirely from data about book sales. I would be less concerned if the authors of the report showed more awareness of the limitations of the data. Instead they go the other way and make claims that the data cannot possibly support.
I could write 10,000 words on this pretty easily (I started to), but I’ll spare DA’s readership that and try to be succinct.
(1) This is Amazon-only data. We call this a “sample of convenience,” i.e., the authors figured out a way to scrape Amazon so they did. They justify extrapolating from Amazon to the entire book selling market by saying that Amazon has been “chosen for being the largest book retailer in the world.” This is analogous to me studying politics in the state of Alaska and then claiming my study provides an accurate representation of politics in the entire USA because Alaska is the biggest state. Or studying politics in China and then claiming my study explains politics all over the world because China is the most populous country. In other words: No.
Amazon is the biggest, but it’s not the only or even the majority bookseller. And it’s not a representative bookseller statistically, i.e., there is no reason to believe that Amazon provides a representative snapshot of all book sales. It probably sells the most ebooks, it might sell the most titles, and it is where a lot of self-published authors sell their books. But it does not sell the most total units of (print and e-) books, which means comparisons across categories will most likely be skewed and unreliable. If the authors’ conclusions were limited to Amazon-specific points, I would be less bothered. But they are making claims about general author behavior based on partial, skewed data. No. No. No.
(2) This is one day’s worth of data, one 24-hour period of sales and rankings. We call this a cross-sectional study. Cross-sections are snapshots, which can be very useful to give you an idea of relationships between variables. But they have their own biases, and these biases can be irrelevant (good for your study) or relevant (bad for your study). In this case the 24-hour period comprised parts of the last Tuesday and Wednesday of January 2014. I can think of at least two relevant biases: (1) Books for the next month are frequently released on the last Monday-Tuesday of the previous month; and (2) people buy textbooks at the beginning of school semesters (January is the beginning of spring quarter/semester). This latter condition might create substitution effects (people spend their money on non-fiction books) which are not the same across all publishers and categories. I don’t know if these biases matter, but the point is that the authors don’t even tell us that they considered that there might be bias issues in picking this particular time period. I mistrust studies where limitations aren’t at least discussed.
(3) Cross-sections cannot give you trends. Trends need more than one data point. You cannot determine a trend from a single observation. If a book is #1 today, that doesn’t mean it will be #1 tomorrow. You cannot infer anything about the past or the future from a single data point in the cross-section.
(4) Nevertheless, the authors do try to infer from this. In fact, they do a lot of inferring that is analytically indefensible. Let’s take the inferences in turn.
- They infer sales numbers from rankings (because Amazon does not publicly report sales), based on their own books and information from other authors. In asking around I have been told that the sales numbers correspond to rankings fairly well. I’m willing to believe this, but Amazon itself points out that rankings can change without sales figures changing and vice versa. This may be a bias that is sufficiently general that it doesn’t compromise the inferences. But it’s something to keep in mind.
- They take the sales numbers for one day (an inference) and combine it with the publisher data, which gives them the gross and net sales figures for each book. They then take the author’s net revenue and multiply that number x365 to get the author’s earnings for that book for the entire year. This is
insanecompletely absurd. According to this “formula,” a book that sells zero copies on January 28/29 nets the author zero dollars for the year. A book that sells 7000 copies and is published by Amazon nets the author over $4 million. Not only is this unbelievable (99% of books move around the list over the year, unless they’re stuck at 0 sales), it casts doubt on every other data and inference decision the authors make. Why on earth would you accept information, let alone take advice, from someone who thinks this is a good way to calculate author earnings? You should be running far, far away. It is very bad analysis. It is horrifically bad inference. This criticism doesn’t even take into account the difficulty of estimating author earnings without including advances, but frankly, in my estimation it’s a sufficiently disabling criticism on its own. - Author behavior is being inferred from data on books. There is too much author behavior that is simply missing and can’t be inferred in any legitimate way. Authors make choices about editing, packaging, writing quality, etc. that affect reader decisions to purchase books. In addition, anecdotal evidence consistently points to the importance of a backlist, a backlist that can come from self- or publisher-published books. None of these variables are captured in this data. So what we are at risk for having is “omitted variable bias,” where the correlations are inaccurate because not everything that matters is present in the dataset.
(5) Many of the top selling books are Amazon imprints. Thomas & Mercer is the top-selling imprint in this dataset. That lucky author is on course to make $4 million, according to the report. (Or not, depending on which planet and universe you inhabit. Here’s hoping she lives in the alternate one.) It makes sense that Amazon does so well at Amazon, since they have many ways of boosting visibility and they naturally use those techniques to sell their own books. And since the NYT and USA Today don’t include exclusive-vendor books on their bestseller lists we can’t see how Amazon books do when we include other rankings. It’s a classic problem of comparability: Amazon doesn’t include pre-order sales in its rankings and NYT/USA Today don’t include Amazon-only books in their rankings. so you can look at apples at one and oranges at the other, but you can’t look at apples and oranges together.
The authors attempt to bolster their existing data by looking at Bookscan numbers, but because Bookscan doesn’t break down data between ebook and print sales and instead includes overall sales, the information revealed to us in this new report doesn’t help us situate the Amazon data in a larger context. At best the Bookscan numbers might reveal the proportion of books sold at Amazon relative to the larger marketplace captured by Bookscan (although Bookscan doesn’t account for all sales). But instead, the Bookscan numbers are used to compare e-book sales to print sales, which are a completely different issue.
(6) The authors provide the data in an Excel spreadsheet format so that the rest of us can analyze it. I appreciate this, and I’m happy to work with flat files, although there are a lot of advantages to relational databases (e.g., MySQL and Access). But when I downloaded the files I realized that (a) this is not “raw data” and (b) important information is uncollected or removed. In particular:
- When the data were scraped, if they had picked up the release date it would let us know where the book was in its life cycle. Then we could apply a survival model, i.e., one that estimates a rate of sales decay over time, and the date would also help us identify whether the price in the cross-section is permanent or temporary. These data aren’t perfect, because publishers can change release dates (and there are different release dates for different editions, including self-pub updates). But being able to use even imperfect release date information would allow revenue projections to approximate something that isn’t prima facie absurd.
- The “author data” sheet in the file combines all of each author’s books into one observation (one row in the spreadsheet) and labels them with one publisher category. This potentially conflates self- and publisher-published books, books across genre categories, and top-selling and lesser-selling books (case in point: #1 Book is 7000 sales and #1 Author has two books at 7000 sales, so one of #1 Author’s books has 0 sales). I would like to decompose this data but I can’t because the author info has been “anonymized,” as has the title info. Therefore I can’t combine the info provided in the two sheets into one dataset. This is apart from the main problem, of course, which is that there is very little point to running more rigorous statistical analysis because the underlying data have essential reliability and validity problems.
There is a post at Digital Book World which provides descriptive statistics for the data (something the report’s authors did not, which is also a breach of data analysis norms). The data look to be skewed, and also to be non-normally distributed. I’m betting there are correlation issues that will wipe out at least some of the results in the pretty charts if we subjected those bivariate relationships to proper controls in a multivariate analysis. There is also an excellent criticism of the report’s discussion of star ratings here.
A sentence in the report has been making the rounds:
“Our data suggests that even stellar manuscripts are better off self-published.”
No. That conclusion is writing a check that the data can’t cash.
As an empirical researcher who respects the limits inherent in all data collection and analysis, my strongest advice is to read this report as you would read any interesting tidbit about the publishing industry. Treat it as entertainment, not information. If you’re interested in data analysis more generally, think of this as a stellar example of What Not To Do.
If you pushed me for a recommendation based on what I see in these data, I would say, after reminding you of the insurmountable shortcomings contained within it: If you plan on selling ebooks solely or primarily at Amazon and the opportunity cost of your time is greater than zero, you might want to sign up with submit to (and hope you are offered a contract by) an Amazon imprint. Because Amazon books do extremely well and the cut they take may well be worth the time you save doing all your own production and promotion. Somehow I don’t think that’s the takeaway the authors intend, but that’s an obvious one for me. But remember, I don’t have a dog in this hunt. I’m just looking at the data.
I don’t have anything to do with this situation either, I’m just a reader, and had no previous knowledge of the Author Earnings spreadsheet. But this post totally just made me geeeek OUT. You just made statistics and data collecting interesting to read. I hope you dropped a mic :)
I never bothered to read that post you’ve analyzed. I’ve seen many tweets about it — and many parody tweets about it too. Now that I’ve read all of your post, I’m glad I never wasted my time. And for all those who read the original post and thought it was revelatory, I have just one word for you: SUCKERS!
“No. That conclusion is writing a check that the data can’t cash.”
HAHAHAHAHA!!!
And, yes. So much yes to everything. I can only hope that Howey and all the other authors lapping this up are better writers than statisticians. Probably my flaw is being a better mathematician (not really a statistician) than writer. Oh, well.
I am mystified at the whole “critiquing the crirics” of this study. As an author, why wouldn’t you want to use the best “brain tools” possible and produce a study that means something?
Sunita, how dare you bring rational and credibility to data collecting and analysis ;). I’m not even going to pretend to understand what seems to be a complicated process but I’m taking your word as a scientist who works in this field.
I like to trust those who know what they’re talking about. There’s a reason I go to a doctor and not Google for a diagnosis.
You don’t need to know ANYTHING about research methods or data collection to understand that, just because a book sells X copies on one day, it will not automatically keep selling X copies per day for the next year. Or even the next week.
Thank you, Sunita, for writing such a well-reasoned, no-nonsense post, and articulating what I had been thinking about this whole thing from the start, but did not have the years of experience to explain. Data lovers unite!
Oh, Sunita, THANK YOU. I don’t do financial statistics, but I did a fairly comprehensive course of social science stats, and even though it’s been years since I worked with them, the first time I saw that data and the conclusions that went with it, I went “you have to be kidding me.”
In general, with all the bruhaha over this study, I’ve had to remind myself several times that for my own peace of mind I try never to attribute to malice what can as easily be explained by ignorance, but it’s hard not to see deliberate bias here—not so much in the collection of the data, but it what people are making of it.
I thank you from the bottom of my heart for writing a post I can simply point to and go “that.”
Were you equally critical of the author income report that averaged in the $0 income from people who didn’t even have their first book written?
@Cryselle: You can see the discussion about that DBW report in the comments to this Dear Author post. It’s hard to do such an in-depth analysis of its problems because the information in the press release is limited and most of us haven’t paid the $295 to see the whole thing. But on the other hand, I haven’t seen anyone using that report to make the kind of ludicrous assertions that many authors seem to be doing on the basis of Howey’s report.
Thanks for the kind words! I really wanted my points to be clear to people who don’t have a stats/research-design background, and I’m so glad it’s comprehensible.
@mari: There is a lively debate and also some very bright lines between people who see themselves as championing self-publishing and those who assert that traditional publishing is still superior for most authors, and there is a fair amount of distrust within each camp regarding the motives and interests of the opposite group, and this distrust extends (for some) to the studies each side produces. This comment at Robin’s Friday news post provides links that give you a sense of the background.
@Cryselle – you must be new to Dear Author, or you would know that skepticism has been expressed about the DBW/WD survey, as late as on Friday. https://dearauthor.com/features/industry-news/friday-news-ambitiously-presented-author-earnings-suspicious-book-similarities-uk-digital-lending-trends-and-video-game-romance-novel-covers/
In addition, if I can toot my own horn, see my comment on that page explaining why calling the DBW/WD survey “an author income report” is fallacious.
@Sunita: Thank you so much for a clear, level headed, rational and logical look at Howey’s website. Of course, since you took the time to establish your bona fides as a social scientist, qualitative/quantitative analyst and academic, prepare to be attacked by Howey and accolytes for being an intellectual smarty-pants and therefore automatically suspect.
It’s astounding to me how anyone possessing an ounce/gram of common sense can look at Howey’s number and take them for anything but what they are: a one day crawl of carefully selected genres sold in one format in one store. I respect Howey’s ability to drum up sound and fury, but it signifies absolutely nada. His study is as substantial as sugar water, flavored with gold rush dreams. I feel sorry for anyone who swallows Howey’s Kool-Aid without a heaping side order of skepticism, logic, and basic math ability.
You left out my favorite inference: self-pubbed books are higher rated than trad-pubbed books (by a number that others have pointed out is statistically irrelevant) because they’re approaching par on quality and self-pubbed books deliver more value for money. Which is such an obvious conclusion to grab from this one day crawl, natch (:roll eyes), and assumes that reviewers are a monolith whose behavior and attitudes are consistent across thousands of individuals.
you might want to sign up with an Amazon imprint
FYI, you can’t sign up with an Amazon imprint. Just like with the NY publishers, you have to submit and be considered. They will often make offers to successful indies, but publishing with Montlake or 47North or Thomas & Mercer requires the same submission and acceptance/rejection process as do all publishers. Though if that’s what you meant, ignore me. ;)
@Cryselle: I’ve neither criticized nor praised the DBW study in any meaningful way. How could I with any credibility, when I haven’t read the full report or seen the data? It’s a very expensive report (as Ros notes) and the data are proprietary. I’ve commented in passing (on Twitter, I believe) that the inclusion of authors who have not published a book yet yields results that are not particularly useful if you want to know about published authors. I’ve also commented on the clear competence of the DBW study’s author at Carolyn Jewel’s blog because I thought she was being unfairly criticized for “flaunting” her very relevant credentials. I can’t speak to the competence of the “Data Guru” on the Author Earnings report because his identity hasn’t been revealed. [ETA: And I don’t have any interest in talking about his competence; the data and report speak for themselves.]
@Alison Kent: I did know that, but I phrased it badly, so I will definitely not ignore your comment and I thank you for the correction. ;)
In the oil business, the “decline curve” of a well is looked at carefully by the production folks. Every well does its best the first day, and how the well changes production over time tells what kind of total reservoir it’s draining. The benefit of my wife’s book published by an Amazon imprint last year was in part the data provided in the day to day sales figures. Matching that up with sales rankings, it gave a good snapshot of the sales cycle of that book, ranking vs sales. The “sales cycle” turned out to be remarkably similar to that of a certain type of oil well drainage, that of a depletion drive – most of the wells in the news you hear about “fracking” are that type. Simply put, there is a predictable decline for any one particular book but the _rate_ of decline differs when sales are plotted on a logarithmic chart.
Here’s an image of just one well’s oil and gas decline http://i.imgur.com/Jxnv29Q.jpg
You can see the ski slope shape – steep at the top, and then a long tail. Software exists to examine this ski slope shape, as some wells have steeper drop offs in production than others.
Here is an example of some production, and then predictions on how the remaining production will go:
http://i.imgur.com/kJqXCft.jpg
You can see there are three “tails” on this “long tail” chart. There are three curves represented here. Called a “b factor decline curve”, they are assigned different numbers depending on the shape. The top represents production with the slowest dropoff in production. The others dropoff in production quicker. Some wells will have very similar shaped curves, even though one well has significantly higher production on the first day.
Book sales have to be similar to this. I don’t know this for fact, but that’s my assumption. On the last graphic, you can imagine book sales with a popular author having the slowest total decline. The poorest curve would represent a small fan base with no word of mouth spread.
Granted, not every book sale is on Amazon. But Amazon provides a sales rank for each book it sells and Amazon sales rank DOES represent the popularity of a book. But as mentioned in the above blog post, the rankings of Amazon Imprints can’t be compared directly to that of books selling outside of Amazon. A Thomas & Mercer book can’t be compared in total sales to the latest James Patterson thriller. Plus, you have the issue that Amazon Imprints sell very very few paper copies relative to the Kindle sales. So, Amazon has the paper book rating as well as the Kindle book rating. Comparing the paper and Kindle rankings and including any Amazon Imprint numbers would be a mistake.
So, the numbers presented are wrong in so much as there’s not enough data to believe in the accuracy of the charts as presented. But I believe the charts could still be correct even using the improperly collated data. Specifically: The Big 5 publishers take more money than they should from many authors by virtue of higher prices. Self publishing wouldn’t have gotten as far as it has unless the author could lower the price on her own books to some magical sell point. Amazon Imprints have come in and priced things in the middle and are making more money for former mid-list authors than they would at the Big 5. Amazon has been able to do this because they own and control what goes on the Kindle.
Some may think the data in the authorearnings sheet may not reflect my thoughts, but I am confident they are accurate. I am confident, because after four dozen books each selling less than 20,000 copies each, my wife’s book published by Amazon sold well over 100,000. At full price. It did not make the NYTImes Bestseller List. Or USAToday’s. Because it was only sold on Amazon. We have not provided any data to these lists, btw. I only mention the 100K number because that’s when Amazon editors will provide you with a cute plaque saying so, and we get to show that off.
Amen, sister. Thank you for writing it all down. I used to be a market analyst with General Foods (now Kraft) many moons ago, and that report just looked wrong to me, or rather, the extrapolations. There didn’t seem to be any distribution analysis, either. And you have to take raw data in context when you’re interpreting. A good report will always take account of lags and other factors. This one didn’t even try.
@Walt: I don’t think anyone is discounting the data. As Sunita said ” In asking around I have been told that the sales numbers correspond to rankings fairly well. I’m willing to believe this, but Amazon itself points out that rankings can change without sales figures changing and vice versa.”
The problem is extrapolating that data to match those conclusions or really to be anything but a confirmation of what the data says. That X book sold X number of copies (give or take 10% either way) on a given day. You can also probably extrapolate that on any given day the Kindle bestseller list will contain traditionally published books, Amazon published books and self published books. That’s a conclusion that is markedly different than how the market existed even a couple of years ago
Beyond those conclusions, however, the data can’t support much else. It’s a start and if Howey et al had presented it as such there probably wouldn’t be much to take issue with. It’s the conclusions and extrapolations that Howey presents with breathless excitement that is problematic.
I’m not sure what his aim is. If it is to help authors make an informed decision about where they should publish their next book, it doesn’t achieve that goal because the conclusions are fallacious at best. If it is to encourage publishers to change their terms, it actually hurts because they know that the conclusions presented about their books are inaccurate casting doubt on other self published claims.
There’s no question that some authors are better off self publishing and others are better off doing other things but the data as presented doesn’t fulfill the goal of assisting any decision making process but it can serve to confirm an author’s decision to do A or B.
Seriously awesome – I saw this survey briefly mentioned in another DA article but did not even go read it. No, let’s just say that it is really hard to make me interested in statistics and you so did.
I started to write out a comment and then realized it was just too massive, so put this on my blog.
http://www.courtneymilan.com/ramblings/2014/02/16/some-thoughts-on-author-earnings/
@Walt:
My experience is that it not quite a decline curve because the well can refill with visibility events–for instance, I always see a jump in the rankings for my book when I release a new book, or do a Bookbub ad, etc.
Thanks for this explanation, Sunita. I think when we have two people with expertise in data analysis (you and Dana Weinberg) pointing out the issues with various studies, we should all be taking a step back and asking how and if we can get a more rigorous study in place.
But, I also want to ask — why are these studies necessary, in the political sense? Good data and good analysis is, in my opinion, always a good thing. What I mean is, WHY is this subject suddenly such a flash point?
I know what I think. In the current environment, there is a class of authors who in the past have been pedestrian money makers for traditional publishers, and that class of authors are very likely to make more money self-publishing.
It’s dangerous, in my opinion, for publishers to harp on “self-published authors don’t make much money.” Traditional publishers are not going to offer contracts to most of them. But for the authors they WOULD offer a contract to? Those contract terms are now competing with self-publishing. At the moment, purely on the basis of money offered and the terms of those contracts, such offers are not competitive.
@Jane: Go re-read Walt’s comment again. He is saying yes you can extrapolate the sales data and it is remarkably similar to an oil well decline curve.
@Josh: We can’t extrapolate where we are in the curve, even if we agree on the shape of the curve, because there is nothing in the data that provides us with that information. *That’s* what we can’t extrapolate. You cannot identify a trend from a single data point. The only reason we would be able to do it with date-in-life-cycle-of-book data is because we could use that observation to go in and create another variable (the number of days the book has been available for sale).
@Walt: I agree that this is a plausible sales curve. But there are others. In addition to the points Courtney Milan raises about bumps based on external shocks (such as reviews in prominent places), we are likely to see also, for less well known authors, a curve that looks more like a normal-distribution curve, which starts low, builds, peaks, and then declines (with or without some external-shock bumps along the way).
I have no trouble believing that the AE data reflect your experiences. I just don’t know where your experiences fit in the overall picture. And the AE data can’t help me figure that out because it’s not a representative sample.
@Carolyn Jewel: I don’t disagree at all with your point about the relatively success of self-publishing (or Walt’s comments along the same lines). I think that it’s incontrovertible that at least *some* authors are making more money self-publishing than they would have signing with a traditional publisher, and some may make more money than with Amazon. In Alison Kent’s case, signing with Amazon seems to have been considerably more lucrative for her than the sales she was making on her previous books, which i believe were NY-published.
And I can well imagine that publishers are worried about this and unwisely making sweeping claims about how *no* authors can do better self-publishing.
I’ve been thinking about what might be possible in terms of building a representative sample, and it’s both difficult and expensive. Randomizing the sample is a non-trivial problem. We might be able to use Amazon’s massive title collection (along with some other retailers) to generate such a sample, but then you’d have to decide whether to stick with observational data (book info) and infer author behavior, or you could opt to conduct a survey of the authors (direct data on attitudes, but at least somewhat self-reported) whose books are pulled in that random draw (stratifying the sample and oversampling where necessary). Either way would be expensive and time-consuming.
You could also continue with the Author Earnings data scrapes but do them in a more rigorous and self-aware way, taking into account the potential biases of selecting particular days and/or particular months and compensating for that and adding more variables. And of course relying more on the real “raw” data and avoiding the logic-of-inference problems that permeate the current analyses.
@Janie Watson: I meant to respond to you earlier, sorry! I agree the star-grade stuff is eyebrow-raising. I think the difference is actually statistically significant (I saw a comment somewhere in which someone ran a chi-square test on the data and found significance). But unless we know that the Fiverr, Friend-and-Family, etc. reviews do not differ systematically across the two samples, we can’t say much about how well the star grades reflect honest reader attitudes.
Thank you for the article, Sunita. Excellent points, and there’s nothing I can add except a minor correction.
You state, “Amazon doesn’t include pre-order sales in its rankings.” Their info about pre-orders on the help page you referenced earlier in the article is specifically referring to pre-orders not being included in sales reports. They do indeed use pre-orders for salesrank, which is a separate beast.
You can see this on their books coming soon page:
http://www.amazon.com/s/ref=lp_283155_nr_p_n_publication_date_2?rh=n%3A283155%2Cp_n_publication_date%3A1250228011&bbn=283155&ie=UTF8&qid=1392581648&rnid=1250225011
Pre-orders acquire sales rank as soon as the first order is placed, and some even reach top 10 in various categories, pushing ahead of books which are actually available now.
One of the things that seems clear to me is that, unless you are published *by* Amazon.com, to reliably make a living self-publishing, you have to write quickly and have multiple books available.
I think sometimes, when evaluating numbers, people look at the traditionally published figures and say: “But you would have made *way more* money if you had done it yourself.” And, if all things were equal, that might be true.
But if you write a book every three years, I don’t think self-publishing is necessarily your best option. The trajectory for most first self-published books by unknown authors is not good. Everyone expects that. It’s the accumulation of backlist, of virtual shelf-space, that seems to build momentum (this was, in the old days, true of paperbacks; in 1990, sales reps said it took about 5 books on the shelf before you had a sense of whether or not an author truly had ‘legs’). Since, in traditional publishing, the opportunity to *have* 5 books on the shelf is pretty much gone unless you have already proven you have legs, this isn’t a good traditional publishing bet.
But. I’ve said elsewhere, and will use this as an example again, Patrick Rothfuss. Name of the Wind. I loved that book. I think it’s brilliant. I handsold it to anyone who said the word ‘fantasy’ and stood still for more than 3 seconds. I am, of course, not the only bookseller to have done this. Patrick sold every conceivable translation right before book one was published, or just after. He came out in hardcover, and his book was at – the last time we got the first book in hardcover, 7 printings. He came out in mass market. He was widely, widely read.
And he wouldn’t have gained this traction with a book that was essentially 300K words and a single, lone, self-pub in epic fantasy. To gain traction, he would have probably needed a minimum of 3 books (my sense is higher, but let me be optimistic) self-published. And: it took him 3 years to write book two. Book two debuted at #1 NYT. Book 2, self-published, would not have done that.
At one book every 3 years, to gain the type of traction that would have rewarded him financially, he would probably have taken 9-20 years. So: in the case of someone whose writing process is Patrick Rothfuss’, I don’t think self-publishing is the best option. Yes, he’s an outlier. But – so much of the discussion on either side of the fence is about the outliers.
I don’t think that self-publishing is terrible. I have done some, and hope to do more in the future. I watch & read & collect data and evaluate it; I read really really interesting blogs & etc. I think it is absolutely true that midlist authors who can gain traction do *better* self-publishing, and for highly adopted genres (romance especially) I think it’s pretty much the future.
But…I don’t think it’s one-size fits all. There is a lot of luck involved, no matter which way you choose; there’s a lot of failure involved no matter which way you choose. And, of course, there is success.
The study made sense to me: if you are talking about self-publishing, you focus on Amazon.com. That’s where the brunt of your sales are going to be. To self-publishers it’s a lot of information they didn’t have, in one large mass.
But…as proof that you are better off self-publishing, I don’t think it works. It really is case-by-case. Even if you write one book a year – if you self-publish, it will take a while to build your backlist. If you traditionally publish, and you do not gain huge market traction – eg Konrath – you still have a base of readers, and if you then self-publish you’re not starting out at square one. If you write a book every 3 months, I’d have a different opinion.
@Jane:
“If it is to help authors make an informed decision about where they should publish their next book, it doesn’t achieve that goal because the conclusions are fallacious at best. If it is to encourage publishers to change their terms, it actually hurts because they know that the conclusions presented about their books are inaccurate casting doubt on other self published claims.”
The claim has been made elsewhere that the Big 5 publishers don’t market their own author’s works as well as they might. Whether this is true or not, it’s clear at least to me that is what _I feel_ the true intent of the charts is meant to help portray. I haven’t looked deeply into the actual data to see how flawed it may be. Moreover, any analysis of the marketplace involving Amazon as a player is going to be a moving target. There are several variables I can list off the top of my head that could change things quickly. The rate at which Amazon pays its content creators is reportedly changing (or not, depending on who you listen to) and how Amazon supports its creators with advertising on the front page of Kindles can easily change. Even the price Amazon charges for their Kindles could influence total future book sales. I’ve heard claims that the “ebook market has stabilized in the percentage of the total market” but if there’s a sale on ebook devices, more people will have them — and that has to come in to play.
Ask an author whether they want the rights to their old manuscripts back. You could find several authors who had just as soon let the publisher keep hawking the thing – mainly because they don’t want to be bothered with the hassle of converting (and that includes [gasp!] scanning old paper!) their old work into ebook format. But by and large authors that have taken the time to rework and resell their old works of fiction have found that they’re selling more copies than the publisher was. It turns out that working the price model on old books will get a sale when there wasn’t one before. So, on old out of date books, the Big 5 publishers can easily be seen to be either greedy or stupid, because they have this same power and refuse to use it. The data in the author earnings chart doesn’t reflect this, but I’m not sure there’s too many often published authors that don’t know it.
If the anecdotal data that I am aware of with respect to old book sales is accurate, how much of a stretch is it really to assume that, within certain sales limits, it also applies to new books of fiction as well? Isn’t this what Amazon accomplished by lowering the sales price point?
I would imagine that with big print runs and great advances, the Big5 publishers are tied to promoting the best of their contracted creative talent. The biggest successes, even giant failures. The mid-list author seems more like the steady income while all the excitement is at the top of the charts.
Here’s the “So what?” point I’m getting to.
Hypothetical question: What happens if mid-list authors leave the Big5 and go out on their own?
I don’t have an answer for this, just putting it out there.
One more hypothetical question: What happens if the one of the Big5 decides to choose a few books to lower the price on, and sees an increase in sales? I’m confident that’s why they’re opening up the ebook only format. To do just this. I mean, can you imagine the inner turmoil in the boardroom if the ebook faction of a pub house lowered the price on a an ebook (but not the print) and so many more print books were then left in the warehouses? And if you lower the price on the print books, the brick and mortar stores make less money from each sale? I’m not in the business, so I profess a cluelessness about it. So, I’ll shut up now.
Last thing: Barnes&Noble could be in the same spot as Amazon is now, with respect to selling ebooks. You authors out there marketing your own books know all too well B&N was very slow when dealing with the technology to help self-pubs sell more books.
Thanks for your site, Jane.
@Michael: Replying to myself here, since I didn’t make it clear that this salesrank includes only sales via Amazon.com. Just clarifying since that help page on Amazon is talking about Nielsen BookScan numbers.
The single day ranking doesn’t capture presales. So if a book is released on the Tuesday of the collection date and given a rank of X, it isn’t reflected in the ranking because presales accumulate over a period of time. I know of several authors whose presales at Amazon far exceed the average sales per ranking publicized on various sites but their first day ranking on Amazon doesn’t reflect this.
Self published authors who have been bestowed with the presales option have discussed how because of the way Amazon’s secret algorithms take into hour by hour velocity of sales that presales can adversely affect both initial ranking and stickiness.
The benefit of presales is that you can accumulate sufficient sales to appear on the NYT list or high on the USA Today list but it’s a trade off in release day/week visibility and ranking.
@Sunita:” In addition to the points Courtney Milan raises about bumps based on external shocks (such as reviews in prominent places), we are likely to see also, for less well known authors, a curve that looks more like a normal-distribution curve, which starts low, builds, peaks, and then declines (with or without some external-shock bumps along the way).”
First, I forgot to thank you, Sunita for your article.
As to bumps, that’s been the most fascinating thing! In the oil patch, there are times when you work over a well. This happens when a free flowing well declines, and needs help. You shut it in and add a pump at the surface. Or a gas lift (injecting gas down and it bubbles up like the old style aquarium pumps) This will bump the production considerably and there’s a new decline – but- ultimately the longer decline curve is pretty much held to.
With my wife’s book at Amazon, there was a corresponding bump in ranking (and sales) for every move they made. Some worked better than others. Lowering the price was something that was eventually done, but just for a week or so. Ultimately, you do get more sales, but with this book the sales ultimately started matching up with the predicted curve. So the sales bumps were pretty much a finite thing.
She has another book coming out in the same series very soon. I’ll watch this one, too.
@Walt I am typing on my phone so I’ll come back to respond to your comment. Two things stick out to me though. First, a trad publishers success and health rests on big books and not midlist. It really only takes one or two big books for a publisher to have a good year. The reliance on the blockbuster is unfortunate and I think that is what harms mid list authors the most because publishers spend more money pushing a big book than it does its lower performing books.
For many midlist authors self publishing is a lot better than advances of one or two thousand dollars per book and rights that aren’t returned for thirty some years.
As for lower priced ebooks, that is happening already. HarperCollins was the 0first to do this during the dark days if Agency pricing and they have continued to do this along with every publisher except maybe Random House but Macmillan, Hachette, Penguin, etc are engaged in discounting
@Jane: I’ve heard people talk about pre-orders and how they affect release day ranks and lists, but I always assumed that looking at a pre-order’s rank on a given day before it actually releases will still tell you how many many presales it had that day relative to the sales of other books in the store, so the only time the rankings vs sales are out of wack is actually on release day.
But you know what they say about assumptions, so I could be way off base.
@Jane: I didn’t mean that the website will report sales rank the moment a sale is made. Just that it’s acquired. There’s an overall delay in reporting of sales rank for new books ranging from minutes to a day, regardless of whether they’re pre-orders or orders, but it will be reflected in the rank soon after that first sale even if they weigh the value of pre-sales differently. Are you sure that’s not the delay you’re describing? I’ve seen pre-sale books achieve single day ranking their first day out. Unfortunately Amazon’s opaqueness makes it difficult to generalize that any particular behavior we witness is the rule. It’s taken years for authors to compile enough comparative data to even guesstimate the relationship between rank and sales, and even those best guesses miss the mark by a large margin for some authors I know. Amazon’s really the only one who knows the Colonel’s recipe.
As far as the possible adverse effect of presale ranking on actual release, that’s a very important consideration authors should weigh. A similar pitfall awaits authors who enable preorders on All Romance, since the only time a book appears on the new releases section of the front page is when it’s first on sale — whether it’s pre-order sale or not. Since many fewer customers will order ahead, an author doing pre-orders sacrifices having any front page visibility on release day.
Really. Who cares? It doesn’t change how many books I sell or how much I make. I find all these surveys a bit ludicrous as people try to make a case– for what? For one author trad publishing is the smart move. For another going indie is. For some, it’s hybrid, a term I coined in June 2011 and seems to have caught on.
I prefer not to be a statistic.
Successful Authors Are Outliers, Not Statistics http://writeitforward.wordpress.com/2013/12/10/successful-authors-are-outliers-not-statistics/
Jane wrote – For many midlist authors self publishing is a lot better than advances of one or two thousand dollars per book and rights that aren’t returned for thirty some years.
This is the case, there is no doubt about it, just about every mid list author makes more money SP.
If the big 5 don’t keep them I’m sure the statics of finding that best seller drops significantly.
I don’t really care about the stats. I’m an author, and I made a decision to trial different channels to market and to find what works for me! Every author will be different depending on many variables. I’d rather see us talk about supporting an authors’ choice to publish through whatever channel suits them – TP or SP. Hybrid or SP alone. I have writing friends who have never TP and who are making what they regard as a living wage off writing. They may never make a NY Times or USA Today list, they do make best-seller lists on Amazon, B&N and Apple, but the bigger lists – they don’t care. They tried the TP way and got nothing. But readers seem to like their stories and they make enough money to give up the day job. I think that’s wonderful. I think the readers benefit to. They get stories they might never have had any exposure to. SP is a new and exciting world. I’m enjoying the ride.
Thanks for the article, but I agree with Bob Mayer. I’ve enjoyed reading everyone’s analysis and frankly I record my sales figures every day and I see the rise and fall and have learned what works and what doesn’t. Have I made a million dollars? Not yet, but for all of us this is a journey and frankly if you don’t believe in what the Indies are making there is nothing I can say, no statistics I can prove to change your mind. And frankly after the last couple of weeks I’m tired of the discussion. So I will smile and go on about my every day business and thank God for my readers and my indie books. Your article is probably the last one I’m going to read for awhile about statistics on who is making what in publishing. I’m over it.
Thanks, Sunita, for taking the time to do this right. I know there are some people who’ve learned from it, even if there are others who should but won’t.
As for “who cares?” I do! I care that people who don’t seem to really know what they’re doing with statistics are claiming to have THE ANSWER — and that their visibility in the field of publishing might get people to listen without realizing that the use of statistics is off.
It’s clear why people are interested in knowing more about what works in getting published/making sales, and I think it’s great when people who have data are willing to share it. But I’m grateful for people who understand data analysis stepping up to help in that process.
@Bob Mayer:
That’s what I think. Like anyone in a business, make a plan and trial what you think works for you. The freedom of the many different channels to market is the point we should focus on. SP or TP or hybrid depends on the author and what works for them.
@Michael – No. My understanding (and this is based on what authors share with me, not my own experience) is that presales can be captured in rankings prior to the release but that they don’t count on the day of release. Only the sales made that day are factored into the ranking. So an author can have, say 20,000 in presales (units just at Amazon) but if they make only 1500 sales on the day of release, then their ranking on release day is reflective of the 1500 NOT the 20,000.
Actually he did say it was a start, there were serious limitations, and that more data, more cross-sectionals, more refinement, and more retailers are coming, so your taking issue with him NOT saying it’s just a start is unjustified. He DID say that. He also said everywhere he extrapolated or guessed.
@Jane: Thank you for the clarification. I see we’re talking about two different things then. I was only pointing out in my original comment that pre-order sales do acquire salesrank. Whether the pre-sales that influenced that rank count for naught on release day, unless Howey’s data scraping friend has specifically filtered out pre-release titles from the data, some are going to be there in the list of top 7000 books and Howey’s going to be way off in his sales guesstimates for certain titles if that’s the case. I haven’t looked at the data myself yet to see if pre-orders are included, only mentioned the possibility for anyone else who may dig into it since there’s so much else flawed about the report. If it was already made clear somewhere that pre-orders are not present in the scraped data, my apologies.
@Liana Mir: If only he said that he extrapolated and guessed but he did not. He made definitive statements such as the on Sunita quoted in her article.
“Our data suggests that even stellar manuscripts are better off self-published.”
I haven’t read the comments, and I’m also a reader with no dog in this fight. All I want to say is that it warms the cockles of my heart to see another reader so concisely and clearly explain why so many of these claims are, at their base, yet another “get rich quickly and easily” scam–even if it’s a scam some authors are perpetrating on themselves.
@Bronwen Evans:
When you draw up a business plan for any type of business, you are need good information about about the strengths, weaknesses, threats, and opportunities out there in order to figure out the best path for own business. If all the information out there is flawed because of faulty analysis, poor models, ideological posturing, etc. , then rational planning is well nigh impossible and there is no way for an author to plan out the best publishing mix for them. Freedom of choice in a low information system means you just “go with gut” or “flip a coin” because you simply don’t have any other way to
assess all those choices. You may end up a success but until someone works out a way to capture information about your success and other people’s successes in a statistical valid and informative way — all we have are anecdotes, not data.
Sunita’s not telling anyone here that one publishing path or another is the best way, she’s merely pointing that this particular study’s methodology is flawed and the data provided does not support Howey’s claim that “even stellar manuscripts are better off self-published.” His claim could in fact be correct, but his own stats don’t show this (or really much of anything).
@Michael: One of the big problems with the data as presented is that we have no idea which data have been transformed and which haven’t (e.g., the publisher info is clearly not transformed but is just as it appears in Amazon’s information, but with other variables we just don’t know). I looked for some kind of description of the variables or a key to the non-obvious observations and values, but I haven’t been able to find one. So with the pre-sales we just have to guess.
@Liana Mir: Yes, he’s said it’s preliminary, but if the original data are badly put together why should I have any confidence in the quality of the next round? Twice as much flawed data isn’t necessarily going to be better. And the opacity of the variables (e.g., are the individual observations for daily sales numbers guesswork or derived by some fixed formula or interpolation?) means no one else can perform secondary analysis on the data, let alone replicate the analysis. It doesn’t do me any good to give me the data if I don’t know how the data are compiled and I can’t decompose it to its original values. This is the opposite of “raw data.”
And whether the authors are rowing back any of their claims at various internet locations or not, they are *not* rowing back the claims they make in the report and they are not increasing the transparency and usability of the data. So I feel completely justified in criticizing this behavior. That’s not how empirical analysts respect either the data they use or their empirical colleagues.
Sunita,
I’d like to also add my thanks for your article. There’s so much for new authors to learn, and imho, if an author can both self publish and also have books released by a major publisher, that might be the ideal scenario. There’s nothing stopping an author from still submitting their manuscript to agents and also independently releasing other works. Personally, I’d rather compile info on how all this works in order to educate myself on publishing and also share whatever I can with others who may be just starting out. I take the approach of being a life-long learner on most things.
People can make up their own minds, however it’s always good to get both pro and con on an issue, as in this case, which data to use and which may not apply to each individual’s business plan.
And speaking of Amazon, the 2014 Amazon Breakthrough Novel Awards started today. If anyone wants to see a bit of info on the terms they offer authors, I’d suggest looking at Amazon’s publishing Contract Summary (I made it a short link, since it was so long):
http://goo.gl/iJ1m5f
Another error in their analysis is their use of the snapshot’s price points. Book prices on Amazon change frequently, so calculating author income from the snapshot price, or attempting to infer a correlation between the snapshot price and the average review rating, is, as you say, completely absurd. Reviews, for example, may be the product of books bought at higher or lower prices than the snapshot shows.
@E.A. Williams: That is a really, really good point. I didn’t pay as much attention to the review rating discussion as I should have because I tend to discount the baseline data, i.e., the reviews themselves. But i should have considered how the inference of the relationship between the snapshot price and the review rating was potentially corrupted. Amazon prices can change quite a bit, whether the book is self-published or traditionally published, and I’ve read enough reviews that discuss the price (and written some of my own that bring up price-quality ratio) that I should have been more alert to that.
Thank so much for reminding me of the importance of this information.
@wikkidsexycool: Thank you. I agree that new authors suffer particularly from low information issues, not least because so much of the data are closely guarded by publishers and retailers. If you’re a veteran author who has a number of books in release, you can at least use your own data to help you make decisions about the future. But when you’re starting out you have to rely on other people’s experiences and that can be tricky.
I also want to point to @Michelle Sagara: ‘s comment upthread, which illustrates really well how the “best” decision for an author is going to depend on a lot of factors that are specific to each individual. For some authors it may be worth leaving money “on the table,” as the AE reports put it, if what they gain is security and time to write. For others, having control over the production and promotion process is absolutely worth the effort. Each author has to make her own decision, and I just want to contribute what I can to make that decision a well-informed one.
That’s really why I wrote this post, because I didn’t want people to use flawed data to make important decisions unless they were aware of the flaws and could take them into account. As Kathryn says, the results in the AE report could be correct in some areas, but it’s not because the data clearly led to those conclusions.
@Bob Mayer: As wikkidsexycool says just a few comments above, the authors just starting, who know nothing about publishing, care about all this. It’s their career and potential income, and being misled by all the hype–however well intentioned–can be heartbreaking.
Oh yes, it’s those authors business to get educated on what works and how and for whom, which brings us back to: they care.
And thank goodness there are people out there willing to explain things to them and to educate them on what has worked for each of them–as Courtney Milan has done a couple of times so far. Thank goodness there are people like Sunita who are willing to spend some time dissecting those authority dictates (“This works! Everyone, without exception, should do it now!) so that fewer people fall into the trap of believing there is only One True Way ™ to do anything–let alone publishing.
He linked to AND explained where he got the KDP sales rank extrapolation and then opened up the full raw data spreadsheet and made it available for crunching, while stating he was going to take all that into account because this is new for him and one of the first major attempts to crack the opacity of Amazon’s unshared numbers.
In other words, he’s improving the algorithms after explaining exactly and linking to where he got all his figures and how. Which Courtney Milan explained in great detail and how to improve his modeling.
So sorry. You’re wrong. His official stance has been to be as transparent as he knows to be, answer questions thoroughly, explain that he intends to improve the analysis and methodology as rapidly and exponentially as possible, and please everyone help him know if they see areas for improvement. How much more transparent can you get?
@Jane:
“Suggests” is so definitive. After he explained precisely throughout where he’d extrapolated from the data and why, so HIS conclusion was that it suggested because he felt his extrapolations were valid because of so and so.
Context.
If you can’t accept that he has made it eminently clear through multiple interviews, posts, and the report itself that this is preliminary and full of limitations and this is what I extrapolated from and why and how and this extrapolation got me that, then you are simply denying that he has been transparent about his process and is excited about what his data suggests to him and wide open to improving that data. You can deny it, but it doesn’t change what he said or that he immediately started researching how better to add in print figures after folks told him Bookscan was insufficient, etc., etc.
@Liana Mir: The original data he provides isn’t fully transparent (as explained by Ms. Milan). And his conclusions are…well, yes they are transparently incorrect.
While he may have backtracked in his conclusions elsewhere, he doesn’t in his original report – the one that everyone links to – as opposed to say Ms. Milan for example who does an inline edit and acknowledgment of error. His report makes bold statements based on extrapolations from one data point which is an hour’s snapshot of rankings on Amazon. I don’t need to go into why that’s a bad data point.
If Howey and his programmer continue to scrape data for a significant period of time and continues to improve his modeling and ties conclusions to sound data, then we can talk. As it exists today, Sunita’s statement is the one truism I can back. His conclusions are writing a check that the data can’t cash.
As for Howey’s openness to criticism? https://twitter.com/hughhowey/status/434181344201932800
“Hey, they’re selling product, not peddling truth. Scientists don’t react nearly as violently to dissent as charlatans do.”
@Liana Mir: The report gives two different explanations and four different sources for the daily unit sales. In the body of the report the authors state:
That presumably is not from the data scraping at Amazon. The footnote at the end, Footnote 5, then states:
The phrase “here, here, and here” contains links to three additional sources for this data: Theresa Ragan’s, Edward Robertson’s, and the KDP calculator. Courtney Milan concludes that the authors are using the KDP calculator and she hypothesizes that they are linearizing the data (since the KDP calculator gives a range and the report gives specific numbers).
I took the information in the body of the report to be what the authors used, with the footnoted information being alternatives that other researchers could use if they wanted. Milan concludes that they are using the KDP calculator. The point is that one of us is wrong (I’m quite willing to believe it’s me), whereas if the report had contained a table with a description of where every single variable was gathered and how it was then recorded, we would both be using the same information. That’s what I mean by transparency. Instead, we have to draw conclusions and different readings of the text and footnotes can reasonably lead to different conclusions. Data that are being offered for others to use should be clearly defined and labeled. From reading the comments, it’s not even clear to me that these data have been cleaned and rechecked before being released.
You believe I am being unfair to Howey and his co-author because I am not giving them the benefit of the doubt. I , on the other hand, am concerned with the quality, reliability, and consistency of the data that are being made available, not the (probably good) intentions of the dataset creators.
A dataset that is continually being updated is not a dataset that anyone else should be working on and it shouldn’t be released until it is in a relatively stable version. And it certainly shouldn’t be used to make sweeping statements of the kind that are being used to tout its value.
@jane
That was my entire point! He is going to be doing that and he did say what he had done and so stating that he didn’t say it was “just a start” is false. Did I defend his data? No. I defended the fact that you claim he didn’t say it was “just a start,” but he did.
Sure, he’s not a statistician. He never claimed to be and made clear that he wasn’t but made an attempt to start a process everyone wants started. So he didn’t do it quite right out the gate because he’s not a statistician. So? He made that clear, did the best he could, and committed to expanding and improving the data.
Which equates to “just a start.”
And yes, Courtney Milan did a beautiful analysis of it, while stating that he WAS transparent (it didn’t load properly the first time) and it was just a start and pointers on how to improve his model. In short, I agree with her because she’s at least recognizing reality instead of expecting perfection.
@Liana Mir: This is not an attack on self publishing or Hugh Howey personally. It’s a criticism of the conclusions and extrapolations he drew from a single data point. His post, the one that he has failed to edit, clarify, or retract, makes sweeping claims about earnings that aren’t backed up by his own data.
That you find Milan’s post helpful is great. Stick with her.
@Liana Mir:
Hello,
This is what Howey also stated regarding his data:
“we hope others will run their own reports and analyze our data. We hope they will share what they find and that this will foster greater discourse.”
Since he encourages discussion on this, then part of the “greater transparency” would be articles like Sunita’s.
And no, “perfection” isn’t a requirement. However, Howey ‘s name carries weight, and possibly to some newbie writers, overwhelming credibility (I say this after reading some of the comments left on his article’s page). That’s why it’s important that he may need to either edit or clarify his findings sooner rather than later on his main article page, imho.
I can certainly commend his willingness to share info with others, but as a newbie author who primarily uploads work on Amazon, there are so many variables that his data didn’t and probably can’t account for, that I’m afraid imho it generalizes when more specifics are needed.
Writers need to know what they’re getting into, and Howey’s data and conclusions can be one source. But its my sincere hope that new writers realize it’s imperative that they compile info from several reputable sources in order to make an informed decision. I hope those who initially read the article and perhaps took it as a definitive answer will return once adjustments are made.
@Sunita: I confess I don’t follow your confusion over Howey’s footnotes about data source (for actual sales based on rankings).
First – there isn’t a definitive, let alone immutable, relation (not so much correlation) between the number of sales and a given rank.
Liken it to the New York Times Bestseller List(s). Even taking a single list, the actual sales that result in a New York Times Bestseller ranking vary from week to week and throughout the year. This has been well-documented by many within the industry, by publishing professionals and authors. Additionally, the Times list relies upon incomplete sales data and in some ways an inaccurate portrayal of how popular a book actually may be. Some typical criticisms of the methodology are here:
http://en.wikipedia.org/wiki/The_New_York_Times_Best_Seller_list#Criticisms
While Howey’s explanatory notes and footnotes are not as well-organized as a statistical paper intended to be peer-reviewed, there is no disconnect between the two statements you highlighted. When Howey says:
“Again, daily unit sales are estimated by sales ranking, using publicly shared data from dozens of authors who have logged the correlation between rank and daily purchases”
And later says:
“Daily sales according to Amazon rank can be found in numerous places, including here, here, and here. Depending on the source, the model changes, but not enough to greatly affect the results.”
He’s describing in the first note the ultimate source that feeds the aggregators of actual sales and rankings.
A real-world example: “Energy price indexes are compiled by shared data from hundreds of energy trading and marketing firms…Daily energy price indexes can be found in numerous places, including Platts Gas Daily, Platts Megawatt Daily, Inside Ferc, etc.”
Howey is not naive. He’s a published author who’s been around for several years and is very aware that his own sales are not linear. As well, he’s connected with many other authors who share their own sales and he’s very aware that no one else’s sales are linear, either. J.A. Konrath is another self-publishing success who’s been open about his sales and has made it very clear they’re not linear, either.
Howey is definitely enthusiastic about writing and indie authors, and he could have issued a more clear caveat (for naive readers who might not understand how extrapolation works and its inherent weaknesses) that he’s taking a sales estimate and extrapolating it across a year of time to show a potential (not likely) long-term view. Statisticians and analysts do these kinds of “what-if” scenarios all the time. Usually it’s couched as a SWAG and isn’t intended to be part of the core statistics that have been more rigorously compiled and tested.
Howey is aware (again, because he’s not naive and he explains many times over the difficulty in actually achieving any kind of transparency) that his own data is estimates based upon estimates of incomplete data. This doesn’t make his data non-useful. It makes it imprecise, of course. But then, the New York Times Bestseller rankings are extremely imprecise and inconsistent. #1 NYTimes Bestseller means something quite different at any point in time, and there are non-bestsellers that outsell those that make the list. But we still utilize bestseller lists like the Times, and we do so because they have some utility.
I confess I also didn’t quite follow your interest in correlation matrices behind the data. I downloaded the spreadsheet and glanced over it, and didn’t see anywhere that correlations would actually enter into any of the very simple equations being used. This is a tiny slice of data and it’s interesting, obviously incomplete, and only a glimpse of a whole picture. And we know that we’e not going to get full and accurate sales data anytime soon. While Wal-Mart might have been an early pioneer in tracking inventory and sales to make their supply chain more efficient than other retailers before the others caught on, we don’t see that same kind of efficiency within publishing. We know that sales can be generated from Amazon, large booksellers like Barnes & Noble, and all the myriad small independent bookstores. The publishers themselves know that this kind of disparate system introduces a lot of inefficiency in actually tracking what’s been sold – let alone when. Sales data probably alternately gushes and trickles to publishers, and it’s anything but timely, which is why publishers pay royalties so belatedly. Amazon controls and tracks its own sales and can pay self-publishers quickly through its own platform.
Back to correlations – I think that what’s being confused is the relation between a given sales rank and actual sales. Howey himself said, “Using these snapshots, I could plot the correlation between rankings and sales.” The sales related to any given rank are a very loose figure. They will depend upon the genre or sub-genre (which changes as Amazon redefines these periodically), time of day or time period (just as the New York Times list does), Amazon’s evolving algorithms to assign rankings, and other factors. Indie authors are well-aware of all this, so it’s more of a wet-thumb-in-the-air, and that’s good enough. A SWAG, in other words. Because when you lack actual data or the underlying data itself is changing because of external factors as well as upstream methodology tweaks, you either give up and go home for lack of accurate data, or you take a SWAG to get an idea. It’s imprecise, but sometimes a hazy view is better than quitting for lack of a perfect view.
A given sales rank will have a floor and ceiling and probably offer a distribution (not necessarily a normal distribution) of sales to the ranking. I’m sure that the sites Howey linked to that guesstimate actual sales based upon rank use varying degrees of sophistication or simplification (such as simple average) to compile their estimates. But hearkening back to real-world examples, the energy industry utilizes index prices that are based upon only a subset of self-reported sales, and the publishers of those index prices use various methodologies to work out noise and outliers and attempt to gain a picture that’s as accurate as they can from an incomplete picture. Statisticians do this in all walks of life. Political pollsters develop methodologies that attempt to extrapolate a subset of voters against the bigger picture. Some of these methodologies are criticized for being limited due to reliance upon only landline phones (which may no longer be representative demographically and may skew toward older or more rural respondents) or internet-based, and so on. Flawed methodologies and assumptions are commonplace, in other words. The test comes when they develop and gain traction and competition and actual results can be measured against them. I think Howey’s methodology is only a very tentative first step. I seriously doubt we will see actual full transparency anytime in the next 15 years. But people tend to often not mind flawed or incomplete pictures when the alternative is none at all.
With correlations, we’re typically trying to calculate a range of coefficients across a matrix between related things. A typical simple example would be currencies which can all be exchanged for one another. In such a case, a correlation matrix is highly useful. We would place the various currencies (USD, CAD, EUR, GBP) across both an X and Y axis and track the correlations between each possible pair. We would also have a diagonal strip of “1” as each individual currency shows perfect correlation with itself. We can also do a time-series correlation for a single product – a good example would be a futures product where each forward month’s price can be correlated against the others. The two time-series of forward months would be represented on the X and Y axes, and the correlations would typically be derived from day-on-day log return price changes across a given span of time. We would know that regardless of the correlation matrix, its own utility would depend heavily upon the length of time used to calculate correlations. Too short a time-period might not capture a full picture. Too long might introduce seasonality that skews results. Short-term events might affect results. One of the best possibilities is a constantly-recalculating correlation matrix that shifts forward in time and reveals trends in correlations. And you can further expand such a correlation matrix multi-dimensionally by comparing any two futures prices at any potential pairing of forward months. Your correlation matrix in such a case may exceed the ability to portray within an application like Excel because of row and column limitations. You can always pre-calculate the needed rows and columns of potential pairings using factorial calculations and watching the whole thing exponentially grow.
More importantly, correlations find their greatest use by defining the strength of the relationship of two related things across time. So back to the currency example – what you want is to look at the relationship of currency movements up and down over a time period and calculate correlations from the way any two currency pairs move. A big weakness of a correlation matrix is liquidity of these movement patterns. If you have an illiquid thing whose value (whatever it may be valued in) never changes or rarely changes, you mathematically tend to run into issues trying to correlate a non-moving thing against another that moves.
Sales rankings and actual sales don’t fall into this kind of methodology of making good use of a correlation matrix. You could technically construct one and start from zero through 20,000 sales across one axis and sales ranking across the other axis, and plot out correlations between them. But you’d have extremely low correlations across most of it and then a bulge of high correlation where a given sales rank falls within a range of actual sales. And you’d have to apply some alternative correlation calculations since the rank numbers would never change while the sales connected to those ranks would. A distribution curve would be simpler and pictorially more interesting. But frankly, since the underlying data is constantly changing as to actual sales as well as methodology to construct rankings (so you have inconsistent shifts in two axes) it’s probably better to just accept the dirty data and go with an average, weighted average, or something along those lines and perhaps include standard deviation.
As far as testing results, the truth is that a lot of statistical models fail at various times when tested. You can chi-square, backtest, or apply various specifically individualized tests and iterations of those. And at certain points pretty much any model fails. It’s why probabilistic models rely upon confidence intervals, usually 95% or 99%. When you review the data behind these to see the 1% case, there’s a lot of variation that lies outside in the tails. Tail-events cannot be predicted or pre-measured for the most part.
As far as various other ways the data could be analyzed “better” – that has to be played around with to find out. While statistics has myriad methodologies and calculations and processes, none are one-size-fits-all. For example, if we want to study the volatility of a pattern of numbers, we have many different approaches we can take. We can apply a historical volatility, an exponential weighted moving average, simple standard deviation logic, etc. We can also backtest the results we achieve, and what we often find is that one or more methodology may test better than the others, or may do so only in certain times. Worse, a lot of backward-looking methodologies fail for future predictability – and the reason is that there are always extrinsic factors that cannot be quantified and often cannot even be foreseen.
A lot has happened within publishing over the past few years. We witnessed the demise of Borders, the dwindling of Barnes & Noble along with its once-promising Nook competitor to Amazon’s Kindle, Sony’s withdrawal from the ebook market, Apple’s entry, Kobo’s entry which seems to have plateaued early, price-fixing legal judgments against the big publishing houses and Apple, Amazon’s evolving royalty payouts, algorithms, and author sales and marketing tools, and many other developments. There is no way any model even with complete data could ever predict outcomes in such an environment.
I tend to see Howey’s data is interesting from a “state of now” perspective. As serious statistics, I really don’t care. It isn’t, but then, most statistics involve a lot of assumptions and you can rip them up pretty well once they’re tested under real-life conditions. That’s why businesses that rely upon statistical models allow them to evolve. And they also recognize that “A dataset that is continually being updated is not a dataset that anyone else should be working on” is not necessarily true. The underlying dataset of any statistical model is pretty much always changing. Some are closer to static than others. But a good statistical model adapts and is tweaked to accept changes.
Wow, women talking about stats really bring out the mansplainers.
@Sunny: I’m a woman, and I found nothing condescending or “mansplainy” about Matthew’s comment. In fact, he made a lot of very good points.
I’m very geeked that I understand Sunita’s post (am taking a statistics course this semester). And the romance genre is populated by repressed, sheltered spinsters and housewives, right?
Thank you Sunita–and Courtney and other commenters–for this rigorous discussion.
I was thinking about the data last night and all the really interesting things that it could tell us other than Howey’s conclusion of “self publishing is the only way for anyone who doesn’t want to leave money on the table.”
Such as how many backlist titles do authors in the top 1000 have in common. How do subsequent releases do compared to a debut release. (In watching the lists there seems to be an issue of indie stickiness).
What are the best release dates. Is there a consistent bell to the rise/fall of rankings over a certain period of time. At what point do also boughts (part of amazon’s recommendation algorithms) kick in and what do they do to affect the rankings/sales? Are there titles that sell better or is there a consistency to covers. How about blurbs?
Do authors with the words NYTimes/USA Today Bestseller sell better. I.e. over on http://unbound.bookbub.com/, the suggest that the goodreads reviews have more affect on sellers than a mention to booklist:
This might be the kind of thing that the data scraping of Amazon’s public rankings by the hour can provide and would really be helpful. What I don’t think the data can provide is exactly where Howey went and that is that self publishing is the only way. In the end, if data is continually scraped for a significant period of time and that information is released to actual statisticians to analyze then there is probably some really great information to be gleaned that could actually benefit how authors market, release, package and sell their books.
@Sunny: I think this is just people who work with stats talking about stats. Sunita – as have many others who have reviewed Howey’s analyses – raised valid points about limited data, extrapolation of results, and methodologies.
She suggested other statistical approaches, but I’m not certain correlation matrices would make much difference. When I differ in an opinion, I explain why. Doesn’t mean I’m right. Stats are interesting because there are different approaches. And she may be talking about correlating other factors or in a different way than I considered when I looked at the spreadsheet.
The main problem is the limited data, which contributes to the extrapolation. That isn’t going to change, and it can easily worsen if Amazon doesn’t like having its data scraped, or continues to make changes to genre and sub-genre definitions so that aggregating data meaningfully over time becomes very difficult to do. That’s pretty likely.
But I think the data is interesting. I don’t believe it’s predictive, and I liken Howey’s approach to a SWAG, which can still have uses. The math in the spreadsheet is extremely simple, but that may not be such a problem since the data is so limited.
1) Saying other people must do this, that, the other, and give me a rainbow and then we can talk is churlish. You paid nothing for it, take it for what it’s worth to you or build your own.
2) No harm will come from this. Authors are not going to change behavior en masse because of some data and breathless writing. Even if you believed it 100%, are you really going so far astray?
3) Criticism from experts, even those bordering on pedantry ad absurdum, will help to make the project better. If, of course, the aim of the project is to get better.
@Jane: I wonder if you could comment from your legal background, whether there could potentially be secondary liability issues for anyone using Howey’s scraped data since he violated Amazon’s terms of use to obtain it?
“LICENSE AND ACCESS
Subject to your compliance with these Conditions of Use and your payment of any applicable fees, Amazon or its content providers grant you a limited, non-exclusive, non-transferable, non-sublicensable license to access and make personal and non-commercial use of the Amazon Services. This license does not include any resale or commercial use of any Amazon Service, or its contents; any collection and use of any product listings, descriptions, or prices; any derivative use of any Amazon Service or its contents; any downloading or copying of account information for the benefit of another merchant; or any use of data mining, robots, or similar data gathering and extraction tools.”
I’m not sure that’d be binding on something like academic research, but considering that’s not what this purely is and there is an indirect commercial element here I wonder if it’s risky. I believe it’s rare for Amazon to take action against scrapers — they would usually just point them to their official API, but I don’t believe Howey would be able to use the API since it wouldn’t provide all the information he’s gathering, and also they require the API be used in conjunction with linking back to products on Amazon.
I tried finding relevant case law, but my search-fu is failing me as I don’t know the right terms to be looking for. All I came up with was something where Craigslist used an anti-hacking law to stop a scraper who tried to use IP rotation to get around a block. I have no reason to think Howey and friend have or would do anything to circumvent a block if put in place, so I don’t think that particular ruling sheds much light.
@Sunny:
@Sunny, as another woman chiming in I too found nothing off putting or condescending about Mathew’s post – I wish I could say the same about others on this blog entry.
I was referring to the comments at large, otherwise I would have specified it @ someone.
@Michael – Not really. Is that the Amazon’s TOS or the KDP TOS? The KDP TOS may have greater implications for Howey. The first layer is determining whether a TOS actually applies. There are varying decisions on whether a website’s TOS is even enforceable. There are a couple of types of TOS defined by the courts – browse wrap and click wrap. Most of us would likely be click wrap users and the question is whether the customer was given a fair opportunity to review the terms and agreed to be bound by them.
Amazon has restricted accounts and rescinded access based on unauthorized returns and other unstated abuses so my best guess is that primary act of scraping subjects you to a greater jeopardy but that the secondary liability is low. But again, there isn’t a lot of relevant case law.
Thanks, Jane. That’s from the Amazon TOS. Sounds like TOS law is pretty murky. I’m also pretty certain that Amazon itself does scraping as part of its KDP favorable nation pricing checks and KDP Select enforcement (an employee admitted as much to a client of mine in an email), so they might open a can of worms should they ever get heavy-handed.
I know of a few instances where folks scraping Amazon have been contacted (including myself a few years ago) and warned to use the API instead, but I don’t know if anyone’s ever faced legal action. It’s unfortunate the API is fairly limited in what data it provides on Kindle titles. Print is a little better as they allow you to pull a bit more information.
@Matthew: You’ve written a very long comment and to answer every point would require too many words. Please feel free to follow up with specific questions I’ve failed to address. I’ll address the points that I see as particularly germane to this discussion, but as I say, feel free to ask me for additional material as necessary.
(1) Thank you for your elucidation of the relationship between the sales ranking description in the report and the footnotes. That reduces the range of measures to 3. They continue to have the same problem for me, which is that I don’t know when the reported sales are being drawn from. Given the variation across times of week and times of month that authors report about their rankings, this is a non-trivial problem for a cross-sectional dataset that claims to be a valid temporal snapshot.
(2) I want a correlation matrix because the report is hanging much of its value on bivariate relationships and offering advice about strategic author behavior based on those relationships (correlations). The problem with bivariate relationships is that we don’t know the direction of the influence and we don’t know if those relationships are influenced by other variables. A correlation matrix (or a series of cross tabs, but they’re reporting the same relationships) allows me to see ALL the bivariate relationships, as well as give me an idea of the data structure. In the world I live in, empirically and statistically speaking, you don’t choose an estimation procedure (AKA a “statistical model”) until you have a sense of the data structure. [ETA: I’m talking about the choice of statistical procedures, not the theoretical model, which ideally you should have worked out before you collect the data in the first place.]
I’m not sure we’re using “correlation” in the same way. I am literally talking about the relationship between two variables. That’s it. I just want to see ALL the bivariate relationships. I can’t run that myself because of the anonymization of important variables and the opacity of certain transformed variables makes me uncomfortable with respect to other variables.
(3) SWAG is what you do over beers, at a conference, among friends. You don’t report it as real analysis. Again, that’s in my world. For those who don’t know what the acronym stands for, I assuming we’re translating it as “Scientific Wild Ass Guess,” in other words not scientific and totally guessing.
(4) Constantly updated datasets might be used in-house in commercial, for-profit enterprises, or in proprietary data analysis in which one is only reporting results. But people who are sharing data and collaborating on primary research do not constantly update datasets, at least not without specifying each change carefully.
(5) I am well aware of how political polling works. The head of NBC’s elections unit, i.e., their head pollster, was a student of mine. Polling has well established weaknesses and strengths attached to different samples, different estimation procedures, and different assumptions. That’s part of what I’m trying to establish here, for this new and evolving sample of data.
(3) SWAG is what you do over beers, at a conference, among friends. You don’t report it as real analysis. Again, that’s in my world. For those who don’t know what the acronym stands for, I assuming we’re translating it as “Scientific Wild Ass Guess,” in other words not scientific and totally guessing.
THIS. A thousand times, this.
This is what frustrates the hell out of me as someone who wants to see self-publishing flourish. If the best defense you can come up with for what Howey has done is basically, ‘well, he’s trying to do something good’ and ‘he knows there are a lot of problems with it,’ then how about the obvious question: Why not wait until you’ve got a sound database and statistical analysis from which you can make defensible inferences and on which you can base reliable extrapolations?
I appreciate the ways in which the Internet has democratized opinion. But there are some areas in which credentials still count, and this is one of them. There are, as far as I can tell, three people who have thus far demonstrated the appropriate credentials to do and discuss this kind of work, all of them are women, and none of them is Hugh Howey (and the fact that his numbers person is anonymous presents another reliability and transparency problem, IMO). And all three of these people with actual, demonstrable expertise have identified MAJOR issues with Howey’s “report (and if you haven’t checked in with Courtney Milan’s post since she got the entirety of the “raw data” and has extensively edited her analysis, you definitely should).
In some ways the surge of energy around Howey’s “report” reminds me of the response to Harlequin’s foray into self-publishing a few years ago, even though the perspective has flipped. The passionate insistence that self-publishing was going to taint the marketplace, ruin the genre, and make it harder for authors was swift and strong (http://www.smartbitchestrashybooks.com/index.php/site/want-to-self-publish-how-about-harlequin/). And look how much has changed in only a few years, and how many authors who seemed dead set against self-publishing back then are now doing very well and serving as outspoken advocates of this new market. All of which is great.
However, publishing is still very much an unsolved equation for many authors, and IMO if self-publishing really wants to be persuasive and professional, it needs to rely on thoughtful, sound, well-researched data and analysis. ESPECIALLY if it wants to go the limit. Traditional publishing may be struggling right now, and the balance definitely seems to be shifting to (some) authors who are willing to go it on their own. But traditional publishers are already working their way back into the market — witness Macmillan’s crowd sourced teen romance imprint and Amazon’s Kindle Worlds fan fiction community — and more competition within the self-publishing marketplace is going to kick in, as well. As we see over and over, the pendulum swings back and forth, and if authors and readers don’t want it to swing all the way back to traditional publishing as THE model, there needs to be a solid foundation for self-publishing now, and if anything, I think Howey’s “report” is a) taking attention away from authors who have quietly been *doing* the work (a lot of whom are women, btw), and b) providing an unstable analytic and statistical base, which is going to be harder, not easier, to build on reliably.
@Sunita: Hi Sunita –
Thanks for the reply. Statistics are always quite fun, and there are definitely different methodologies.
Regarding your points:
1) Just guessing (since he didn’t say which of the three he used) Howey may have elected to either establish preference for one (or more, depending on a particular ranking) or else distilled the data. It’s definitely a loosey-goosey approach, but the data is messy to begin with. It’s already a distillation of self-reported author sales and ranks and each of the sites probably utilizes a different methodology. This is why I wouldn’t take the report as any kind of gospel but I find it interesting. And I similarly wouldn’t bother trying to hit it with sophisticated statistical methodologies. Rough data in will yield rough results out, no matter what approaches are taken. You just can’t fix estimated data.
2) I think now that you mention it I recall working on a correlation matrix like you describe probably 20 years or so ago. One of the things I remember from that study was differentiating strong correlations as being relevant or meaningful versus random – since unfortunately there are strong and near-perfect correlations in the wild that are statistical junk, like the Redskins Rule:
http://en.wikipedia.org/wiki/Redskins_Rule
Here are a couple examples I worked up to show the correlation types I was describing (the first is time-series of two futures commodities and the second is a currency example). The numbers are made up. The currency one does show the “1” correlation down the diagonal as each currency correlates perfectly with itself. As well, the two halves separated by the diagonal will be mirrors of one another:
http://farm4.staticflickr.com/3750/12603226575_9aff35a036_b.jpg
You can also do multi-dimensional correlations but you can only portray those pictorially via heatmaps and other 3d displays. Also good for visualizing option behavior. But the raw numbers are also good to work with, just not as good as a picture. Scatterplots also work as a picture. I haven’t done those since probably working with Minitab way back when.
I don’t think Howey has anything close to that. As I said before, my impression is that he acknowledged the underlying data is estimated, he has a single scrape of data, and the analyses are more along the lines of wet thumb in the air. I would fear that wrapping stats around estimated (and limited) data either lends an air of greater science behind it or can contribute messiness of its own. That’s why I don’t really have a problem with him taking a SWAG approach. I don’t really consider his analyses to be (or even intended to be) able to withstand rigorous backtesting and the like. When it begins with estimated data, wrapping fancier stats around it rather than the simple Excel formulas he’s using just doesn’t feel right.
Now if he is able to scrape over a period of 6 months to a year and compiles that data, it starts to get more workable because one can theoretically track trends as books come up and down ranks, etc. But I think a daily scrape might be problematic since Amazon doesn’t leave their book genres and organization alone for very long. I wouldn’t want to program the web scrape for something that can shift and where you could lose the value of a few months of collected data because Amazon changes the way bestseller lists are portrayed. Plus they’re always tweaking their algorithms so the “meaning” of a ranking changes.
I do see your issue with his statements about his conclusions from what he observes with the data. A number of other sites particularly zeroed in on that. I personally don’t take issue with it, only because I can take a glance at the data, see it’s a simple analysis based on estimated data, so I take it at face value and don’t worry too much with mis-reading it. Of course, there are people who can be influenced or over-awed with charts and numbers. For that, I don’t have an answer. And as I said, I just feel that putting heavier math around the estimated data isn’t going to really help.
3) I agree on SWAGs, which is why I said ” Statisticians and analysts do these kinds of “what-if” scenarios all the time. Usually it’s couched as a SWAG and isn’t intended to be part of the core statistics that have been more rigorously compiled and tested.” I guess I can look at data and analyses like these and I see them for what they are and take them at that value. I find them interesting. Not conclusive, but interesting. But again, there are certainly people who may not know what they’re seeing and may accept it at face value.
4) I do recognize what you’re saying. As you mention, this is a difference between commercial and research, as far as the datasets. It’s anathema to the latter, but necessary for the former. I would say that for Howey’s continuing analyses to proceed, he’s going to fall into the changing dataset category. This makes it a moving target and increases the SWAG-i-ness of the whole endeavor. I suspect he’ll incorporate a lot of the critical feedback and add to his caveats and footnotes. But this thing is never going to be precise, because you can’t do precision when you’re downstream of data that is not only moving, but whose methodologies that produce that data are shifting as well. So I guess I just wrap those expectations around all of it, so another reason why I don’t have a problem with the simple-math approach. If he had not made his underlying data transparent, I probably would have more of an issue.
5) I mixed in detailed examples and descriptions of methodologies for the benefit of readers who don’t work with stats. So I tried to give examples that people could follow as I worked out my thoughts on the whole matter.
From the Wall Street Journal
“In 2013, self-published books accounted for 32% of the 100 top selling e-books on Amazon each week, on average.”
From Barnes and Noble Press Releases (April 9 2013)
“Customer demand for great independent content continues to dramatically increase as 30% of NOOK customers purchase self-published content each month, representing 25% of NOOK Book™ sales every month.”
– – – – – – – – – – – –
It’s common sense that the 3 most popular genre fiction (romance, scifi/fantasy, mystery/thriller) the percentage would be higher.
my job used to be to analyse the figures and present the results to the teams, whether it be sales, marketing or the board. Generally the higher up they were, the less they knew. Sad but true.
Anyway, I didn’t do the sums, someone else did that for me, but I had to understand what they were doing. As a matter of course I’d order a full array of figures. Standard deviation, standard error, chi squared and Student’s T tests were standard, as well as a series of correlation arrays against variables like the weather conditions, standard economic figures and competitor’s products. Without that the figures were pretty meaningless. That’s what’s happened here. There’s no robustness to the results. It’s not ready for prime time and the results don’t mean a lot.
That’s what surprised me about the Howey report. There was none of that vigor. A report at that stage of development would be monitored by Kraft (my old employers, well, actually General Foods who were bought out by Kraft) until ready to be presented. There would have to be a bunch of other figures before they were considered rigorous enough for market testing.
And I was puzzled by SWAG, because in the UK, that’s closely associated with Noel Gallagher. Short for “swagger.” The other explanation I wouldn’t use because I was supposed to make the figures as easily understandable as possible, while making them as rock-solid as I could because if anything went wrong, they’d come back and blame the numbers. “You got it wrong so the marketing campaign was wrong.”
Mind you, they’d often cherry pick the results and pick one or two to concentrate on, if it supported what they’d decided to do anyway.
And quantitative data, they loved that because someone like me, and some of the others who’ve spoken here, could construct a questionnaire to give you the answers you want. That is, the interview you have with the person commissioning the report was as important if not more so than the actual results. There’s a great sketch from “Yes Minister” that illustrates this perfectly.
@Matthew: Correlation matrices and scatterplots will not go away, not as long as data structure influences the types of estimation procedures we can use (independent of the theoretical questions we are asking).
I agree that rough data yield rough results. I work with crappy data (my technical term for it) all the time, because better data aren’t available. But the *way* in which it is crappy matters, and that’s what I want to be able to figure out in these data.
@Lynne Connolly: Yes, those are essentially robustness checks, on both the data and the appropriateness of the estimation procedures, and I agree we should always be doing them. That’s what lets other people trust our results.
@Robin/Janet: I’ve been thinking about this comparison since I read your comment, and one of the things that strikes me is that advice about what to do (whether it was 5 years ago or now) is consistently based on the current context. Taking the doom-and-gloom advice of 5 years ago would have been unproductive, and taking the AE report advice now is likely to be equally unhelpful.
Consider the “value ratio” chart in the report, the one that says that readers “value” self-published books at a higher (and statistically significant) rate than Big 5 books. By using “value” the authors get around the problem of thinking about “quality,” which we clearly can’t measure through review ratings (for a variety of statistical and substantive reasons). But even the comparative values are based on previous and current price differences. As E. A. Williams notes above, the review ratings come from a wide period of time, in which the prices may well have fluctuated (and in the Big 5’s case, older books in the sample might have been under Agency pricing when the ratings were given).
If authors who self-publish follow the “lower the prices and increase reader value” advice, then readers will adjust their expectations, especially if lots of authors are taking this advice. And what if the Big 5 also lower their prices? You either have to drive the price very low to sustain the “value” advantage, or the difference goes away. Or maybe something else happens that I can’t think of. But the point is, it’s not a static situation or one that a single author can control.
@Sunita: Hi Sunita –
I agree correlation matrices and scatterplots (and other visualizations) won’t disappear. I haven’t used scatterplots in a very long while, but certainly use correlation matrices and other visualizations. It can take ~500 points of raw data to derive only one point in some correlation matrices. And if the matrix is something like 24×24, it becomes ~150,000 pieces of data to populate the matrix (diagonal plus two mirrored hemispheres). A distribution curve can easily go into the hundreds of thousands or millions of discrete data points.
So a visualization (including any transformational calcs needed to make raw data meaningful) are much better than trying to work with the raw data alone. Plus the visualization reveals flaws and anomalies in the raw data pretty quickly.
This seems so invested in disagreeing with the underlying source that I think it would behoove those of us in the industry to take what may be a non-representative sample into our thoughts. Why not empower ourselves by using Howey’s work, self-selective or not?
@Paul:
If by “underlying source” you mean the data, one does not agree or disagree with data. One evaluates its utility and quality. If you mean the creators of the dataset, I’m not sure why you think I’m invested in disagreeing with them. I’m not a fiction author (self-published or otherwise) and I have no desire to be one, I don’t read their books (that I know of, given Data Guru is anonymous), and I don’t frequent writer hangouts like the Kindleboards, Absolute Write, or any other similar forums. My “investment” is as a longtime reader of genre who values the romance novel reader-author community.
Absolutely. Non-representative samples can offer interesting findings from which one can build theories and derive suggestions about how to create representative, scientific samples. Just don’t generalize from them and base your behavior on the idea that they offer generalizable predictions.
I’m not sure what empowerment involves in this case, but if it includes taking his advice, I encourage you to go right ahead, just do so with the knowledge that it is not supported by the data that is used to justify it.
Sunita,
Thanks for the thoughtful analysis. You raise a lot of good points, though I tend to be more optimistic about the study and look forward to seeing how things shake out after they gather more data and delve deeper (informed, hopefully, but the critiques that have popped up.)
A couple quick thoughts. You mention that the data is not normally distributed, but is there any reason we would expect it to be? I’m not a statistician, but my understanding is that samples are only normally distributed if they are randomly chosen from the population. These data points were purposefully chosen to be only the top sellers in the market, so I would actually have been quite surprised if they had been normally distributed.
Regarding the biases of this particular day during which the data were taken. It seems that the two biases you mention — traditional book launches and textbook purchases — would actually be working against Howey’s conclusions. Book sales tend to spike the day of a launch due to preorders, so if anything, that would skew the data toward more traditional sales. The textbooks, I guess wouldn’t factor in as much since Howey focused on fiction.
Anyways, interesting stuff. I’m looking forward to seeing how things develop.
@Matthew:
Hi! Nice to see you and your cohorts popping in from the Passive Voice.
But despite all your collective “discussion” over at the other site about Sunita’s very clear, very logical and impeccably argued post, and all lovely talk here about statistical models, you seem to fail to recognize that you and she AGREE that the data is, well, crappy. In your own words, “I do see your issue with his statements about his conclusions from what he observes with the data,” “It’s definitely a loosey-goosey approach, but the data is messy to begin with,” “this is why I wouldn’t take the report as any kind of gospel,” etc.
But unfortunately, the issue has never been whether Matthew can understand the data. Bully for you, though (despite the fact that you – along with Howey and the spider coder – have failed to establish any sort of credentials in data analysis or given any of us a reason why we should give credence to your words. This is in severe contrast to Sunita, Courtney Milan and Dana Beth Weinberg).
Nor is anyone doubting that the numbers aren’t a real snapshot of what happened in that moment of time, or that self-published genre books are capable of selling as well as or better than Big 5 genre books. This is a site that discusses romance; we are very familiar with the self-publishing success of Liliana Hart, Bella Andre, Barbara Freethy, and many, many others. You’re not telling us anything we don’t know. The only surprise is that Howey seemed so surprised.
No, the issue is that Howey used messy data – your own words, Matthew – to make sweeping conclusions designed to advise people how to best run their careers.
You and the Passive Voice commentators seem to think Dear Author is some sort of rah rah traditional publishing site and that the people who have issue with Howey’s report are kneejerk responding because we have our noses up New York’s butts. I don’t presume to speak for Jane and her writers, but as a long time lurker and infrequent commenter on this site: BWA-HA-HA!!!!!
Perhaps if you all took those enormous chips off your shoulders regarding self publishing – and I don’t blame you for having them, what with Howey and Konrath, et al, doing their best to keep the chips front and center – and stepped back to take an objective look at how Howey is couching his report, you would understand why people who have no dog in the hunt and/or possess logical thinking skills are skeptical of Howey’s claims. It’s not that we want self-publishing to fail or for traditional publishing to keep its hobnailed boot on writers’ necks. We – or at least I – just want good stories. (But well-edited and properly formatted stories, please.)
And that means providing authors with the best information possible so they can write and publish those stories, not a religious screed masquerading as “the clearest public picture to date of what’s happening,” “bombshell,” “What percentage of the overall reading market does this represent? Our data guru said this was a question we could easily answer,” “The real story of self-publishing,” etc. In fact, Howey goes on to say, “Our fear is that authors are selling themselves short and making poor decisions based on poor data” – and then he provides “loosey-goosey,” “messy” data – your own words, Matthew – himself!
So if you want to know what “side” I am on: I am on the side of authors publishing their works in the best manner that suits their individual needs, talents and capacity while maximizing their income as much as possible.
I am also on the side of logical, clear thinking; fair, considered judgement; and an absence of breathless hyperbole designed to inflame and excite. Unfortunately, it appears that as long as you have Howey as your spokesperson, those things are going to be absent from your side of the conversation. This doesn’t mean I am anti-self-published authors; I will follow Courtney Milan wherever she wants to lead (although I hope it leads her to the laptop to write more books). But when Howey purposefully misconstrues and twists his critics’ words – the way he slimes Dana Beth Weinberg and her work is particularly nasty and reminds me of his egregious “Bitch from World Con” blog post fiasco – he’s already lost the battle. Period.
@Livia Blackburne: Yes, in a perfect world, a random sample would be normally distributed or close to it. If the sample is stratified (or you oversample some groups) then you are less likely to get that.
I don’t expect a normal distribution in this kind of sample. But I want to see the *type* of skewness these data exhibit so that I can take that into account and choose an appropriate estimation technique (and error tests, and so on).
As for the bias examples, yes, they could work against the data (which would be great because it would strengthen the results) or it could work to inflate the results. It’s not a matter of textbooks showing up on the rankings, it’s that if you have a fixed amount of money to spend on books, then you’re buying non-fiction rather than fiction in January. If you are more likely to buy mystery-thrillers, you’re more likely to forego a Big 5 book because they are disproportionately Big 5-published (thus changing its relative position v. an indie book). If you buy romance, your decision not to buy more likely depresses sales of an indie book.
I doubt *this* example is really happening, but it’s the kind of perturbation we tend to think about when we pick cross-sections and then try to imagine what might make it biased compared to other cross-sections.
As for pre-orders, I think there are some discussions upthread that talk about how pre-orders are counted and whether they are being adequately incorporated in the Amazon rankings.
@AlexaB: Thank you for your comment. That is such *interesting* information. I hadn’t realized we were being scraped again, but the interest in this topic makes it irresistible, I suppose.
I’m a bit mystified by the belief that I’m anti-self-published books and authors. I’ve had self-published books on my Best Of lists for the last three years, and the self-pubbed and small-press books on my lists far outnumber the NY-published books. I don’t like reading badly written and produced books, but who does?
@AlexaB: I meant to respond to one of your other points: the way Dana Beth Weinberg’s motives, credentials, and research are being smeared is astonishing to me. She freely admits that the survey is a non-random sample and now it’s being criticized for being unscientific, as if non-random aggregate data cannot provide *any* benefits (it can’t be generalized from, but it has plenty of other possible uses). Meanwhile, a dataset that exemplifies “sampling on the dependent variable” is fine.
Courtney Milan, Weinberg, and I aren’t the only critics out there. If you haven’t read Steve Mosby’s post, it’s well worth a look. Mosby is a mystery/thriller writer who publishes in the UK, two characteristics that the AE data probably aren’t capturing particularly well. But major players on the AE side have some sort of beef with him, so maybe they discount his objective information on those grounds.
@AlexaB: Hi Alexa –
I think I made it pretty clear I acknowledged the limitations of Howey’s source data, as well as agreed with Sunita’s and many others’ points about the limitations of the methodologies used – and most obviously the one that has received the most criticism – which is the extrapolation.
You even cited a few of my many mentions: “…you seem to fail to recognize that you and she AGREE that the data is, well, crappy. In your own words, “I do see your issue with his statements about his conclusions from what he observes with the data,” “It’s definitely a loosey-goosey approach, but the data is messy to begin with,” “this is why I wouldn’t take the report as any kind of gospel,” etc.”
I’m sorry there was still an impression I “fail to recognize” issues with the data.
I made many repeated statements about it to try to make it very clear I understood the limits of the data and what this meant:
* “he’s taking a sales estimate and extrapolating it across a year of time to show a potential (not likely) long-term view. Statisticians and analysts do these kinds of “what-if” scenarios all the time. Usually it’s couched as a SWAG and isn’t intended to be part of the core statistics that have been more rigorously compiled and tested.”
* “Howey is aware (again, because he’s not naive and he explains many times over the difficulty in actually achieving any kind of transparency) that his own data is estimates based upon estimates of incomplete data. This doesn’t make his data non-useful. It makes it imprecise, of course.”
* “This is a tiny slice of data and it’s interesting, obviously incomplete, and only a glimpse of a whole picture.”
* “The sales related to any given rank are a very loose figure.”
* “The main problem is the limited data, which contributes to the extrapolation.”
* “But I think the data is interesting. I don’t believe it’s predictive, and I liken Howey’s approach to a SWAG, which can still have uses.”
* “I don’t think Howey has anything close to that. As I said before, my impression is that he acknowledged the underlying data is estimated, he has a single scrape of data, and the analyses are more along the lines of wet thumb in the air.”
etc., etc.
What I was discussing with Sunita was that I do not believe wrapping more robust statistical methodologies – correlation matrices, distribution curves, etc. – will make anything better.
I tried to make that clear many times:
* “But frankly, since the underlying data is constantly changing as to actual sales as well as methodology to construct rankings (so you have inconsistent shifts in two axes) it’s probably better to just accept the dirty data and go with an average, weighted average, or something along those lines and perhaps include standard deviation.”
* “I tend to see Howey’s data is interesting from a “state of now” perspective. As serious statistics, I really don’t care.”
* “She suggested other statistical approaches, but I’m not certain correlation matrices would make much difference.”
* ” And I similarly wouldn’t bother trying to hit it with sophisticated statistical methodologies. Rough data in will yield rough results out, no matter what approaches are taken. You just can’t fix estimated data.”
* “I would fear that wrapping stats around estimated (and limited) data either lends an air of greater science behind it or can contribute messiness of its own.”
* ” I don’t really consider his analyses to be (or even intended to be) able to withstand rigorous backtesting and the like. When it begins with estimated data, wrapping fancier stats around it rather than the simple Excel formulas he’s using just doesn’t feel right.”
etc., etc.
I think Howey’s data is interesting. I said this several times.
I did not say it was transformative, or rigorous or anything more than “interesting”.
Whether I “have failed to establish any sort of credentials in data analysis or given any of us a reason why we should give credence to your words” is fine. I talked about statistical methodologies and approaches, gave examples including visual ones. But it really doesn’t matter whether that shows familiarity or experience with statistics (sometimes jokingly referred to as “sadistics”) or data analyses. When it comes down to it, there is a person on either side of the screen, and a bunch of pixels in-between to try to make sense of. And the impression any of us make of those pixels doesn’t equal the person on the other side. We make the best we can of it.
I don’t think Dear Author is a big publishing shill. I never said it. Never had that impression.
Nor do I blindly follow the gospel of Howey and Konrath. There are certainly a lot of people who do. I think it’s good to have voices like Howey, Konrath, Eisler, and others just as it’s good to have voices like Shatzkin, Steve Zacharius (a *really* interesting and smart guy – I’ve grown extremely impressed by his engagement and love his even-handed demeanor), and voices in-between in the amorphous middle ground – such as Dean Wesley Smith and Kris Rusch (hybrid authors with very practical advice based on their perspectives and experiences).
I don’t believe there is one single truth, in other words.
The question is how great a disservice is Howey doing with his enthusiasm and hyperbole? Or Konrath with his blunt and unabashed criticism? Or any number of people on the traditional side who advocate the path of query letters, agents, the publishers, etc.?
All of these competing voices recommend a particular “right” way to do something.
There are certainly going to be prospective authors who latch onto one person’s recommendations (just as John Locke sold a lot of his “How I sold a million books” book while forgetting to mention using review-farms as part of the secret sauce). People consume blogs and advice and they either take it all in and make their own decision or they can act impulsively. And even in the latter case, it’s not a dead-end because if they decide the advice wasn’t for them, they can take another’s advice and try something different. Or forge their own path. Or use a conglomeration of it all.
The prospective authors who are methodical aren’t going to jump onto anyone’s bandwagon and buy a lottery ticket sight-unseen. To even self-publish still requires investigation and work. Going the traditional route also requires a lot of front-end work. Stephenie Meyer admitted some naivete when she sent off “Twilight” (she sent it to an agency that was normally less likely to accept an unsolicited manuscript, and the size was way beyond the typical and would normally have been an auto-reject). But she also first researched agents, developed a query letter, worked out the submission requirements, and sent it off.
I’m aware that there is both good and bad advice in the world of publishing. I’m also aware that sometime good advice can be wrong for a particular person and bad advice can actually work for another. There is always an element of individuality and pure random luck (both good and bad) at play.
@Matthew:
Awesome.
But you might want to post that on the site from whence you came. I think on this site we’re all in agreement that Howey’s report is imprecise, unable to withstand rigorous testing, far from gospel and leaps to some pretty wild suppositions that his data just can’t support. Interesting, sure. Reliable? No frakking way (and I’m not referring to Walt’s oil drilling fallacy).
You say credentials aren’t important, it’s the pixels on the screen. I say failing to establish someone’s expertise but taking their words at face value on the ‘net is a really good way to gather really bad – and potentially harmful, depending on the subject – information. So once again, not sure why I should listen to your interpretation of the numbers’ robustness – or Howey’s, or his web crawler’s (I refuse to call him a Data Guy – he wrote a code, as near as I can tell) – over the interpretation of those who established their data analysis qualifications.
@AlexaB: Hi Alexa –
I think you’re still misunderstanding what I’m saying.
I didn’t say credentials aren’t important. You originally said I hadn’t established my credentials, whereas others have. In my reply, I said that was fine. I don’t have a problem with whether you believe I have a grasp of statistics or analyses. And I explained that when it comes down to it, there are people and there are pixels between the people. And we make what we can of it.
Way back in my undergraduate days, actually first semester, I had a Psych professor whose first lecture consisted of him not being present and instead the class listened to a message where he discussed some foundations of psychology – including how we form impressions and how close lie to reality. He challenged us to preconceive some facts about him, and when we saw him in the next lecture we could determine how close our impressions came to reality.
What you or I believe when we see one another’s little scribbles on blog posts isn’t going to change the person behind the screen on either side into a different than we are in reality. We’re still people who go about our sometimes quite mundane lives day-to-day. And sometimes I forget that when I jot things down because I focus on something else – the message, a rebuttal, a debate, and so on.
So what I mean is I could say I’ve worked with stats daily the past fifteen years. It could be true or false or in-between. You could believe it or not. And no matter what I wrote or what you believed or disbelieved about it, the person on either end of this is still the person either of us are.
You also say you shouldn’t listen to my “interpretation of the numbers’ robustness”.
You shouldn’t. But you don’t have to. Because at no point did I say that the numbers are robust. I did everything to say that the data is limited, the math is very simplistic, the extrapolations are a major stretch and weakness, and the whole exercise is more akin to a SWAG.
You’re aligning me with others who you truly have an issue with who may be holding up the analysis as the answer. I called it “interesting” – and my opinion of the whole thing probably falls closer to Courtney Milan’s. She has some suggestions of ways Howey might improve and strengthen his analyses and make it more useful and usable.
I am not a partisan in the great indie-vs.tradpub war. I don’t make absolutist statements that one is better than the other, that anything is a panacea, and so on. I’m a realist and I know that what fits one person well doesn’t fit another quite so well.
@Matthew:
Hi,
You came here pointed by the Passive Voice. It’s possible to visit that site, read your posts and judge for ourselves why you are commenting on Sunita’s post.. In essence, you already left that taped message and I do have a preconceived idea of who you are and why you are here.
I think it’s fascinating, however, that on this site you have done zero harm to Sunita’s credibility. In fact, you ended up agreeing with her regarding the robustness of the data, and can’t even really poke at her on the edges of stats wank (I have an MBA, I followed along). Thanks for playing, though.
Most scientists understand that establishing credibility lends weight to their theories. In fact, most professionals understand that establishing expertise is what leads people to engage them and pay attention to their words. You yourself say that Howey is “not naive” and trot out his best-selling author status as back-up for this statement (because apparently selling a lot of books makes one an expert in how to interpret data and write an authoritative, reliable market research report).
Since you fail to present what gives you the knowledge, education and/or authority to question Sunita or the other established social scientists who have looked at the data and found it wanting – and even so you admit the data cannot support Howey’s assertions and it is messy, loose, and far from robust – I consider this conversation closed. It’s been a pleasure, but it’s just going around and around at this point. And as a female I’m not really one for circle jerks.
@Matthew:
I really think we are talking past each other at this point, so the sensible thing is to agree to disagree. It is unfathomable to me to think of bivariate correlations as “fancy” or “sophisticated.” I think of them as first cuts that help me understand the relationships in the data. I’m not asking to do anything that can’t be done *in Excel.* I realized pretty quickly that the data aren’t presented in a way that will allow me to do much more than that with them.
I don’t “wrap” stats around data. I use statistical techniques (some simple, some quite complex) to make sense of what is going on in data. It’s a tool. I want to use the simplest tools in the arsenal, and the data as presented don’t allow me to do that. And I can’t decompose the data and get rid of the transformations done on it because the transformations haven’t been described or defined.
When I opened the file, I had taken the authors at their word that they wanted other people to work with the data. I believe they were sincere, but it is troubling that they don’t understand why that isn’t possible for several variables of interest. And what’s left isn’t really worth doing much with because any analysis would be necessarily incomplete and possibly misleading. Sometimes bad analysis is worse than no analysis at all.
@Mike Cane: Yep. All-kinds of drastic exclamations about how valuable the data is have been going around. The problem is, the data is in its infancy and people who don’t understand statistics and proper analysis are seeing much more than there is. My biggest problem wasn’t the data and easily digestible charts, but the words between them, the swiss cheese logic used to make far-reaching inferences that wrongly sway opinions. He has a lot of people who trust him, yet aren’t equipped to understand how he came to his findings. Releasing when/as he did was a bit irresponsible. That ship wouldn’t float in a scientific/historical setting.
But I do hope Howey continues with it and makes an honest effort to improve the findings.
Thanks for this post – it explains much of what I tried but couldn’t. Wanted to let you know I posted a link on my new blog post: JCHemphill.blogspot.com
@Sunita: Hi Sunita –
Thanks for your reply.
I don’t think we disagree on the data. I don’t see how the single snapshot can yield much of anything useful with a lot of normal statistical approaches. The data is too limited, and otherwise includes many estimations. And I think in your last paragraph you express the same opinion.
When I talk about more sophisticated statistics, that’s relative. Howey used simple math, easy to follow for the majority of people. There are obviously a lot of analyses and calcs that can be done with any body of data.
Obviously, the extrapolation of a sales point into a year of earnings is a stretch. I was bored, so I took a couple hours and worked up some other stats that could be run.
If you want to take a look, the additions I made are uploaded here:
http://www.sendspace.com/file/bg9xok
I added three things to Howey’s spreadsheet:
1) A simple web scraper for variable URLs – in case anyone is curious what it takes to scrape a site. It can range from “not very much” (I provide a simple example) to sophisticated. Howey’s coder is probably using a pretty sophisticated scrape-and-parse methodology. The parsing is the toughest part.
2) A Monte Carlo simulation of author earnings – that way it’s at least non-linear. The Monte Carlo sim allows a user to select a category: Big Five Published, Amazon Imprint, Small/Medium Published, or Indie Published. This is because the royalty structure is different from each, and also based on Howey’s data these each tend to sell in different volumes – with Big Five coming out the highest.
I enhanced the flexibility further by allowing you to define a number of price points. This is mainly aimed at Indies because many Indies experiment with different price points throughout the year, including free days. Free days are also allowed in the Monte Carlo simulation. You can also specify a Floor and Ceiling for both Sales and the Days at which the Sales occur at specified price points.
The Monte Carlo uses a random number simulator to derive random sales between the Sales Floors and Ceilings specified. You can run 1,000 or more scenarios, and the distribution graph and usual stats (StDev, Kurtosis, Skewness, plus Max/Min/Avg, etc.) are also calculated. It runs 5,000 scenarios in about 10-15 seconds and 10,000 scenarios isn’t too long either.
The spreadsheet pre-loads scenarios but you can change all the Prices, Sales Floors and Ceilings, and Days of Sales Floors and Ceilings. I threw in basic error-handling also.
Obviously, it’s a simulation with a lot more variables plus applied randomness. Only slightly better than a straight arithmetic extrapolation. But what it illustrates is that there can be considerable variability – even in just a simulation. And I had to keep it simple because I didn’t want to spend a lot of time on it and preferred to keep the interface simple. Also, no forecast is going to be accurate. This one just shows a broader range of possibilities (and not probabilities) than a linear extrapolation.
3) A simulated Correlation Matrix calculator – assuming we actually had a year of actual sales data to compare sales trends. Obviously, correlations can be run on any number of things. This one analyzes trend – which includes direction and intensity of sales changes day-on-day over the course of a 365 day time period.
The input page allows comparing a simulation of 4 books each from Big Five, Amazon, Small/Medium, and Indie. A user can define a Sales Floor and Ceiling and apply factors on a Monthly basis to the sales to account for the potential of declining sales (such as during summer) or ramping up of sales (such as often happens between October and January). That way the sales are at least non-linear on a daily basis. The simulation also uses a Random Number approach to sales as bounded by the specified Floors and Ceilings. The basic approach could be modified to examine other variables.
The spreadsheet doesn’t do anything terribly sophisticated. It’s Excel, after all, and comes in at under 2MB. Nothing is password-protected and the code is open. Formulas are left in place other than for the Monte Carlo simulation since too many volatile formulas affects performance.
It looks like Howey released a more extensive snapshot. Ideally, if he were to aggregate data over a period of time, some more definitive analyses could happen. But there are a ton of variables that make it difficult to draw a lot of pictures even then. Book release dates, pricing changes, promotions, rank algorithm changes, multiple books with loss-leaders, and a lot of others.
I think a population of something like 10,000 (preferably a lot more) books tracked over a period of year – with daily sales, prices, rankings, and reviews would an interesting dataset to work with. One could definitely discern relationships and trends around it. Forecasting, not so much.
@JC Hemphill: That’s a great post, I enjoyed reading it. Thanks for providing the link!
@Matthew: If you enjoy noodling around with the data, go for it. Seeing what the various possible outcomes are depending on the inputs is kind of interesting, although I’m a little terrified that Excel allows you to run Monte Carlo simulations. Not addressing you here, because obviously I don’t know you, but what comes to mind is what we used to say when Stata and other programs developed GUIs (so you didn’t have to know command line language anymore): “Awesome! Now people can run estimation procedures they don’t understand and report results they can’t interpret.”
For me, there’s no point to *my* playing around with data if I fundamentally mistrust what I’m looking at in the first place. The variables of interest to me, besides sales, are genre category, title, and author. Two of the three of these have had their information removed and the other I believe has not been provided. Without title and author you can’t track individual books across cross-sections, you can only compare ranking slots by publisher. I also want finer breakdown of the different publisher categories, which I don’t think the AE authors have any interest in providing. So the information that would *really* help authors, i.e., what are the different author and book attributes that correlate with sales and ranking position (especially when controlling for other relationships), are simply not possible to estimate.
As long as so much of the data remain opaque or uncollected, this is a dataset of highly limited usefulness and highly suspect predictive or prescriptive value. If you can do things with it that you find useful, that’s great, but I’m done here. It’s crap data; worse, it’s *opaque* crap data (unlike the crap data I’ve occasionally had to work with). Adding cross-sections of more crap data, or simulating 1000 events of crap rather than sticking to one event of crap, isn’t improving the situation.
It’s been interesting talking to you about this, but I’m moving on to spend my time and effort on data that is transparently gathered and reported, in which I can therefore have some confidence.
@Sunita: Hi Sunita –
That was kind of what I meant when I said it was probably better for the data to just have simple formulas rather than taking a more sophisticated route. When more complexity is applied, it becomes harder to follow and easier to introduce new issues that become hidden because many people tend to trust models. Data needs to be solid before applying modeling.
This is a fairly basic Monte Carlo – the whole code is about 650 lines and the Monte Carlo less than 200 lines (including formatting and formulas). It’s easy to build more complex models even in Excel, including utilizing custom functions, DLLs, etc. Or simply pass data and run it server-side. Or use a MATLAB plugin.
But each successive layer makes the process more complex and difficult to follow or validate. My experience is that people often trust complex things they don’t understand, which is rather scary. I’m not a fan of black boxes. I like to know what’s happening with the data within the model itself.
But as you said, the snapshot data is limited in its ability to be worked with.
It was nice talking with you as well. I understand better where you’re coming from, as from a research perspective the data needs to be very solid, assumptions made clear and defensible, and results verifiable.
Hello, I’m Heather. I’m new to this blog. And I got Sunita confused with Jane. Sorry. I sent the owner of this blog a question, based on this post.
You said: ….”new report doesn’t help us situate the Amazon data in a larger context.”
I’m somewhat new to all this and yet I sense one of the main points here is just that:
Howey–and many ebook industry champions–isn’t looking at the larger context. It’s easy for people like him and others who have already climbed up on the rock to comment on how simple the process is, but they seem to continually overlook that new writers are facing a completely different ball game. But we get shut down because hey, let’s listen to the success guys and maybe it’ll rub off on us.
I was even surprised today to find out that in early days, even Howey hadn’t promoted “Wool” as effectively as most try to do now. I don’t mean his success was a fluke, but maybe that new user needs to understand the CONTEXT of what was happening on Amazon when he found this success. I’ve seen a lot of this behaviour, travelling from blog to blog.
Someone will always be first. That is the nature of change.
I feel this was the same story with Stephen King. Right place, right time. And you don’t allow the newbies to kill themselves trying to repeat this success that sometimes, largely, was based on factors beyond anyone’s control.
Context. I’m got getting enough of this from some of the largest ‘champions’ of this ebook industry. They find a premise to prove and overlook what just happened to the newbie. Who wasn’t traditionally published and moved his fanbase to self-publishing. A newbie cannot compete with that; it’s not the same context. Hm.
I feel that’s an issue I encounter with a lot of writers I read ebook industry info about. They start with a premise like: “we need to show you Amazon will work for you just like it did for us!”
@Heather Lovatt: Hi Heather, thanks for commenting.
I agree with you that spectacular successes are rarely good models for the rest of the people engaged in the same enterprise. As the comment thread to my “market for lemons” post shows, even Hugh Howey can’t pin down exactly what made Wool take off like it did. We almost never are able to separate out the factors that can be reproduced from the factors that are unique or idiosyncratic to the particular example, so it can’t really provide a template.
But everyone wants to know how to catch lightning in a bottle, so to speak.