One of my laments as an avid reader has been the disappearance of real editors at major publishing houses. Arguably the most famous and possibly the most skilled book editor was Max Perkins, but every publisher boasted editors who at least strived to equal Perkins’ genius. Today (with a few exceptions, of course) what passes for editors at major publishing houses are really acquisitions editors and the marketing department, who see sales (i.e., profit) as the sole purpose of their jobs.
This month came word of an algorithm developed by three men from the Department of Computer Science at Stony Brook University that they claim can predict the success of novels (i.e., sales). They claim further that their success rate can be as high as 84 percent accurate. Needless to say, I was curious to see what they had devised. After all using math to determine literary success sounds intriguing.
The three men who authored the algorithm, also authored a paper reporting their findings in the hitherto unknown (at least to me) Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, which was copyrighted by the Association for Computational Linguistics. So what, you are reasonably wondering, is “computational linguistics”? The easiest answer comes from the Association’s webpage, which – contrary to my earlier assertion – is remarkably clear:
“Computational linguistics is the scientific study of language from a computational perspective. Computational linguists are interested in providing computational models of various kinds of linguistic phenomena. These models may be ‘knowledge-based’ (‘hand-crafted’) or ‘data-driven’ (‘statistical’ or ‘empirical’). Work in computational linguistics is in some cases motivated from a scientific perspective in that one is trying to provide a computational explanation for a particular linguistic or psycholinguistic phenomenon; and in other cases the motivation may be more purely technological in that one wants to provide a working component of a speech or natural language system. Indeed, the work of computational linguists is incorporated into many working systems today, including speech recognition systems, text-to-speech synthesizers, automated voice response systems, web search engines, text editors, language instruction materials, to name just a few.”
So with background now out of the way, let’s go back to the algorithm developed by the gentlemen from Stony Brook University. Early in their paper they explain that the majority of their experiments they “procure novels from Project Gutenberg.” Project Gutenberg currently offers more than 40,000 books in electronic format for free download. Of course, in order for them to be free, they must also be in the public domain, meaning that they are no longer under copyright. The authors used the synopses provided by Project Gutenberg to help determine genre classifications and then based their determination on whether a book was successful or not on the number of times it had been downloaded from Project Gutenberg.
As I mentioned, the books available on Project Gutenberg are in the public domain, so a quick look at their five most downloaded books (and therefore the most successful books from the study’s standpoint) shows: 1) Adventures of Huckleberry Finn by Mark Twain, 2) The King in Yellow by Robert W. Chambers, 3) Pride and Prejudice by Jane Austin, 4) The Adventures of Sherlock Holmes by Arthur Conan Doyle, and 5) Alice’s Adventures in Wonderland by Lewis Carroll. Certainly worthwhile books by and large and works that have stood the test of time, but my suspicion is that a lot of students out there took advantage of the “free” nature of Project Gutenberg and downloaded the book from their site rather than buy a copy – thereby skewing the download totals.
The authors go on to list a variety of other criteria (like limiting authors in the study set to no more than two books) and conclude the dataset construction outline by saying: “These constraints make sure that we learn general linguistic patterns of successful novels, rather than a particular writing style of a few successful authors.”
Well, their analysis led to insights like the following: in the adventure genre, less successful books used negation words like “never,” “risk,” “worse,” “bruised,” “heavy” and “hard.” More successful books used words – or rather, word – like “not.” Folks, I feel compelled at this point to note that I am not making this stuff up! [Notice that I used the word “not” in the previous sentence]
At this point, I think it remains to be seen if the authors’ algorithm actually works. Perhaps if they invested money in purchasing e-books from the past decade – both bestsellers and books that were not successful – and then ran their program on this sample group we (and they) would have a better idea if there is, in fact, some correspondence between word usage and a novel’s success or failure. As things stand now, I believe it remains to be seen whether math can help identify literary success.
Years ago, however, I did note one interesting aspect of at least some successful novels, which had to do with structure rather than language. I first noticed it when I read Jean Auel’s Clan of the Cave Bear and later in John Grisham’s The Firm.
In Grisham’s case, he has stated in numerous interviews that when he sat down to write The Firm he was specifically trying to write a bestseller. So he studied other bestsellers – particularly suspense novels – and then began writing.
To say that The Firm was successful would be a substantial understatement. It launched Grisham’s career (after his well reviewed first novel, Time to Kill, was a commercial failure) and he can now publish virtually anything he writes.
So what did Grisham do in the structure of his book (as Auel did before him)? In short, he organized his plot like a sine curve (i.e. a smooth, repetitive oscillation). Any novel – particularly suspense novels – is made up of a series of crises and resolutions and, in the case of The Firm, the number of pages between these minor crises was almost always the same [if my recollection serves it was about nine pages in the edition I read]. Likewise the number of pages between the resolutions of these minor crises was the same. Auel’s Clan of the Cave Bear followed the exact same pattern [11 pages in this instance if my memory holds true].
So why does this work? The obvious answer is that the pace of the book holds the reader’s attention and keeps them turning the pages to see what happens next. The less obvious reason is that humans love rhythm, and a novel organized along the lines of a sine curve provides a perfect rhythm.
So, perhaps there is something to math helping determine the success or failure of a novel. Whether the gentlemen from Stony Brook University have discovered a replacement for editors (real or acquisitions), however, remains to be seen.