Computer scientists at Stony Brook University have developed an algorithm that predicts, with an accuracy reaching 84%, the “success” of a novel. In a paper presented in October at a conference on natural language processing, authors Vikas Ganjigunte, Ashok Song Feng, and Yejin Choi lay out the process by which they analyzed various linguistic characteristics that they found correlated with critical and commercial success (the two are not distinguished in the results). The resulting algorithm was developed from a corpus provided by the online literary archive Project Gutenberg, which contains the full text of some 42,000 books out of copyright.
The researchers then tested their model against a “handful” of more recent works, culled from the lower reaches of the Amazon sales rankings, and were able to “predict,” with a measure of accuracy, the success of these books. Though some outlets, in writing about the study, have noted its amusing identification of Dan Brown’s The Lost Symbol as a less successful book, it was also the only true bestseller the researchers included in their non-Gutenberg set (“because of negative critiques it had attracted from media despite its commercial success”). It was joined in this category of algorithmic condemnation by works from Bernard Malamud, William Faulkner, and Philip Roth.
But the study did deliver some intriguing (and some mundane) conclusions about the shared characteristics of bestsellers: “readability” and — paging Jonathan Franzen — literary success are negatively correlated; a glut of “thinking verbs” will correlate more closely with success than “emotional or action verbs”; and nouns and adjectives win out against verbs.