Amazon’s take on the Book of Mormon

Amazon.com has an algorithm for noting the “Statistically Improbably Phrases” in any given book. The idea is to look for word combinations that are uncommon generally but common in the book in the hope that this provides potential buyers some insight into what the book is about. Here are the ones for the Doubleday edition of the Book of Mormon:

hearts upon riches,
lievest thou,
exceeding faith,
mine epistle,
more wicked part,
nowise inherit,
hath covenanted,
abominable church,
land northward,
labor exceedingly,
angel spake,
choice above all other lands,
plain unto,
thou hast beheld,
continual peace,
beareth record,
time cometh,
exceedingly great joy,
made manifest unto,
land southward,
whoso believeth,
soul delighteth,
stiffnecked people,
secret abominations,
salvation cometh

Not a bad list all considered, although where is “and it came to pass”? I guess that phrase isn’t improbable enough. Nor apparently, are the various references to deity which, although ubiquitous in the Book of Mormon are not uncommon in other books. The link is here, although, you know, if you’re reading this I bet you already have a copy of the Book of Mormon.

16 comments for “Amazon’s take on the Book of Mormon

  1. What a philosophy we could construct from this list! It’s a relief that “secret abominations” and “hearts upon riches” do not form the subject of much discussion, but how sad that “continual peace” and “exceedingly great joy” are so “improbable.” How would the tourist board of Arizona change its advertising if it realized that “land southward” is probably read/said more in Utah than elsewhere? And how will the statistical frequency of “abominable church” change in the Bible Belt during the Romney campaign?

  2. Hmmmm… Since those phrases are all less than 5 words in length, I wonder what algorithm they’re using? Maybe bi-grams something or other? Wonder what it would look like if you used long snippets…. n-grams where n = 10? n = 20?

  3. A NM,

    I confess complete ignorance, but on the other hand surely huge chunks of the book stand out as unique if you make the chunks long enough (with the obvious exception being the overlap with the Bible).

    Connor,

    That is weird, because the Doubleday link in the post you mention goes to a different page than the one I linked to. The Doubleday book I link to suggests buying The Pearl of Great Price (which seems surprisingly good). Such are the vagaries of computer algorithms.

    Ardis, I think you should contact Arizona and New Mexico.

  4. Well, my favorite “improbable phrase” would have to be more wicked part. However improbable one might suppose that phrase to be, a sizable part of Christianity has historically been given over to controlling (in some measure) our more wicked part.

  5. Most n-gram tools don’t go above n=5, sadly. I agree, though, that it would be interesting to go n=20 or so. The problem isn’t reprocessing the BoM with n=20; it’s whether or not the original source material (for comparison) has been processed with n=20 (which isn’t hasn’t, probably).

  6. Gavin/Nonny Mouse,

    Is the n keeping track of the maximum number of words in each phrase or is it more esoterically related to the number of words in a phrase?

  7. Frank: bigrams look at the statistical probabilities of the text in 2 word chunks, tri-grams in 3 word chunks n-grams in n word chunks. I went back and looked over their description of the algorithm (pretty vague, but still interesting) and it looks like what the do is come up with the satistical probabilties of all the words (or sets of n words) in their entire corpus of scanned text, and then look at which ones appear statistically more often or less often in the target book compared to the rest of the corpus. Which is definitely interesting, even on the 2 word level.

    Gavin’s right: we don’t need just to BOM 5-grams or what have you, but you have to re-process the entire corpus. It’s just that 2 word phrases tend to be pretty small, you know? :)

  8. I was writing my post while you asked your question :) So, for n-grams usually what you do is just look in successive n-word windows. So, you count up all the probabilities for the first sentence from this comment like this, where n = 2: “I was”, “was writing”, “writing my”, “my post”, “post while”, “while you”. Etc. And you figure out what the chance of each of those combinations in the document is. Then you reference that versus the chance of that occuring in the overall corpus. If it just occurs once, then you know it’s pretty unique to your volume, and can be an “interesting phrase”.

    That’d be the easiest way to generate that info, but it’s too hard to tell form their description if that’s what they’re really doing.

  9. This is, at a very generic level, one of the techniques they used to catch the Unabomber.

  10. Is it just me, or do those phrases look like examples of subject lines from spam email?

  11. Mike, I noticed that, too. Just add the word “viagra” and they cover most of the spams currently in my inbox:

    viagra. exceedingly great joy
    thou hast beheld viagra
    soul delighteth viagra
    secret abominations viagra
    viagra, more wicked part
    land southward, viagra
    viagra, salvation cometh

  12. DKL: This looks like the high priests’ variation on the old game played by the priests, adding “in a bathtub” to the names of the hymns.

  13. I guess “for behold, I say unto you nay,” was too long to count.

    Meanwhile, the Book of Mormon, as sold by the Church for $2, has to be a ridiculously good value, since the Doubleday version at $17 comes out to 16,889 words per dollar. Though, depending on the index used, as few as 16% of all books with the Search Inside features enabled are “harder” to read.

  14. “Statistically Improbably Phrases”
    What are the odds someone wishing to address statistically improbable phrases would mis-spell \”improbable\”?

    semper et ubique

Comments are closed.