Before You Trust Book Scout, You Should See This

How we tested Book Scout's accuracy — and what we're still getting wrong.

We use the word "trust" a lot when talking about Book Scout. Trust that what it surfaces is accurate. Trust that it won't flag something that isn't there.

At some point we realized we should actually test whether that trust is earned. So we did.

How we tested it

We put together a list of 27 books. Children's classics, YA, adult literary fiction — Charlotte's Web, The Hunger Games, A Little Life, Pride and Prejudice, The Shining, Thirteen Reasons Why. Books with nothing in them and books with everything.

Before running anything, we wrote down what we expected to see. Charlotte's Web should surface Death & Grief. The Shining should surface Violence, Horror, Drug & Alcohol Use — Jack Torrance's alcoholism drives the whole story, and it should show up. Thirteen Reasons Why should flag Suicide and Sexual Assault. Pride and Prejudice should return nothing, because there's nothing there.

We also wrote down what shouldn't appear. Harry Potter shouldn't be flagged for Occult content — Hogwarts is a fantasy school. The Giver shouldn't be flagged for Political content — it's a dystopian novel about conformity, not advocacy for a political party.

Then we ran the AI against all 27 books and scored every result.

The two numbers that actually matter

Most "accuracy" claims only track whether the AI found what it should have found. We track that — but we also track the opposite: how often it flags something that isn't there.

Both matter. If we tell you Charlotte's Web contains violence, you'll stop trusting us for the books where it counts.

Recall — when a theme is genuinely present, does the AI find it? 95.3% across 191 checks.

False positive rate — when a theme isn't there, does the AI stay quiet? 2.0% — meaning out of 148 things we expected it to stay quiet about, it got 3 wrong.

Movie-style rating accuracy (G / PG / PG-13 / R): 96.3%

What the results actually look like

Charlotte's Web came back with one theme: Death & Grief, Mild. That's it. For a children's book where a spider dies peacefully, that feels right.

The Shining came back with seven themes: Violence, Disturbing Content, Abuse, Drug & Alcohol Use, Death & Grief, Psychological Distress, and Profanity. Jack Torrance's alcoholism is the reason the family is in danger — and it's in the results.

The Hunger Games got Violence and Death but nothing for Sexual Content, Drug Use, or Political Content. Because book one doesn't have those things.

What we're still getting wrong

This part matters as much as the scores.

A Little Life has been a tough one. It's the most intensely distressing book in our dataset — childhood sexual abuse, prolonged self-harm, suicide, all depicted with unflinching detail. The AI correctly surfaces six of those themes at the highest severity level. But it consistently misses Drug & Alcohol Use, even though the main character's substance dependency runs through hundreds of pages. When we ask it directly — "does this book contain drug addiction?" — it answers correctly. It just doesn't volunteer the theme on its own when it's competing with six Graphic-severity items. We're still working on this.

Harry Potter and the Deathly Hallows occasionally misses War & Conflict. The Battle of Hogwarts is large-scale combat with named characters dying — it should show up. Sometimes it does, sometimes it doesn't. We've improved consistency here and are still tuning it.

Wonder sometimes returns Death & Grief as a flag, which we consider a false positive. There's a grief subplot, but it's not the kind of thing a parent searching for content warnings would expect to see highlighted. The threshold question for lighter books is something we keep revisiting.

We're publishing these cases rather than glossing over them. A 95% recall number only means something if you know what the remaining 5% looks like.

The part that's harder to test

The aggregate scores measure whether the AI finds things most readers would agree are there. But parents aren't looking for averages — they're looking for their specific concern.

That's what the Content Finder is for. You tell us what you're looking for in your own words, and we check for it.

We ran this separately across 9 books and 17 specific topics — things like "teen suicide" in Thirteen Reasons Why, "childhood sexual abuse" in Perks of Being a Wallflower, "drug addiction" in A Little Life. It found the topic 94.1% of the time when it was genuinely there.

When the topic wasn't there — "racism" in Thirteen Reasons Why, "violence" in Charlotte's Web, everything we checked in Pride and Prejudice — it returned a false positive zero times across 22 checks.

The Content Finder tends to be where the product earns its keep for parents with a specific concern they need answered, not just a general content overview.

How the numbers got better

These aren't the scores we started with. Recall was around 84% when we first built this. The false positive rate was over 8%.

Getting from there to here was mostly about understanding why things failed, not just that they did. The Occult false positive on Harry Potter kept appearing even after we told the model explicitly that genre fantasy isn't real-world occult practice — it took building a structured self-check step into the prompt before it would reliably stick. The Drug & Alcohol Use misses on The Shining turned out to be a counting problem: we had capped the model at returning five themes per book, and it was dropping alcoholism in favor of the more obvious horror content. Removing the cap let it surface what's actually there.

We run the same 27-book test every time we change a prompt so we can't accidentally fix one thing while breaking another.

The AI isn't the last word

One thing worth knowing: the AI doesn't have to be the final answer.

If you read a book and something in the theme breakdown looks wrong — too harsh, too lenient, or just missing something obvious — you can flag it. A human reviews every flag. That's intentional. We'd rather have a correction process than pretend the AI is infallible.

You can also leave a comment on any book with your own take. Not everyone's threshold is the same. What reads as mild violence to one parent might be a dealbreaker for another. Comments let readers share that context with each other in a way no AI analysis can fully replace.

The scores above measure what the AI gets right on its own. The flagging system and comments are how the community fills in the gaps.

The honest version

We're not claiming this is perfect. A Little Life is a good example of why — the AI knows about the drug dependency, it just doesn't always surface it on its own. We'll keep working on it.

There's also a hard limit worth being upfront about: the AI's training data cuts off around June 2024. Books published after that don't have results — the model will tell you so rather than guess. If you're searching a 2025 or 2026 release, Book Scout will say it doesn't have enough reliable information instead of making something up. It's a real limitation, and it's the right call.

For everything published before mid-2024, the 95% recall and 2% false alarm rate give you a reasonable baseline for trusting what this tool surfaces.

If you search a book and the results don't match what you know is in there, we'd genuinely like to hear about it. That's how the 84% became 95%.

Happy reading.

Blog

Know what's in every book before you read it.

Search millions of titles. Free to get started.