Measuring Information Density at Scale
How do you measure information density? Most people can visit a page and tell if it contains a lot of marketing fluff. This can work if your website has a dozen posts, but once you get to the scale of hundreds or thousands of pages (like our clients do) you have to find a way to quantify and automate this process (unless there are some interns you want to make very unhappy).
At PromptMarketing we have tried multiple approaches to measure information density, like measuring lexical and propositional density, but the approach that gave us the most consistent results while preserving scalability turned out to be an embedding-based approach.
The high-level overview:
1. Chunk pages on sentence level
2. Embed each sentence
3. Compare to the centroid of a list of marketing fluff sentences
4. If the average sentence of a page is close to this centroid, you are most likely dealing with a page with low information density
Why Measure Information Density in the First Place?
Web-searching LLMs rely heavily on embedding-based approaches, both on page level and on chunk level. This means that in order to perform well in AI search, your pages and their chunk embeddings need to be close to that of the search queries you are optimizing for.
Fluffy marketing sentences tend to dilute your embeddings, pulling them away from your target keywords. And every sentence on your page is a citation opportunity, if your sentences don't contain much information on average, your overall citation probability will tank.
In short:
- Fluff dilutes similarity. Generic sentences pull your page and chunk embeddings away from the queries you're targeting.
- Every sentence is a citation opportunity. If your sentences are weak on average, LLMs have less reason to cite you.
Workflow
1. Extract markdown using Screaming Frog
2. Split each article into sentences
3. Embed each sentence
4. Compare each sentence embedding to the centroid of a collection of marketing fluff sentences
5. Take the average similarity per page, then compute `1 - avg`
We call this the specificity score, higher specificity generally means higher information density.
The exact numbers will depend on your embedding model and fluff sentence collection, but as a rule of thumb:
- Below 0.70: definitely needs a rewrite
- 0.70–0.75: underperforming, prioritize for review
- 0.75–0.80: acceptable, revisit later
- Above 0.80: generally fine
How to Do an Information Density Audit
Plot the specificity scores at 0.05 intervals: x-axis is specificity, y-axis is the count of pages in that range.
Here's an example distribution for one of our clients:

A lot falls in the 0.70–0.75 range, but roughly half is underperforming at below 0.75. In our experience, pages with low specificity scores tend to share the same problems: they lack third-party citations, contain few concrete action points, and are light on declarative sentences that actually say something. You'll also find a lot of older posts that were padded to hit a word count, or content that exists because someone needed to publish something that week, not because it had something to say.
Because GEO analysis of course goes farther than just your own website, we also analyzed the information distribution of our clients biggest competitor! This competitor was consistently outperforming everyone on LLM’s. Here are the 2 distributions compared;

As you can see, the information density distribution of the competitor looks a lot healthier than that of the client, the peak of the distribution is in the 0.75-0.80 range with outliers in the 0.90-0.95 and even 0.95-1.00 range!
Here's a distribution from another client:

This one looks a lot healthier. Most of their content sits well above 0.75. But notice the single page in the 0.60–0.65 range, that turned out to be their "about us" page. It was packed with fluff, which is exactly what you don't want for a page like that in the age of AI search. Your "about us" page needs to be one of the strongest on your site, clearly defining what your company actually does.
In both cases, a single chart gave us an immediate overview of which pages needed attention, even across hundreds of posts.
What We Tried Before
I like to keep tools like this as simple as possible, so we explored a few approaches before landing on embeddings:
- Lexical density: inconsistent, didn't give a clear signal on what was fluff versus substance
- Propositional density: same problem, results were all over the place
The embedding-based approach was the first to give a satisfactory, consistent result. The logical next step would have been using LLMs to score each page directly, but that would be expensive, slow, and nondeterministic, overkill for what is essentially a triage step.
Conclusion
AI search is embedding-driven, and your content is competing at the vector level. Pages stuffed with generic marketing language get pulled away from the queries that matter, and LLMs have less to cite when every other sentence says nothing.
The specificity score gives you a fast, scalable way to find the weakest pages in your content library. Run the audit, sort by score, and start rewriting from the bottom up. It won't tell you how to fix a page, but it reliably tells you which pages need fixing, and at scale, that's the harder problem to solve.
Launch your next big idea today
Join creators, teams and startups who are building with PromptMarketing.
Get in touch