The Quest for Reading Time Estimates, Part 2: Implementation and Verification.
Summary
- Goal: show reading time estimates on my blog posts that go beyond dividing number of words by a flat reading rate.
- Hypothesis: adjusting reading rate according to average word length yields more precise reading time estimates than using a flat reading rate.
- Verification: experiments using a JavaScript implementation of adjusted reading rate and flat reading rate show that this is not the case.
- Discussion: using a flat reading rate in words per minute outperforms the adjusted reading rate, refuting my hypothesis.
Introduction & Recap
Because I was not content with the state of the art of estimating reading times online, I tried to come up with a better algorithm. Based on literature analysis, and a weighing of exactness versus practicability, I decided to use formulae that take average word length as predictor of text complexity into account:
reading_rate_en = 238 * 4.6 / [average_word_length_of_post]
reading_rate_de = 185 * 5.97 / [average_word_length_of_post]
Both are based on the work of Brysbaert (Brysbaert 2019). For an overly detailed account of why I chose these formulae, see the previous post. My expectation is that using a thus adjusted reading rate yields more precise estimates for reading time.
In this post, I describe my implementation of this academic theory and my findings from critical examination of the implementation. The results are somewhat disillusioning.
Implementation
I implemented the formulae with vanilla JavaScript, because that's the language of the static site generator that I use. Estimation will take part during compile time and use the compiled blog posts (i.e., HTML as string) as input.
Since the input will contain HTML, implementation needs two parts: 1) sanitization of input, and 2) calculation of reading time.
Sanitizing the Input
I don't want HTML markup to mess with the word count, so I first need to filter out the HTML tags. I used text parsing based on regular expressions find all HTML tags and punctuation and replace them with either an empty string or a blank space. Nothing fancy here.
/** Remove any HTML tags, linebreaks, punctuation or hyphens from a string
*
* For the purpose of later calculating reading rate and reading time of
* the input text.
*
* Regex used for matching HTML tags is taken from the `insane` npm library:
* https://github.com/bevacqua/insane/blob/master/parser.js#L7
*
* @param {string} inputText text to sanitize
* @returns {string} input text sans any HTML tags, linebreaks, punctuation or hyphens
*/
const sanitizeMarkup = (inputText) => {
const regexStartTag = /<\s*([\w:-]+)((?:\s+[\w:-]+(?:\s*=\s*(?:(?:"[^"]*")|(?:'[^']*')|[^>\s]+))?)*)\s*(\/?)\s*>/gi;
const regexEndTag = /<\s*\/\s*([\w:-]+)[^>]*>/gi;
const regexNewLine = /[\n\r]/g;
const regexPunctuation = /[\.|\?|;|,|:|!|"|'] /g;
const regexHyphens = / - /g;
return inputText.replaceAll(regexStartTag, '')
.replaceAll(regexEndTag, '')
.replaceAll(regexNewLine, ' ')
.replaceAll(regexPunctuation, ' ')
.replaceAll(regexHyphens, ' ');
}
Calculating Reading Time
Assuming a sanitized input string, we can simply split it (using whitespace as separator) and use the resulting array for all further calculations (word count, average word length, etc.). The function makes use of the adjusted reading rate optional. This is mainly due to me wanting to compare both calculation methods during testing and verification.
/** Takes a text and estimates the reading time based on average reading speed of the given language.
*
* Reading speeds are taken from academic literature. You can find out more about choice of reading speeds
* here: https://auferbauer.net/en/blog/2025-04-29-2025-04-29_reading_time_estimate_part_1/
*
* You may opt to use an adjusted reading rate that takes average word length of the text vis-à-vis average
* word length of that language into account. Refer to https://doi.org/10.1016/j.jml.2019.104047
*
* @param {string} text the text for which you want to estimate reading time.
* @param {string} language the language of the text (influences assumed reading speed in words per minute); takes either 'en' or 'de', defaults to 'en'.
* @param {boolean} useAdjustedReadingRate whether you want to use an adjusted reading rate that takes average word length into account.
* @returns {number} Estimated reading time of the text in seconds.
*/
const estimateTime = (text, language, useAdjustedReadingRate) => {
if(!text || typeof text !== 'string') {
throw new Error('Text was not provided as string.');
}
let languageReadingRate = 0;
let languageAvgWordLength = 0;
switch (language) {
case 'de':
languageReadingRate = 185;
languageAvgWordLength = 5.97;
break;
case 'en':
default:
languageReadingRate = 238;
languageAvgWordLength = 4.6;
break;
}
const words = text.split(' ').filter((word) => {
let trimmed = word.trim();
return trimmed.length > 0;
});
const wordCount = words.length;
const characterTotal = words.reduce(
(characterTotal, word) => characterTotal + word.length,
0
);
const documentAvgWordLength = characterTotal / wordCount;
const adjustedReadingRate = languageReadingRate * languageAvgWordLength / documentAvgWordLength;
const readingTimeSec = (wordCount / (useAdjustedReadingRate ? adjustedReadingRate : languageReadingRate))*60;
if (process.env.NODE_ENV === 'development') {
console.debug(`Total characters: ${characterTotal}`);
console.debug(`Word count: ${wordCount}`);
console.debug(`Average word length: ${documentAvgWordLength}`);
console.debug(`Adjusted reading rate (${language}): ${adjustedReadingRate}`);
console.debug(`Estimated reading time (${language}): ${readingTimeSec} sec`);
}
return readingTimeSec;
}
Verification
My original hypothesis was that the function given above will yield a more precise reading time estimate. To test this hypothesis, I need some ground truth. Literature offers a few text corpora for which reading time has been measured. From English literature, I used the following:
- Frank et al. provide eye tracking data and self paced reading times (48 and 117 subjects, respectively) for a corpus of English sentences. I selected nine sentences (70 words total) at random for evaluation.
- Trauzettel-Klosinski and Dietz showcase a single passage from a standardised international reading test (156 words), including reading time statistics for 436 participants.
- Siegelman et al. provide eye tracking data from an internationalized corpus of text passages. I arbitrarily selected passage #9 (200 words) and calculated the average reading time of 42 participants from the openly available data.
Here is how the estimated reading time calculated by my code compares to the empirical measurements:
Corpus | Measure | Estimate | Deviation |
---|---|---|---|
1. | 55.02s | 51.57s | 6% |
2. | 40.4s | 37.04s | 8% |
3. | 48.15s | 53.49s | 11% |
That is actually a fairly decent result. Deviations are within acceptable limits for my purpose. Interestingly though, not adjusting the reading rate for average word length and instead using a flat reading rate of 238 words per minute yields even better results:
Corpus | Measure | Estimate | Deviation |
---|---|---|---|
1. | 55.02s | 53.95s | 2% |
2. | 40.4s | 39.33s | 3% |
3. | 48.15s | 50.17s | 4% |
That is a gratifying result as far as my code is concerned. However, it also means that my attempt at finding a better measure for reading speed was in vain.
But what about German? It was distinctly more difficult to find texts to use as ground truths for German. There are only two sources for which both the text and reading speed measurements are available:
- The Potsdam Text Corpus (PoTeC) contains eye-tracking data for twelve scientific texts in German (Jakobi et al., 2024).
- The international corpus by Siegelman et al. that I already used for verification in English also provides German versions.
I was unable to use PoTeC for verification, as the reading times I calculated from the raw data showed reading rates far below those reported in other literature; as low as 40 words per minute for some participants. Most likely, I did not handle the data correctly. Whatever the cause, the values lie so far outside expected parameters that I chose to not use PoTeC.
This leaves me with the text of Siegelman et al. as the single data point for verification:
Measure | Estimate | Deviation |
---|---|---|
51.78s | 60.46s | 16% |
Not too bad, but not great either. However, same as for English, not adjusting the reading rate for word length and instead using a flat reading rate (of 185 words per minute) yields better results:
Measure | Estimate | Deviation |
---|---|---|
51.78s | 54.81s | 5% |
Discussion
At the outset of my quest was the hypothesis that using a flat reading rate would yield inaccurate estimates for reading time. Hence I attempted to find a better measure that is precise yet still practical. I thought I had found such an estimate by adjusting reading rate according to average word length.
Surprisingly, for all text corpora in literature that offer reading time measurements, using a flat reading rate in words per minute outperforms the adjusted reading rate. While the adjusted reading rate estimates deviate from the empirical measurements by 6%-16%, the flat reading rate estimates deviate only by 2%-5%.
I see no need to investigate other approaches or complicate things further. There really isn't any point trying to improve a 2% deviation from empirical data. Flat reading rate based on words per minute is the more robust tool for estimating reading time and performs more than satisfactorily.
The code I used to examine my hypothesis thus is no better or worse than most of the other reading time estimation code you will find floating around the web. At the time of writing, I'm using it to estimate reading times on this blog; which brings my quest for reading time estimates to a close. In the end, KISS strikes once again.
See you in the next post!