The Quest for Reading Time Estimates, Part 1: Finding an Appropriate Formula.
As I am starting a new blog, I wanted to add a reading time estimate to each article. Surely, this is a solved problem, right? Well, yes and no.
The problem appears solved in that there are How-To articles and tutorials for it. However, most assume a universal fixed reading rate in words per minute. Then, they simply divide the total number of words in the article by that reading rate and call it a day.
Thus the problem is not solved, insofar as I am not satisfied with this approach. It seems to me oversimplified. What about the nature of the text? Surely, a technical paper will result in different reading rates than a tabloid article. And does this magic universal reading rate number apply to both German and English?
Factors that Influence Reading Rate
As it turns out, there are many factors that influence reading rate. They can be roughly divided into three categories, depending on whence they emanate:
- There is complexity that is inherent in the text itself. The higher this complexity, the slower the reading speed will be. The ratio of long words (Miller and Coleman 1971; Radner et al. 2002), as well as syntactical sentence structure (Gibson 2001; Pauly and Nottbusch 2020) correlate with this complexity.
- There are factors that influence reading rate which are inherent to the reader; such as literacy (Radner et al. 2002; Trauzettel-Klosinski and Dietz 2012) and age (Trauzettel-Klosinski and Dietz 2012; Brysbaert, Keuleers, and Mandera 2019).
- Some factors that influence reading rate come from how the text is displayed and the environment in which content is consumed. Contrast (Legge et al. 1990), text size (Bailey, Clear, and Berman 1993), and font (Mansfield, Legge, and Bane 1996) fall into this category.
Both the second and third categories are unknowable to me for all practical purposes. I have no certainty as to how my blog is displayed or who reads it (at least a priori). This leaves me with only complexity as candidate for estimating reading times.
Text Complexity as Determinant of Reading Rate
As text complexity is my only candidate for estimating reading rate, I need some way to model it. Multiple theories exist on how to model text complexity. Two that have been popular in literature are:
- Dependency local theory (DLT) postulates that, during reading, cognitive resources are expended for 1) storage of sentence structure so far, and 2) integration of the currently read word into that structure. It further states that complexity depends on distance of two elements being integrated (Gibson 2001). Simplified: the longer and more nested a sentence is, the higher its complexity.
- Surprisal models complexity as "an information-theoretic concept that reflects the expectedness of each word given its preceding context" (Henderson et al. 2016). Meaning: The less likely a word is to appear in a given context, the higher its complexity.
A complete theory of complexity will likely need to integrate both of these measures, as they have been suggested to be uncorrelated (Demberg and Keller 2008).
While these theories are academically interesting, their application seems hardly feasible for my purpose. The expected gain in accuracy is unlikely to outweigh the effort of implementation. I am, after all, looking to estimate reading times for a blog that maybe five people and one LLM will ever read.
Practical Approach: Readability Scores
Directly implementing the academic theory regarding text complexity appears excessive for my purpose. There is a related category of model that is more practical for estimating reading rate: text readability scores.
The most prominent text readability score, as far as my impression from internet research goes, is the Flesch Reading-Ease index. Other examples include the SMOG grade or Coleman-Liau index. All of these are tuned for the English language. In contrast, the Wiener Sachtextformel and Lesbarkeitsindex (LIX) work well for German (https://barrierefreies.design/werkzeuge/lesbarkeit-analysieren).
The majority of scores relies on calculating length of sentences and words, as defined by the number of syllables they contain. This is their link to the purely academic theories on text complexity. Alas, it is also their caveat for my purpose.
The calculation of sentence length in itself is not trivial. Consider punctuation in the middle of a sentence; such as, e.g., the one you are reading right now. Robust calculation of syllables appears even further out of reach. Attempts have been made, but are not reliable (see this example, this example, or this one).
Simplifying Further
Readability scores are easier to implement than the purely academic theories on text complexity. However, most still rely on metrics that I can't reliably compute with a reasonable amount of effort: text length and syllable count.
Do I have to resort to words per minute alone, after all? As it turns out, I do not. Because, luckily for me, someone came up with a simple formula that still factors in complexity.
In 2019, Brysbaert conducted a meta-review of literature regarding reading rates (Brysbaert 2019). He chooses words per minute as metric for reading rates due to its high rate of adoption. However, he suggests a corrected formula for calculating the expected reading rate: 238 * 4.6 / average word length
. 238
here is the best available estimate for reading rate in words per minute. 4.6
is the average word length in non-fiction texts. Thus, the formula uses words per minute as a basis, but factors in word length as representation of complexity.
That seems like a promising start. An applicable formula without error-prone calculation of sentence length or syllable count. While there is some risk of oversimplification, word length correlates well with text difficulty (Miller and Coleman 1971). This is corroborated by Radner et al., who note that longer words correlate with a decrease in reading speed (Radner et al. 2002). Thus I will consider word length a reasonable proxy for complexity.
Adapting for German
As I write also German, I still need to adapt the formula of Brysbaert for German average reading speed and word length. Sadly, there is no meta-review for German reading rates that makes these values easily available.
Duden claims an average length of 5.97 letters per word - that's pretty much cut and dried, coming from the foremost authority on the German language corpus. The numbers get more diverse when it comes to average reading rate.
Radner et al. report a reading speed of 209 words per minute for short sentences and 170 words per minute for long sentences (Radner et al. 2002). Trauzettel-Klosinski and Dietz measure 179 words per minute in an international standardised test (Trauzettel-Klosinski and Dietz 2012). Pauly and Nottbusch report a mean reading speed between 184 and 210 words per minute , depending on syntactic sentence structure (Pauly and Nottbusch 2020).
Based on this literature, I'll assume an average reading rate of 185 words per minute for German readers. The reasoning behind this choice is that I'll be writing about technical topics; hence a value from the lower end of the range. However, I usually put effort into lowering the complexity of my writing; therefore I didn't choose the absolute lowest observed reading rate. I will adjust this value as needed once I have some empirical data.
The result
Putting all of the above together, the formulas of choice that I'll use to calculate the reading rate will be:
reading_rate_en = 238 * 4.6 / [average_word_length_of_post]
and
reading_rate_de = 185 * 5.97 / [average_word_length_of_post]
for English and German, respectively. Considering all literature known to me, this appears the best compromise between keeping things pragmatic and factoring in some measure of text complexity. I expect word length to be a reasonable approximation for overall complexity in accordance with (Brysbaert 2019; Miller and Coleman 1971; Radner et al. 2002).
Finally, the formula for estimating the reading time of a given blog post in minutes would be:
reading_time_minutes = [words_in_post] / reading_rate
Caveats
I took some liberties when plucking numbers from literature. I completely disregarded the typographical aspects under which average reading rates were determined. Many factors could influence the average reading rate on my site: contrast, font type, font size, line length, etc.
Of course, all of this is conjecture and lacks critical examination for now. Trying to empirically refute the proposed formula is out of scope for one single blog post. I'll work on that in the future.
What's Next?
In the next post, I'll describe an implementation of the proposed formula. We'll be using vanilla JavaScript to estimate the reading time of blog posts written in Markdown / HTML format. Once that is implemented, I may start some empirical tests of the calculated reading times. Stay tuned!