Text skimming is a strategy for rapidly getting the main ideas of a text, without reading all the words but instead dipping into the text, such as reading the first paragraph, unusual words, and proper nouns (e.g. BBC, AACC). Skimming can be useful to quickly get concepts, ideas and themes about a text; or to aid quickly finding a particular passage in a previously read text. Using this strategy the viewer will read titles, introductory sentences, and then rapidly skip across the text and dip into it for proper nouns, unusual words, enumerations and so forth. However, picking out unusual words can be difficult – they are not differentiated from the other text.
Instead the format of the text can be adjusted to make these words visually more dominant. In a skim formatting approach, the font formats for individual words can be varied to make the unusual words stand out. The infrequent words in the English language are made heavy-weight, making them visually pop-out from the surrounding words, while very common words are light-weight to make them visually recede.
Below are some PDF’s of full books formatted for skimming.
Wizard of Oz, by Frank Baum, is set in 5 levels of Source Sans Pro, a highly readable sans serif font with alot of variation between heavy-weight and light-weight versions. Note how some major themes stand-out in the heavy-weight words in the opening paragraphs – whirlwind, cyclone, gray, blistered, gray. (Baum-WizardOfOz)
The Adventures of Buster Bear, by Thornton Burgess, is a children’s book. It is also set in 5 levels of Source Sans Pro, but shifts towards the lighter weights, so the overall tone is a bit lighter and there aren’t quite so many heavy-weight words vying for attention. (Burgess-BusterBear)
A Tale of Two Cities, by Charles Dickens, is set in 5 levels of a serif font – serif fonts are more typically used in books, however the amount of variation between the heaviest weight and lightest weight is not as pronounced as with some sans-serif fonts. (Dickens-TwoCities)
A trio of texts by the Wright brothers including How We Invented the Airplane required a bit of tweaking – too many words were coming out heavy-weight. See notes on the algorithm below if interested. (Wright-Airplane).
Baseline word frequencies for the English-language are based on Wiktionary word frequency lists from Project Gutenberg. Texts are processed into unit words using Python NLTK toolkit. Each word is then ranked based on the Wiktionary frequencies. Typically, the breaks used were as follows:
- < 1oo: extra-light
- 100-500: light
- 500-1000: normal (book)
- 1000-20,000: bold
- 20,000+: black
This approach worked generally OK for novels, but the Wright text was more technical, resulting in more uncommon words, resulting in too much heavy-weight text. In this case the thresholds were shifted to 200, 1000, 5000, 20,000 which seemed to remedy the issue.
The NLTK parser and associated Python script has various logic to attempt to format plain text. For example, single line of text surrounded by whitespace is assumed to be a sub-heading. A word surrounded by _underscores_ will be italicized. Numbers are supposed to be extracted and assigned heavy-weights – but the word list seems to have some numbers making some numbers come out lighter weight. Handling of word-stemming and contractions wasn’t handled in particular. Also, processing of quotes is currently not handled well – spaces follow all quotes which is not right; and furthermore the plain quotes should be replaced with proper open/close quotes.
The Python script generates an HTML file wherein each word has an assigned class (1-5); and then an external CSS file can be easily used to try different fonts and different weights. The rendered HTML can then be output to a PDF file – which is required since this blog doesn’t seem to handle CSS and multiple fonts.