Dickens and Oz formatted for Skimming

Text skimming is a strategy for rapidly getting the main ideas of a text, without reading all the words but instead dipping into the text, such as reading the first paragraph, unusual words, and proper nouns (e.g. BBCAACC). Skimming can be useful to quickly get concepts, ideas and themes about a text; or to aid quickly finding a particular passage in a previously read text. Using this strategy the viewer will read titles, introductory sentences, and then rapidly skip across the text and dip into it for proper nouns, unusual words, enumerations and so forth. However, picking out unusual words can be difficult – they are not differentiated from the other text.

Skim Formatting

Instead the format of the text can be adjusted to make these words visually more dominant. In a skim formatting approach, the font formats for individual words can be varied to make the unusual words stand out. The infrequent words in the English language are made heavy-weight, making them visually pop-out from the surrounding words, while very common words are light-weight to make them visually recede.

First two paragraphs of the Wizard of Oz formatted for skimming

First two paragraphs of the Wizard of Oz formatted for skimming

Below are some PDF’s of full books formatted for skimming.

Wizard of Oz, by Frank Baum, is set in 5 levels of Source Sans Pro, a highly readable sans serif font with alot of variation between heavy-weight and light-weight versions. Note how some major themes stand-out in the heavy-weight words in the opening paragraphs – whirlwind, cyclone, gray, blistered, gray. (Baum-WizardOfOz)

The Adventures of Buster Bear, by Thornton Burgess, is a children’s book. It is also set in 5 levels of Source Sans Pro, but shifts towards the lighter weights, so the overall tone is a bit lighter and there aren’t quite so many heavy-weight words vying for attention. (Burgess-BusterBear)

A Tale of Two Cities, by Charles Dickens, is set in 5 levels of a serif font – serif fonts are more typically used in books, however the amount of variation between the heaviest weight and lightest weight is not as pronounced as with some sans-serif fonts. (Dickens-TwoCities)

Emma, by Jane Austen, is also set in a serif font, in this case the elegant Cormorant font, with ligatures turned on for the heavy-weight words (Austen-Emma).

A trio of texts by the Wright brothers including How We Invented the Airplane required a bit of tweaking – too many words were coming out heavy-weight. See notes on the algorithm below if interested. (Wright-Airplane).

Algorithm Notes

Baseline word frequencies for the English-language are based on Wiktionary word frequency lists from Project Gutenberg. Texts are processed into unit words using Python NLTK toolkit. Each word is then ranked based on the Wiktionary frequencies. Typically, the breaks used were as follows:

  • < 1oo: extra-light
  • 100-500: light
  • 500-1000: normal (book)
  • 1000-20,000: bold
  • 20,000+: black

This approach worked generally OK for novels, but the Wright text was more technical, resulting in more uncommon words, resulting in too much heavy-weight text. In this case the thresholds were shifted to  200, 1000, 5000, 20,000 which seemed to remedy the issue.

The NLTK parser and associated Python script has various logic to attempt to format plain text. For example, single line of text surrounded by whitespace is assumed to be a sub-heading. A word surrounded by _underscores_ will be italicized. Numbers are supposed to be extracted and assigned heavy-weights –  but the word list seems to have some numbers making some numbers come out lighter weight. Handling of word-stemming and contractions wasn’t handled in particular. Also, processing of quotes is currently not handled well – spaces follow all quotes which is not right; and furthermore the plain quotes should be replaced with proper open/close quotes.

The Python script generates an HTML file wherein each word has an assigned class (1-5); and then an external CSS file can be easily used to try different fonts and different weights. The rendered HTML can then be output to a PDF file – which is required since this blog doesn’t seem to handle CSS and multiple fonts.



About richardbrath

Richard is a long time visualization designer and researcher. Professionally, I am one of the partners of Uncharted Software Inc. I am also pursuing a part-time PhD in data visualization at LSBU. The opinions on this blog are related to my personal interests in data visualization, particularly around research interests related to my PhD work- this blog is about exploratory aspects of data visualization not proven principles.
This entry was posted in Font Visualization, Text Skimming, Text Visualization and tagged , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s