Word Stems Visualized

In this blog there have been many posts of words visualized where differences are accentuated and encoded using bold, italics, underlines, etc. But what if you want to visualize the similarities?

Stemming is a basic task in a lot of text analytics where the same semantic word has variant spellings, for example, to indicate different verb tenses (e.g. swim, swam, swimming). But there are also interesting derivations with different meanings, e.g. swim, swimmer, swimmable. So, how could you visualize these to focus on the commonality across the word roots, not the differences?

Stem & leaf plots are possible, but previous examples shown here create lists of leaves, not well suited to comparing syllables. Word trees have been discussed here before, but word trees put a big gap between different parts of text and may vary sizes and weights of different chunks of text in the tree.

Here are six English word sets, each with a common five letter root:

WordStems.png

Visually, there are six root-word-plots here. Each has the common root word running vertically along the left side of the plot as a stem (e.g. night-, orthomicro-, etc.) and affixes branching out horizontally along the right side of the plot. The left root and right affix indicate full words (e.g. nightcap, nightclub, nightfall, etc). When there are common intermediate syllables, these span across the common words (e.g. graph in orthographic, orthographical, orthography).

Visually scanning any group means that the affixes can be easily compared. For example, night- -cap, -club, -fall, -ie, -ingale are all derived words with complete different affixes, all referring to completely different objects. Under micro-, microbe and microbiology have much more commonality in meaning: being about tiny life forms or the study thereof; — although very different meaning from the words microchip and microcosm. Under astro- there are two very different branches of study, namely astrologer, astrological, astrology versus astronomer, astronomical and astronomy.

You can see that the first column, showing words starting with night- and with stand-, have root words that are independent words used to form compound words. Visually, these compound words don’t have additional shared syllables: the prefix is being used to create new words to define unique objects. However, the later two columns have Greek prefixes (ortho-, micro-, chrom-, astro- ) — none of these prefixes are independent words. And in each of these, there are common syllables indicating subsets of related words. At the same time, these common prefixes can be used to create new, highly different words that deviate more from the others, such as astroturf, or microchip.

From a design standpoint, the visual layout borrows from stem & leaf plots, with additional intermediate grouping and only singular leaves (so, not much like a stem and leaf plot:-). Design-wise, it also seems problematic that words don’t split quite on syllables: for example, the plot shows astro-nom-er whereas it should be as-tron-o-mer. 

Technically, it is not easy to create text of different sizes and widths that all visually appear to have similar stroke weights. Ideally, a font based purely on strokes rather than fills would work well for this. The early vector-based computer fonts by Allen Hershey would be great (which I used once-upon-a-time on an old Textronix 4014). However, their obscure format isn’t readily adaptable to modern font standards. Please, Frank Grießhammer, I hope you can find the time to release the Hershey fonts in OTF format! This is one example of a real-world application for vector fonts.

 

Advertisements

About richardbrath

Richard is a long time visualization designer and researcher. Professionally, I am one of the partners of Uncharted Software Inc. I am also pursuing a part-time PhD in data visualization at LSBU. The opinions on this blog are related to my personal interests in data visualization, particularly around research interests related to my PhD work- this blog is about exploratory aspects of data visualization not proven principles.
This entry was posted in Alphanumeric Chart, Data Visualization, Text Visualization and tagged , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s