Stem & Leaf Bigrams

Stem & leaf plots are quirky alphanumeric data visualizations. In typical usage they show distributions of numeric values. Here’s a simple dataset of heights of my family on the left, transformed into a stem and leaf plot on the right:

bigram-stemLeafIntro

The resulting stem & leaf plot, at a macro level, shows a distribution. At a micro-level, you can easily read off the minimum value (152), the maximum value (187), and probably make a reasonable guess to the medium value – or in this case – the median is explicitly indicated with an added underline (172).

Character-based Stem & Leaf Plots

So can stem & leaf plots be extended to plot text-based data?

Consider a simple example based on letter pairs. Bigrams are pairs of letters. They can be analyzed across a large body of text and certain letter pairs will be common in certain languages. This can be useful for applications such as auto-detection of language and for cryptography.

Since bigrams are pairs of letters, one could split a bigram into a first letter (for the stem) and the second letter (for the leaf). The resulting stem & leaf bigram plot for all the letter pairs that occur more than 0.5% in the English language are shown here:

bigram-Leading

The stem (left side) is the leading character of the bigram; the leaves (right side) are all the trailing letters, with font weight indicating frequency. You can see that many of the most common English bigrams start with E. And you can see that ER, IN and TH are among the most frequent bigrams.

Unlike a stem & leaf plot based on numeric values, the trailing characters may be of interest. Instead we shift the stem to the trailing character when we add a second stem&leaf plot on the right:

Bigrams-Both

In the second stem&leaf plot, the stem (now on the far right) indicates the trailing character; and all the leaves indicate the leading character. E is also the most common trailing letter. However, it is now a bit more visible that a trailing N is fairly high frequency.

Stem & Leaf Trigrams

And the approach can be extended to trigrams. The plot below is centered on the second letter of high frequency English language trigrams:

Bigram-Trigram

The trigrams are ordered from the center out – so in the top row, the top trigram is HAT, followed by EAR, followed by WAS. The most common trigram is THE, and there appear to be many trigrams with T at the center (note, dataset is from Practical Cryptography: http://bit.ly/1wf1Lgc, which didn’t consider word spaces).

Word Bigrams as Stem & Leaf Plots

Text-based stem & leaf plots can go beyond characters as units and extend to words and phrases. I did an previous blog post regarding characters and adjectives from Grimms Fairy Tales, which essentially was a stem & leaf plot.

Another variant is to use people’s names as bigrams: the typical forename-surname is a word bigram. Forename-surname is commonly used in the west can be used to construct textual stem & leaf plots. Here’s some of the families that were passengers on the Titanic. Surnames form the central stems – first class on the left half of this diagram (in a fancy serif font) – third class on the right half (in a plain sans serif font):

bigrams-Titanic

In this example, leaves on the left side of the stems are women (italic), leaves on the right are men. Children are indicated in ALLCAPS. Those who perished are bold. For example, top left is Ethel Fortune, a first class adult woman who survived. In the same family are Mark and Charles Fortune,  first class adult men who died. Among the first class, one can see that the women (along the far left) are almost entirely non-bold, meaning that they survived. Among the first class men, there are many more bold entries (died) although many of the survivors are allcaps (children). Thus, we see among the first class, that many women and children survived. However, in the right half of this plot we see the third class almost entirely bold: Most of the third class perished – regardless if they were women or children.

Phrases as Stem & Leaf Plots

As a final example, we expand the approach out to phrases. This example is based on the book of Psalms from the Bible, which has a number of repetitious structures. Here, we identify some of the most common phrases, indicated along the horizontal grey line. Above the grey line are the leaves that precede the common phrase, below the grey line are the leaves that follow the common phrase.

bigram-PsalmPhrases

Interestingly, the phrase “I will praise thee” is common, but there is no commonality among the preceding and following phrases. However “O give thanks unto the Lord” is typically preceded by “Praise ye the Lord” and followed with “for he is good”.

 

 

 

 

Advertisements

About richardbrath

Richard is a long time visualization designer and researcher. Professionally, I am one of the partners of Uncharted Software Inc. I am also pursuing a part-time PhD in data visualization at LSBU. The opinions on this blog are related to my personal interests in data visualization, particularly around research interests related to my PhD work- this blog is about exploratory aspects of data visualization not proven principles.
This entry was posted in Alphanumeric Chart, Data Visualization, Font Visualization, Text Visualization. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s