Text in Visualization – thesis on-line

You can now find my full thesis on-line. Instead of reading the whole thesis to learn about the design space of text in visualization, you can find a two page overview that summarizes the entire thesis on page v-vi:BrathThesisTextInVizDesginSpaceOverview.PNG

The first half of the thesis (page v on the left) methodically defines the design space by reviewing many examples. The second half of the thesis (page vi on the right) then tests the breadth of the design space by creating many different kinds of extended and novel visualizations and provides general critiques. If you want to drill down into any area, little blue subscripts are links to the corresponding chapters.

For readers of the blog, you’ll find more detail on many items previously discussed here.

Posted in Data Visualization | Leave a comment

Word Stems Visualized

In this blog there have been many posts of words visualized where differences are accentuated and encoded using bold, italics, underlines, etc. But what if you want to visualize the similarities?

Stemming is a basic task in a lot of text analytics where the same semantic word has variant spellings, for example, to indicate different verb tenses (e.g. swim, swam, swimming). But there are also interesting derivations with different meanings, e.g. swim, swimmer, swimmable. So, how could you visualize these to focus on the commonality across the word roots, not the differences?

Stem & leaf plots are possible, but previous examples shown here create lists of leaves, not well suited to comparing syllables. Word trees have been discussed here before, but word trees put a big gap between different parts of text and may vary sizes and weights of different chunks of text in the tree.

Here are six English word sets, each with a common five letter root:


Visually, there are six root-word-plots here. Each has the common root word running vertically along the left side of the plot as a stem (e.g. night-, orthomicro-, etc.) and affixes branching out horizontally along the right side of the plot. The left root and right affix indicate full words (e.g. nightcap, nightclub, nightfall, etc). When there are common intermediate syllables, these span across the common words (e.g. graph in orthographic, orthographical, orthography).

Visually scanning any group means that the affixes can be easily compared. For example, night- -cap, -club, -fall, -ie, -ingale are all derived words with complete different affixes, all referring to completely different objects. Under micro-, microbe and microbiology have much more commonality in meaning: being about tiny life forms or the study thereof; — although very different meaning from the words microchip and microcosm. Under astro- there are two very different branches of study, namely astrologer, astrological, astrology versus astronomer, astronomical and astronomy.

You can see that the first column, showing words starting with night- and with stand-, have root words that are independent words used to form compound words. Visually, these compound words don’t have additional shared syllables: the prefix is being used to create new words to define unique objects. However, the later two columns have Greek prefixes (ortho-, micro-, chrom-, astro- ) — none of these prefixes are independent words. And in each of these, there are common syllables indicating subsets of related words. At the same time, these common prefixes can be used to create new, highly different words that deviate more from the others, such as astroturf, or microchip.

From a design standpoint, the visual layout borrows from stem & leaf plots, with additional intermediate grouping and only singular leaves (so, not much like a stem and leaf plot:-). Design-wise, it also seems problematic that words don’t split quite on syllables: for example, the plot shows astro-nom-er whereas it should be as-tron-o-mer. 

Technically, it is not easy to create text of different sizes and widths that all visually appear to have similar stroke weights. Ideally, a font based purely on strokes rather than fills would work well for this. The early vector-based computer fonts by Allen Hershey would be great (which I used once-upon-a-time on an old Textronix 4014). However, their obscure format isn’t readily adaptable to modern font standards. Please, Frank Grießhammer, I hope you can find the time to release the Hershey fonts in OTF format! This is one example of a real-world application for vector fonts.


Posted in Alphanumeric Chart, Data Visualization, Text Visualization | Tagged , | Leave a comment

Why history matters in data visualization

In any thesis or academic peer-reviewed paper, positioning your work in context of prior research is paramount to show your unique contribution and how your work “stands on the shoulders of giants”[ref].

My recent PhD thesis goes beyond the typical references of the last 10-20 years in my field (data visualization) and even the origins of my field (arguably, the foundations were set 50 years ago by Jacques Bertin [ref]).  I look beyond the field to other old domains such as cartography, typography and the arts.


A lot of what we are “inventing” in visualization have precedents in history and other domains. If there are precedents – maybe something was learned over the other field that we can leverage? Here’s a few of examples:

Sunburst chart
(aka hierarchical pie chart, concentric chart)

John Stasko and Eugene Zhang did this great visualization of a sunburst chart back in 2000. It’s a great approach to intuitively show hierarchical data. And you can find many great implementations on D3 these days too:


But there are earlier precedents. I particularly like this one: A Zoological Chart from Fike’s Concentric Charts of the Sciences, from 1890 (110 years earlier than Sunburst):


There are some really interesting details here in this pre-sunburst chart. Text rotates to best fit each segment – and spaced out to fill wide wedges, tight for narrow wedges. There are great little images out at the edge of the hierarchy, presumably a great way to engage bored students. Delicate colors that don’t fight with the text. And particularly interesting, the chart is padded with empty slots so that the each circle is complete – not ragged like most sunburst charts.

Word Trees

Word trees are awesome. The examples by Martin Wattenberg and Fernanda Viégas are viscerally and intellectually engaging with wonderful examples from classic texts. Not as many examples in D3.js, and, unfortunately, IBM’s Many Eyes implementation no longer exists:


But there are interesting earlier examples. How about this example from 1541 in a text by Loys Vasse?

It’s a sentence that’s been structurally split into a tree. It’s quite similar to the WordTree, in that sentences can be split apart into trees, whether representing repetition across many sentences (such as WordTree) or logically structuring content (such as Vasse’s example). In fact, this hierarchical structuring of text lasts for hundreds of years in print documents. We can see examples 200 years later in Chambers’ Cyclopedia in 1720:


Interestingly, the approach is not strictly limited to trees, but can be generalized to draw sentences as directed acyclic graphs, such as this example (again from Vasse): WordTreeLoysVasse1541.PNG

So what?

Why do we care about these old examples? They aren’t interactive, they don’t dynamically update to different content and they were certainly difficult to create in their old technologies.

They are important because they show other approaches for solving similar problems.

In the early 2000’s I had a particularly vexing project where we needed to show a hierarchy and through the design process both tree maps and sunbursts were rejected by the client, as were other representations such as a graph, a file structure, a radial graph, and so on. All were “too complicated”. This was pre-D3, so lots of prototyping code was being written (and discarded). Instead, we revved a sunburst with padding, so that the chart was always fully circular, not ragged. The client loved it. Two years later, I saw Fike’s Concentric Charts and was impressed that Fike found a similar solution 115 years earlier. If I’d been aware of Fike’s example, we might have reached the solution faster with less code.

Similarly, the old word trees hint at other potential uses for Word Trees. And so on.


If we assume that old techniques are interesting, then what? How do we find these old examples? You can’t find “concentric charts” via Google Search if you don’t know the search term. And since Fike’s concentric charts predate the Internet (and have a very tiny Internet footprint), even searching for “concentric charts” doesn’t return these vintage results. So far, browsing is the best answer that I have: on-line such as archive.org, museum websites, library websites, antique prints, blogs, etc. But also, browsing in the real-world, such as museums, art galleries and libraries.

Let me know if you find any more great charts by Fike: despite the plural “charts” in Fike’s title, the above chart is the only example I’ve found.

Posted in Data Visualization | Leave a comment

Microtext Line Charts: Sample Code

MicrotextLinesRandomI’ve presented Microtext Line Charts a number of times. There is a lot of interest and a lot of questions. Questions are generally two flavors.

  1. How do you implement this?
  2. What happens if:
    • there are more data points in the line
    • the lines cross each other more frequently
    • the lines have sharp corners rather than interpolated bends
    • the text is a bit larger (or bit smaller)
    • the text is differentiated using caps; or italics; or weight; or etc.
    • the text has a halo, doesn’t use color, uses different sizes on different lines, etc.
    • the text animates with each successive update
    • you could put data values in the lines at a high point or a low point
    • you could shift the text so that it is less likely to overlap
    • the text is a narrative explanation instead of labels
    • etc.!

Short answer to both questions: Here’s a link to an interactive example on CodePen. Try it out, copy it, make changes, run evaluations. It uses random fonts, colors and data. It has buttons to turn on/off the underlying lines and change text size. If you do use the technique, I appreciate acknowledgement (e.g. refer to this post, or cite this research paper).

For those asking how the code works, essentially D3 is a library that manipulates SVG. SVG has built-in text-on-path functionality. D3 makes lines for line charts. These lines can be used as paths. For SVG text, you can add text to a text path and then associate the text path with the line. From the SVG reference:

In addition to text drawn in a straight line, SVG also includes the ability to place text along the shape of a ‘path’ element. To specify that a block of text is to be rendered along the shape of a ‘path’, include the given text within a ‘textPath’ element which includes an xlink:href attribute with an IRI reference to a ‘path’ element.
— W3 SVG Specification, Text On A Path

For those asking all the other questions, click the link to the sample code. Each time you refresh the page, the random data will be different – more points, fewer points, more volatility, less volatility, different colors. And you can modify the code from there.

See the Pen Microtext Line Chart by RBrath (@Rbrath) on CodePen.

For example, white halos around the text *could* be added by changing the stroke outline of the text (bad idea! – the stroke width will eat into the fill of the letterform reducing text legibility in an representation where legibility is already challenged by overlapping text); or better, the halo could be added by making a second copy of the text with a white fill and a fat white stroke under the other text.

Instead of one long string of microtext, individual pieces of microtext could be placed along the line, then nudged left or right (dx) to reduce collision. Similarly, text labels corresponding to the high point or low point for a line could be shifted to the high/low points on each line based on shifting its left right position.

And so on.

Posted in Data Visualization, Line Chart, Microtext, Text Visualization | Leave a comment

Successful PhD Defense!

I recently successfully defended my PhD. Yay! It was almost 3 hours, as there were many questions. There have since been many congratulations and questions from others.  The most common question is:

How did you complete a part-time PhD in 5 years?

This is a really good question. I had previously completed a part-time masters degree in the 1990’s which unfortunately took my 6 years to do. Doing any kind of independent research it’s easy to fall into a hole where you get side-tracked on something not important, over-work some code more than necessary, design a poor experiment, complete a task without being aware of prior work, and so on. Back when I started the PhD, I specifically made a list of things to avoid/improve so that I wouldn’t fall into the same trap as before.

  1. Meet with your supervisor frequently. It’s easy to have scheduling conflicts, but in the days of Skype, web meetings, Slack, etc., it’s pretty easy to reschedule and do live meetings. My supervisor and I both agreed on meeting at least once a month and we’d reschedule as needed so the meetings didn’t get missed. This is really important to to avoid the above pitfalls.
  2. Lots of small tasks instead of really big tasks. Decomposing a big research project into small tasks is a good idea regardless of the circumstances. However, when part-time, this is really important. Small tasks can be chunked into a weekend or two.
  3. Know your limitations. You’re not on campus, you don’t have the same access to resources, you don’t have the same access to big blocks of time. I would have liked to do a evaluation study, but it had more overhead (e.g. experiment design, ethics committee), less access to students, and it would have been a big task. Instead, I did a number of small surveys.
  4. Submit, submit, submit. Submit posters, talks, papers, journals and so on. The submission process means that you have to organize your ideas, perform some focused research, analyse results — all of which are good. Then, you get reviews. Sometimes these are disappointing rejections (I got a -3 on a 1-5 score range on one paper), but there are lots of good nuggets of useful information in each rejection.
  5. Workshops. Instead of really big conferences, workshops and side-conferences are a great venue to get feedback on work in progress. Workshop papers are smaller scale making it easier to do the work and write the paper rather than the big conference. The workshops also provide for a more collaborative environment to get feedback from your direct peers specifically interested in your topic, as opposed to the mega-conference where questions can be somewhat random. If you do a really good job on a workshop paper, you might get invited to submit to a journal too. Overall, I had 12 peer-reviewed publications during my PhD vs. 2 for my masters (which was longer duration).
  6. Solicit cross-disciplinary feedback. Likely whatever you’re working on has applications across domains or at least there are different constituents of stakeholders. Directly approach those different stakeholders and get their input. They have different viewpoints. In my case, I reached out to typographers and cartographers a couple years into my thesis; and both these groups helped identify significant gaps in my work. I might have been able to get away without their feedback since my thesis reviewers were not typographers nor cartographers, but it made for a much stronger, much more defensible thesis because I’d incorporated their feedback.
  7. Background. Too many papers that I review seem to be missing related relevant research. Google Scholar has made search through a lot of current peer review research relatively easy. But don’t stop there: there is likely older relevant research that can also be found: many of the world’s largest libaries and museums are online, old websites and old texts can be found on archive.org, and so on.
  8. Blog! I used the blog as means of forcing me to always write about something related to my research (Ahem, I did not always achieve one post per month). It’s great to get feedback from the Internet at-large and see what resonates across the Internet. I thought my posting about Pokemon would have more reposts than it did. I got more reposts on my discussion regarding 500 years of separation than expected.
  9. Time-outs. There are unplanned events that always occur and need to be accommodated. My wife’s step father passed away. My mom sold her house and downsized. I had pneumonia for a couple months. You have to take some time out, but then you need mechanisms to get started again so that you don’t lose momentum. Always having a paper submitted  somewhere means you’ll get a response. Having commitments such as supervisor meetings or blog posts to do gets you back on track.

In case it’s not obvious yet, rapid iteration with frequent feedback is at the core of almost all the above tasks. Essentially, it’s about putting in place mechanisms to keep you on track, guided and focused.  It worked fairly well for me so far – now I just need to do the “minor revisions” and keep focused on getting those done.

Posted in Data Visualization | Tagged | Leave a comment

Building a Better Table of Contents

As I near the completion of my thesis, I find it challenging to navigate through 250 pages of prose and remember where certain arguments or certain pieces of evidence are located; or challenging to get a broad overview of the narrative sequence.

Search is only effective if I can recall the name of reference or if the search term is suitably unique. Unfortunately technical terms are frequently repeated – for example – preattentive occurs 46 times. Even using keyword in context (KWIC) in the search panel still provides an excessively long list and perhaps the associated words of interest aren’t within a few words of the search term. Furthermore, search doesn’t work for the broad overview.

A Table of Contents should help, but not as much as I had hoped. A few problems:

  • On a large document, the table of contents is spread out across many pages, requiring paging and requiring the use of short term memory to hold onto relevant headings.
  • Headings tend to be terse, so it may not be obvious what the content is associated with a heading.
  • Furthermore, headings reduce information down to a few words – the narrative stringing the pieces of the argument together are lost.
  • While there are different levels of headings, overall, the representation of a table of contents is fairly uniform and undifferentiated, thereby not providing landmarks to facilitate navigation back and forth across the pages.

Table of Contents: an ordered list of headings without connective narrative and minimal landmarks spread across many pages. 

It may be suggested that the gist of document should instead be conveyed by the abstract. An abstract is more terse and provides the narrative sequence of the document – however – the abstract does not facilitate navigation of the document as it does not provide any links or page references. Furthermore, the length of abstracts tend to be quite limited: meaning that only the broadest overview can be described in the abstract.

Instead, I decided to use a very old technique. Chambers’ Cyclopaedia from 1728 creates a View of Knowledge – a unique narrative description of contents of his encyclopedia, all neatly organized in a hierarchical word-tree. Here’s a subset – you can read it left to right as a continuous narrative:


So, instead of a Table of Contents, perhaps Chambers’ approach is more useful. Here is a a narrative Description of Contents  for my thesis:


In one page, this description provides a brief readable narrative, outlining all the major arguments of the thesis, using FULL CAPS (and whitespace) to indicate major parts of the thesis. SmallCaps indicate each major heading. Superscripts (in blue) are links to page numbers.

[Aside: Note that word trees tend to use word size to indicate data. Unfortunately, word size uses alot of space; Full caps, small caps, superscripts and color can all be used to indicate data while offering higher data density and improved readability.]

The approach can be applied recursively: each part of the thesis has its own description of contents. For example, here’s the description of Part II:

In this example, the description contains many thumbnail images providing cues to the contents associated with each section (e.g. what’s the difference between text semantics and typographic semantics? – hint: comic book text is the image associated with typographic semantics). The large image at the bottom is a diagrammatic representation summarizing the design space derived in from the preceding sections.  Other cues (e.g. lines) can be added too to connect relations between items further apart.  Images also act as landmarks and commonality between different types of images help visually separate out various sections as well.

Does it work? I created the Descriptions to help me clarify how the pieces fit together. A couple of reviewers have commented how the descriptions have helped them understand the document – acting as a preview to each part and as a quick reference to navigate around the document. Obviously, more effort is required to create the description than a table of contents or abstract – but I’ve found it much more useful than either the abstract or table of contents to both navigate the document and to help move portions of the thesis into a better topics and flows.


Posted in Data Visualization, Design Space, Search | Leave a comment

Color in Text Visualization

Highlighting text via bold or italic is less noticeable than color. There is an incredibly long history of the use of color in a continuum from medieval times to present. From a visualization perspective, color is scoped to the level of words, or Proper Nouns, such as keywords in context (KWIC).  Color ranks highly in a study by Strobelt et al. Color is already among the most highly used attribute for encoding data into text (as of Jan 2016: 71 of 249 text visualizations at textvis.lnu.se used color, only size was slightly higher at 76, and the next closest was orientation down at 10). And color has a strong visceral appeal – note all the tag clouds that vary word color randomly.

So, what more can be said about color in text?

The scope of text in visualization doesn’t need to be constrained to words: color can be applied to entire phrases, or down to individual glyphs. Certainly medieval authors were fond of illuminating initial letters of paragraphs and rubrication for the leading letter in sentences:


Colorful leading letters: Illuminated initials leading paragraphs, rubrication on lead initials of sentences. (bonus, note the sparkline at top right)

Note also that color does not need be constrained to a single color. In the above manuscript, the illuminated initials are blue set on an ornate red background. The idea of varying both foreground and background color of text has certainly occurred to some visualization designers as well:


Spatiotemporal tags vary both foreground and background color.

Mixing different foreground and background colors can result in illegible type: the words architecture and alemania are hard to read in the above example, as are some of the foreground/background combinations in this stem&leaf timetable:



Red text on an orange bar isn’t particularly readable.

Typographers have different ways of combining multiple colors into letters. Chromatic fonts contain complementary shapes, each printed in a separate color, each layer carefully aligned on a press. Consider this historic poster:

Chromatic fonts on historic poster: colored outlines and drop shadows. Image courtesy of  Gerry Leonidas, Department of Typography & Graphic Communication, University of Reading. 

Chromatic fonts emerged in the mid-1800’s, along with large sizes, many weights, wide variety of serifs and decorations. For the first 40 years of computer screens 72-96 DPI probably wasn’t sufficient for chromatic fonts and likely would have resulted in visual artifacts (i.e. jagged bits of high contrast colors interfering with legibility).  Support for chromatic fonts on computers is very recent and currently not consistent across browsers.
But, computer-based chromatic fonts open up interesting design possibilities. Wood letters with ink-based printing are limited: the approach scales to only a few discrete ink colors.  Instead, with software-based chromatic fonts, many hues and gradients are feasible and can be applied to portions of letters. The implication for visualization is that color can be applied to any subset of letters or parts of glyphs; potentially used to indicate quantities; still maintain legibility; and have a much more dynamic aesthetic than plain text. Some of the examples of computer-based chromatic fonts hint at very interesting future visualization possibilities:



Sample chromatic fonts via colorfonts.wtf


It would be very interesting to see what new ideas some visualization research or graphic design programs might come up with!
(Title image conveniently cropped from Specimens of Chromatic Wood Type… 1874)

Posted in Data Visualization, Font Visualization, Text Visualization | Tagged , | 1 Comment