I’m really appreciate some of the examples I’m seeing in the wild. Here’s some fantastic examples from Georgios Karamanis. I like the shifted + bold text indicating voting topics at the UN, and word-pairs describing makeup. And Georgios provides the code, so if you’re into R and want to see how to implement some of the text visualization techniques from the book, see his github.
Also exciting to see an endorsement from Michael Friendly, and many others on Twitter. Thanks for the posts.
I’ve talked about the book internally at our company (Uncharted) and have been pleasantly surprised to see some of the ideas weaving their way into some of our visual analytics, such as a button indicating the color legend within the button glyph; or a technique for interactively labelling neighbourhoods while zooming around a massive network.
While I can’t really do a book tour during Covid, I did a talk at Naomi Robbins’ Data Visualization NYC Meetup and interview with Lee Feinberg’s Analytic Stories. Looking back at the videos, I see I may have talked over a couple people – sorry! Happy to follow up.
This technique of fitting type onto lines or into shapes has been going on for centuries: I like Calligrammes from the early 20th century, medieval monks speaking in scrolls (on the cover of the book!), text set into the shape of an axe in a book from 1530, or awesome psychedelic posters, such as Wes Wilson‘s posters from the 1960’s. For any visualization researchers interested in algorithms for fitting text into complex shapes, see Ron Maharik’s Digital Micrography research and PhD thesis.
News headlines about the GameStop price swings this week reminded me of some old SVG visualizations of stock bubbles and crashes that I’ve done. SVG was around before D3. I generated SVG visualizations in the mid-2000’s well before D3.js. It was painful in comparison, but at the time it was a lot of fun.
The objective was experimentation: what could be done with a scalable vector graphics library? As such, it wasn’t constrained to screen resolution and screen dimensions (which at that time was typically 1600 x 1200) – rather you could do much more detail, import into Illustrator and print things at very high resolution with lots of detail, lots of transparency, and tiny text.
Here’s an example: Microsoft’s daily stock price from late 1992 to early 2004.
That’s 12 years of daily data with about 250 trading days per year resulting in a 3000 px wide visualization — which now draws just fine on my 4k display. With SVG, it is easy to layer in many different analyses creating different marks, lines and areas. Here’s a closeup of Lucent’s daily stock price during the Internet bubble:
Many different graphical marks indicate data and derived indicators:
The blue line indicates the daily price.
The small green/yellow/orange/red bars behind the blue price line indicate the monthly price move from the beginning to the end of the month, colored by whether the price increases (green) or decreases (red). It aligns neatly with the monthly grid, filling in stripes within the grid.
The fat green/yellow/orange line behind that indicates yearly change in price. (There’s also light grey boxes behind that align with the thicker yearly grid.)
The green/purple circles indicating successive high/low points. Values for the highs and lows are indicated in text as well as the date. All successive low points are connected by a straight purple lines, all successive high points are connected by green straight lines. The zone between the green/purple lines form an envelope around the price range.
The many orange lines are moving averages each with a different time period, forming a guilloché. When moving averages start to cross it is indicator of a change in trend, e.g. from up-trend to down-trend (as many stock traders know). In 1997, the averages start to converge but then the trend continues. However, in 2000, the moving averages successively start to cross, and by June most of the moving averages have crossed, before the much steeper crash.
The large arcs and corresponding fills indicate major trends. Up-trends start at trend low to the ending trend high (e.g. 3/19/1997 at $10.75 to 12/8/1999 at $71.31: up almost 7x!) Down-trends start at trend high and end at trend low (e.g. June 5, 2000 at $56.23 to a point outside the closeup: 9/27/2002 at $0.77: down to almost 1% from the start, ouch!) This is what a bubble looks like and how it plays out.
The filled shadows in pink and blue indicate the 52-week low price and 52-week high price. On the first two-thirds of this chart, the pink shadow is predominant as the stock keeps going up with a few “lakes” of blue filled in the occasional dips. In the last third of the chart, it’s almost entirely under the blue shadow as the stock tanks.
Why is it important to have all these different markers, bars, lines, arcs, text, and so on?
There are many techniques to analyze timeseries data – moving averages, standard deviations, envelopes, and so on. In finance, one can explicitly study timeseries analysis. And more broadly, these analyses apply beyond stock prices to electrical grid loads, network utilization, software performance, automobile diagnostics, and so on.
SVG is scalable, so I applied it to longer and longer timeseries. Here’s the Dow Jones Industrial Average(R), when I re-ran the code with in 2010 to compare the 2008 financial crisis to the 1929 crash:
So, 114 years of daily data results in a plot more than 28000 pixels wide. That’s more than my 4K screen can display in it’s entirety at full resolution. But paper can. The visualization uses a log scale: you can see the magnitude of the 2008 crash at the top right is far smaller compared to the crash of 1929. I.e. from the index peak in 2007 around 14000 to a low of 6500 in 2009, the index lost a bit more than half of it’s value; whereas in 1929 the index peaked at 381 in September 1929 then dropped to 44 in 1932, down 88% of its original value! There was a lot of pain in 2007-8, but the massive intervention in the markets by the Federal Reserve and central banks helped stave off a much bigger crash and avoid a much bigger, longer recession.
There’s a few other things going on with this experimental visualization. Here’s a closeup of the 1950s:
In addition to the the many different shadings and markers, the grid lines also participate in indicating data range: rather than extend across the display, the grid lines are localized to the line +/- a range. Axis labels also follow data-driven rules. The price labels in this snapshot follow the grid (note the stock market high of 381 in 1929 wasn’t surpassed until 1954, some 25 years later). The date labels are completely data driven – indicating dates on the top side of the line when the price hits new highs, and on the bottom side of the line when the price hits new lows.
Why is it important to have such long timeseries in so much detail?
Our firm has clients with financial timeseries that are more than 200 years long. Seeing all the detail is important. The current GameStop bubble is not unique, there have been many, many more, going back to railroad mania in the 1840’s or the South Sea bubble in 1720’s. Different bubbles will play out in different ways: having detail allows for comparison to prior bubbles for insight into the current bubble. Recoveries from recessions will be different for different sectors. Some market experts will use this information to inform their portfolio strategies in response to GameStop, or in response to Covid, or in response to an election cycle.
And why the variations in grids, arcs and areas?
D3 is great and much can be done out-of-box. But, when you just use the standard examples applied to different data, you might not be indicating the things that matter to the end user (i.e. the purpose is insight, not pictures). As such, it’s important to use the toolset to experiment with the underlying graphics to highlight the core insights.
I’ve done much more experimentation with SVG before D3, but those will be for some future blog posts.
LondonLives.org is a collection of 240,000 historic manuscripts from eighteenth century London. These have been collected, organized and analyzed, such as Sharon Howard’s summary of 2894 coroner inquests from London 1760-1799 with each case including subjects, verdicts and causes of death as well as links back to original handwritten manuscripts. The dataset is fascinating: How might this data be visualized?
Each each row in the summary dataset indicates subjects (e.g. Susanna Thompson, Sarah Cox), verbs (e.g. crushed, falling) and objects (chimney). These can extracted using natural language processing (although, note that Howard cautions that the causes of death are not fully accurate, nor is my natural language processing). These extracted words can be assembled into a hierarchy, for example verb, object, subject: – Crushed – Chimney – Susanna Thompson, Sarah Cox
There are many, many different techniques for visualizing hierarchies (e.g. treevis.net): – Treemaps and sunburst focus on accurate representation of values using area, meaning that fitting legible text can be difficult. – Graph representations, such as nodes and links, emphasize the structure, again potentially making difficult to fit all legible labels. – Org-charts are designed to fit text, but the left-right/up-down layout can result in layouts that become very wide or very deep, making it difficult to fit an org-chart with 3000 items to a rectangular screen and display all the text at a legible size. – More generally, most quantitative visualizations aren’t designed for representing large amounts of text.
Instead, typographers and publishers have techniques for depicting textual hierarchies not shown in any visualization compilation. Indexes and dictionaries are designed to show a large number of words in hierarchies in a very dense format. Dictionaries use wide variety of formatting, for example, using very heavy-weight text to make the defined word stand-out. Alphabetic ordering facilitates search. This supports quick non-linear skimming to the word of interest; then different formats are used within the definition (e.g. italics, small caps, etc.,) facilitating jumping to the type of data of interest. Size of entries is relevant: words that have many meanings have longer entries. In effect, indexes and dictionaries are visualizations, although constrained to printed pages bound in a book. With larger screens, indexes can be set out to be entirely visible on single screen or two.
Returning to the coroner inquests data, consider an index-like visualization. Organizing the hierarchy by verb, then object, then subject results in a list of subjects under an object, which in turn are listed under a verb. Here’s all the ways that people died under the verb CRUSHED:
Each object is listed in a separate coloured section. Each section starts with the object in bold (e.g. chimney), and has a list of named subjects in a narrow brown font (e.g. Christiana Jorden and 3 others). Individual names are links to the original manuscripts, e.g. Sarah Cox:
Given some unusual objects, a sample cause of death is shown in italics, (e.g. crushed by falling chimney). More frequent causes result in taller sections, for example being crushed by a chimney was more common than being crushed by a scaffold; while being crushed by a house or a theatre crowd was more common than chimneys.
Also, I was curious about gender: were women more prone to certain causes of death than men (assuming that women’s deaths were investigated equally to men’s deaths)? After each object a pair of numbers indicates the gender ratio, e.g. 1/3 indicates 1 man and 3 women were crushed by chimneys. To facilitate skimming for this ratio, the background colour of each section is shaded light blue to light red. This colour scale isn’t meaningful when viewing causes with only a couple deaths but provides some macro-scale patterns when viewing the larger dataset.
Here’s the 2894 inquests presented in an index layout, as a simple interactive HTML/CSS visualization. On a 4K screen all text is legible and readable:
Like a dictionary, the biggest blocks have the most entries: the biggest blocks are DROWNED (in column 3 in the image above) and HANGED (column 6). Both of these tend to be light blue (more men than women). The largest light red block (more women than men) is under BURNT (in the first column).
In the full visualization, the colour of subjects’ names correspond to the verdict: brown is for deaths deemed accidents, red for homicides, green for suicides, etc. These can be interactively filtered, for example, the most popular method of suicide was hanging (biggest box), generally more male suicides are recorded than female (generally more blue than red), many methods were used by both genders (e.g. drowning, poisoning, cutting), although gun shots are almost entirely male.
This example is from my recent book Visualizing with Text, including a comparison to a treemap. It’s a good illustration of a visualization that provides high-level perceptual patterns (large coloured blocks) and low-level details (words and phrases) within the same visualization – which Edward Tufte refers to as micro/macro readings. The book review of Visualizing with Text by Alec Barrett succinctly summarizes the benefit as being able to “zoom with one’s attention.”
Thanks to LondonLives.org and Sharon Howard for collecting, organizing and summarizing these historic documents.
I’ve done a couple of presentations with content from my book Visualizing with Text to grad classes in the last month. In both cases, a couple of people expressed concern regarding the complexity of some visualizations, being particularly data dense. This is an argument that I have heard on occasion throughout my career as I am often involved in the design and development of data-dense visualization for domain experts. Data-dense displays are not uncommon in domain applications – consider a few examples:
These visualizations are packed with data. The map has many layers: roads, railways, rivers, canals, drains, buildings, labels, textures, etc. This periodic table has icons, symbols, numbers, names, pathways, ions, solutes, charges, etc. The trading desk shows a wide variety of information associated with securities: timeseries, events, color-coded news, internal data. The electric grid shows hundreds of entities: transformers, capacitors, hydro dams, wind farms, interchanges, power corridors of different voltages, status, thresholds, etc.
And, if you look at the Uncharted research website, you’ll see many more data-dense visualizations that our company has worked on.
Data density slows down understanding
The essence of the data-density criticism is that, with a greater number of data points and/or multi-variate data, the viewer can become confused as to where to look. There may be many different visual patterns competing for attention. Should the viewer be focusing on local patterns in a subset of the display, or macro patterns across the entire display? Should the viewer be attending to the color, or the size, or the labels? If each represents different data then the semantics will be different, based on the visual channel that the viewer is attending to. Worse, if the viewer starts to move back and forth between many different channels, one may forget the encodings and become more confused.
Some people have become conditioned to think of data visualization as they might see in the popular press, visualizations made for communication, or visualizations that utilize straightforward visualization types that you might get with a library such as D3 or Vega:
Viewers may be conditioned to think of these as encompassing all (or most) visualization types, with many articles organizing visualization into a limited palette of visualization layouts: periodic tables, chart choosers, lists, galleries, zoos, and more. These visualizations typically don’t have many data points (20-200), and typically show only a couple of variables. Data is homogeneous – you’re not looking at multiple datasets with different types of entities jumbled together. Answers are easier, because there really aren’t many different dimensions or layers to be considered.
But not all problems are simple.
Complex problems may need complex data
The images in Figure 1 show that multi-metric, data-dense visual representations exist in practice – in both historic visualizations and modern interactive visualizations. These complex visualizations bring together multiple datasets in layers, in many windows, in large displays – i.e. into data-dense representations.
This extra data is required, because there may be multiple answers possible. If the price of a stock goes down, it may be due to the overall market going down, to a competitor dropping their prices, to poor sales data in the company’s earnings, to a negative news story about the company, or other factors. The extra visual context facilitates reasoning across the many plausible causes to assess the situation. In the stock example, multiple causes can be true at once: an expert needs to see all and determine which are most relevant to the current situation.
In addition to quantitative data, there may be other facts and evidence: qualitative data, news, videos, and so on. There may be multiple perspectives to consider. There may be different time horizons to consider. (For example, the stock market collapse in 2008 was triggered by the collapse of Lehman Brothers on September 15, 2008; but months before Bear Stearns (a competitor) was acquired when it ran into funding issues, and even earlier some mortgage origination companies went bankrupt.)
More generally, wicked problems are not easily solvable. The problem can be framed in more than one way, different stakeholders have different timelines, constraints and resources are subject to change, and there is no singular definitive answer.
As a colleague tells me: Complex problems have easy to explain wrong answers.
The Value of Data Density
Communication: There are many different reasons for creating visualizations. The low density visualizations in Figure 2 may be part of narrative visualizations, for explaining the results of an analysis. Data has been distilled to a few key facts.
Dashboard: Or, simple low-density visualizations may be part of a overview dashboard, with many small visualizations, each of which provides an overview of a different process, and can be typically clicked on for more detailed analysis. These overviews only need to provide sense of status: if there are any issues then the viewer has workflows to access more detail.
Beyond the communication and dashboard uses, there are many other uses for visualizations, where density may be valuable:
Organization: For example, the map and the periodic table in Figure 1 organize large amounts of data. These many layers of data allow cross referencing between many different types of information. On the map, the user may need to know the location of buildings (objectives), roads (connections), canals and railroads (obstructions) in order to plan a route.
Monitoring: The market data terminal and the electric grid operations wall in Figure 1 provide real-time monitoring across many data streams. Many heterogeneous datasets come together into a single display. Time is of the essence in real-time operations. Detailed data can’t be hidden a few clicks away: all key information must be designed and organized for quick scanning and immediate access.
Analysis: Knowledge maps and network visualizations are often about analysis of complex data. SciMaps.org has 100 knowledge maps, each collecting and visually representing many facets of a particular corpus; such as Figure 3 left, an interactive visualization of the edit history of Wikipedia articles by Wattenberg & Viégas. Visual Complexity has 1000 network visualizations, such as Figure 3 right, a dynamic visualization of social networks indicating people, links, activity, postings, sequence and message age by Offenhuber & Donath. Both of these are visualizations about text over time: edits, exchanges, persons. In both cases there are many dimensions to understand and comprehend.
Exploration: Data-dense visualizations aren’t limited to domain experts attempting to understand complex datasets for their jobs. In Dear Data, Georgia Lupi and Stephanie Posavec create some awesome multi-variate data-dense visualizations of mundane day-to-day data. Why? Exploratory data analysis needs to consider lots of different data – the exploration is required to consider, assess, investigate, compare, understand and comprehend many different data elements. To do exploratory data analysis with only a well-organized quantitative dataset may miss much relevant data (e.g. see Data Feminism). Lupi and Posavec show by example that many different attributes that can be extracted from everyday life and then made explicitly visible for an initial exploratory view.
Data Density and Visualizing with Text
The objective of the book is to define the design space of Visualizing with Text forall kinds of visualizations, simple or complex. Section 2 in the book deals with simple labels, such as scatterplots, bar charts and line charts: text can be used to make simple visualizations more effective. Section 3 in the book goes further, using multiple visual attributes to convey more data (figure 5).
The Future of Data Density
Data density is likely to become a bigger issue in the next decade. Greater awareness of bias in data makes it more important to represent more datasets in a visualization. Analysis of richer data types – such as text, video and imagery – will likely necessitate new ways to layer in additional visual representations. Bigger data will have even more variables requiring more ways to show more data, or risk summarizing too much useful detail out of data. Specific visualization applications, such as cybersecurity, fake news, and phishing, need to deal with ever more complex attacks which implies more nuanced analyses based on more complex data.
Data density will become increasingly important to future visualization and visualization research.
I just received my author’s copy of Visualizing with Text this morning! It’s awesome to finally hold the book after 2 years of writing (and the start of this blog 7 years ago!):
Flipping through triggers some memories, like finding this user study on charts from 1961 comparing labels vs. legends! (Can you BELIV that there were user studies 59 years ago, before VisWeek? see Sarbaugh et al: Comprehension of Graphs):
Or a larger effort specifically for the book is captured on this page. Since the book defines a design space for visualizing with text, I felt compelled to demonstrate the flexibility of the design space to create many different visualizations of one document: here’s 14 different visualizations of Alice’s Adventures in Wonderland:
And, here’s a page on very different uses of visualizations (beyond using visualization for preattentive perception of patterns). On the left is a system diagram of a power grid (an inventory use that organizes all the assets in the grid, courtesy of ISO New England). Top right is an infographic by Nigel Holmes of a graph, where the edges are literal text implicating individuals (a communication use that distills days of testimony down to select statements, courtesy of Nigel):
“Preview” is now working on the CRC Press site, and “Look Inside” is now working on Amazon.
Visualizing with Text releases any day now: I hope to have my copy before the end of VisWeek. I’ve finally posted all the figures that I authored on-line with a CC-BY-SA 4.0 license. There’s 158 high-resolution images and diagrams from the book in the PDF file. These may be a nice complement to the eBook or physical book as some of the text may be too small to be readable in some of the larger visualizations. My figures are all released with a CC-BY-SA license, so they can be reused, for example, for teaching or mixed up into a collage or whatever.
There’s another 99 figures that are not mine – I’ve included links to online versions of these images where available on the last page.
Sometimes people ask me which of the visualizations I like the best. The answer varies over time, although I am currently biased towards the text-dense multi-variate visualizations designed for a large screen, such as these ones (Figures 6.19, 8.10, 9.8, 10.11, 11.3, 12.11) – see the PDF for high-res versions:
Why? Viscerally, I like the rich texture of shapes, colors, and structure where multiple patterns appears – visualization should be supportive of representing complexity and affording multiple interpretations. In my day-to-day work, I often design visualizations for financial market professionals: they don’t necessarily make money if they have the same ideas and same thesis as everyone else. Data-dense visualizations that prompt multiple hypotheses can be a good thing. (see also Barbara Tversky’s keynote at Visualization Psychology earlier today!).
I also think these dense visualizations push the boundaries of the design space of visualization and of text-visualization. Perceptually, multi-variate data can be a challenge. Data-dense visualizations can be a challenge. The linearity of text (i.e. you have to read words in some order) vs. the volume of information is a challenge: what happens to the global pattern? what happens if “overview first” doesn’t necessarily provide a macro-pattern?
Over on EagerEyes, Robert Kosara recently asked, what happened to ISOTYPE? It’s a good question, with a multi-faceted answer. Here’s a few facets, mostly focused on inherent limitations of the Isotype design approach:
1. What’s an Isotype line chart?
Isotype examples are typically bar charts and maps, as they show counts of things – i.e. what we call unit visualizations. Extending it to other types of visualizations is non-trivial. Certainly scatterplots shouldn’t be too hard (e.g. fruit, animals), but what about line charts. Certainly timeseries data is important to plot for many analyses but what’s the Isotype answer for line charts? Typically Isotype reduces the timeseries down to a few time periods and draws them as their typical stacked bar. As such, Isotype is an approach to creating charts, but Isotype is not a comprehensive system for all types of visualizations.
Some have tried, for example, see the Agricultural Outlook Charts from the USDA in the 1950’s. Some of the bar charts in these publications are heavily influenced by Isotype, such as the coins in the Figure 1 left. However the line chart in Figure 1 right struggles with icons, instead the icons are limited to identifying the line, and indicating the trend with the horse representationally and quantitatively heading down-hill.
Figure 2 shows another 1950’s publication heavily influenced by Isotype: Midcentury White House Conference on Children and Youth, A Chart Book. On the left is a bar chart that could almost have been lifted straight out of Isotype, wonderfully clear. On the right is a pie-chart infused with Isotype icons, where only by luck the thin wedges of the pie fit it smaller icons assigned to them (and strictly speaking, the pie segment with “both parents” should have repetition of that icon through out the area, but that wouldn’t quite work either.
2. What’s the icon for GDP or CPI?
Icons for concrete real-world objects can be easy to design, as the real-world object can be the basis for the icon, e.g. people, fruit, animals, tractors, and so on. It gets trickier when some of those categories are visually similar: Isotype never created separate icons for wheat, barley and rye, for example. And Tufte’s log animal chart has three rather similar looking small furry mammals, which can be only be definitively decoded by referring to the labelled scatterplot on the previous page (exercise for the reader to now go find the previous page:-)
After that, icons get difficult to design for abstract objects. What would the icons be for financial data asset classes, such as: stock, bond, credit default swap, collateralized debt obligation, repo, option, future and a forward? Even concrete real-world entities can be misinterpreted, such as the famous dogcow.
Making simple, expressive icons requires design effort. Gerd Arntz‘s wonderfully expressive icons helped drive the success of Isotype, but are beyond the design capabilities of the average non-designer. Gerd could create an icon for the abstract concept of unemployment with a human icon, looking down, at rest, hands in pocket: it’s brilliant, but not easy to design especially with such clean, clear graphical shapes that can be easily printed.
3. And what about the axes (and the values)?
Perhaps most audacious move of Isotype is the removal of the numeric axes. Isotype charts are beautiful with their clean depiction of icon stacks and graphical cues. Removing the numeric axes is brilliant because you can easily and visually compare ratios of different stacks, e.g. one stack of icons is twice as long as another stack of icons.
But what if you want to know the values?
It’s a common task to want to know what number a stack of icons represents. Unfortunately, Isotype makes it hard. You have to first count the number of icons. Then you have to find the legend, where it tells you how many items that one icon represents. So, for example if I look at Eheschliessungen in Deutschland (Die Bunte Welt, 1928, page 42), I see that in 1919-1922 there are 8 marriage icons and each icon represents 400,000 marriages, so 8 x 400,000 = 3.2 million marriages. That’s math. That’s cognitive effort.
Furthermore, when the design system uses a small number of icons, it’s not very precise. Isotype tends to use full icons or nice fractions such as 1/4 or 1/2 of an icon. The prior stack of 8 icons could easily represent 3.1 million or 3.3 million.
Often simple designs are the result of hard work. Simplicity takes effort. In The transformer: principles of making Isotype charts (Hyphen 2009), Marie Neurath’s first hand account describes the design task of transforming data into an Isotype representation (what we might now refer to as encoding). Marie explains a myriad of design decisions made in different charts to get the desired reading of the result. For example, coffins are replaced with tombstones to address the issue of relative size of adjacent icons and potential misinterpretation. Or, doubling with width of an adjacent bar so that relative portions can be perceived. And so on. These are non-obvious design solutions, arrived through a design process to achieve a good effect that may seem obvious in retrospect. (Unfortunately, image copyright status is uncertain).
Has Isotype really disappeared?
The prior four points are focused on Isotype’s limitations that make it hard for Isotype to extend more generally across data visualization. I don’t even address points such as modernism (which both Robert Kosara have both previously talked about) or technical changes (by the late 1950’s phototypesetting became the norm, but the technology tended to soften edges and fine detail, so crisp icons may have been difficult to reproduce- I discuss some of these other factors, which also limit the use to text, in my forthcoming book Visualizing with Text).
That said, I believe that unit visualizations have inherited the legacy of Isotype. And there are some fantastic unit visualizations:
Shapes: Less expressive than photos, simple shapes are a good choice for unit visualizations with hundreds of thousands of items, such as SandDance.
Labels: I’m personally interested in the use of labels in unit visualizations, such as this example of the passengers on the Titanic. I’m not the first, there are more compelling examples such as Maya Lin’s Vietnam Memorial, which in turn was based on earlier lists of casualties.
Physical units: Physical visualizations are well suited to using units. These include specifically designed physical unit visualizations using concrete scales, as well as examples in the real world where the units can be perceived individually or as part of a whole, such as the fields of WWI crosses.
Over the summer, eight of the examples from Visualizing with Text have been re-implemented as Observable notebooks! These can be interacted with, code is visible, and code can be edited live! All the demos can be accessed from the supplementary website.
They are all text-intensive visualizations with some interesting datasets, such as countries in the world where more people are dying than being born, countries with big jumps in Covid unemployment, and coroners’ reports indicating many ways to die in Georgian London.
For the rest of this post, I’ll focus on one demo: word skimming. This demo processes paragraphs of text, ranks words based on frequencies compared to a large corpus (Wikipedia), then weights the words so that least common words are formatted to stand-out. This formatting facilitates perception of uncommon words, which is a taught strategy for skimming text:
While skim formatting may seem strange in prose on a computer screen, the technique has existed for centuries, in instruction manuals, advertisements, comic books, and even software code editors, as shown in some of the examples in the book.
In this particular demo, the viewer can easily cut and paste a completely different text to format for skimming. They can also select from a few different formats, such as adjusting font weight, font width, or x-height. Here’s the same demo using the opening paragraphs from The Decline and Fall of the Roman Empire and formatted using both font weight and x-height to indicate uncommon words, such as valor, emperors, Trajan and Hadrian:
Since this example is implemented in Observable with all code editable, it’s straight-forward to modify the code and try other visual attributes that aren’t used in the demo code. For example, here’s the opening paragraph from Moby Dick, with uncommon words highlighted in red:
To highlight the words in red, a single line of code can be added (without worrying about adapting the drop-down menu, or configuring d3 scales, etc.) In this case, the following was added to the update cell:
// words are in an array elements [0-n] and d.rank is the rank of each word:
elements[i].style.setProperty("color", (d.rank > 5000) ? "red":"black")
Essentially, if the rank value is over 5000, color the word red, else color the word black.
Instead of a toggle red/black, we could easily add a d3 scale and use a color ramp, so that colors range from black, through blue and purple to red. (Note, no SVG is used, this is all HTML – D3 is only used in this code to facilitate scaling numeric values into ranges that suit the variable font used). In the snapshot below, a color scale is used in addition to font weight and x-height in this snapshot of text from Frankenstein:
While feasible, this is not necessarily recommended. The many changes of hue, weight and x-height creates many distractors, which reduces the readability of the sequential text. As a design effect, it may be an objective to create disruptive formats, such as the wonderful encoding by Ben Fry of Frankenfont.
On the otherhand, if the goal is skimming, then it is desirable to easily read the immediate surrounding context to aid comprehension. Changes of many formats reduces the typographic consistency requiring greater attention to decipher the surrounding text. Typographers discuss this as readability, which is described more in my book (briefly) and in typography books in more detail. One must take care when manipulating typographic attributes to not create Frankenstein paragraphs unless it suits the particular task or objective.
My book Visualizing with Text is nearing publication (early November!). One goal of the book is to appeal to both researchers (with structured text with logical arguments), and designers (with many examples and pictures). I really like books with lots of images. Like the quote from Alice’s Adventures in Wonderland, in the title, I prefer well illustrated books. And I don’t like reading about visualizations where you only get little teeny snapshots: after all the primary subject is visualization!
Text is good to explain, structure and provide context, but pictures are important to a designer: to see how the conceptual description is actualized; to see all the little design decisions that contribute to the whole; to see the anomalies and things that could be improved; to provoke design inspiration; and so on. Arguably, Jacques Bertin was successful with Semiology of Graphics 50 years ago because he presented a theory supported with many examples and illustrations. My ideal goal is to have a 50/50 split between text and images. But, measuring text to images is a bit tricky:
It turns out that the publisher and I count images differently. We can easily agree on page count (274), but my publisher currently counts 146 illustrations and I count 250 images (+/- a few to be sorted out). What? How can we differ so much?
I think the publisher is counting image captions. On the otherhand, I’m counting the individual images that might be referred to in a single caption, for example here’s figure 1.3. It has a single caption but two different images: left is a book from 1497, right is a different book from 1589. I count this as two images (the images come from two completely different sources).
Then, there are parts of the book where I discuss using visualization techniques inline with text. Since, I’m advocating inline visualization, these are simply text richly formatted within the flow of the prose. In this case, there is no image caption, so they don’t count at all, whereas I count it as one image (I had to write the code to crunch the data to format the text).
But there’s more nuances. What about a couple words formatted within written text, simply to facilitate cross-referencing to an image? I don’t count those. And what about an image that’s a composite of many teeny snapshots? I count those as one image. And what about a table that’s got some visualization formatting associated with each cell based on data attributes? That might count as a table and not an illustration. Again, I had to write the code to crunch the data to format the text, so again I count that as an image.
So, overall, I get to around 250 images, out of 274 pages. Given that the Table of Contents, Preface and Index don’t have images, that gets close to the 50/50 split.
Types of Pictures
Returning to the topic of pictures, a few other stats are useful. The pictures split approximately 50/50 between pictures that are my own vs. pictures of other visualizations. I think it’s good to provide a grounding based on real-world precedence. For the pictures from other sources, I’ve tried to include URL’s to them, so readers and educators can easily find the online versions where available.
With regards to my pictures, some are new unique visualizations, some are sequences of pictures, such as showing the same visualization using different attributes, or zoomed in or so on. There’s about 80 different examples of visualizations with text that I created in the book. Some have been published before, on this blog or in research papers.
But I wanted to create some new content (why buy a book with old pictures). So there’s new, unpublished visualizations that will be making their first appearance in the book, including scatterplots of cars, an adjacency matrix of dialogue, a syntax diagram, a massive textual stem & leaf diagram, assorted tables with visualization characteristics, a data comic, some expressive lengthening examples, and a topic model visualization. Here’s eight examples taken from the draft:
Larger versions of these pictures will be available when finalized and placed on the publisher’s website with a CC license shortly before the book is released.
Yesterday the Data Visualization Society hosted a Fireside Chat regarding typography and visualization, which was fun to participate in. There were too many questions to answer all. One question with many variants was: “Which font should I use in my visualization?” The answers given noted that there isn’t any one font, it depends on the use. In this post, I’ll list a few that I tend to use and why; and a few caveats.
For things like tick labels or labels in the plot, I tend to use a font that will be robust on the screen at a small size: it needs to be legible. This is not the place for a “display font” (fine serifs, funky letterforms). Use a workhorse font, such as the ones you might see heavily used in mobile design, such as these sans serifs: Roboto, Source Sans Pro or Segoe. A very close second is a slab serif font. Slab serifs are chunky serifs so they can work well at small sizes. Two that I like are Rockwell and Roboto Slab.
Data driven text
I like to use data-driven text in visualizations. Like labels in maps, type can express data values not only by varying size, but also by varying attributes such as font weight, width, typeface and so on. Much of this blog has examples of data driven text, such as the emotion word graph above, as well as my upcoming book Visualizing with Text. Here’s a sample of type attributes that can be data driven:
Even though the row “Typeface” shows some rather funky fonts, for data driven fonts, I tend to stick to a small number of different typefaces that can be readily distinguished. Readily distinguished means that each font should look different from the other fonts used but still work at small sizes. Again this rules out display fonts. I might use a mix such as a sans serif with a high x-height (e.g. Source Sans Pro), a slab serif (e.g. Roboto Slab), a serif with a low x-height and humanist letter forms (e.g. Garamond; or maybe a high stress serif, such as Bodoni), a blackletter font (no current favorite, avoid ornate ones, Lucida Blackletter is OK), and maybe a handletter font (again, avoid ornate ones, I like Tekton Pro: verticals are vertical and not sloped). Here’s a snapshot so you can see how different some of these fonts can be:
When encoding quantitative values into text, the most common approach in maps is to use small variation in size, or variation in font weight. You need to use a font with a large variation in weight from lightweight to heavyweight. Again, Source Sans Pro and Roboto offer a wide range of weights. Variable fonts often offer a wide range of weights. Some fonts also offer variation in widths – in this case I might use Saira which has many weights and many widths, but there may be better variable font choices now. Variable fonts are also better suited for web: instead of downloading the 36 weight and width combinations, a single font can be downloaded then configured in CSS.
Titles and Subtitles
Titles and subtitles are generally larger. This gives you more options. Often titles and subtitles may contain a sentence or two. Readability is a consideration and serifs are often considered more readable. I tend to like to use slab serifs (e.g. Roboto Slab) or a geometric sans (e.g. Gill Sans or Lato) for titles. Geometric sans tend to use simple geometric forms, such as a perfect circle for the letter o, which tends to make them wider than other sans serifs, which is why I don’t use geometric sans within the visualization.
There’s always caveats. If you’re creating a visualization where the labels use codes, such as airline flights (e.g. AC123), bonds (e.g. IBM2.5-250515), airline reservation codes, etc, make sure that the numbers and letters are clearly distinct – for example O0 or I1l may look too similar (e.g. Titillium Web). This is a real problem in many displays such as air traffic control, electric grid operations, financial market screens, and just about any modern app where items refer to ID codes. Font B612 was specifically designed to maximize these differences usable at small sizes in visual displays. Also note that many monospaced fonts are designed to accentuate these differences, such as Inconsolata.