Dashed and patterned lines for visualization (aka 1D texture)

Bertin and Texture

Jacques Bertin discusses texture in visualization, in his landmark book Semiology of Graphics (1967 original French edition, 1983 English translation). Much of modern visualization theory and implementation hinges on Bertin’s framework, including the notion of visual attributes, marks and layouts (for example, Wilkinson’s Grammar of Graphics, or Bostock’s D3).

Bertin talks about texture as a visual attribute and shows quite a few examples of texture applied to points, lines and areas. Bertin’s work exists at the heyday of Letraset – transfer film used by graphic designers to create professional graphics. Letraset provided professional fonts, graphics and textures that could be applied by any designer to create professional graphics for use in glossy magazine ads, to black-and-white screen graphics in newspapers, down to small scale zines – it was the democratization of desktop publishing before computation desktop publishing appeared in the mid 1980’s. Letraset and their competitors offered a wide variety of ready-made textures:

Samples from 300+ ready-made textures from Letraset (1,2,3,4).

And these kinds of textures appear in Bertin’s lines and maps and substantially influenced his work. Note how textures can be combined together to represent multiple variables:

Bertin uses textures in many visualizations, to represent quantitative values (left), or layer multiple categories (center and right).

But what is texture? What are the variables that make up texture? What are the parameters for use? Did Bertin articulate all the possibilities, or are there more?

This is too big of a question for a single blog post, so instead, consider a very narrow use of texture as applied in only one dimension – that is – a texture applied to a line. Texture constrained to a line is very limited, particularly if the line is understood to have no width (at least for this blog post).

Dash Patterns: Length, Gaps and Rhythm

Like Bertin, we can think of texture as a sequence of on/off values applied to a line to create a sequence of dashes. The regularity of the sequence allows the viewer to distinguish between a line with a short dash pattern with big gaps and or long dash pattern with small gaps, such as A50 and F90 in the image below. The wide range in variation indicates that these lines could be used to show quite a few different categories within one visualization. Also, both the dash length and the the proportion of ink could be utilized to represent quantitative values associated with the line, e.g. less ink for lower contour lines, more ink for higher contour lines; short dashes for low certainty, long dashes for high certainty, etc.

Dashes varying in length, gap proportion, rhythm, and randomness. Link to code using D3 and dasharray.

In the bottom half of the image, other variants are considered. Rhythm can vary: on the left, the lines have a consistent dash-gap rhythm ABABAB, as seen in line G0. In the center column, a dot is added into the gap: the line now has the rhythm ABCBABCB, such as G1. It has the same ink as its counterpart in the 0 column, and is clearly differentiated. The right column has two dots (rhythm ABCDCBABCDCBA – which is clearly easier to describe as a visual line shown in G2 rather than as an alphabetic sequence). Lines with none, one, or two dots are clearly differentiated, allowing for many more possible combinations dash sequences to be used.

Randomizing Lengths and Gaps

In the bottom half of the image, within each column the regularity of the line pattern is perturbed. That is, with each successive line a bit more randomness is added. In G0 the dash lengths and gaps are consistent, in H0 the difference is almost unnoticeable, and by K0 there is some noticeable differences in dash lengths and gaps. G0 is the same line as E70. The K0 line retains the highest similarity to line E70, even though dash lengths and gaps have been modified more heavily. The ability to make minor adjustments to dash lengths and gaps while retaining the same line identity is a property exploited by old mapmakers – essentially it is preferable for the dashed line to be solid at a corner, not a gap:

Dashed lines are solid at corners and gap over other lines.

As seen on this Ordnance Survey map from 1900, the dashes are solid at corners. The draftsman (drafts-person) has made minor adjustments to the intervals to retain the dash style while making the point of change in line direction visually explicit. Another interesting adjustment is that the gaps in the dashed lines occur when a dashed line crosses another line, as can be seen on the dotted line crossing lines in the far left and far right in the above thumbnails. This facilitates visual separation of the lines, not allowing them to be become an intersecting blob.

These kinds of adjustments are critical to making dashed lines effective in visualization. Consider a timeseries chart using a dashed line. If the gap occurs at a high point or low point, then it is impossible to determine the value from visually reading the chart, significantly reducing the effectiveness of the chart:

Left chart from 1912 shows dashed line with every high point and every low point solid. Right chart from Excel some high points and low points are gaps making it impossible to visually determine the value.

In the left chart, the drafts-person has a solid dash at every corner, whereas the Excel chart on the right has arbitrary dash lengths and gaps leading to corners not being explicitly visible. Since the corners on a line chart represent the actual data points, the arbitrary rendering results in data points not being visible! Yes, it is feasible in Excel to also plot the data points (as boxes, circles, stars, etc), but that only clutters the chart and breaks the rhythm of the line style — the drafts-people didn’t need to add extra marks for each point.

Given that a data visualization is supposed to show data, it is strange that the line style pattern overrides the visibility of the datapoints. Therefore, implementing effective dashed lines for visualizations isn’t as simple as defining a dasharray in SVG or D3. A lower-level dashed line module would be needed to make sure solid dashes occur at corners. This is non-trivial: what if there are multiple sharp angles in a short sequence (e.g. on the 18th day in the 1912 chart above), or multiple line crossings?

Showing Rhythm with Transparency, Saturation, Brightness or Hue

In Bertin’s example, texture is binary: it’s on or off. Much more is possible using computer graphics as opposed to transfer film. Transparency, saturation, brightness or hue can be modified with the regularity of the dash sequence:

Texture patterns represented with varying gradients and color. Link.

In the top block, L50 – Q90, gradient transparency varies from solid to transparent to create the same pattern as previously seen in the dashes on lines A50 – F90. The lines in the 50 column use a linear gradient that goes smoothly from min to max and back to min, whereas in column 70 the gradient drops off more quickly than it rises, and in column 90, the gradients have a sharp drop off. Interestingly, the non-equal application of the gradient appears to give the lines the illusion of motion, like motion blur in photographs.

In the bottom block, Rs – Wh, the same patterns are used with attributes of color. In the first column, only saturation is changed (from orange to grey) which is difficult to perceive as there is no contrast between the two colors. In the second column, only brightness is changed, with the alternating pattern of light orange, dark orange clearly visible. And the third column changes hue, resulting in a repeating rainbows along the lines. Again, attempting to use SVG and D3 to implement these is difficult and the approach used in my quick code won’t work with curving lines.

More importantly, these examples suggest that texture is operating at a different level than visual attributes such as length, transparency, hue and so forth; as the texture can be represented in any these other variables. So Bertin’s notion of texture and how texture fits into the formal definitions of visual attributes of data visualization may be more nuanced than the current models used in visualization research.

Conjunction of Textures

Bertin uses multiple overlapping textures to convey multiple variables in his maps: textures of small dense dots, with large coarse dots, with lines oriented in one direction, and lines oriented in a different direction. Can multiple dash patterns be combined on one-dimensional lines? Yes:

In the top row, the line X1 has a very short pattern with a lot of white space is shown. Then, on the left, a line Y2 also has a short pattern and an interval that is a multiple of the interval for X1. The two patterns can be interleaved form the line XY12. The right image from Bertin back at the top of this article uses this interleaving appproach: it uses carefully designed textures such that dots fit neatly between the diagonal lines.

On the right, the first line X1 is the same short dash pattern, while the second line Y7 has a long dash pattern: there is no way to combine the two patterns without intersection. Here the patterns are combined to alternate on/off to create a reverse video effect (i.e. the patterns are XORed together).

This kind of approach might be desirable for charts with many different dimensions, for example, a census chart plotting unemployment overall/minorities, and for all ages/and those under age 25. Thus lines can be overall solid, minorities short dash; all ages solid (same data as overall), under 25 long dash, and the conjunction of minorities under 25 in the conjunction of the two dash patterns.

So What?

If dash patterns are problematic, then why use them at all? Sometimes there may be a need to use many lines, more than can be comfortably differentiated using color. Dashed lines are also common on line charts to show predicted data, or on maps to show unpaved paths: semantically a dashed line can effectively convey the data has uncertainty. Dashed lines have real uses.

Texture is an under-explored area of data visualization. Historical charts and maps do show that these can be effectively used. Visualization tools and libraries, however, use dashes arbitrarily and don’t take into account how to draw dashes to suit perceptual needs. Furthermore, our definition of the design space around texture may be somewhat lacking – perhaps some future grad student will want to take on some of these issues.

Posted in Data Visualization, texture | Leave a comment

Visualizing with Text: Endorsements and more examples

Thanks everyone who’s bought a copy of Visualizing with Text. I hope you’re enjoying it.

I’m really appreciate some of the examples I’m seeing in the wild. Here’s some fantastic examples from Georgios Karamanis. I like the shifted + bold text indicating voting topics at the UN, and word-pairs describing makeup. And Georgios provides the code, so if you’re into R and want to see how to implement some of the text visualization techniques from the book, see his github.

Examples of visualizing with text by George Karamanis.

Also exciting to see an endorsement from Michael Friendly, and many others on Twitter. Thanks for the posts.

Photo of Michael Friendly visually endorsing Visualizing with Text.

I’ve talked about the book internally at our company (Uncharted) and have been pleasantly surprised to see some of the ideas weaving their way into some of our visual analytics, such as a button indicating the color legend within the button glyph; or a technique for interactively labelling neighbourhoods while zooming around a massive network.

While I can’t really do a book tour during Covid, I did a talk at Naomi Robbins’ Data Visualization NYC Meetup and interview with Lee Feinberg’s Analytic Stories. Looking back at the videos, I see I may have talked over a couple people – sorry! Happy to follow up.

Also, I noticed some Tweets regarding Typograms, or more generally laying out type to fit into shapes such as Aaron Kuehn‘s beautifully typographic anatomical posters. It’s a technique discussed in the book, such as Automatic Typographic Maps, which in turn were based on Axis Maps; Jean-Luc Arnaud‘s typographic maps or Kate McLean‘s smell maps.

This technique of fitting type onto lines or into shapes has been going on for centuries: I like Calligrammes from the early 20th century, medieval monks speaking in scrolls (on the cover of the book!), text set into the shape of an axe in a book from 1530, or awesome psychedelic posters, such as Wes Wilson‘s posters from the 1960’s. For any visualization researchers interested in algorithms for fitting text into complex shapes, see Ron Maharik’s Digital Micrography research and PhD thesis.

Posted in Data Visualization, Microtext, Text Visualization, Thematic Map | Leave a comment

Visualizing with SVG before D3: Timeseries

News headlines about the GameStop price swings this week reminded me of some old SVG visualizations of stock bubbles and crashes that I’ve done. SVG was around before D3. I generated SVG visualizations in the mid-2000’s well before D3.js. It was painful in comparison, but at the time it was a lot of fun.

The objective was experimentation: what could be done with a scalable vector graphics library? As such, it wasn’t constrained to screen resolution and screen dimensions (which at that time was typically 1600 x 1200) – rather you could do much more detail, import into Illustrator and print things at very high resolution with lots of detail, lots of transparency, and tiny text.

Analyzing timeseries

Here’s an example: Microsoft’s daily stock price from late 1992 to early 2004.

MSFT daily stock price 1992-2004 with different markers and shadings.

That’s 12 years of daily data with about 250 trading days per year resulting in a 3000 px wide visualization — which now draws just fine on my 4k display. With SVG, it is easy to layer in many different analyses creating different marks, lines and areas. Here’s a closeup of Lucent’s daily stock price during the Internet bubble:

Lucent stock price closeup of 1997-2001.

Many different graphical marks indicate data and derived indicators:

  • The blue line indicates the daily price.
  • The small green/yellow/orange/red bars behind the blue price line indicate the monthly price move from the beginning to the end of the month, colored by whether the price increases (green) or decreases (red). It aligns neatly with the monthly grid, filling in stripes within the grid.
  • The fat green/yellow/orange line behind that indicates yearly change in price. (There’s also light grey boxes behind that align with the thicker yearly grid.)
  • The green/purple circles indicating successive high/low points. Values for the highs and lows are indicated in text as well as the date. All successive low points are connected by a straight purple lines, all successive high points are connected by green straight lines. The zone between the green/purple lines form an envelope around the price range.
  • The many orange lines are moving averages each with a different time period, forming a guilloché. When moving averages start to cross it is indicator of a change in trend, e.g. from up-trend to down-trend (as many stock traders know). In 1997, the averages start to converge but then the trend continues. However, in 2000, the moving averages successively start to cross, and by June most of the moving averages have crossed, before the much steeper crash.
  • The large arcs and corresponding fills indicate major trends. Up-trends start at trend low to the ending trend high (e.g. 3/19/1997 at $10.75 to 12/8/1999 at $71.31: up almost 7x!) Down-trends start at trend high and end at trend low (e.g. June 5, 2000 at $56.23 to a point outside the closeup: 9/27/2002 at $0.77: down to almost 1% from the start, ouch!) This is what a bubble looks like and how it plays out.
  • The filled shadows in pink and blue indicate the 52-week low price and 52-week high price. On the first two-thirds of this chart, the pink shadow is predominant as the stock keeps going up with a few “lakes” of blue filled in the occasional dips. In the last third of the chart, it’s almost entirely under the blue shadow as the stock tanks.

Why is it important to have all these different markers, bars, lines, arcs, text, and so on?

There are many techniques to analyze timeseries data – moving averages, standard deviations, envelopes, and so on. In finance, one can explicitly study timeseries analysis. And more broadly, these analyses apply beyond stock prices to electrical grid loads, network utilization, software performance, automobile diagnostics, and so on.

Long timeseries

SVG is scalable, so I applied it to longer and longer timeseries. Here’s the Dow Jones Industrial Average(R), when I re-ran the code with in 2010 to compare the 2008 financial crisis to the 1929 crash:

Dow Jones Industrial Average 1896-2010 daily: 28500 days, every point plotted.

So, 114 years of daily data results in a plot more than 28000 pixels wide. That’s more than my 4K screen can display in it’s entirety at full resolution. But paper can. The visualization uses a log scale: you can see the magnitude of the 2008 crash at the top right is far smaller compared to the crash of 1929. I.e. from the index peak in 2007 around 14000 to a low of 6500 in 2009, the index lost a bit more than half of it’s value; whereas in 1929 the index peaked at 381 in September 1929 then dropped to 44 in 1932, down 88% of its original value! There was a lot of pain in 2007-8, but the massive intervention in the markets by the Federal Reserve and central banks helped stave off a much bigger crash and avoid a much bigger, longer recession.

There’s a few other things going on with this experimental visualization. Here’s a closeup of the 1950s:

Closeup of Dow Jones Industrial Average during the 1950’s

In addition to the the many different shadings and markers, the grid lines also participate in indicating data range: rather than extend across the display, the grid lines are localized to the line +/- a range. Axis labels also follow data-driven rules. The price labels in this snapshot follow the grid (note the stock market high of 381 in 1929 wasn’t surpassed until 1954, some 25 years later). The date labels are completely data driven – indicating dates on the top side of the line when the price hits new highs, and on the bottom side of the line when the price hits new lows.

Why is it important to have such long timeseries in so much detail?

Our firm has clients with financial timeseries that are more than 200 years long. Seeing all the detail is important. The current GameStop bubble is not unique, there have been many, many more, going back to railroad mania in the 1840’s or the South Sea bubble in 1720’s. Different bubbles will play out in different ways: having detail allows for comparison to prior bubbles for insight into the current bubble. Recoveries from recessions will be different for different sectors. Some market experts will use this information to inform their portfolio strategies in response to GameStop, or in response to Covid, or in response to an election cycle.

And why the variations in grids, arcs and areas?

D3 is great and much can be done out-of-box. But, when you just use the standard examples applied to different data, you might not be indicating the things that matter to the end user (i.e. the purpose is insight, not pictures). As such, it’s important to use the toolset to experiment with the underlying graphics to highlight the core insights.

I’ve done much more experimentation with SVG before D3, but those will be for some future blog posts.

Posted in Data Visualization, Line Chart, Timeseries | Leave a comment

Visualizing Causes of Death in Georgian London

LondonLives.org is a collection of 240,000 historic manuscripts from eighteenth century London. These have been collected, organized and analyzed, such as Sharon Howard’s summary of 2894 coroner inquests from London 1760-1799 with each case including subjects, verdicts and causes of death as well as links back to original handwritten manuscripts. The dataset is fascinating: How might this data be visualized?

Each each row in the summary dataset indicates subjects (e.g. Susanna Thompson, Sarah Cox), verbs (e.g. crushed, falling) and objects (chimney). These can extracted using natural language processing (although, note that Howard cautions that the causes of death are not fully accurate, nor is my natural language processing). These extracted words can be assembled into a hierarchy, for example verb, object, subject:
– Crushed
– Chimney
– Susanna Thompson, Sarah Cox

There are many, many different techniques for visualizing hierarchies (e.g. treevis.net):
Treemaps and sunburst focus on accurate representation of values using area, meaning that fitting legible text can be difficult.
Graph representations, such as nodes and links, emphasize the structure, again potentially making difficult to fit all legible labels.
Org-charts are designed to fit text, but the left-right/up-down layout can result in layouts that become very wide or very deep, making it difficult to fit an org-chart with 3000 items to a rectangular screen and display all the text at a legible size.
– More generally, most quantitative visualizations aren’t designed for representing large amounts of text.

Instead, typographers and publishers have techniques for depicting textual hierarchies not shown in any visualization compilation. Indexes and dictionaries are designed to show a large number of words in hierarchies in a very dense format. Dictionaries use wide variety of formatting, for example, using very heavy-weight text to make the defined word stand-out. Alphabetic ordering facilitates search. This supports quick non-linear skimming to the word of interest; then different formats are used within the definition (e.g. italics, small caps, etc.,) facilitating jumping to the type of data of interest. Size of entries is relevant: words that have many meanings have longer entries. In effect, indexes and dictionaries are visualizations, although constrained to printed pages bound in a book. With larger screens, indexes can be set out to be entirely visible on single screen or two.

Returning to the coroner inquests data, consider an index-like visualization. Organizing the hierarchy by verb, then object, then subject results in a list of subjects under an object, which in turn are listed under a verb. Here’s all the ways that people died under the verb CRUSHED:

Before government building codes and safety standards, many people died by being crushed.

Each object is listed in a separate coloured section. Each section starts with the object in bold (e.g. chimney), and has a list of named subjects in a narrow brown font (e.g. Christiana Jorden and 3 others). Individual names are links to the original manuscripts, e.g. Sarah Cox:

Given some unusual objects, a sample cause of death is shown in italics, (e.g. crushed by falling chimney). More frequent causes result in taller sections, for example being crushed by a chimney was more common than being crushed by a scaffold; while being crushed by a house or a theatre crowd was more common than chimneys.

Also, I was curious about gender: were women more prone to certain causes of death than men (assuming that women’s deaths were investigated equally to men’s deaths)? After each object a pair of numbers indicates the gender ratio, e.g. 1/3 indicates 1 man and 3 women were crushed by chimneys. To facilitate skimming for this ratio, the background colour of each section is shaded light blue to light red. This colour scale isn’t meaningful when viewing causes with only a couple deaths but provides some macro-scale patterns when viewing the larger dataset.

Here’s the 2894 inquests presented in an index layout, as a simple interactive HTML/CSS visualization. On a 4K screen all text is legible and readable:

Causes of death in Georgian London. The biggest blocks correspond to the most frequent causes, e.g. drowned, hanged.

Like a dictionary, the biggest blocks have the most entries: the biggest blocks are DROWNED (in column 3 in the image above) and HANGED (column 6). Both of these tend to be light blue (more men than women). The largest light red block (more women than men) is under BURNT (in the first column).

In the full visualization, the colour of subjects’ names correspond to the verdict: brown is for deaths deemed accidents, red for homicides, green for suicides, etc. These can be interactively filtered, for example, the most popular method of suicide was hanging (biggest box), generally more male suicides are recorded than female (generally more blue than red), many methods were used by both genders (e.g. drowning, poisoning, cutting), although gun shots are almost entirely male.

So what?

This example is from my recent book Visualizing with Text, including a comparison to a treemap. It’s a good illustration of a visualization that provides high-level perceptual patterns (large coloured blocks) and low-level details (words and phrases) within the same visualization – which Edward Tufte refers to as micro/macro readings. The book review of Visualizing with Text by Alec Barrett succinctly summarizes the benefit as being able to “zoom with one’s attention.”

Thanks to LondonLives.org and Sharon Howard for collecting, organizing and summarizing these historic documents.

Posted in Alphanumeric Chart, Data Visualization, Text Skimming, Text Visualization | Leave a comment

In Defence of Data-Dense Visualizations

I’ve done a couple of presentations with content from my book Visualizing with Text to grad classes in the last month. In both cases, a couple of people expressed concern regarding the complexity of some visualizations, being particularly data dense. This is an argument that I have heard on occasion throughout my career as I am often involved in the design and development of data-dense visualization for domain experts. Data-dense displays are not uncommon in domain applications – consider a few examples:

Figure 1: Some data-dense visualizations. Clockwise from top left, map of Abu Kebir, 1918; Earth Scientist’s Periodic Table of the elements and their ions, 2013; Financial trading floor desk, 2012; NYISO’s video wall of electrical grid, 2014.

These visualizations are packed with data. The map has many layers: roads, railways, rivers, canals, drains, buildings, labels, textures, etc. This periodic table has icons, symbols, numbers, names, pathways, ions, solutes, charges, etc. The trading desk shows a wide variety of information associated with securities: timeseries, events, color-coded news, internal data. The electric grid shows hundreds of entities: transformers, capacitors, hydro dams, wind farms, interchanges, power corridors of different voltages, status, thresholds, etc.

And, if you look at the Uncharted research website, you’ll see many more data-dense visualizations that our company has worked on.

Data density slows down understanding

The essence of the data-density criticism is that, with a greater number of data points and/or multi-variate data, the viewer can become confused as to where to look. There may be many different visual patterns competing for attention. Should the viewer be focusing on local patterns in a subset of the display, or macro patterns across the entire display? Should the viewer be attending to the color, or the size, or the labels? If each represents different data then the semantics will be different, based on the visual channel that the viewer is attending to. Worse, if the viewer starts to move back and forth between many different channels, one may forget the encodings and become more confused.

Some people have become conditioned to think of data visualization as they might see in the popular press, visualizations made for communication, or visualizations that utilize straightforward visualization types that you might get with a library such as D3 or Vega:

Figure 2: Some common visualization techniques, not particularly data dense compared to Figure 1.

Viewers may be conditioned to think of these as encompassing all (or most) visualization types, with many articles organizing visualization into a limited palette of visualization layouts: periodic tables, chart choosers, lists, galleries, zoos, and more. These visualizations typically don’t have many data points (20-200), and typically show only a couple of variables. Data is homogeneous – you’re not looking at multiple datasets with different types of entities jumbled together. Answers are easier, because there really aren’t many different dimensions or layers to be considered.

But not all problems are simple.

Complex problems may need complex data

The images in Figure 1 show that multi-metric, data-dense visual representations exist in practice – in both historic visualizations and modern interactive visualizations. These complex visualizations bring together multiple datasets in layers, in many windows, in large displays – i.e. into data-dense representations.

This extra data is required, because there may be multiple answers possible. If the price of a stock goes down, it may be due to the overall market going down, to a competitor dropping their prices, to poor sales data in the company’s earnings, to a negative news story about the company, or other factors. The extra visual context facilitates reasoning across the many plausible causes to assess the situation. In the stock example, multiple causes can be true at once: an expert needs to see all and determine which are most relevant to the current situation.

In addition to quantitative data, there may be other facts and evidence: qualitative data, news, videos, and so on. There may be multiple perspectives to consider. There may be different time horizons to consider. (For example, the stock market collapse in 2008 was triggered by the collapse of Lehman Brothers on September 15, 2008; but months before Bear Stearns (a competitor) was acquired when it ran into funding issues, and even earlier some mortgage origination companies went bankrupt.)

More generally, wicked problems are not easily solvable. The problem can be framed in more than one way, different stakeholders have different timelines, constraints and resources are subject to change, and there is no singular definitive answer.

As a colleague tells me: Complex problems have easy to explain wrong answers.

The Value of Data Density

Communication: There are many different reasons for creating visualizations. The low density visualizations in Figure 2 may be part of narrative visualizations, for explaining the results of an analysis. Data has been distilled to a few key facts.

Dashboard: Or, simple low-density visualizations may be part of a overview dashboard, with many small visualizations, each of which provides an overview of a different process, and can be typically clicked on for more detailed analysis. These overviews only need to provide sense of status: if there are any issues then the viewer has workflows to access more detail.

Beyond the communication and dashboard uses, there are many other uses for visualizations, where density may be valuable:

Organization: For example, the map and the periodic table in Figure 1 organize large amounts of data. These many layers of data allow cross referencing between many different types of information. On the map, the user may need to know the location of buildings (objectives), roads (connections), canals and railroads (obstructions) in order to plan a route.

Monitoring: The market data terminal and the electric grid operations wall in Figure 1 provide real-time monitoring across many data streams. Many heterogeneous datasets come together into a single display. Time is of the essence in real-time operations. Detailed data can’t be hidden a few clicks away: all key information must be designed and organized for quick scanning and immediate access.

Analysis: Knowledge maps and network visualizations are often about analysis of complex data. SciMaps.org has 100 knowledge maps, each collecting and visually representing many facets of a particular corpus; such as Figure 3 left, an interactive visualization of the edit history of Wikipedia articles by Wattenberg & Viégas. Visual Complexity has 1000 network visualizations, such as Figure 3 right, a dynamic visualization of social networks indicating people, links, activity, postings, sequence and message age by Offenhuber & Donath. Both of these are visualizations about text over time: edits, exchanges, persons. In both cases there are many dimensions to understand and comprehend.

Figure 3: Knowledge maps and network visualizations.

Exploration: Data-dense visualizations aren’t limited to domain experts attempting to understand complex datasets for their jobs. In Dear Data, Georgia Lupi and Stephanie Posavec create some awesome multi-variate data-dense visualizations of mundane day-to-day data. Why? Exploratory data analysis needs to consider lots of different data – the exploration is required to consider, assess, investigate, compare, understand and comprehend many different data elements. To do exploratory data analysis with only a well-organized quantitative dataset may miss much relevant data (e.g. see Data Feminism). Lupi and Posavec show by example that many different attributes that can be extracted from everyday life and then made explicitly visible for an initial exploratory view.

Figure 4: Dear Data. Left, Lupi’s visualization on doors. Right: Posavec’s visualization on clocks.

Data Density and Visualizing with Text

The objective of the book is to define the design space of Visualizing with Text for all kinds of visualizations, simple or complex. Section 2 in the book deals with simple labels, such as scatterplots, bar charts and line charts: text can be used to make simple visualizations more effective. Section 3 in the book goes further, using multiple visual attributes to convey more data (figure 5).

Figure 5. Some more complex visualizations from Visualizing with Text.

The Future of Data Density

Data density is likely to become a bigger issue in the next decade. Greater awareness of bias in data makes it more important to represent more datasets in a visualization. Analysis of richer data types – such as text, video and imagery – will likely necessitate new ways to layer in additional visual representations. Bigger data will have even more variables requiring more ways to show more data, or risk summarizing too much useful detail out of data. Specific visualization applications, such as cybersecurity, fake news, and phishing, need to deal with ever more complex attacks which implies more nuanced analyses based on more complex data.

Data density will become increasingly important to future visualization and visualization research.

Posted in Data Visualization, Text Visualization | 1 Comment

Visualizing with Text: author’s copy and new content

I just received my author’s copy of Visualizing with Text this morning! It’s awesome to finally hold the book after 2 years of writing (and the start of this blog 7 years ago!):

Here’s the book with the nice glossy cover.

Flipping through triggers some memories, like finding this user study on charts from 1961 comparing labels vs. legends! (Can you BELIV that there were user studies 59 years ago, before VisWeek? see Sarbaugh et al: Comprehension of Graphs):

Label or legend? An experiment from 1961.

Or a larger effort specifically for the book is captured on this page. Since the book defines a design space for visualizing with text, I felt compelled to demonstrate the flexibility of the design space to create many different visualizations of one document: here’s 14 different visualizations of Alice’s Adventures in Wonderland:

14 different visualizations of Alice’s Adventures in Wonderland.

And, here’s a page on very different uses of visualizations (beyond using visualization for preattentive perception of patterns). On the left is a system diagram of a power grid (an inventory use that organizes all the assets in the grid, courtesy of ISO New England). Top right is an infographic by Nigel Holmes of a graph, where the edges are literal text implicating individuals (a communication use that distills days of testimony down to select statements, courtesy of Nigel):

Different uses of a visualizations.

“Preview” is now working on the CRC Press site, and “Look Inside” is now working on Amazon.

Posted in Graph Visualization, Text Visualization | 1 Comment

Visualizing with Text – high-res figures now on-line

Visualizing with Text releases any day now: I hope to have my copy before the end of VisWeek. I’ve finally posted all the figures that I authored on-line with a CC-BY-SA 4.0 license. There’s 158 high-resolution images and diagrams from the book in the PDF file. These may be a nice complement to the eBook or physical book as some of the text may be too small to be readable in some of the larger visualizations. My figures are all released with a CC-BY-SA license, so they can be reused, for example, for teaching or mixed up into a collage or whatever.

Some of the figures from Visualizing with Text.

There’s another 99 figures that are not mine – I’ve included links to online versions of these images where available on the last page.

And links to many of the external figures.

Sometimes people ask me which of the visualizations I like the best. The answer varies over time, although I am currently biased towards the text-dense multi-variate visualizations designed for a large screen, such as these ones (Figures 6.19, 8.10, 9.8, 10.11, 11.3, 12.11) – see the PDF for high-res versions:

Some text-dense visualizations.

Why? Viscerally, I like the rich texture of shapes, colors, and structure where multiple patterns appears – visualization should be supportive of representing complexity and affording multiple interpretations. In my day-to-day work, I often design visualizations for financial market professionals: they don’t necessarily make money if they have the same ideas and same thesis as everyone else. Data-dense visualizations that prompt multiple hypotheses can be a good thing. (see also Barbara Tversky’s keynote at Visualization Psychology earlier today!).

I also think these dense visualizations push the boundaries of the design space of visualization and of text-visualization. Perceptually, multi-variate data can be a challenge. Data-dense visualizations can be a challenge. The linearity of text (i.e. you have to read words in some order) vs. the volume of information is a challenge: what happens to the global pattern? what happens if “overview first” doesn’t necessarily provide a macro-pattern?

A couple of these visualizations I just presented for the first time yesterday at the Visualization for Digital Humanities workshop in a paper titled Literal Encoding: Text is a first-class data encoding.

Posted in Data Visualization, Design Space, Text Visualization | Tagged , , , | Leave a comment

The transformation of Isotype

Over on EagerEyes, Robert Kosara recently asked, what happened to ISOTYPE? It’s a good question, with a multi-faceted answer. Here’s a few facets, mostly focused on inherent limitations of the Isotype design approach:

1. What’s an Isotype line chart?

Isotype examples are typically bar charts and maps, as they show counts of things – i.e. what we call unit visualizations. Extending it to other types of visualizations is non-trivial. Certainly scatterplots shouldn’t be too hard (e.g. fruit, animals), but what about line charts. Certainly timeseries data is important to plot for many analyses but what’s the Isotype answer for line charts? Typically Isotype reduces the timeseries down to a few time periods and draws them as their typical stacked bar. As such, Isotype is an approach to creating charts, but Isotype is not a comprehensive system for all types of visualizations.

Some have tried, for example, see the Agricultural Outlook Charts from the USDA in the 1950’s. Some of the bar charts in these publications are heavily influenced by Isotype, such as the coins in the Figure 1 left. However the line chart in Figure 1 right struggles with icons, instead the icons are limited to identifying the line, and indicating the trend with the horse representationally and quantitatively heading down-hill.

Figure 1: Isotype is not well suited to a line chart with many time intervals.

Figure 2 shows another 1950’s publication heavily influenced by Isotype: Midcentury White House Conference on Children and Youth, A Chart Book. On the left is a bar chart that could almost have been lifted straight out of Isotype, wonderfully clear. On the right is a pie-chart infused with Isotype icons, where only by luck the thin wedges of the pie fit it smaller icons assigned to them (and strictly speaking, the pie segment with “both parents” should have repetition of that icon through out the area, but that wouldn’t quite work either.

Figure 2: Isotype is also not well suited to this pie chart.

2. What’s the icon for GDP or CPI?

Icons for concrete real-world objects can be easy to design, as the real-world object can be the basis for the icon, e.g. people, fruit, animals, tractors, and so on. It gets trickier when some of those categories are visually similar: Isotype never created separate icons for wheat, barley and rye, for example. And Tufte’s log animal chart has three rather similar looking small furry mammals, which can be only be definitively decoded by referring to the labelled scatterplot on the previous page (exercise for the reader to now go find the previous page:-)

After that, icons get difficult to design for abstract objects. What would the icons be for financial data asset classes, such as: stock, bond, credit default swap, collateralized debt obligation, repo, option, future and a forward? Even concrete real-world entities can be misinterpreted, such as the famous dogcow.

Making simple, expressive icons requires design effort. Gerd Arntz‘s wonderfully expressive icons helped drive the success of Isotype, but are beyond the design capabilities of the average non-designer. Gerd could create an icon for the abstract concept of unemployment with a human icon, looking down, at rest, hands in pocket: it’s brilliant, but not easy to design especially with such clean, clear graphical shapes that can be easily printed.

3. And what about the axes (and the values)?

Perhaps most audacious move of Isotype is the removal of the numeric axes. Isotype charts are beautiful with their clean depiction of icon stacks and graphical cues. Removing the numeric axes is brilliant because you can easily and visually compare ratios of different stacks, e.g. one stack of icons is twice as long as another stack of icons.

But what if you want to know the values?

It’s a common task to want to know what number a stack of icons represents. Unfortunately, Isotype makes it hard. You have to first count the number of icons. Then you have to find the legend, where it tells you how many items that one icon represents. So, for example if I look at Eheschliessungen in Deutschland (Die Bunte Welt, 1928, page 42), I see that in 1919-1922 there are 8 marriage icons and each icon represents 400,000 marriages, so 8 x 400,000 = 3.2 million marriages. That’s math. That’s cognitive effort.

Furthermore, when the design system uses a small number of icons, it’s not very precise. Isotype tends to use full icons or nice fractions such as 1/4 or 1/2 of an icon. The prior stack of 8 icons could easily represent 3.1 million or 3.3 million.

If the chart had a numeric axis, you could just scan it and estimate the number directly – much easier. Or you could put the number directly in the chart. In Figure 3, the same marriage chart from Isotype is replicated with US data the Midcentury White House Conference on Children and Youth, with the addition of quantitative values at the end of the icon stack:

Figure 3: Isotype has only the icons and the legend: if you want to know the value, you can estimate the value by counting icons and multiplying. In this derivative of Isotype, you can just read the number.

4. Good Isotype is hard

Often simple designs are the result of hard work. Simplicity takes effort. In The transformer: principles of making Isotype charts (Hyphen 2009), Marie Neurath’s first hand account describes the design task of transforming data into an Isotype representation (what we might now refer to as encoding). Marie explains a myriad of design decisions made in different charts to get the desired reading of the result. For example, coffins are replaced with tombstones to address the issue of relative size of adjacent icons and potential misinterpretation. Or, doubling with width of an adjacent bar so that relative portions can be perceived. And so on. These are non-obvious design solutions, arrived through a design process to achieve a good effect that may seem obvious in retrospect. (Unfortunately, image copyright status is uncertain).

Has Isotype really disappeared?

The prior four points are focused on Isotype’s limitations that make it hard for Isotype to extend more generally across data visualization. I don’t even address points such as modernism (which both Robert Kosara have both previously talked about) or technical changes (by the late 1950’s phototypesetting became the norm, but the technology tended to soften edges and fine detail, so crisp icons may have been difficult to reproduce- I discuss some of these other factors, which also limit the use to text, in my forthcoming book Visualizing with Text).

That said, I believe that unit visualizations have inherited the legacy of Isotype. And there are some fantastic unit visualizations:

  1. Photos: Instead of the effort of designing icons, why not use photographs of the entities of interest? A favourite is Münzkabinett’s interactive piles of coins (Gortana et al).
  2. Shapes: Less expressive than photos, simple shapes are a good choice for unit visualizations with hundreds of thousands of items, such as SandDance.
  3. Labels: I’m personally interested in the use of labels in unit visualizations, such as this example of the passengers on the Titanic. I’m not the first, there are more compelling examples such as Maya Lin’s Vietnam Memorial, which in turn was based on earlier lists of casualties.
  4. Physical units: Physical visualizations are well suited to using units. These include specifically designed physical unit visualizations using concrete scales, as well as examples in the real world where the units can be perceived individually or as part of a whole, such as the fields of WWI crosses.
Posted in bar chart, Data Visualization, Isotype, Line Chart, Pie Chart | Tagged , , | Leave a comment

Visualizing with Text – Interactive Demos

Over the summer, eight of the examples from Visualizing with Text have been re-implemented as Observable notebooks! These can be interacted with, code is visible, and code can be edited live! All the demos can be accessed from the supplementary website.

They are all text-intensive visualizations with some interesting datasets, such as countries in the world where more people are dying than being born, countries with big jumps in Covid unemployment, and coroners’ reports indicating many ways to die in Georgian London.

Manipulating Code

For the rest of this post, I’ll focus on one demo: word skimming. This demo processes paragraphs of text, ranks words based on frequencies compared to a large corpus (Wikipedia), then weights the words so that least common words are formatted to stand-out. This formatting facilitates perception of uncommon words, which is a taught strategy for skimming text:

While skim formatting may seem strange in prose on a computer screen, the technique has existed for centuries, in instruction manuals, advertisements, comic books, and even software code editors, as shown in some of the examples in the book.

In this particular demo, the viewer can easily cut and paste a completely different text to format for skimming. They can also select from a few different formats, such as adjusting font weight, font width, or x-height. Here’s the same demo using the opening paragraphs from The Decline and Fall of the Roman Empire and formatted using both font weight and x-height to indicate uncommon words, such as valor, emperors, Trajan and Hadrian:

Variable width and variable x-height are uncommon in most prose text formatting. This demo uses the relatively new technology of variable fonts. With variable fonts, the font designer can expose properties, such as weight, width and x-height as quantitative parameters. These parameters are then adjusted based on a simple algorithm which splits apart words (using the JavaScript library compromise), calculates word rank (based on word lists from Wikipedia), and then adjusts the variable font parameters accordingly.

Since this example is implemented in Observable with all code editable, it’s straight-forward to modify the code and try other visual attributes that aren’t used in the demo code. For example, here’s the opening paragraph from Moby Dick, with uncommon words highlighted in red:

To highlight the words in red, a single line of code can be added (without worrying about adapting the drop-down menu, or configuring d3 scales, etc.) In this case, the following was added to the update cell:

// words are in an array elements [0-n] and d.rank is the rank of each word:    
elements[i].style.setProperty("color", (d.rank > 5000) ? "red":"black") 

Essentially, if the rank value is over 5000, color the word red, else color the word black.

Instead of a toggle red/black, we could easily add a d3 scale and use a color ramp, so that colors range from black, through blue and purple to red. (Note, no SVG is used, this is all HTML – D3 is only used in this code to facilitate scaling numeric values into ranges that suit the variable font used). In the snapshot below, a color scale is used in addition to font weight and x-height in this snapshot of text from Frankenstein:

While feasible, this is not necessarily recommended. The many changes of hue, weight and x-height creates many distractors, which reduces the readability of the sequential text. As a design effect, it may be an objective to create disruptive formats, such as the wonderful encoding by Ben Fry of Frankenfont.

On the otherhand, if the goal is skimming, then it is desirable to easily read the immediate surrounding context to aid comprehension. Changes of many formats reduces the typographic consistency requiring greater attention to decipher the surrounding text. Typographers discuss this as readability, which is described more in my book (briefly) and in typography books in more detail. One must take care when manipulating typographic attributes to not create Frankenstein paragraphs unless it suits the particular task or objective.

Posted in compromisejs, Data Visualization, Observable Notebook, Text Skimming, Text Visualization, Variable Fonts | Tagged , , , | Leave a comment

“…and what is the use of a book without pictures or conversations?”

My book Visualizing with Text is nearing publication (early November!). One goal of the book is to appeal to both researchers (with structured text with logical arguments), and designers (with many examples and pictures). I really like books with lots of images. Like the quote from Alice’s Adventures in Wonderland, in the title, I prefer well illustrated books. And I don’t like reading about visualizations where you only get little teeny snapshots: after all the primary subject is visualization!

Text is good to explain, structure and provide context, but pictures are important to a designer: to see how the conceptual description is actualized; to see all the little design decisions that contribute to the whole; to see the anomalies and things that could be improved; to provoke design inspiration; and so on. Arguably, Jacques Bertin was successful with Semiology of Graphics 50 years ago because he presented a theory supported with many examples and illustrations. My ideal goal is to have a 50/50 split between text and images. But, measuring text to images is a bit tricky:

Counting images

It turns out that the publisher and I count images differently. We can easily agree on page count (274), but my publisher currently counts 146 illustrations and I count 250 images (+/- a few to be sorted out). What? How can we differ so much?

I think the publisher is counting image captions. On the otherhand, I’m counting the individual images that might be referred to in a single caption, for example here’s figure 1.3. It has a single caption but two different images: left is a book from 1497, right is a different book from 1589. I count this as two images (the images come from two completely different sources).

Then, there are parts of the book where I discuss using visualization techniques inline with text. Since, I’m advocating inline visualization, these are simply text richly formatted within the flow of the prose. In this case, there is no image caption, so they don’t count at all, whereas I count it as one image (I had to write the code to crunch the data to format the text).

But there’s more nuances. What about a couple words formatted within written text, simply to facilitate cross-referencing to an image? I don’t count those. And what about an image that’s a composite of many teeny snapshots? I count those as one image. And what about a table that’s got some visualization formatting associated with each cell based on data attributes? That might count as a table and not an illustration. Again, I had to write the code to crunch the data to format the text, so again I count that as an image.

So, overall, I get to around 250 images, out of 274 pages. Given that the Table of Contents, Preface and Index don’t have images, that gets close to the 50/50 split.

Types of Pictures

Returning to the topic of pictures, a few other stats are useful. The pictures split approximately 50/50 between pictures that are my own vs. pictures of other visualizations. I think it’s good to provide a grounding based on real-world precedence. For the pictures from other sources, I’ve tried to include URL’s to them, so readers and educators can easily find the online versions where available.

With regards to my pictures, some are new unique visualizations, some are sequences of pictures, such as showing the same visualization using different attributes, or zoomed in or so on. There’s about 80 different examples of visualizations with text that I created in the book. Some have been published before, on this blog or in research papers.

But I wanted to create some new content (why buy a book with old pictures). So there’s new, unpublished visualizations that will be making their first appearance in the book, including scatterplots of cars, an adjacency matrix of dialogue, a syntax diagram, a massive textual stem & leaf diagram, assorted tables with visualization characteristics, a data comic, some expressive lengthening examples, and a topic model visualization. Here’s eight examples taken from the draft:

Larger versions of these pictures will be available when finalized and placed on the publisher’s website with a CC license shortly before the book is released.

Posted in Data Visualization, Text Visualization | Tagged , , , , | Leave a comment