Legacies of Isotype

ISOTYPE was a dramatic reconceptualization of statistical graphics in the 1930’s by Otto and Marie Neurath and their collaborators. Contemporary charts, such as seen in Brinton, were mostly black, simple dots or lines, tiny captions and full of dense grid lines, axes, ticks and labels. Isotype instead was bold; almost always devoid of grid lines, axes and tick marks; minimal bold sans serif text; and usually relied on repetition of expressive icons to convey quantities. Compare the two images below. Isotype evolved at the same time as Modernism, where these same ideas — broadly, “less is more” — was applied to many areas of design including architecture, art, dance, industrial design, etc.

How did Isotype’s visual language become diffused across charts, visualization and interfaces over the next few decades? Here’s three:

Pictographic Icons

Perhaps the best known feature of Isotype is the use of pictographic icons. Use of pictographic icons to indicate things became increasingly important with post-war globalization. Pictographic icons are recognizable across language and use less space than long labels. Standardized icons became popular across many areas of society such as highway traffic signs, Olympic symbols, airport signage, warning symbols and so on. And then Mac and Windows used icons as core interaction elements in graphic user interfaces (How many icons are visible in your screen right now? I have more than 125). Here’s a mid-1970’s set of standardized symbols for the US Dept. of Transport:

Isotype_SymbolSigns_CooperHewittOrg_18673291

Standardized icons from mid 1970’s, US Dept. of Transport.

No Grids

The diffusion of Isotype benefited in part from technical changes to printing, moving from metal-based printing (which could handle fine detail) to offset printing (which was based on photographic compositing techniques and this reduced the ability to use fine details such as thin lines and crisp serifs). As such, thin grid lines and small text are more difficult to use than chunky icons, large patches of color and bold, heavy-weight labels. This lines up well with design ideology of Isotype. If we look at some charts from the mid-1970’s, we can see the remains of Isotype — few or no grid lines, minimal text, and expressive pictographs:

Isotype_Graphis_Diagram_1976_p24.PNG

Charts from 1975: low on grids, low on text and some icons (Graphis Diagrams, 1976)

Labeled Values

Isotype worked hard to reduce text, but showing the numeric values seems to be important when we look at charts after Isotype. In the prior image, there are explicitly labelled numeric values in all six charts. Presumably viewers want an estimate of numerical quantities corresponding to the visual marks, and they don’t want the cognitive load of counting icons or guessing the area associated with a circles, folded corners or the relative width of smoke. Or, perhaps icons are difficult to express fractions. Regardless, the addition of numerical values either as labels on marks or labels on axes come back. This was probably one of the first aspects of Isotype that may have slipped — here’s a US Dept Agriculture bar chart from 1950, highly influenced by Isotype:

Isotype_Agricultural_Outlook_Charts_1950_Fuel_p25..PNG

Chart from 1950, highly influenced by Isotype (compare to first pair of images).

It has the icons (although moved to the axis and explicitly labelled), and minimal grids (although an outer frame has been added to the plot area). And it labels the bars. In this chart, like the 1970’s charts, the values are explicitly labelled.

The take-away is that removing value labels completely may have been a bit too far on Isotype’s part. Even Haroz et al‘s study on “Isotype” charts always included quantities along the y-axis in all test conditions. Either a numeric axis or labelled bars or some numeric guidance on the values seems to be broadly desired. We see these labelled values in many charts, such as many Excel charts that label both the numeric axis and number value per bar (3 of the 11 quick styles provide both) such as this one:

Isotype_Excel_Chart_with_Axis_and_Labels.PNG

or the USA Today Snapshots (which use many cues from Isotype, including pictographs, minimal text and no grids):
Isotype_USA_Today_Snapshot_Chart_Parks.PNG
Or in the very first bar chart in the very first tutorial of D3js (“Let’s make a bar chart):

Isotype_D3js_bars_with_values.PNG

 

 

 

Advertisements
Posted in Data Visualization, Isotype | Leave a comment

Which scatterplot is preferred?

Next week, we’ll be presenting a machine learning (ML) and visualization paper at the IV2019 conference in Paris. The core idea is to tag and display relevant news headlines in a real-time ambient visualization system.

From an ML perspective, the challenge is to use an open source news dataset (e.g. GDELT) where thousands of headlines are available and updated frequently (e.g. every 15 minutes) but the tags provided don’t match the needs. In general, classifying news stories is an ongoing challenge, as new topics and words emerge (e.g. Brexit, tariffs), and topics may change over time (e.g. Clinton, environment, etc.) We provide a classification module where expert users start with a simple text search against the headlines. Then we automatically suggest additional relevant keywords, which the user may explore, add, or remove. Additionally, the user may inspect sample headlines associated with any of the keywords, grouped by similarity. Then the the user can explicitly select target headlines and keywords that match their intended search topic (or not). Finally, we can run a classifier to tag all the headlines with respect to this topic defined by sample keywords and headlines. The user can iterate, modify, reclassify, and so on. Lots more detail on the technical approach is in the paper.

From the visualization perspective, the challenge is how to display a subset of these headlines. We were working with an existing animated ambient visualization which provided either a scatterplot or map upon which we could display headlines and would automatically select headlines at random to popup. Note that a map with point data is essentially the same as a scatterplot: the x and y location of the data points is based on latitude and longitude; plus an underlying image is used to show geographic features, such as land/sea.

We created four scatterplot variations: 1) a scatterplot map (with underlying land/sea image); 2) a scatterplot with explicit axes (recency vs. # stories); 3) a scatterplot based on a multi-dimensional projection (e.g. PCA or t-SNE); and 4) scatterplot based on random layout (i.e. like a wordcloud) – which is visually similar to the multi-dimensional projection:

VisIRML1
VisIRML2
VisIRML3

These visualizations were then reviewed with a small target group of users, in a meeting environment. This is where things get really interesting:

  1. Map: Everyone likes the map. It is immediately understandable.
  2. Scatterplot with explicit axes: Everyone logically understands the representation, but otherwise largely unenthusiastic. A few people are very interested, but given the broad audience for an ambient visualization, there is not enough support to push this into production.
  3. Scatterplot projection: The experts are not expert in multidimensional data projection. Multidimensional data projection is difficult to explain and difficult for people to comprehend. They can’t use what they don’t understand.  Unfortunately, dead on arrival.
  4. Scatterplot random layout (aka cloud): Surprisingly, some people really like word clouds, but this community of experts is not interested. It’s art, not information.

So, the map wins: there is a strong preference for the map over all other scatterplot variants. Why should the map have such a strong preference? Unfortunately, the project didn’t have scope to consider this, and there are various confounding effects going on too: titles are smaller on the map, the scatterplot also uses color-coding of text, the map had leader lines to association headlines with locations, etc.

Here are some hypotheses why there was such a strong preference for maps:

a. Maps are easy to decode. The specific map we used is always global. A global map is easy to decode because people are very familiar with them. A global map has very low cognitive load. Low cognitive load may be very important in an ambient visualization as people don’t want to have to think too hard about what they are seeing. Scatterplots, however, have higher cognitive load. You have to be actively engaged to decode it: you have to reference back and forth between data points and an axis, or you need to understand what a projection means. And a cloud doesn’t have anything to decode positionally, so there’s no information there. From an information standpoint, the map provides more information than a random cloud.

b. Maps automatically engage prior knowledge. A news headline situated in Iowa means that we can bring all of our knowledge about Iowa immediately as context to interpret the visual representation. If the news headline is about corn, that probably matches prior knowledge that Iowa is farmland and may have a lot of corn crops.

c. Maps are visceral. In Norman’s Design of Everyday Things, some objects elicit a visceral response. A visceral responses are immediate and emotive. The map is immediately accessible and engaging. In a ranking of maps vs. scatterplot with axes vs. multidimensional projection scatterplot, my hunch is that the map is most viscerally engaging.

Thoughts? Comments?

 

 

Posted in Data Visualization, Text Visualization | Tagged | Leave a comment

SparkWords

SparkWords are words in running text, such as narrative prose or lists, where the words have additional data embedded in them as visual attributes, such as color, bold or italic. The simplest use is differentiation, such as italic to indicate the name of a ship, such as Titanic, however SparkWords can go much further. Attributes can be combined, for example, one could indicate political candidates using color to represent their political party, italic to indicate gender and underline to represent an incumbent: Mazie Hirono, John McCain or Bernie Sanders.

SparkWords can go a lot further. Here’s a paragraph with some text about departments in France using SparkWords:

SparkWords_Semiologie_Graphique

Word weight and proportion of red, green, blue are based on data. Four different quantitative data values are conveyed by visual attributes applied to the words. There is no need for a separate legend, it’s embedded in the explanation. There’s no need for spark lines or spark bars: it bars were used instead, you would still need some kind of interaction to identify the individual bars. With SparkWords, the words uniquely encode the identity of each item. That is, in addition to weight, color, etc, the word itself is encoding one more dimension of data.

Note that SparkWords do NOT adjust size of words, because, in running text the size of text stays consistent.

Here’s another example, an entire paragraph of SparkWords showing all the 2018 baseball games of the NY Yankees:

SparkWords_Yankees2018
Each three letter sequence is an opposing team (e.g. TOR for Toronto, TBR for Tampa Bay). Each character is a game, red for a loss, green for a win (grey if no game). The background bar is the score differential: The initial game, represented by the first T in the paragraph was a win for the Yankees with five runs over the Toronto Blue Jays. The second game was won with a smaller run differential (2), the third game was lost by a couple runs.

So What?

Perhaps most interesting is that SparkWords represent a different way to think about visualizations. Instead of a separate visualization and a separate paragraph explaining the visualization, with SparkWords the visualization moves directly into the words of the narrative text and there is no need for an additional visualization.  The notion of a separate plot area or even a micro-sized plot such as a spark line, is not required. And the SparkWords do provide more information context than just text: in the example above, there’s a lot of similarity between Gers, Creuse and Lozere — same green, same light weight. But Paris, Seine and Nord are different: Paris is more blue (services) while Nord is more industrial (red). Or in the case of the Yankees, there’s alot of green overall, but some bad sequences of losses (TBR and BOS swept 3 games series in the middle of the season).

Instead of putting text into a visualization, SparkWords puts the visualization into the text.

So why should we want to use SparkWords? There is an increasing need to explain data: data journalism, explainable AI, automated insights, data-driven natural language generation. Visualization, by itself, does not direct attention — there are many possible patterns and it’s not obvious what to look at and what the specific insights are. Data narratives do explicitly talk about specific data points and trends — but do not provide context to help inform critical understanding. Techniques which offer tighter integration between explanation and visualization can be much more informative. SparkWords, like data comics, automated annotations, and in-line visualizations (e.g. spark lines) all bring visualization and narrative closer together. SparkWords are the only option that is pure text, so maybe there are some use cases where SparkWords are uniquely well suited for explanations.

I’ll be talking more about SparkWords at EuroVis on Thursday next week (June 6) at the 11-1 session on Text Visualization.

Posted in Alphanumeric Chart, Data Visualization, SparkWord | Leave a comment

Data Comics (with NFL play example)

Data comics are a great extension to infographics. Data comics are essentially a narrative explanation of a visualization set out in a comic-like format. The overall sequence explains the story. (e.g. see this paper for some examples comparing infographics to data comics).

I wanted to get a better sense of data comics, so I made one. For a starting point, I took an example of NFL data from my previous book Graph Analysis and Visualization (Brath and Jonker, 2015). Here’s the resulting data comic of two NFL teams and the sequence of plays that they did during 2011:

NFL_run_vs_pass_frequency_data_comic

Hopefully the story is self-explanatory in the comic. The purpose of making this comic though was to learn more about data comics – what works and what doesn’t? Essentially this is research-through-design, wherein insights are gained by making something rather than studying theory:

“The term ‘experiment’ is narrowly understood (in ‘the scientific method’) as a piece of controlled research, in which variables are isolated and controlled, and a hypothesis is validated or rejected. But the term has another use – in a much broader sense of ‘trying something out to see if it works’ as either part of an inquiry or program (Redström 2011) or as part of an action-oriented intervention (Halse et al. 2010)” – Stappers and Giaccardi.

Steps include having some problem or hypothesis, design iteration to create prototypes, and an end result of knowledge gain.

Why a data comic?

So first question is: why a data comic? There are many different ways to combine narrative explanations with visualization, ranging from infographic posters, to interactives with steppers, to long narratives alternating between paragraphs and charts, to long scrolling narratives where the scroll triggers an interaction such as a filter or zoom, to visualizations with sequential tutorials in a side panel (step 1 do this, step 2 now do this), and so on.

So why data comics? Like most infographics, most of the data and the story is made explicit. The story isn’t buried under tooltips or required interaction. I’ve always been an advocate of not hiding too much data under interactions. However, unlike some infographics, the narrative story telling sequence is much more explicit in a data comic. If you have an explicit narrative, a comic offers a strong sequential structure and follows a recognized conventions.

But wait a minute. There are other ways to provide a strong narrative structure. Long scrolling visual stories on websites are pretty much data comics too – aren’t they? (e.g. there’s 15 long scrolling visual stories in Archie Tse’s post about scrolling story telling at the NY Times). While a comic is page orientated with a left-to-right structure (left image), the scrolling layout is essentially a set of panels oriented vertically (center image):

data_comic_vs_scrolling_data_story_vs_surrounding_sequenced_story

And I’ve created other strongly narrative visualizations that are somewhere between a data comic and an infographic, as shown by the third example. We’ve implemented the third example in some fully automated data-driven charts. This third example was informed by a process we’ve often used for documenting visualization wireframes: Rather than many pages of wireframes, we create a single wireframe with sequential annotations around it (some old examples here referred to as paper landscapes). Cues such as sequential numbers and leader lines are used in addition to a general left-to-right top-to-bottom flow to enhance the sequential narrative.

Knowledge gained

Strong narrative sequence is inherent in a data comic. It follows well-known comic conventions, likely familiar world-wide, and therefore requires little training. This narrative sequence does not need additional supporting narrative cues, such as a numbers to sequence chunks of text.

However:

Text and visualization don’t need to be constrained to panels. Just like in a movie where the character is still talking even after the scene cuts to the next scene, the annotation or the visualization can extend across multiple panels. I initially attempted to make the above data comic work by repeating the visualization tree in each panel:

data_comic_viz_constrained_to_panels.PNG

This is an arbitrary constraint inherited from the comic convention. Yes, it’s a small multiple that can be compared and contrasted – but in the NFL example above nothing is changing to compare and contrast! It’s a waste of ink. Instead, the visualization can extend across the panels. This reduces the pain of repetition of the same visualization scene to scene: plus it creates more space to enrich the visualization – in this NFL example it allows the addition of useful labels:

data_comic_viz_spanning_panels.PNG

And, spanning text or visuals is a technique used in comics for many decades – here’s an example from 1953 from the Digital Comic Museum: (Aehaya!)

data_comic_cave_girl_text_spans_comic.PNG

 

Incremental legend. Because the panels are smallish and the text is brief, the visualization can’t be explained up front: the layout, the colors, the scales can’t all be covered in the first panel. So the notion of the legend gets split up into pieces in different panels and revealed throughout the story. But they don’t always occur in the panel where they are discussed. For example, the horizontal scales at the bottom of the second row are also applicable to the corresponding panels  on the upper row – but that might not be obvious.

Similarly the column labels (team, 1st down, 2nd down, etc) float strangely between the narrative text and the visualization. Ideally, they are associated with the viz (same color and same font as the viz).  But there is the potential for confusion. The integration of visualization legend and labels – and explanation of the visualization technique in relation to the narrative story – could likely have been done in a better way.

Narrative. The top row of panels tends to be descriptive observations regarding the data. The narrative in the lower panel is more comparative of the difference between the two rows. Unfortunately, the narrative in this NFL example is hand-written, and it’s not easy to write a story to fit the limited space available. And the hand-written story was created only for these two teams, so the stories are no longer useful when looking at other teams.

An even better solution would be machine generation of the story such that when the viewer changes the team, the narrative would update appropriately (see previous post on insight generation). Obviously there is some interesting research opportunities for interactive-data-comics + natural-language-generation.

Callouts. Speech bubbles from comics can be easily used to call out some data or insights. They can act like tooltips to let the data speak. In this NFL example, the individual plays and players are lost when the data is aggregated to create the tree visualization. So some info from the top players behind the plays are made visible using call outs.

SparkWords. One challenge in any visual explanation is linking the narrative text to the visual representation. In a comic, the text and the characters can be tightly linked using a variety of cues. For example, the pointy bit on the speech bubble links it to a character. The placement of a sound-effect is proximate to the thing making the sound. Or the font used matches the character and their emotional state (e.g. shaky scary letters for a ghost).

SparkWords encode data using the same color-coding as the visualization. In this NFL example, the words run, pass, and other in the narrative use the same color-coding as the corresponding bar in the visualization. Given that there are many bars, the color-coded words presumably can be more easily associated with the corresponding colored bar. However, the color-coding of the words occurs before the explanation of the color-coding, so there is the potential that these colors could confuse the reader. SparkWords will be the subject of a future post.

 

 

 

Posted in Data Comic, Data Visualization, Design Space | Leave a comment

Visualizing Quantitative Values in 3D

We’re working on a few data visualization projects at Uncharted using VR, AR and 3D printing. Given the rise of these new techniques, it may be time to dust off 3D data visualization (again). What are the use cases where 3D visualization works? What were the things that were difficult with 3D on the desktop that devices or 3D prints might solve? Yes, 3D has issues such as occlusion, navigation, perspective foreshortening and so on. And 3D is already known to be effective for things that are already inherently 3D, such as fluid flow analysis or 3D medical imaging.

For this particular post, I’ll consider some cases where 3D may be effective for visualizing quantities, such as scatterplots, bar charts and surfaces:

1. Length

Length is effective for representing quantities in 2D (Bertin, MacKinlay, Cleveland and McGill, Heer and Bostock, etc, all agree on this). The viewer can make quick comparisons of ratios, for example, to estimate if one bar is twice as long as another bar. In 2D, error increases when base lines are not aligned, but it’s still much more accurate to use lengths rather than, say, hue, brightness or area.

Going into 3D perspective, presumably the error to estimate lengths will increase due to perspective distortion. But is it really that much of an error? There are extremely strong visual perspective cues that we use to facilitate making judgements in 3D spaces. For example, we know parallel lines converge towards the horizon, such as a roadway. Regular patterns, such as the dashed lines, also provide a strong cue – the regularity of the dash pattern in perspective provides a cue for estimating distance.

So error will increase in perspective, but lengths in perspective can still be quite accurate. Consider this old “pin map” from Brinton (from 100 years ago!):

3D_pin_map_Willard_Cope_Brinton_Graphic_Methods_1914_smaller.png

All the pin-stacks are set on a common base. The perspective effect, judging from the base appears to be not particularly distorted. The consistent size of the round pin-heads further increases confidence that sizes aren’t distorted.  A viewer likely has a high degree of confidence to say that the height of Boston is around 2.5x New York.

Compare this to a contemporary 2D map, using bubbles to indicate quantities: 2D_bubble_map_EnvironmentAmerica.png

A viewer likely has much less certainty comparing the bubble in New Hampshire to the bubble in Rhode Island. New Hampshire is bigger but how much? 3x? 4x? 5x? Area is less accurate than length.

2. Perspective is just a log transformation

While some people consider perspective to distort data, it’s really just a log transformation of the entire scene. Log transformations are common in data visualization, except we’re used to transforming only the plot area, not the entire scene. Here’s a bar chart from the 1970’s tilted back in 3D (with a weird bend at the back):

3D_Tilted_Bar_Chart_California_Water_Atlas.jpg

At the front of the scene, i.e. near the base of the chart, we can see more detail than we can see at the back of the scene. Small bars are comparable e.g. in September (far right) the Feather River appears to be 2x American River, which in turn is perhaps 5x Putah Creek. Large values are also visually comparable to other large bars, for example, in April the Feather River is almost 2x the Yuba River. The perspective effect is much stronger in this example, but the strong grid lines and the vanishing effect on the the consistent-width bars are strong cues facilitating estimation:  you can see the dip in Putah Creek from Oct – Nov with values that are in the low 100’s and the slight dip in Feather River Mar – May with values in mid 10,000’s.

You can apply the perspective distortion along the x-axis instead. Here’s a timeseries chart with a few years of daily data:

3D_Tilted_Timeseries_Uncharted_Software.png In the foreground, far right, each day is clearly visible. In the background, far left, individual days are not, in effect compressing time for older date. This time compression is typical in a lot of timeseries analysis: a typical tabular analysis might provide comparisons such as week-to-date, month-to-date, quarter-to-date or year-to-date.

Essentially, this is a focus+context visualization technique (e.g. see TableLens or Fisheye views). The right side clearly shows the discrete daily movement of the price with more than 30 times the 2D area compared to the start of the timeseries which provides the context where daily movement is not clearly visible but the longer trend and broader vertical range is clearly visible.

However, perspective provides additional value beyond other focus+context techniques. A table lens of fisheye are discontinuous in their magnification adding extra cognitive load on the user switching back and forth between the closeup and the context. Perspective provides a continuous transformation across the display facilitating continuous comparison between the detail data and the context data.

Trends across the perspective are clearly visible. For example, a straight line could be drawn from the starting point (at $10 in Jan 2009) to the high point (near $27 in May 2011), and this line would be near to many of the other high points in 2009 and 2010.  And this straight line would remain a valid straight line regardless of the perspective viewpoint.

3. 3D bars may facilitate comparisons

3D bars are commonly used as an example where 3D should not be used. Tall 3D bars in the foreground can occlude short 3D bars in the background. Short bars are more visually salient because they still have a larger graphical area than just their height as their tops are visible. And so on.

But 2D bars also have issues and introduce biases. Here’s a quick example:

3D_bar_chart_vs_2D_bar_chart_smallest.png

In 2D, the bars must be oriented either vertical or horizontal. The orientation introduces bias: it is far easier to compare across bars in columns, than it is to compare across bars in rows. In the 3D representation the viewer can compare by row or column. In 3D the viewer can also distinguish between a zero value (flat bar is in cell B2) versus a null value (no bar). There’s probably a few experiments that could be done here for a keen masters or PhD student.

4. Meshes and Surfaces

Rather than just bars or lines, rectangular meshes are well suited to 3D. When the mesh is spaced at regular intervals, there is a strong perspective cue facilitating comparison across other points on the mesh. Relative heights between points can be assessed. Here’s a couple of examples from Brinton’s book:

3D_Surface_Brinton_1914.PNG

3D surfaces have many modern applications  such as plotting distributions across two variables, evaluating financial derivatives, etc. Here’s an example surface showing the Canadian yield curve (along the right side, i.e. interest rates for one month out to 10 years), and the value of that curve every day over 5 years (left side) (via Uncharted):

3D_Surface_Canadian_Yields_2006-2010.png

The huge drop in short term rates in 2008 is immediately visible as interest rates dive during the financial crisis. Areas where the surface is nearly flat, tilted at an angle, or periods where there are curves and kinks are visible as well. These waves, wobbles and kinks are visible in part due the consistent grid lines of the mesh and color applied to the surface. It is also aided by the careful lighting and material configuration in the 3D scene which creates highlights.

3D Printed Surfaces

Given a data-driven 3D computer-generated surface, why not print it? Here’s the same dataset, as a 3D print (set on a matching laser cut wood box):

3D_Surface_Canadian_Yields_2006-2010_3D_Print_Uncharted.jpg

The grid in 3D print is obtained by changing the print material from transparent plastic to black plastic at regular intervals. While there are no tooltips nor interactive slicing, there are some other observations facilitated with a physical object. It’s tactile – you can feel how the shape changes. Some of the sharp ridges and depths of crevices are more easily explored  as a tactile 3D object. In a physical environment the viewer can easily tumble the object to any orientation without strange keyboard or mouse movements.  And one can easily adjust position of the object to relative to physical light sources to see highlights (or not) or otherwise gain insight to the complex shape. And there is a light in the box to illuminate the surface from behind.

There’s more to 3D than just estimating lengths and heights. Perhaps there are many future blog posts to be done on other aspects such as navigating 3D, text in 3D, mental models in 3D and so on.

Posted in Data Visualization | Tagged , | Leave a comment

Generating stories about data with visualization

Early in my career, I’d create data visualizations and without fail, my manager would ask: “So, what’s the story here?” In data visualization the objective isn’t the visualization – it’s the insight gained from the visualization.

Visualizations don’t announce their insights. Whether dashboards with a couple of bar charts, massively complex visualizations of billions of tweets, or hairball graphs there are many possible insights. Narrative visualization is the addition of a story to a visualization, to explain a visualization and to highlight specific insights. The NYTimes and The Guardian create human-authored narratives to explain insights. But visualizations with human-authored narratives like the Times are a lot of work. Instead, an office worker without a graphics team, might add a paragraph or two on top of a visualization, maybe with a link or two that pivots the view.

Instead, data-driven Natural Language Generation (NLG) can completely sidestep visualization. The approach is to use data and advanced analytics to algorithmically derive the insights, then assemble those insights with computer generated text. Some of the results are impressive, generating not just insights but interesting stories.

But automated insights wrapped in natural language loses all the contextual data. One alternative is to simply add a visualization beside an NLG paragraph. But that requires the reader to do all the work cross-referencing back-and-forth between the paragraphs and the visualization.

Why not automate insights, and put them directly in the charts?

Visualization libraries such as Semiotic have built-in code-driven annotations, so it would be feasible to automatically generate the insight, map that to some kind of annotation, then plot the annotation. This sounds great, but before we can do that we need to know:

What are the kinds of insights that work well with annotations?

This is something that we’ve done in a number of different ways on different projects over the years. Looking back, there are some common patterns, such as an insight about a specific data point, about the plot area or some event data:

 

Insights about data points

Scrutinizing specific data points is a common task in data visualizations. These data points might be the extremes (to identify who the leaders and laggers are); or perhaps outliers (to validate that the data isn’t erroneous); or may be benchmarks that help orient the viewer in the data (like a landmark). Labeling points is straight-forward in a variety of different visualizations, as these diagrams suggest:

Annotations_about_data_points.PNG

 

Insights about the plot area

In other cases, the desired insights are at an aggregate level. For example, understanding the range of the data is a common task. How big is the difference between the biggest and smallest data point? If I look at a stock chart there’s a big difference between a stock that has a 2% range vs a stock that has a 80% range and a big difference to how an investor responds to that magnitude.

Trend is related, and there are many ways to potentially measure trend, such as average, last-first, regression, moving average, curve fitting and so on.

Sometimes the challenge isn’t about the data, but what the semantics of the plot are: scatterplots can be challenging because the sweet spot requires some cognitive effort to determine the meaningful combinations among coordinates. Instead highlight areas on the plot or using contours are effective.

Another pattern, common in sports commentary, is the threshold: such as the sports superstar approaching the all-time record. This is easily translated into a visual annotation such as a line:

Annotations_about_plot_area.PNG

 

Insights associated with an event

Sometimes the insight is an event that already has associated commentary; such as a news story, a tweet, or a pivotal event. In these cases, a narrative snippet may already exist, such as a news headline. This can be depicted directly as a textual annotation:

Annotations_about_events.PNG

 

So what?

Annotations explicitly label insights directly on a plot, with the full context visible. The viewer gains the benefit of the insight. The viewer is also fully informed by the context to ask critical questions or otherwise probe the data. It’s this ability to understand the authored insight but then derive our own insights that makes narrative visualization so compelling.

The above patterns are just a start. What is the catalog of all the insightful patterns that go with visualizations? Do they work across the wide variety of esoteric visualizations? And, more important, which insights are meaningful: there are many possible insights, so from an automation standpoint, which insight should be promoted to be a visible annotation?

 

Posted in Annotation, Data Visualization | Tagged | 1 Comment

Bertin extended to text (pt 2)

In a previous post, I’d talked about Bertin’s previous writings on text attributes in visualization in the classic text Sémiologie Graphique (only the French edition – not translated into English). In particular, I noted that Bertin has been highly influential in the fields of visualization and cartography — not only because he provided a framework for creating visualizations, but also because he created hundreds of examples to illustrate the breadth of possibilities.

Figure 1. Bertin’s dataset of
populations in 90 departments.

So now, 50 years after Bertin, I decided to mash up some of the text-based visualization ideas that I’ve been using with Bertin’s original French population dataset.

Bertin originally used a small data set of 90 French departments, with population counts for three different occupations (agriculture, manufacturing, and services). The dataset was small enough for Bertin to publish on half a page (page 100 of the English edition, shown here in Figure 1). The additional columns are totals and ratios.

Bertin takes the dataset and then creates nearly 100 different visualizations: bar charts, scatterplots, ternary plots, parallel coordinate plots, maps, cartograms, and so on (pages 101-137 in the English edition). A small subset are shown in Figure 2 below. But none of them use text.

Figure 2. Just a few of the many different visualizations that Bertin constructs from the same small dataset of populations per department.

I take the same dataset (why not?) and then create a dozen new, text-rich visualizations (shown in Figure 3). Typically, I use the names of the departments in the visualizations. Individual departments can be identified directly in the visualizations: there is no need to cross-reference to tables, no need to rely on interactions.

Figure 3: 12 new text-dense visualizations based on the same dataset as Bertin.

For example, I’ve previously talked about microtext line charts. In the center is a parallel coordinates plot where the lines connecting columns have been replaced with microtext – shown as a much larger image in Figure 4. Color is based on percent of occupation: green for high agriculture, red for high manufacturing and blue for high services. At a macro-level you can see the inverse relationship between agriculture and manufacturing. At a detail level you can trace the lines a bit more easily and directly identify them.

Figure 4: Microtext parallel coordinates chart. Names and codes for each department are along each line. Click for big version.

A full research paper on these visualizations is in a special issue of the Cartography and Geographic Information Systems journal (CaGIS) (volume 46, issue 2). The issue is specifically on the 50th anniversary of Jacques Bertin. Full volume description here and this link is a free view to the paper (first 50 viewers only).

 

Posted in Data Visualization, Microtext, Text Visualization | Leave a comment