Communication Design Studio: Project 3

John Baldridge
14 min readNov 11, 2020
World’s Biggest Data Breaches & Hacks Select losses greater than 30,000 records. Screenshot from https://www.informationisbeautiful.net/visualizations/worlds-biggest-data-breaches-hacks/. Data sources: databreaches.net, IDTheftCentre and media reports

WEEK 11 — CLASS 21

Tuesday, November 10, 2020

Project 3 introduction

For Project 3, we were each assigned to review an existing data visualization and asked to analyze it to determine its narrative and key points. After analyzing the World’s Biggest Data Breaches & Hack’s data visualization, it became clear that the author wanted to showcase both the frequency and severity of data breaches over time. Based on the visualization, each year we seem to see more data breaches with greater data exposure.

The interactive data visualization allows the user to filter the data to reveal other key points. For example, the user can break the data down by sector (academic, government, tech, retail, etc.) and method (hacked, lost device, inside job, etc.) and then reveal if the source of most data breaches stems from being hacked our user error, or the “oops” category.

The user also has the ability to toggle between the year of the exposure and the data sensitivity level. The user can then hover over each bubble to learn more about the breach. For example, the Facebook bubble reveals that in Sept 2019, “Several unprotected databases were found to contain the phone numbers of around 20% of all Facebook users, with (in some cases) names and locations.” It then links to the corresponding Fast Company article.

In the backend, the data that is feeding the data visualization looks like this.

In row 59, the Facebook bubble that the user hovered over is feeding the data visualization this information. What the user is not seeing is the 1–5 data sensitivity rating. For this specific bubble, the Facebook breach was given a “2” indicating that Social Security and/or other personal details were exposed. This is indicated by the color being used on the front end of the data visualization. Below is the color range depending on the severity of the breach.

WEEK 11 — CLASS 22

Thursday, November 12, 2020

Stacie Rohrbach

This week we were assigned a reading from Richard Saul Wurman. In the reading, Wurman talks about how although information is infinite, the way we organize information is finite. Wurman says that all information can be organized by the LATCH method: L (location), A (alphabet), T (Time), C (Category), and H (Hierarchy).

Screenshot from Richard Saul Wurman’s “LATCH: The Five Ultimate Hatracks” page 40

Wurman talks about how designer Charles Eames created an exhibit on Thomas Jefferson and Benjamin Franklin that was shown as a timeline.

The exhibition spanned 120 years of American history (1706–1826), from the American Colonial experience and its European heritage, to the point when the young Nation was able to make its great move westward. It follows the careers of Franklin and Jefferson through the important times during the formulation of the Declaration of Independence, throughout the Revolutionary War, and during the early stages of the Constitutional government. https://www.eamesoffice.com/blog/franklin-jefferson-and-charles-ray/

Source: The office of Charles and Ray Eames, Metropolitan Museum of Art in New York

Cartesian coordinate system

A widely used coordinate system for data visualization is the 2d Cartesian coordinate system. This is where each location is uniquely specified by an x and a y value.

“Standard cartesian coordinate system. The horizontal axis is conventionally called x and the vertical axis y. The two axes form a grid with equidistant spacing. Here, both the x and the y grid lines are separated by units of one. The point (2, 1) is located two x units to the right and one y unit above the origin (0, 0). The point (-1, -1) is located one x unit to the left and one y unit below the origin.” Source: https://clauswilke.com/dataviz/coordinate-systems-axes.html

Narrative structure

Data can tell compelling narratives. The below narrative-driven piece was created by Pitch Interactive in 2015.

Since 2004, the US has been practicing in a new kind of clandestine military operation. The justification for using drones to take out enemy targets is appealing because it removes the risk of losing American military, it’s much cheaper than deploying soldiers, it’s politically much easier to maneuver (i.e. flying a drone within Pakistan vs. sending troops) and it keeps the world in the dark about what is actually happening. It takes the conflict out of sight, out of mind. http://drones.pitchinteractive.com/

Questions

What have you gathered from the readings and class activities to date?

It was very helpful to see examples of data visualizations that are working well. It was also helpful to see that the data could be presented in a more abstract way rather than precise bar charts. An example of abstract visualization can be found below. By using a well-known object like a water bottle, and a familiar geographic location, like New York City, the massive scale is quickly understood. This approached is far more effective than simply stating that 480 billion water bottles were sold in 2018.

The pile visualized below is around 2.4 km high and dwarfs the glittering skyscrapers of the Financial District at the tip of Lower Manhattan. Source: https://graphics.reuters.com/ENVIRONMENT-PLASTIC/0100B275155/index.html

What facets of your data are you considering using in your project and why?

I would like to show that users are giving organizations more and more data, and perhaps have them think twice about who they are giving their data to and why. I would also like to explore the idea that as organizations collect more and more data, they are much more destructive when they experience a breach.

What design research question is guiding your project?

I would like to explore the best way to represent “data” in a visual way. Data can range from emails, passwords, SNNs, addresses, or personal health records, so what might be the best way to represent that? I was thinking that I could have one standard 8x11 sheet of paper indicate 1 record. This might be an interesting visual, especially if I multiple the thickness of a piece of paper by the number of records (30M minium). I could then stack the paper next to popular landmarks to show scale.

What organization methods do you imagine leveraging in the data (LATCH)?

Regarding the LATCH method, the most useful ways to organize the information I was provided will be by Time, Category, and Hierarchy.

What coordinate system(s) do you see emerging as logical and appropriate?

One coordinate system that I started to develop, below, is showing the number of records lost per year.

What may serve as a logical sequence for people to move through the content (narrative/indexical/combo)?

I would like to use a combination of narrative and indexical. Indexical would be helpful to show the sheer amount, frequency, and severity of data breaches over time. While the narrative approach will help explain why this is important and why the reader should care.

WEEK 12 — CLASS 23

Tuesday, November 17, 2020

Screenshot from the in-class review of the Data Points: Visualization That Means Something by Nathan Yau
Source: https://msucreativecomp.files.wordpress.com/2016/08/data_points.pdf
Source: https://msucreativecomp.files.wordpress.com/2016/08/data_points.pdf

After reviewing the data set I was given there seem to be some holes I would like to address. For example, when I organize the data by year and volume of breach, it looks as though there were no data breaches in 2010.

Figure from the revised data set showing the number of records lost by the year

A quick Google search suggests otherwise, Top 10 Data Breaches of 2010. That being said, I’m sure there are many holes in the data set and I will have to input that data. Also, if you look at the chart above, it seems like we are having a slow 2020 in terms of data breaches, but this is misleading since the last data input for 2020 was in May. I will need to add the current 2020 data.

Updated data set using color-coding. Entities are shown in red (on the left) if they are repeat offenders.

Another interesting angle I was looking at was the method in which the data was taken. Not surprisingly, the most frequent method of a data breach is hacking. This is followed by poor security, lost device, inside job, and then by accident (“oops”).

Figure from the revised data set showing the frequency of the method used to expose the data

I also thought it might be interesting to see the largest data breaches to date and see if there were any similarities in the organizations that were hacked.

The data set indicates that data breaches are on the rise and external resources confirm this trend. According to a 2019 report by the nonprofit Identity Theft Resource Center, data breaches rose 17% in 2019 a year earlier, bringing the number to 1,473 (MarketWatch, 2019). According to the same report, the banking, credit, and financial-sector breaches in 2019 accounted for 61% of the records exposed, despite accounting for only 8% of breaches that year.

Figure from the revised data set showing the number of records exposed by the organization

Above I have listed the 15 largest data breaches from 2008–2020. The largest breach is from Yahoo which admitted in October 2017 that all of its 3 billion user accounts had been compromised (Reuters, 2017). The breach happened when one Yahoo employee clicked on a spear-phishing email linked to hackers aligned with the Russian state security service. The “smallest” breach listed was the 134 million records exposed when Heartland Payment Systems was hacked. Heartland later paid out $145 million in compensation for fraudulent payments due to the hack (CSO, 2017).

WEEK 12 — CLASS 24

Thursday, November 20, 2020

As I started to work on the narrative structure and how I might represent severity and amount in an abstract form. I first created a 1–5 color coding scale to show the levels of severity.

DATA SENSITIVITY SCALE

  1. Just email address/Online information
  2. SSN/Personal details
  3. Credit card information
  4. Health & other personal records
  5. Full details

I then started to work on how I could represent the “amount” variations. All records lost start at 30K and grow to over 3M. At first, I thought I could show the amount size by line width as shown below.

After sketching this out, I thought that this visualization wasn’t working that well and it was not clear as it could be.

For the above second draft, I decided to show the amount by the size of the circle. There are 5 different sized circles, each one indicates an amount size.

DATA AMOUNT SCALE

  1. 30K — 100K
  2. 100K — 500K
  3. 500K — 1M
  4. 1M — 3M
  5. 3M +

I also wanted to add an aural component. When the user hovers over each circle they will hear an alarm that gets louder depending on the severity of the data lost.

Answer on medium: How are you layering information and defining points of entry? What is informing your decisions?

Information layering

Below you will see how I'm layering the information. The user will only see a range of color-coded circles organized by year. These circles range in size, depending on the number (amount)of records lost. The number (amount) of records is also located in the middle of the circle.

As the user hovers over the circles they will produce a siren sound effect ranging from soft or loud (on a spectrum from 20% — 100%) depending on the severity of the data lost. The circle will also pulse during the hovering stage.

The user will then be able to click the circle to review the source organization of the data breach. The user can then navigate to the news article featuring more information about the data breach.

SOUND EFFECT BASED ON AMOUNT SCALE

  1. Just email address/Online information (20%)
  2. SSN/Personal details (40%)
  3. Credit card information (60%)
  4. Health & other personal records (80%)
  5. Full details (100%)

Sound range example 1

Sound example 1. Exploration of using sound to show levels of impact

Sound range example 2

Sound example 2. Exploration of using sound to show levels of impact

I also revised the data by updating the color scheme based on the severity color scale. I also hid unimportant columns.

WEEK 13 — CLASS 25

Tuesday, November 24, 2020

For most consumers, it’s hard to understand what data is and why it’s important to protect it. For this project, I would like to present the data in a relatable, engaging, and informative way.

To do this, my plan is to personify the data and use the metaphor of a “data monster” to show the different levels of data and why some could be more dangerous than others.

The idea is to show 5 levels of monsters, each increasing in intensiveness and we move up the severity scale.

Data Monsters

Each monster will grow or shrink in size depending on the number of records lost.

The more personal indefinable information you feed the monster, the bigger and more dangerous it will become. If the data monster breaks free it could wreak havoc and never fully be contained again.

Using a data visualization tool developed by Datawrapper GmbH (https://www.datawrapper.de/charts), I was able to take my entire data set and format it in a meaningful and interactive way.

The user can search for certain companies or organizations to reveal repeat offenders. They can also sort the data by the entity, records lost, year, and data sensitivity.

Interactive data display showing data breaches over time

I also was able to code the below scatterplot using the same platform. In this visualization, the scale of the circles indicates the sensitivity of the data on a 1–5 scale, with 5 being the most sensitive. The data is presented in log format to account for the Apollo data breach which lost 9,000,000,000 records. Unfortunately, there was no way to color code the scale.

I was however able to code the tooltip hover state. As the user hovers over the data, more information is revealed. In addition, when the user clicks the circle, the element is “sticky.” This allows the user to interact with the tooltip content.

I also created a chart that shows uses the scale of the circles to correspond with the number of records lost.

WEEK 14 — CLASS 27

Tuesday, December 1, 2020

Today we had our interim presentation with special guests Adrian Galvin, Senior User Interface Designer at NASA Jet Propulsion Laboratory and Angela Norwood, Associate Professor at York University.

Adrian suggested that I look at my color scale and check to see if it’s accessible. He offered two resources that I will look into.

Viridis palette

According to the resource, the Viridis palette was first developed “for the python package matplotlib, and was implemented in R later.”

Color design for cartography

There have been a number of color palettes that have created to account for color blindness with the intention of being more accessible. Below is one color palette that I would be interested in using that was created by IBM and can be found in the IBM Design Library.

Proposed new color blind safe color palette
Proposed new color blind safe color palette coupled with the scale

Angela suggested that I consider what, if anything, I’m trying to say with the overlapping transparent circles. Ideally, the circles would be color-coded and not overlap in the final design.

Another comment about rounding the numbers to make the data visualization more understandable. Stacie also suggested that I embrace the outliers, which with this data set is a 9B record lost, which when presented without log, makes it difficult to show the entire data visualization at a readable scale.

Updated data monsters using the new color palette

Below is the updated data visualization visual showing the new color coding scale along with a more detailed key.

The data shows the world’s largest data breaches and hacks from 2004 to December 2020. The data can be sorted by entity breached, records lost, year breached, and data sensitivity on a scale of 1–5, 5 being the most sensitive records.
Graphic style and mood for the narrative piece

The goal is the bring our data visualization together with our Seminar One course with Molly Wright Steenson. In Molly’s course, we talked about design justice, race & technology, metaphors, Data, values & ethics, and more.

For our final Seminar One paper we are to create a narrative that frames, explains, and tells the story of what you’ve explored in our data visualization. We were asked to “think of it like a piece of data journalism that brings your story to life.”

Using the metaphor of a data monster, I want to inform consumers about the dangers of data breaches and how they might reconsider who they give access to their personal information.

My draft of the story can be found here: https://preview.shorthand.com/Rv1QDLbC8YyBBvdS

WEEK 14 — CLASS 28

Thursday, December 3, 2020

Given the limitations of the data tools I have been using, I wanted to layout some concepts using Adobe Illustrator. The concept below is meant to show both data sensitivity levels and the number of records lost. The worst possible combination would be the very top right sector (Level 5, with over 3M records lost), while the best possible scenario for a data breach would be the bottom left sector (Level 1, with a low number of records lost).

The first draft of the data visualization
An interactive chart that allows the user to navigate the information my years. This will enable the user to see trends over time. The user can then hover over key points and learn more about each breach.
This is the key to data visualization. The data sensitivity color scale coupled with the data breach amount size scale gives each data breach a “target” icon that helps explain the size and sensitivity of each data breach.

WEEK 15 — CLASS 29

Tuesday, December 7, 2020

After showing the color scale and graphics to a few classmates and Stacie, it has become clear the Level 5 might be easier to understand if it was changed to red. I have updated the color palette for the scale and I have also introduced black, white, grey, and green to support the other content in the data visualization.

Update color palette
Sub colors

WEEK 15 — CLASS 30

Thursday, December 10, 2020

I worked on creating phases of exploration using toggle switches throughout the data visualization. Users can then explore different pathways and compare and contrast information as they move around the piece. The user can then “turn on” different years to see the data expand upwards.

FINAL PRESENTATION

Thursday, December 17, 2020

What I learned is that the size of the records lost isn’t always the most important story. I created “data monsters” the help folks better understand the relationship with their data.

Questions I am investigating are:

  • How many people are affected by data breaches and why should we care?
  • How sensitive was the data lost?
  • Is it happening more or less frequently?
  • Are we losing more or fewer records over time?
  • Who has our data and why do they have it in the first place?
  • Why should we care?

The overall goal of this project is to empower consumers to understand data breaches, help them own their data, and most importantly hold those who are trusted with the data more accountable for its protection.

Final thoughts

I really enjoyed working on this project. I learned a lot about presenting data in a usable way. I also learned that it’s very easy to present misleading data and the designer has an obligation to present the data in an honest and objective way if the project calls for that impartiality. For example, it would have been very convenient for me to remove the outlier data point, but I work on being able to include it to maintain the integrity of the data visualization.

Lastly, I really appreciated all of the feedback from my classmates and Molly and Stacie. This feedback made all of the difference as I moved through different concepts. Thank you!

Final Presentation (PDF version)

Final Story (Shorthand link)

--

--

John Baldridge

Trying to leave the world a little better than I found it