Subscribe to our Newsletter

The 3 Vs of Big Data revisited: Venn diagrams and visualization

This discussion is about visualization. The three Vs of big data (volume, velocity, variety) or the three skills that make a data scientist (hacking, statistics, domain expertise) are typically visualized using a Venn diagram, representing all the potential 8 combinations through set intersections. In the case of big data, I believe (visualization, veracity, value) are more important than (volume, velocity, variety), but that's another issue. Except that one of my Vs is visualization and all these Venn diagrams are visually wrong: the color at the intersection of two sets should be the blending of both colors of the parent sets, for easy interpretation and easy generalization to 4 or more sets. For instance, if we have three sets A, B, C painted respectively in red, green, blue, the intersection of A and B should be yellow, the intersection of the three should be white.

Here, I'll discuss how to create better diagrams, and then focus on how to add extra dimensions to an existing chart - including not just visual elements, but sound.

1. Venn diagrams

If you want to represent 3 sets, you need to choose 3 base colors for the 3 sets, and then the colors for the intersections will be automatically computed using color addition rule. It makes sense to use red, green, blue as the base colors for two reasons:

• It maximizes the minimum distance between any two of the 8 colors in the Venn diagram, making interpretation quick and easy (we assume background color is black)
• It is very easy for the eye to reconstruct the proportion of red, green, blue in any color, making interpretation easier.

Actually, you don't even need to use Venn diagrams when using this color scheme: instead you can use 8 non-overlapping rectangles, with the size of each rectangle representing the number of observations in each set / subset. Note that, to the contrary, choosing red, green and yellow as the three base colors would be very bad because the intersection of red and green is yellow, which is also the color of the third set.

If you have 4 sets, and assuming the intensity for each R/G/B component is a number between 0 and 1 (as in the rgb function in the R language), a good set of base colors satisfying the above first property is: {(0.5,0,0), (0,0.5,0), (0,0,0.5), (0.5,0.5,0.5)} corresponding to dark red, dark green, dark blue, grey.

For 5 sets or more, it is better to use a table rather than a diagram, although you can find interesting but very intricate (difficult to read) Venn diagrams on Google.

If you are not familiar with how colors blend, do this exercise: create a rectangle filled in yellow, in your favorite graphic editor. Next to this rectangle, create another rectangle filled with pixels that alternate between red and green: this latter rectangle will appear yellow to your eyes. Although maybe not the exact same yellow as in the first rectangle. However, if you fine tune the brightness of your screen, you might be able to get the two rectangles to display the exact same yellow. This brings an interesting question: the eye can very easily distinguish between two almost identical colors in two adjacent rectangles. But it can not distinguish more pronounced differences if the change is not abrupt, but rather occurs via a gradient of colors. Every great visualization should exploit features that the eye and brain can process easily, and avoid signals that are too subtle for an average eye/brain to quickly detect.

Question: Can you look at the colors of all objects in your room, and easily detect the red/green/blue components? It's a skill that can easily be learned. Should decision makers (who spend time understanding visualizations produced by their data science team) learn this skill? It could improve decision making. Also being able to quickly interpret maps with color gradients (in particular the famous rainbow gradient) is a great skill to have, for data insight discovery.

2. Adding dimensions to a chart

Typically colors are represented by 3 dimensions: intensity in the red, blue and green channels. You can add metallic aspect and fluorescence as two extra dimensions, but it will make the charts more complex, and I don't think it adds value. Plus, it's difficult to render a metallic color on a computer screen.

For time series, producing a video rather than an image automatically and easily adds the time dimension: see my shooting stars video. In this case, you can add two extra dimensions with sound: volume (e.g. to represent the number of dots at a given time, that is in a given frame in the video) and frequency (e.g. to represent entropy). But these are summary statistics attached to each frame, and it's probably better to represent them by moving bars in the video, rather than using sound. You could have a video where each time you move the cursor, a different sound (attached to each pixel) is produced, but it's getting too complicated and the best solution, in this case is to have two videos or two images showing two different sets of metrics, rather than trying to stuff all the dimensions in just one document.

For most people, the brain has a hard time quickly processing more than 4 dimensions at once, and this should be kept in mind when producing visualizations. Beyond 5 dimensions, any additional dimension probably makes your visual less and less useful for value extraction, unless you are a real artist!

Related articles

Views: 5262

Comment