A Data Science Central Community

This discussion is about visualization. The three Vs of big data (volume, velocity, variety) or the three skills that make a data scientist (hacking, statistics, domain expertise) are typically visualized using a Venn diagram, representing all the potential 8 combinations through set intersections. In the case of big data, I believe (visualization, veracity, value) are more important than (volume, velocity, variety), but that's another issue. Except that one of my Vs is visualization and all these Venn diagrams are visually wrong: the color at the intersection of two sets should be the blending of both colors of the parent sets, for easy interpretation and easy generalization to 4 or more sets. For instance, if we have three sets A, B, C painted respectively in red, green, blue, the intersection of A and B should be yellow, the intersection of the three should be white.

Here, I'll discuss how to create better diagrams, and then focus on how to add extra dimensions to an existing chart - including not just visual elements, but sound.

**1. Venn diagrams**

If you want to represent 3 sets, you need to choose 3 base colors for the 3 sets, and then the colors for the intersections will be automatically computed using color addition rule. It makes sense to use red, green, blue as the base colors for two reasons:

- It maximizes the minimum distance between any two of the 8 colors in the Venn diagram, making interpretation quick and easy (we assume background color is black)
- It is very easy for the eye to reconstruct the proportion of red, green, blue in any color, making interpretation easier.

Actually, you don't even need to use Venn diagrams when using this color scheme: instead you can use 8 non-overlapping rectangles, with the size of each rectangle representing the number of observations in each set / subset. Note that, to the contrary, choosing red, green and yellow as the three base colors would be very bad because the intersection of red and green is yellow, which is also the color of the third set.

If you have 4 sets, and assuming the intensity for each R/G/B component is a number between 0 and 1 (as in the rgb function in the R language), a good set of base colors satisfying the above first property is: {(0.5,0,0), (0,0.5,0), (0,0,0.5), (0.5,0.5,0.5)} corresponding to dark red, dark green, dark blue, grey.

For 5 sets or more, it is better to use a table rather than a diagram, although you can find interesting but very intricate (difficult to read) Venn diagrams on Google.

If you are not familiar with how colors blend, do this exercise: create a rectangle filled in yellow, in your favorite graphic editor. Next to this rectangle, create another rectangle filled with pixels that alternate between red and green: this latter rectangle will appear yellow to your eyes. Although maybe not the exact same yellow as in the first rectangle. However, if you fine tune the brightness of your screen, you might be able to get the two rectangles to display the exact same yellow. This brings an interesting question: the eye can very easily distinguish between two almost identical colors in two adjacent rectangles. But it can not distinguish more pronounced differences if the change is not abrupt, but rather occurs via a gradient of colors. Every great visualization should exploit features that the eye and brain can process easily, and avoid signals that are too subtle for an average eye/brain to quickly detect.

**Question**: Can you look at the colors of all objects in your room, and easily detect the red/green/blue components? It's a skill that can easily be learned. Should decision makers (who spend time understanding visualizations produced by their data science team) learn this skill? It could improve decision making. Also being able to quickly interpret maps with color gradients (in particular the famous rainbow gradient) is a great skill to have, for data insight discovery.

**2. Adding dimensions to a chart**

Typically colors are represented by 3 dimensions: intensity in the red, blue and green channels. You can add metallic aspect and fluorescence as two extra dimensions, but it will make the charts more complex, and I don't think it adds value. Plus, it's difficult to render a metallic color on a computer screen.

For time series, producing a video rather than an image automatically and easily adds the time dimension: see my shooting stars video. In this case, you can add two extra dimensions with sound: volume (e.g. to represent the number of dots at a given time, that is in a given frame in the video) and frequency (e.g. to represent entropy). But these are summary statistics attached to each frame, and it's probably better to represent them by moving bars in the video, rather than using sound. You could have a video where each time you move the cursor, a different sound (attached to each pixel) is produced, but it's getting too complicated and the best solution, in this case is to have two videos or two images showing two different sets of metrics, rather than trying to stuff all the dimensions in just one document.

For most people, the brain has a hard time quickly processing more than 4 dimensions at once, and this should be kept in mind when producing visualizations. Beyond 5 dimensions, any additional dimension probably makes your visual less and less useful for value extraction, unless you are a real artist!

**Related articles**

- Shooting stars
- Visualization through videos, using open source tools
- Internet Topology - Massive and Amazing Graphs
- Simple solutions to make videos with R
- 3-D Visualizations with rotating charts, for small and big data
- Great graphic diagrams
- Two more interesting graphs
- 14 questions about data visualization tools
- The top 20 data visualisation tools
- Another cute graph
- 5 books on data visualization
- Registered meteorites that has impacted on Earth visualized
- When data flows faster than it can be processed
- Big Data Ecosystem
- From chaos to clusters - statistical modeling without models
- Big Data Analytics Ecosystem
- The 3Vs that define Big Data
- 53.5 billion clicks dataset available for benchmarking and testing
- The curse of big data
- How to detect a pattern? Problem and solution

© 2019 BigDataNews.com is a subsidiary of DataScienceCentral LLC and not affiliated with Systap Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Statistics -- New Foundations, Toolbox, and Machine Learning Recipes
- Book: Classification and Regression In a Weekend - With Python
- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of BigDataNews to add comments!

Join BigDataNews