// Remove fullscreen button from SageCell.

Excel TUTORIAL for Statistics Applications

Part 2: Data Visualization

Email Vladimir Dobrushkin

The first step in trying to interpret data is often to visualize it in some way. Data visualization can be as simple as creating a summary table, or it could require generating charts to help interpret, analyze, and learn from the data. Data visualization is very helpful for identifying data errors and reducing the size of your data set by highlighting important relationships and trends.

Data Visualization is important in order to interpret data and convey your analysis to others. The first step in trying to interpret data is often to visualize it in some way. Data visualization can be as simple as creating a summary table, or it could require generating charts to help interpret, analyze, and learn from the data. Data visualization is very helpful for identifying data errors and reducing the size of your data set by highlighting important relationships and trends.

Effective Design Techniques

Excel is the most commonly used tool in the business world to show basic data visualization. We can store the data in Excel as two separate vectors via the following command:

One of the most helpful ideas for effective data visualization is the concept of the data-ink ratio that was introduced by the American statistician and professor emeritus of political science, statistics, and computer science at Yale University Edward Tufte (born in 1942), the expert whose work has contributed significantly to designing effective data presentations, in his 1983 book, The Visual Display of Quantitative Data:

\begin{align*} \mbox{Data-ink ratio} &= \frac{{\mbox Data-ink}}{\mbox{Total ink used to print the graphic}} \\ &= \mbox{proportion of a graphic's ink devoted to the } \newline & \qquad \mbox{non-reduntant display of data information} \\ &= 1.0 - \mbox{proportion of a graphic that can be erased}. \end{align*}
Good graphics should include only data-Ink. Non-Data-Ink is to be deleted everywhere where possible. The reason for this is to avoid drawing the attention of viewers of the data presentation to irrelevant elements. The goal is to design a display with the highest possible data-ink ratio (that is, as close to the total of 1.0), without eliminating something that is necessary for effective communication.


A table is a basic unit for data storage. Data in the table is stored in a cell that is an intersection of a vertical column and horizontal row. The table has a specified number of columns, but can have any number of rows. With DataGrip, you can perform data manipulation and data definition operations with tables. Tables are more appropriate when:

  • The user needs to reference specific numerical values.
  • The user wants to compare numerical values and not vague similarities.
  • The values being displayed are unable to be compared using charts.
Here are two examples of tables:

 First Name  Last Name  Sex  Spending  Phone Number
 John  Smith  Male  $ 115.89  (401) 865-6141
 Jack  Richardson  Male  $ 56.24  (401) 865-6142
 Mary  Williams  Female  $ 84.33  (401) 865-6143
 Donald  Tramp  Male  $ 321.58  (401) 865-6144
 Nancy  Pelosi  Female  $ 212.37  (401) 865-6145
 Kevin  McCarthy  Male  $ 247.96  (401) 865-6146
 Liz  Cheney  Female  $ 89.51  (401) 865-6147
  Unit sales of the Apple iPhone Worldwide from 2007 to 2018 (in millions)
 Year  Sales  Year  Sales
 2007  1.39  2013  150.26
 2008  11.63  2014  169.22
 2009  20.73  2015  231.22
 2010  39.99  2016  211.88
 2011  72.29  2017  216.76
 2012  125.05  2018  217.72

In designing an effective table, keep in mind the data-ink ratio and avoid the use of unnecessary ink tables:

  • Avoid using vertical lines because they make tables look cluttered and therefore unreadable.
  • Horizontal lines are generally used to separate the titles of columns.
  • In large tables, vertical lines or light shading can be usedful to help the reader distinguish the columns and rows.
  • Horizontal lines are used for separating column titles from data values or when indicating that a calculation has taken place.
  • Columns of numerical values in a table should be right-aligned.
  • All values should include the same number of digits to the right of the d ecimal.
  • Left-align text values within a column in a table.
  • Column headings should either match the alignment of data in teh columns or be centered over the values.

Crosstabulation

A useful type of table for describing data of two variables is a crosstabulation, which provides a tabular summary of data for two variables.

Consider the following example that shows a survey of what electronic gadget a group of people were likely to buy in the next 6 months dependent on their age.

  Crosstabulation of Age and Purchase of Electronic Gadget (2017)
 Age  Laptop  Phone  Tablet  Digital Camera
 20--24  38%  29%  31%  12%
 25--29  19%  15%  24%  17%
 30--34  23%  19%  11%  27%
 35--39  19%  12%  9%  30%
 above 40  12%  17%  5%  31%

You can see the distinctive connection between the age and the purchase of the electronic gadget.

PivotTables in Excel

A crosstabulation in Microsoft Excel in known as PivotTable. PivotTables in Excel are interactive and they may be used to display statistics or a simple count of numbers. So follow the following steps.

  1. Click Insert, click PivotTable in the Tables group
  2. When the Create PivotTable appears: choose Select a Table or Range, select New Worksheet, click OK
  3. In the PivotTable Fields task pane, go to Drag fields between areas below
  4. Select Value Field Settings from the list of options
  5. When the Value Field Settings dialog box appears: Under Summarize value field by, select Count, Click OK.

Example: Consider the data that comes from reviews of 25 Ice Cream Shops in New York.

  Quality Rating for Ice Cream Shops in New York
 Ice Cream  Quality Rating  Price ($)  Wait Time (min)
 1  Average  5  10
 2  Good  7  6
 3  Bad  4  3
 4  Average  6  5
 5  Average  5  15
 6  Excellent  8  8
 7  Good  5  0
 8  Excellent  6  13
 9  Good  6  20
 10  Average  6  12
 11  Excellent  7  5
 12  Good  7  16
 13  Bad  5  21
 14  Average  6  5
 15  Average  5  2
 16  Good  3  10
 17  Good  5  11
 18  Excellent  3  3
 19  Good  7  0
 20  Average  9  13
 21  Bad  4  15
 22  Average  5  11
 23  Excellent  10  6
 24  Good  8  3
 25  Bad  6  2
                    


Usually we deal through a lot of data that may include website performance, sales performance, product adoption, customer service, marketing campaign results ... the list goes on. When you manage multiple content assets, such as social media or a blog, with multiple sources of data, it can get overwhelming. So you need to visualize and analyze the data in such a way that you can quickly extract insights and actionable information. Therefore, you need an efficient tool for this information based on your understanding what types of graphs to use -- and why. It is important to avoid choosing the wrong visual aid or simply defaulting to the most common type of data visualization because it could cause confusion with the viewer or lead to mistaken data interpretation. One of the remedies to this situation is using charts---a great tool for displaying data. There are known the following types of charts:

  • Column Chart is a graphic representation of data. Column charts display vertical bars going across the chart horizontally, with the values axis being displayed on the left side of the chart.
  • Line Graph is also known as a line chart, is a type of chart used to visualize the value of something over time. For example, a finance department may plot the change in the amount of cash the company has on hand over time. The line graph consists of a horizontal x-axis and a vertical y-axis.
  • Bar Graph (also called Bar Chart) is a graphical display of data using bars of different heights.
  • Area Chart displays graphically quantitative data. It is based on the line chart. The area between axis and line are commonly emphasized with colors, textures and hatchings.
  • Dual Axis Chart is a great way to easily illustrate the relationship between two different variables. They illustrate a lot of information with limited space and allow you to discover trends you may have otherwise missed if you’re switching between graphs.
  • Stacked Bar Graph is a chart that uses bars to show comparisons between categories of data, but with ability to break down and compare parts of a whole. Each bar in the chart represents a whole, and segments in the bar represent different parts or categories of that whole.
  • Mekko Chart (sometimes also called marimekko chart) is a two-dimensional stacked chart. In addition to the varying segment heights of a regular stacked chart, a Mekko chart also has varying column widths.
  • Pie Chart is a circular statistical graphic, which is divided into slices to illustrate numerical proportion.
  • Scatter Plot Chart (also called a scatterplot, scatter graph, scatter chart, scattergram, or scatter diagram) is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data. If the points are color-coded, one additional variable can be displayed.
  • Bubble Chart is a type of chart that displays three dimensions of data.
  • Waterfall Chart is a form of data visualization that helps in understanding the cumulative effect of sequentially introduced positive or negative values. These intermediate values can either be time based or category based. The waterfall chart is also known as a flying bricks chart or Mario chart due to the apparent suspension of columns (bricks) in mid-air. Often in finance, it will be referred to as a bridge.
  • Funnel Chart is often used to represent stages in a sales process[1] and show the amount of potential revenue for each stage. This type of chart can also be useful in identifying potential problem areas in an organization’s sales processes. A funnel chart is similar to a stacked percent bar chart.
  • Bullet Chart or Bullet Graph is a variation of the Bar Chart type, developed by Stephen Few. Generally, it is used to display progress toward a certain goal.
  • Heat Map is a graphical representation of data where the individual values contained in a matrix are represented as colors. "Heat map" is a newer term but shading matrices have existed for over a century.
A scatter chart is a graphical presentation of the relationship between two quantitative variables.