Researchers have to work with a lot of statistical data and in most cases need to find measures of central tendency like the mean or median of a data set, measures of dispersion like range and variance, or signs of skewness.
Creating a Box Plot in Microsoft Excel comes in very handy as is makes easy to quickly identify some of these parameters in a visual summary without hassles.
The Box plot is also known as “box and whisker chart” perhaps because of its shape and appearance.
It proffers a cool way to visualize the summary of datasets and gain more insights into how the data is distributed.
This tutorial will take us through what a box plot really is, how to create box plots using Microsoft Excel, as well as how results can be interpreted.
Here’s How to Create a Box Plot in Microsoft Excel
The box plot or “box and whisker” chart consists of two major components and divides numerical data into four parts known as “quartiles”.
The two major components as the nickname implies are the box and the whiskers.
The box spans the first and third quartiles and also has a line in it that defines the median which is also the second quartile.
The box indicates the part of the data distribution that is most concentrated.
The whiskers of the box plot are lines that indicate the minimum and maximum values of the data distribution that are outside the box.
In some cases, there are no whiskers or only one whisker.
Why should you use the Box Plot?
The box plot is very useful especially when you need to visualize the distribution of a given dataset and see the range through which it spans.
It summarizes data using five (5) key indicators – minimum value, maximum value, first quartile, second quartile (median), and third quartile.
It helps you interpret the data distribution easily using five (5) key indicators.
You can find outliers and tell if the data is symmetric or skewed.
Box plots are useful for comparing different data distributions particularly when the datasets are very large.
Dissecting a Box Plot
As mentioned earlier, the box plot contains five key indicators.
Let us take a closer look at the various indicators in a box plot and what each one stands for.
- The tip of the bottom whisker represents the minimum value in the dataset.
- The length of the bottom whisker tells us about the distance between the median and the minimum value. It also helps us see clearly whether the dataset is skewed.
- The height of the box tells us about the spread in the data. This essentially translates to whether there is a large variation in the data points or whether they are more concentrated around the median.
- The bottom and top edges of the box respectively mark the first and third quartiles of the dataset.
- The line around the middle of the box represents the median or second quartile.
- The length of the top wicker tells us how far the maximum value is from the median. This can give us insights into whether the dataset is skewed or if there could be outliers.
- The tip of the top whisker represents the maximum value in the dataset.
The following table contains the number of tourists that visited two different resorts over the period of one year:
Month | Resort A Tourists | Resort B Tourists |
January | 14000 | 24000 |
February | 16000 | 8000 |
March | 28000 | 12000 |
April | 30000 | 33000 |
May | 25000 | 15000 |
June | 20000 | 27000 |
July | 35000 | 19000 |
August | 50000 | 32000 |
September | 32000 | 43000 |
October | 21000 | 25000 |
November | 12000 | 39000 |
December | 53000 | 67000 |
Creating a Box Plot
We now have a general overview of what a box plot is and what makes it useful.
We have also dissected a typical box plot to see the various parts it contains.
Now let’s dive right in and see how to create box plots with real data!
Versions of Microsoft Excel from 2013 and older do not support creating a box plot “out of the box”, so you’ll need to go through a few steps to create one.
Thankfully, Microsoft Excel 2016 and above come with the box plot feature and make it a lot easier to create.
Not to worry, we’ll run through how you can create the box plot in both ways so that you can follow through irrespective of your version of Microsoft Excel.
The following table contains the number of tourists that visited two different resorts over the period of one year:
Month | Resort A Tourists | Resort B Tourists |
January | 14000 | 24000 |
February | 16000 | 8000 |
March | 28000 | 12000 |
April | 30000 | 33000 |
May | 25000 | 15000 |
June | 20000 | 27000 |
July | 35000 | 19000 |
August | 50000 | 32000 |
September | 32000 | 43000 |
October | 21000 | 25000 |
November | 12000 | 39000 |
December | 53000 | 67000 |
We will use the data provided in this table to create box plots in this tutorial.
1. Creating a Box Plot in Newer Versions of Microsoft Excel
For versions of MS Excel from 2016 till date, creating a box plot is quite straightforward since there is already a built-in function for this. Unlike older versions, you do not need to compute quartiles or interquartile differences.
Let’s work through the steps together:
- Select the range of the data without the vertical & horizontal headers (C4 to D15 in this case)
- Click the “Insert” menu from the menu bar and select the icon for “Insert Statistic Chart”
- Select “Box & Whisker”
- A box plot of your data is displayed. Easy peasy!
Note that by default, the box plot computes the quartiles using the exclusive median. To toggle between exclusive median and inclusive median, right-click on any of the box plots and select “format data series” to open the right pane, then select your preferred Quartile calculation mode.
The chart below shows the final box plot with inclusive mean. That’s all there is to that!
2. Creating a Box Plot in Microsoft Excel 2013 & Older
To create a box plot in older versions of Microsoft Excel, you need to improvise by first creating a stacked chart and then converting it to a box plot.
So, we’ll take it one step at a time by doing the following:
- Compute the five (5) key descriptors (as mentioned above).
- Compute the Interquartile differences.
- Create a stacked column chart.
- Convert the stacked chart into the required box plot.
We will now dive into the details of these steps.
- Compute The Five (5) Key Descriptors
First, we compute the five key descriptors (as mentioned earlier) in the order which they appear on the table below:
Descriptors | Formulae |
Min Value | = MIN(range) |
1st Quartile | = QUARTILE.INC(range,1) |
2nd Quartile (Median) | = QUARTILE.INC(range,2) |
3rd Quartile | = QUARTILE.INC(range,3) |
Max Value | = MAX(range) |
- Range in this case is C4:C15 and D4:D15 respectively for the two resorts.
- INC helps us use inclusive median directly such that we do not need to change it later. (Otherwise, we would have used QUARTILE.EXC)
3. Compute The Interquartile Differences
Next, we find the differences between each of the five values we have just computed. These will define the heights of each column segment of the stacked column chart.
4. Create A Stacked Column Chart
Now we will go ahead to create a stacked column chart using the interquartile differences we just computed.
These differences will define the heights of the column segments of the chart which we will convert afterwards to a box plot.
- Select the range of cells containing your Interquartile Differences (Including the column headers) – C30:D35 in this instance, click “Insert” on the menu bar, and select the “Insert Column or Bar Chart” icon as shown in the image.
- Select the “Stacked Column” icon from the 2D Column group.
- A stacked column chart is displayed. Observe that it does not resemble a box plot just yet because we selected our data horizontally, rather than vertically.
- We simply reverse the order by right-clicking on the chart and clicking “Select Data” from the context menu.
- Click “Switch Row/Column” in the dialog that is displayed and click OK.
The stacked column chart is now gradually transforming to a box plot but we’re not there just yet. You can also change the chart title and remove the legend if you so decide.
5. Convert the Stacked Column Chart to a Box Plot
The final step is to transform the stacked column chart we have just created into a real box plot.
Remove the lowest segment of the stacked columns
- Select the lowest segment of any of the stacked columns, and it automatically selects the lowest segment of all other columns. (Hint: Any segment you select in a column auto-selects the corresponding segment in the other columns).
- Click on the “Format” tab and select “Format Selection”. From the “Current Selection” ribbon group.
- This opens the “Format Data Series” sidebar on the right. Select the “Fill & Line” icon, then select “No fill” from the list of radio buttons under the “Fill” category.
- Observe that the lowest segments of the stacked charts disappear, leaving us with just four segments each.
Convert the top segment and second from the bottom to whiskers for the box plot.
- Select the second segment from bottom, for the stacked columns
- Under the “Chart Design” tab, select “Add Chart Element” from the “Chart Layouts” group.
- Select “Error Bars” → “Standard Deviation” from the dropdown that appears.
- This adds error bars to all the stacked columns as shown in the image:
- Select any of the error bars, then click on error bar options on the right sidebar and select “Minus” under Direction, “No cap” under End style, and “Percentage” under Error Amount; then key in 100%.
- Remove the second to bottom segment just like we did to the bottom segment earlier by selecting it and removing the fill color (Click on “No fill” from Format Data Series).4
- Repeat these steps for the top segment of the stacked column chart to also convert it to a whisker and your chart should look like the one in the image below:
Change the fill and outline for the remaining two segments
- Our box plot is ready, but we also need to format it to look better. Select any of the segments and click on “Fill & Line” from the “Format Data Series” sidebar. And change the color to a color of your choice. Also change the border to “Solid line” and select a suitable color.
- Your box plot is ready!
- You can also change individual box plots for various series when you double-click on the segment, right-click, and change the color from the “Fill color” menu.
Now your box plot should look very similar to the one we plotted earlier with the actual box plot functionality:
Some Limitations of the Box Plot
As much as the box plot is very helpful for analyzing large chunks of data, it is imperative that we note a few limitations:
- For bell-shaped or Gaussian distributions, the box plot gives a very good representation; however, it may hide some important parameters for bi-modal or other non-Gaussian distributions.
- Some people like to think that box plots are not very intuitive at first glance especially for people who may not be familiar with things like quartiles and interquartile differences. This category of people may not be able to interpret the box plot correctly.
Conclusion
When you’re dealing with data that is symmetrically distributed around the median, the box plot is very helpful for understanding the distribution and range of the data even if the dataset is very large.
Hence, we can conclude that the advantages of using box plot far outweighs its limitations!