A histogram is used to provide a visual interpretation of discrete or continuous data. It is represented as a bar graph with bars of different heights. It represents a frequency distribution with the help of bars whose widths represent class intervals or bins and the areas of whom are proportional to the corresponding frequencies. In this article we learn Histogram in Microsoft Power BI.
For instance, we can show students’ ages on a histogram, where a bar would represent number of students in each year.
Uses of Histogram
- To represent frequency distribution of univariate numerical data
- To identify outliers in the data
- To find out how much skewed or uniform our data is
How to Create and visualize Histogram in Microsoft Power BI?
Let us say we have captured some call centre information like caller ID, duration of calls (in seconds) etc. in a ‘Call_Log’ table and we wish to have a count of Caller ID by minutes (bins or buckets) on a histogram. Let us follow the below steps to do so:
Frequency Distribution of data
- For this, we first load this data in our Power BI workbook. To pull out this information, click on ‘Get Data’ -> ‘Excel’:
On doing this, we will see the files in Excel. We can select the required call center dataset or file:
On opening this, we will see a visual like below:
- Now click on ‘Load’ button and the program will load the data in the workbook.
Convert Call Duration from seconds to minutes
- Now we go to ‘Data’ view, and we can see that there are two columns: Caller ID and Call Duration.
The duration of calls is in seconds. We will now convert these into minutes by dividing these by 60.
- To do this, right click on ‘Call Duration’ column and select ‘New Column’:
Rename the column name as ‘Call Duration (Minutes)’ and this column would be computed with a formula as below:
On pressing ‘Enter’, we will see the new call duration column as below:
Create bins or class intervals
- Now click on ‘Report’ view and then click on ‘Modeling’ -> ‘New Column’:
Rename the column as ‘Xaxis’ and compute it with a formula as below:
So we can see the areas in the ‘Fields’ section are: Call Duration, Caller ID, Call Duration (Minutes), Xaxis.
Create Bins using Xaxis
Now let us create bins or class intervals using Xaxis by grouping it.
- For this, click on ellipses across Xaxis in the ‘Fields’ pane, and select ‘New group’ as follows:
On doing this, a dialog box appears as below:
Now we can change the ‘Name’ to ‘Minutes (bins)’ and select the ‘Bin Type’ as ‘Number of bins’ since we want the frequency:
Now we want the size of each class interval or bin to be 2 minutes (0-2 minutes, 2-4 minutes, and so on). So we change the ‘Bin count’ by trial and error to see how the ‘Bin size’ is changing. So let’s say we change the Bin count to 9, and we get the Bin size as 2.14, which is approximately 2:
Now when we click on ‘OK’, we see a new grouped column: ‘Minutes (bins)’ in the ‘Fields’ section:
- Now click on ‘Stacked Column Chart’ button:
Selecting this, we get a column chart as below:
Now we drag ‘Minutes (bins)’ to the ‘Axis’ area, and ‘Caller ID’ t the ‘Values’ area of the ‘Visualizations’ pane:
So we can see that we get a count of Caller ID on a bar graph as follows:
So the bins are ranging from 0-2, 2-4,…………, 16-18 minutes.
If we wish to show labels, we can click on ‘Format’ button in the ‘Visualizations’ pane and switch on the ‘Data Labels’:
So we can see that the maximum number of calls are resolved in 2-4 minutes, with a count of 118. Similarly, we can see number of calls resolved in each bucket or bin.
So we have seen the frequency distribution of ‘Call_Log’ data using histogram.
Identify outliers in the data
Next we see how we can identify outliers in the data using histogram.
Outliers are extreme values or data points that are far (or different) from the other observations.
In order to detect outliers, we use the approach of standard normal distribution. In standard normal distribution, any value that lies outside the area of -3 to +3 standard deviations is considered an outlier since this range is 99.7% of the area.
So let us first convert our histogram to standard normal distribution histogram.
Convert actual values to z scores
We know that a standard normal variate is defined as: Z= (X- µ)/sigma, where X is a normal variate (Minutes from Call_Log table in this case) with mean µ and standard deviation sigma, and Z is a standard normal variate. By z score, we mean that how many standard deviations above or below a raw score is.
To calculate the z scores, we first need to compute the population mean and standard deviation, i.e. µ and sigma.
- For this, click on ‘More options’ ellipses across the ‘Call_Log’ table in the ‘Fields’ pane and select ‘New measure’:
Mean or average can be computed with a formula as below:
Now standard deviation can be computed with a formula as below:
This will give the average and standard deviation computed as a new measure which we can see in the ‘Fields’ pane:
Now when we have the population mean and S.D, we can calculate the z scores.
We have earlier created our histogram using Minute (bins) which were computed using Xaxis. Now we again create a new column as Xaxis2 computed with a formula as below:
So we can see Xaxis2, which is our Z score in the ‘Fields’ pane, and let us now copy our histogram to have its another visual (standardized form) which will have the z scores:
Now let us create bins using Xaxis2 by grouping it.
- For this, click on the ellipses across Xaxis2 in the ‘Fields’ pane, and select ‘New group’ as follows:
Now let all the things be set to default and select the ‘Bin Size’ as 0.5:
Now drag ‘Xaxis2’ to the ‘Axis’ areas in the ‘Visualizations’ pane:
So we get a standardized histogram with data distributed from -1.5 to +3.0, and mean is at 0.
To look at average value of entire plot, we can click on the following button and drag ‘Avg Call Duration’ to the ‘Fields’ area in the ‘Visualizations’ pane:
So we can see that average call duration is 6.58.
In this case, since no value is lying outside -3.0 to =3.0 S.D, so there are no outliers.
Now we manually add some outliers to our dataset in the Excel file to check if we can detect those with the help of this standardized histogram:
Now we save this Excel file and then refresh our Power BI workbook as below:
So we will see that the average call duration has changed, and the new values are reflected in the histogram:
So we can see that these values are after +3 standard deviations. So these are identified as outliers in our dataset.
So the first plot was used to show the frequency distribution of the data, and the second standardized plot is used to identify outliers in the data.
So we have seen how to create histogram in Microsoft Power BI, and use it to show frequency distribution, identify outliers in the data, and check skewness or uniformity in the data.
If you wants to learn more click here...