Barplot Madness

Maybe it is because of the change in fields and scientific environment, but I increasingly doubt data presented in talks, posters and articles. Mostly because I find myself wondering whether the choice of statistical description of the data makes any sense. I am not speaking of advanced statistics at all, just the simplest descriptive techniques and the choice of the right plot for it.

figure 1. X represents spike rates, categories cell types A and B respectively. Whisker indicates standard deviation.

figure 1. X represents spike rates, categories cell types A and B respectively. Whisker indicates standard deviation.

My doubts are usually raised when I see a bar plot as sketched in figure 1. What looks like a rather okay representation of data actually does not really give you an intuitive idea of how the data are distributed.

A barplot is most usefull to represent counts in histograms or percentages. The bar is supposed to indicate that there would be data points spanning the whole range from the top of the bar down to the x-axis (figure 2). Standard deviations on a barplot would indicate that one measured the whole thing several times and there is some wiggle at the overall sum of each measurement. For example, if you counted the number of labeled neurons in the same brain area of several animals and you want to show the cell count numbers. Here, the relation of the sizes of the bars directly translate into the relation between count numbers.

figure 2. Barplots indicate that the whole range of values can be found in your data.

figure 2. Barplots indicate that the whole range of values can be found in your data.

However, the truth behind a plot might actually look more like I sketched in figure 3, especially when it comes to the comparison between the spike rates of two populations of neurons. A population of neurons in each category gathers around a certain value, while most of the value range is not representative of the data set. People actually are aware that this is a major flaw. Still, they keep making bar plots for traditional reasons. A somewhat half-baked attempt to rescue the presentation involves plotting the actual data into the bar plot, as I did in figure 3. This trend in bar plot presentation is what actually made me aware of how badly such plots describe the distribution of the data.

I suggest a pretty simple solution: use whisker plots (figure 4). Here, the mean value is indicated by a horizontal line and whiskers indicate the standard deviation above and below the mean. In contrast to the bar plot, there is no graphical element that envelopes a data range that is non-existent in your data set. Instead, the graphical elements envelope the data relative tightly and the difference between the distributions is more visible.

figure 3. The real distribution is usually quite different from what a bar plot indicates.

figure 3. The real distribution is usually quite different from what a bar plot indicates.

A more important issue is, when I find data that look like I sketched in figure 5. I have seen this in WAY too many occasions. Let us think about it: you pool the data from cell type A and describe it (obviously blindly) using the mean and standard deviation. Then you look at your data and… neither of these variable are represented in your data set.

Just to make clear what I mean: the mean and standard deviation are supposed to show where the main body of your data is to be found and how it is distributed, if they are no way near your data points, the description is plain wrong.

figure 4. Whisker plots describe the data much better than bar plots.

figure 4. Whisker plots describe the data much better than bar plots.

The dataset actually shows a bimodal distribution. One cluster is pretty much identical with the neurons from category B, the other one is significantly higher. I found exactly such comparisons very often, where one subpopulation looks like the other category while another is clearly different. But by bluntly pooling the data, the difference between categories is underestimated. Of course, if your six cells are now distributed over two categories, you need more data.

Either to see if you just missed cells that have intermediate firing or to have enough data to show better separation. But that’s how it is. There is no use in describing data you actually already have, in a bad way, just to get your ‘n’. It is better to say ‘this is how it looks now, we are increasing the statistical power of our set’ than to play ignorant. People. See. That.

figure 5. How to mess up your statistics and significance.

figure 5. How to mess up your statistics and significance.

Messing up the impression one has of your data also comes from using the mean and standard deviation to describe your data in the first place. These measures are meant to describe data that are normally distributed. But real electrophysiological data (and behavioral data, too) are usually not normally distributed. In figure 7 I show not-normal distributed data that are not stretched out enough to clearly define outliers to exclude (this is usually the case). The left plot shows the use of mean and standard deviation. It is obvious that the mean is overestimating where the main body of your data are. The standard deviation is pretty off, too. In more extreme examples the standard deviation can stretch far beyond your dataset, sometimes into impossible value ranges. This is because this kind of descriptive statistics need a symmetric distribution.

figure 6. The data actually show significantly different sub-populations of category A. Time to make new categories. Here, As and Af stand for fast and slow spiking neurons of category A. The slow ones are actually not different from B!

figure 6. The data actually show significantly different sub-populations of category A. Time to make new categories. Here, As and Af stand for fast and slow spiking neurons of category A. The slow ones are actually not different from B!

The not-normal distributed set of data points is much better described by using the median and quartiles instead, presented in a box-whisker plot like in figure 7. The median is, simply said, the center value. That is, you sort your data by value and choose the one that is positioned right in the center (for instance the 6th data point in a set of 11). The box then envelopes the main body of your data. This is achieved by drawing the lower site of the box at the 25% quartile (25% of your points lie below this value) and the upper site of the box represents the 75% quartile (75% of the data lie below this line). The whiskers extend to the minimum and maximum values respectively.
I hope it becomes obvious, that the median and quartiles actually describe the data better. If you look into it, and you use Matlab, you will find the boxplot() function uses the median and quartiles and it also includes an option to make it easy to see, whether two distributions are significantly different (‘notches’) and your statistical power is insufficient (when notches extend outside the box).

figure 7. Data that are not normally distributed are better described by using the median and quartiles instead of mean and standard deviation.

figure 7. Data that are not normally distributed are better described by using the median and quartiles instead of mean and standard deviation.

I hope this helped. Or maybe there is a good reason why to plot bars and use mean values for non-normal distributed data sets? Please comment!