Saturday 3 October 2009

More on histograms

Someone asked me whether it would be possible to make a histogram that looks like those that MS Office can create: filled with gradient, casting a shadow, and labelled according to their value. Frankly, I just couldn't admit defeat, and I had to figure out a way. But then, it turned out to be rather simple, so I thought that I would share it here. We are going to make this figure




from this data, called 'msbar.dat'
Max 1.0 1.0 1.2
Min 0.9 0.2 0.95
Avg 0.95 0.5 1.1


I have already discussed a way to beefing up the histograms (It was titled Shiny histograms, or something similar.), but I must say, that method is a bit convoluted. The reason for this is that we used a parametric plot. We can avoid that by plotting to a file first, and then plotting the file as many times as needed. We can't save the trouble of having to read our data file, but this is rather simple. We can do that either by employing a very primitive external script, or writing the script in gnuplot, as we discussed it in, say, the post on the recession graph. Given that we haven't got to process any data, this latter method is probably less desirous. Gnuplot is not for printing lines and the like, after all.

After the introduction, let us see the script!
reset
# First, the gradient for the bars
set xrange [0:1]; set yrange [0:1]; set isosample 2, 200
set table 'msbar_bar.dat'
splot y
unset table

# Then we make the shadow
set isosample 200, 50
set table 'msbar_bar_sh.dat'
splot (1.0-exp((x-1.0)*20.)) #*(1.0-exp((y-1.0)*20.0))
unset table

reset
unset key; unset colorbox
xw = 0.2; set boxwidth xw; sh = 0.03
gf(x) = x*x*x*x #change this, if a tighter gradient is needed
set cbrange [0:7]
set xrange [-0.5:2.5]
set yrange [0:1.4]
set grid ytics lw 0.5 lc rgb "#868686"
set xtics nomirror
set ylabel 'Performance [a.u.]'
set palette defined (0 '#e0e8f5', 1 '#31bd71', 2 '#e0e8f5', 3 '#d99795', 4 '#e0e8f5', 5 '#9ab5e4', 6 "#ffffff", 7 "#a2a2a2")
plot 'msbar.dat' u 0:(0):xticlabel(1) w l, \
'' u ($0-xw):($2+0.1):(stringcolumn(2)) w labels, \
'' u ($0):($3+0.1):(stringcolumn(3)) w labels, \
'' u ($0+xw):($4+0.1):(stringcolumn(4)) w labels, \
'msbar_bar_sh.dat' u ($1*xw-1.5*xw+sh):($2*1.0-sh):($3+6.0) w ima, \
'' u ($1*xw-.5*xw+sh):($2*1.0-sh):($3+6.0) w ima, \
'' u ($1*xw+.5*xw+sh):($2*1.2-sh):($3+6.0) w ima, \
'' u ($1*xw-1.5*xw+sh+1):($2*0.9-sh):($3+6.0) w ima, \
'' u ($1*xw-.5*xw+sh+1):($2*0.2-sh):($3+6.0) w ima, \
'' u ($1*xw+.5*xw+sh+1):($2*0.95-sh):($3+6.0) w ima, \
'' u ($1*xw-1.5*xw+sh+2):($2*0.95-sh):($3+6.0) w ima, \
'' u ($1*xw-.5*xw+sh+2):($2*0.5-sh):($3+6.0) w ima, \
'' u ($1*xw+.5*xw+sh+2):($2*1.1-sh):($3+6.0) w ima, \
'msbar_bar.dat' u ($1*xw-1.5*xw):($2*1.0):(gf($3)) w ima, \
'' u ($1*xw-.5*xw):($2*1.0):(gf($3)+2.0) w ima, \
'' u ($1*xw+.5*xw):($2*1.2):(gf($3)+4.0) w ima, \
'' u ($1*xw-1.5*xw+1):($2*0.9):(gf($3)) w ima, \
'' u ($1*xw-.5*xw+1):($2*0.2):(gf($3)+2.0) w ima, \
'' u ($1*xw+.5*xw+1):($2*0.95):(gf($3)+4.0) w ima, \
'' u ($1*xw-1.5*xw+2):($2*0.95):(gf($3)) w ima, \
'' u ($1*xw-.5*xw+2):($2*0.5):(gf($3)+2.0) w ima, \
'' u ($1*xw+.5*xw+2):($2*1.1):(gf($3)+4.0) w ima, \
'msbar.dat' u ($0-xw):2 w boxes lt -1 lw 0.5 lc rgb "#4f81bd", \
'' u ($0):3 w boxes lt -1 lw 0.5 lc rgb "#4f81bd", \
'' u ($0+xw):4 w boxes lt -1 lw 0.5 lc rgb "#4f81bd"


If you look at the graph, we have three different gradients, and the shadow. That makes four colour schemes altogether. Since we would like to save the difficulties associated with multiplot, we have to cram all those colour schemes into one colour range. More on this a bit later.

So, first we plot the gradient that will fill the bars, and then the "surface" that will represent our shadow. Then we set up our figure: we take off the keys, and unset the colour box. We also specify the width of our bars 'xw', and set the box width accordingly. This latter step is needed, because we want to have a border to the bars. The shadow shift, 'sh' is also defined here. In the next line, we set our colour range, in this particular case, between [0:7]. As we have pointed out, we need 4 colour ranges, and they should not overlap, therefore, we need 3 gaps between these ranges. For the sake of simplicity, we define the gap to be 1, and all the ranges to be one. This is why we end up with a total colour range of [0:7]. If you have more, or less bars to plot, you can re-define this range accordingly. The next couple of steps are trivial, up to the definition of the colour schemes. We want to have only simply gradients, thus, it is enough to define the colours at the end points. If you are unhappy with the colouring of your bars, you should change the colours here.

Having set up the figure, we can plot the data. First, we plot the labels for the xtics, and the values of the bars. Next comes the shadow. It might be a bit of an overkill for this figure, so you can skip all lines up to 'msbar_bar.dat'. Note that we simply go through all the data points in our file, and shift the shadow in each step. The height of the shadow is given by the product of the second column (which takes values between 0 and 1) of 'msbar_bar_sh.dat' and the particular data point in 'msbar.dat'. We also add a small downwards and rightwards shift to the shadows, lest they should be covered by the bars. The most important point, however, is that we plot the shadow as
'' u ($1*xw-.5*xw+sh):($2*1.0-sh):($3+6.0) w ima

i.e., the third column is shifted by 6. We do this, so that the shadow is pushed into the [6:7] range, where the shadow's colour scheme is defined. Plotting the bars happens in a similar fashion, the only difference is that we do not add 'sh' to the columns, and we push the values into the range of the appropriate colour range. At the very end, we plot the data file once more, this time with boxes, so that the bars can have a border to them.

If you want to use this plot many times, it might be worthwhile to implement it in a script in the language of your choice. The only thing required is printing lines and numbers. In pseudocode, it would look something like this:
for i
  for j
     d = read(datafile(i,j))
     print "'msbar_bar_sh.dat' u ($1*xw-(%d-1.5)*xw+sh):($2*%d-sh):($3+6.0) w ima, \", i, d
     print "'msbar_bar.dat' u ($1*xw-(%d-1.5)*xw):($2*%d):($3+%d) w ima, \", i, d, i
  end
end


Now, suppose that you want to add a legend to the figure. The usual way, setting the key, will not work here, for obvious reasons. However, we can easily add that to the figure. We just have got to find some space for it. Since there is no space left on the figure, we have to make some: we will use multiplot, and specify the size of the main figure to be less, than 1. Therefore, adding the following lines to our gnu script should produce the legend.
set multiplot
set size 0.8, 1
... (The main plot, identical to our original plot)
set origin 0.75, 0
unset border; unset xtics; unset ytics; unset ylabel; unset xlabel
set label 1 at first -0.15, 1.35 "Flatland"
set label 2 at first -0.15, 1.15 "Curvedland"
set label 3 at first -0.15, 0.95 "Land"
plot 'msbar_bar.dat' u ($1*xw-0.4):($2/10.0+1.3):(gf($3)) w ima, \
'msbar_bar.dat' u ($1*xw-0.4):($2/10.0+1.1):(gf($3)+2) w ima, \
'msbar_bar.dat' u ($1*xw-0.4):($2/10.0+0.9):(gf($3)+4) w ima

unset multiplot

The script above results in the figure below:

4 comments:

  1. Well, thanks for your effort

    Now, I don't want to imagine how above script will look if user has some other data file

    With even more effort, you could do this in assembler also, but you only prove here that Excel rocks, I'm afraid

    Cheers

    ReplyDelete
  2. It rocks !

    Just a bug in the proposed script : you missed the horizontal move in : "($1*xw-(%d-1.5)*xw)" ; it should be something like "($1*xw-(%d-1.5)*xw+%d)" and thus as a whole :

    for i
    for j
    d = read(datafile(i,j))
    print "'msbar_bar_sh.dat' u ($1*xw-(%d-1.5)*xw+sh):($2*%d-sh+%d):($3+6.0) w ima, \", i, j, d
    print "'msbar_bar.dat' u ($1*xw-(%d-1.5)*xw+%d):($2*%d):($3+%d) w ima, \", i, j, d, i
    end
    end

    ReplyDelete
  3. Hi Joker,

    Yes, you are right, the bars should be shifted. Thanks for pointing that out!
    Zoltán

    ReplyDelete
  4. This instructions are for 3 column (max, min and avg). What should we do if we have another column?
    Thanks in advance.

    ReplyDelete