Saturday, 20 June 2009

Stacking shiny histograms

As I promised last week, we will re-visit the problem of the shiny histograms, and we will stack them. After that discussion, it will be plain sailing, because we have everything at our disposal, we have simply got to re-arrange the columns a bit. Obviously, there are two ways of stacking histograms, and I will give a script for both cases. Thus, we will achieve something like either of these plots. (The data file is the same as that in the post on 14th June.)








The difference of these two plots is just that the second one is normalised, so all values are divided by the sum in a particular group.
Now, as for the script generating the first graph, that reads as this:
reset
f(x, a, b, c) = a*exp(-(x-b)*(x-b)/c/c)                          
A = 0.6; B = 0.5; C = 0.2; D=0.5                                 
unset key; unset colorbox; unset xtics; set pm3d map             
set yrange [0:6.6]                                          
set parametric; set urange [0:1]; set vrange [0:1]               
set xrange [0:5.5]                                          
set multiplot                                                    
set isosamples 100, 100                                          
set palette model HSV functions 2.0/3.0, 0.4-0.4*gray, (4.0+2.0*gray)/6.0
splot 5.5*u, 6.6*v, 2*u+v w pm3d                       
set isosamples 2, 2                                              
set palette model HSV function 1.0, 1.0-f(gray, A, B, C), (1.0+2.0*f(gray, A, B, C))/3.0                                                         
splot D*u+0.5, 0.1.0*v, u w pm3d,\
D*u+1.5, 1.0*v, u w pm3d,\
D*u+2.5, 2.0*v, u w pm3d,\
D*u+3.5, 1.2*v, u w pm3d,\
D*u+4.5, 1.6*v, u w pm3d
unset ytics; unset ylabel
set palette model HSV function 0.75, 1.0-f(gray, A, B, C), (1.0+2.0*f(gray,A, B, C))/3.0
splot D*u+0.5,1.0+2.0*v,u w pm3d,\
D*u+1.5, 1.0+1.1*v, u w pm3d,\
D*u+2.5, 2.0+1.2*v, u w pm3d,\
D*u+3.5, 1.2+2.4*v, u w pm3d,\
D*u+4.5, 1.6+2.1*v, u w pm3d
set label 1 "First" at 0.5, -1.32 rotate by 45
set label 2 "Second" at 1.5, -1.32 rotate by 45
set label 3 "Third" at 2.5, -1.32 rotate by 45
set label 4 "Fourth" at 3.5, -1.32 rotate by 45
set label 5 "Fifth" at 4.5, -1.32 rotate by 45
set palette model HSV function 0.5, 1.0-f(gray, A, B, C), (1.0+2.0*f(gray,A, B, C))/3.0
splot D*u+0.5, 3.0+1.0*v, u w pm3d,\
D*u+1.5, 2.1+3.2*v, u w pm3d,\
D*u+2.5, 3.2+3.4*v, u w pm3d,\
D*u+3.5, 3.6+1.1*v, u w pm3d,\
D*u+4.5, 3.7+1.5*v, u w pm3d
unset multiplot


Most of the script has been analysed in the previous post, so I will point out the differences only. If you recall, when we plotted the bargraphs side by side, we had to shift the 'x' position, so that they didn't overlap. Now, since we want to stack them, we have to shift the 'y' position, and in particular, we have to shift it by a value that renders the bars just on top of each other. So, if you look at the three lines highlighted in blue, you will notice that the multiplier of the dummy variable 'v' is always added to the shift of 'v' in the consecutive plot, i.e., in the first plot, the multiplier of 'v' is 1 and the shift is 0, in the second plot, the multiplier of 'v' is 1.1 and the shift is 1, and in the third plot, the multiplier of 'v' is 3.2, and the shift is 2.1, which happens to be the sum of 1 and 1.1. These were the appropriate shift for the second column in the plot. In the rest of the script, we just have to follow the same recipe, and indeed, this is what happens. (I won't demonstrate this, though:-)

Obviously, one would like to do this automatically, and this can be done with the following gawk script:
#!/bin/bash

gawk  'BEGIN {i=0; max=0}
{
if($0!~/#/) {
label[i] = $1
sum=0
for(j=2;j<=NF;j++) {
      v[i,j-1] = $j
      sum+=$j
}
if(max<sum) max=sum
i++
}
}
END {
print "reset"
print "f(x, a, b, c) = a*exp(-(x-b)*(x-b)/c/c)"
print "A = 0.6; B = 0.5; C = 0.2; D=0.5"
print "unset key; unset colorbox; unset xtics; set pm3d map"
printf "set yrange [0:%f]\n", max
printf "set parametric; set urange [0:1]; set vrange [0:1]\n"
printf "set xrange [0:%f]\n", i+0.5
print "set multiplot"
print "set isosamples 100, 100"
print "set palette model HSV functions 2.0/3.0, 0.4-0.4*gray, (4.0+2.0*gray)/6.0"
printf "splot %f*u, %f*v, 2*u+v w pm3d\n", i+0.5, max
print "set isosamples 2, 2"
for(l=1;l<j-1; l++) {
 if(l==j-2) {
for(k=0;k<i;k++) {
      printf "set label %d \"%s\" at %f, -%f rotate by 45\n", k+1, label[k], k+0.5, max/5.0
}
 }
 if(l==2) print "unset ytics; unset ylabel"
 printf "set palette model HSV function %f, 1.0-f(gray, A, B, C), (1.0+2.0*f(gray, A, B, C))/3.0\n", 1.0-(l-1)/(j-1)
 printf "splot "
 for(k=0;k<i-1;k++) {
      printf "D*u+%f,%f+%f*v,u w pm3d,\\\n", k+0.5, base[k], v[k,l]
      base[k]+=v[k,l]
 }
 printf "D*u+%f,%f+%f*v,u w pm3d\n", i-0.5, base[i-1], v[i-1,l]
 base[i-1]+=v[i-1,l]
}
print "unset multiplot"

}' $1


It is very easy to turn this script into one that puts out normalised stacks. All we have to do is to divide the values by the appropriate sum, and re-set the 'y' range to [0:1.1] or something similar, shift the 'y' position of the labels (remember, it was defined as a function of 'max' above, but max is just 1.1 in the present case), and there is nothing else. This difference is highlighted in blue.

#!/bin/bash

gawk  'BEGIN {i=0; max=0}
{
if($0!~/#/) {
label[i] = $1
sum=0
for(j=2;j<=NF;j++) {
      v[i,j-1] = $j
      sum+=$j
}
if(sum!=0)
   for(j=2;j<=NF;j++) {
         v[i,j-1] /= sum
   }
i++
}
}
END {
print "reset"
print "f(x, a, b, c) = a*exp(-(x-b)*(x-b)/c/c)"
print "A = 0.6; B = 0.5; C = 0.2; D=0.5"
print "unset key; unset colorbox; unset xtics; set pm3d map"
print "set yrange [0:1.1]"
print "set parametric; set urange [0:1]; set vrange [0:1]"
printf "set xrange [0:%f]\n", i+0.5
print "set multiplot"
print "set isosamples 100, 100"
print "set palette model HSV functions 2.0/3.0, 0.4-0.4*gray, (4.0+2.0*gray)/6.0"
printf "splot %f*u, 1.1*v, 2*u+v w pm3d\n", i+0.5
print "set isosamples 2, 2"
for(l=1;l<j-1; l++) {
 if(l==j-2) {
for(k=0;k<i;k++) {
      printf "set label %d \"%s\" at %f, -0.2 rotate by 45\n", k+1, label[k], k+0.5
}
 }
 if(l==2) print "unset ytics; unset ylabel"
 printf "set palette model HSV function %f, 1.0-f(gray, A, B, C), (1.0+2.0*f(gray, A, B, C))/3.0\n", 1.0-(l-1)/(j-1)
 printf "splot "
 for(k=0;k<i-1;k++) {
      printf "D*u+%f,%f+%f*v,u w pm3d,\\\n", k+0.5, base[k], v[k,l]
      base[k]+=v[k,l]
 }
 printf "D*u+%f,%f+%f*v,u w pm3d\n", i-0.5, base[i-1], v[i-1,l]
 base[i-1]+=v[i-1,l]
}
print "unset multiplot"

}' $1

2 comments:

  1. Greetings!!!
    I cannot belieave it, beautiful plots done by gnuplot!!!
    Thanks for making it public. please keep posting
    thanks.
    Arman.

    ReplyDelete
  2. good solutions....
    i have one problem not able to plot histograms for the same data looks like.
    4 20 25
    16 44 41
    32 32 11
    64 22 25
    126 41 51
    256 24 18
    512 43 18

    I am not getting "4" "6" as labels... similar to one problem discussed in ur blog "Jan" "FEB" and lable for each box.... can some one help ....

    ReplyDelete