Sunday, 9 May 2010

Ministry of Silly Walks

In a comment last week, someone asked whether it was possible to draw a Marimekko plot, i.e., a histogram in which both directions contain relevant information. In other words, the question is whether we could draw a square, and populate it with rectangles in such a way that the area of the rectangles is read from a file. I thought that it should be possible, but on the way, I also found out a couple of interesting things. If you keep reading, you will see it for yourself, how we can define a vector, whose value is taken from a data file, and how we can manipulate the elements of that vector. In some sense, this is similar to the trick that we made use of, when we generated a parametric plot from a file. But we can do things in a slightly better fashion.

First, we will need a data file, and for the sake of conformity with the question of the commenter, I will just use this
"" France Germany Japan Nauru
Defense 9163 4857 2648 9437
Agriculture 3547 5378 1831 1948
Education 7722 7445 731 9822
Industry 4837 147 3449 6111
"Silly walks" 3441 7297 308 7386

(We can already deduce that with the sole exception of Japan, countries spend a large chunk of their GDP on silly walk.)

Now, our first attempt could be this:

reset

file = 'marimekko.dat'

set style data histograms
set style histogram columnstacked
set style fill solid border -1
set boxwidth 1.0

set xrange [-1:5]
set yrange [0:5e4]

plot newhistogram at 0, file u 2 title col, \
newhistogram at 1, file u 3 title col, \
newhistogram at 2, file u 4 title col, \
newhistogram at 3, file  u 5 title col, \
and it should be quite obvious that this is not what we want:

It just falls short of our expectations in every respect: There is a gap between the columns, the colours are not consistent, and the width of the columns is equal. The only reasonable thing that happened here is that we can actually set the position of the columns. This will become important later on.

Let us try to improve on the figure, step by step. First, we will place the histograms in a multiplot, for that will make life a lot easier: this is our only way of manipulating the column width during the plot. In this spirit, our second script will be this:

reset

file = 'marimekko.dat'

set style data histograms
set style histogram columnstacked
set style fill solid border -1
set boxwidth 1.0

set xrange [-1:5]
set yrange [0:5e4]
set multiplot

plot newhistogram at 0, file u (f($2)) title col
plot newhistogram at 1, file u (f($3)) title col
plot newhistogram at 2, file u (f($4)) title col
set boxwidth 0.3 
plot newhistogram at 2.65, file  u (f($5)) title col
unset multiplot

This is somewhat better, for the colours are now consistent, and we also see that the last column has a different width. We also see how the positioning works: the right hand side of Japan's column is at 2.5, and since the width of Nauru's column is 0.3, its centre has got to be shifted by 0.15 with respect to 2.5. That adds up to 2.65. However, if we watch closely, we will also notice that the ytics and labels are drawn four times; after all, we have four plots. What, if we unset the ytics after the first plot? Well, we would end up with this

Rather upsetting! The problem is that once the tics are unset, the size of the figure changes, so we can no longer count on the plots' proper alignment. However, there is an easy remedy for this: all we have to do is not to unset the ytics, but to set them invisible. That is, we can do
plot newhistogram at 0, file u (f($2)) title col
set ytics ("      ", 30000)
plot newhistogram at 1, file u (f($3)) title col
plot newhistogram at 2, file u (f($4)) title col
set boxwidth 0.3 
plot newhistogram at 2.65, file  u (f($5)) title col
where we have 6 white spaces in the quote. You might wonder why on Earth 6. Well, the answer is that the label "30000" is actually " 30000", which takes up 6 characters' space. With this trick, we get

We have already achieved quite a lot, and slowly, but surely, we are getting to our goal. Just do not despair!

The next thing that we would need is proper scaling of the columns: we want all of them to be between 0 and 100 (%), i.e., we would have to sum all columns first, and then divide the values by the sum. And that is the snag: we have four columns, and we have to do the summing for each column independently, and before the final plots. Otherwise, our multiplot will be messed up. And this is where the array comes in handy: if we just had an array, and could retrieve values from it, we would be saved. And of course, we can do this. Let us take a small detour!

If we think about it, the array (5, 4, 6, 7, 8) is nothing but a finite series: its first element is 5, second element is 4, and so on. But we could also look at the series as a function: a mapping from the set of natural numbers to, well, to anything. In the example above, to natural numbers. It doesn't matter. My point is that an array is a function, a function for which h(0) = 5, h(1) = 4, h(2) = 6, h(3) = 7, and h(4) = 8. As long as this is true, we do not care what h(1.1) is. We need the function's values only at integer numbers. Then the only question is how we could define this function "on the fly". Being a physicist, and a lazy man, I would propose the following:
g(x,a) = (abs(x-a) < 0.1 ? 1 : 0)
h(x) = 5 * g(x,0) + 4 * g(x,1) + 6 * g(x,2) + 7 * g(x,3) + 8 * g(x,4)
g(x,a) is (apart from some numerical factors) nothing but a very primitive representation of a Dirac-delta, centred on 'a'. You can convince yourself that h(x) defined in this way fulfils the requirements above.

After this digression, let us see what we can do with this, and issue the following commands!
ARRAY = "h(x) = 0"
array(x, counter) = ( ARRAY.sprintf(" + %f*g(x,%d)", x/100.0, counter+1) )
ff(x, counter) = (($0 > 0 ? ARRAY = array(x, counter-1) : 1), total = total + x, x)
plot 'marimekko.dat' using 0:(ff($2, 2))
At this point, the variable ARRAY should look something like this
ARRAY = "h(x) = 0 + 91.630000*g(x,2) + 35.470000*g(x,2) + 77.220000*g(x,2) + 48.370000*g(x,2) + 34.410000*g(x,2)"
and if we evaluate it, the function value h(2) returns the sum of the numbers in the second column. (Apart from a factor of 100, of course.) Note that in order to take out the first line, which is the header, we have to use the condition
($0 > 0 ? ARRAY = array(x, counter-1) : 1)
which updates ARRAY only if we are processing the second record, at least. Also note that in order to get the sum of all columns, all we have to do is call this plot as many times as many columns there are. In the light of this, our next script could be this
reset

file = 'marimekko.dat'
col = 4

g(x,a) = (abs(x-a) < 0.1 ? 1 : 0)
ARRAY = "h(x) = 0"
array(x, counter) = ( ARRAY.sprintf(" + %f*g(x,%d)", x/100.0, counter) )
 
ff(x, counter) = (($0 > 0 ? ARRAY = array(x, counter) : 1), x)
plot for [i=2:col+1] 'marimekko.dat' using 0:(ff(column(i), i)) 

set xrange [-1:3]
set yrange [0:110]

eval(ARRAY);

set style data histograms
set style histogram columnstacked
set style fill solid border -1
set boxwidth 1.0

set multiplot
plot newhistogram at 0, file u ($2/h(2)) title col
set ytics ("    " 20)
plot newhistogram at 1, file u ($3/h(3)) title col
plot newhistogram at 2, file u ($4/h(4)) title col
set boxwidth 0.3
plot newhistogram at 2.65, file u ($5/h(5)) title col
unset multiplot
and this results in this figure

So, we are almost there: the columns are rescaled, and placed neatly next to each other. The only missing ingredient is the setting of the widths. But that is really easy: we only have to determine what the grand total is, and then scale the columns accordingly. Our script can, then, be modified as follows
reset
file = 'marimekko.dat'
col = 4

total = 0.0
g(x,a) = (abs(x-a) < 0.1 ? 1 : 0)
ARRAY = "h(x) = 0"
array(x, counter) = ( ARRAY.sprintf(" + %f*g(x,%d)", x/100.0, counter+1) )

ff(x, counter) = (($0 > 0 ? ARRAY = array(x, counter-1) : 1), total = total + x, x)
plot for [i=2:col+1] 'marimekko.dat' using 0:(ff(column(i), i)) 

set xrange [-0.3:1]
set yrange [0:110] 
eval(ARRAY);

set style data histograms
set style histogram columnstacked
set style fill solid border -1   

total = total / 100.0
position = 0.0

set multiplot
set boxwidth h(2)/total
plot newhistogram at position, file u ($2/h(2)) title col

set ytics ("    " 20)
set boxwidth h(3)/total; position = position + (h(2)+h(3))/total/2.0
plot newhistogram at position, file u ($3/h(3)) title col

set boxwidth h(4)/total; position = position + (h(3)+h(4))/total/2.0
plot newhistogram at position, file u ($4/h(4)) title col

set boxwidth h(5)/total; position = position + (h(4)+h(5))/total/2.0
plot newhistogram at position, file u ($5/h(5)) title col
unset multiplot
and this is what we wanted!

Adding labels to the rectangles is relatively easy: we could do the following

plot file using (position):(l($5)):5 with labels tc rgb "#ffffff"
where l(x) is a function that keeps track of the previous values of the column, and adds them as new values are processed. The definition of this function should be trivial.

The last thing that I would add here is that by using macros, we can tidy up the script: we no longer would need all those long and repetitive lines. In fact, we could also add another instruction to our 'ff' function, which would generate the plot command. The advantage of that is that in this way, we do not have to repeat the plot commands four times: we simply put that in our for loop, and then evaluate the resulting string. I discussed this trick in my last post, so, if you are interested in the details, you can look it up there.