Thursday, 3 December 2009

Defining some new plot styles

I have long wanted to write a blog post on this subject, but somehow, I never got the time to do it. But the time has come, and I will do it now. One of the plot styles that I actually like in origin (no matter what I think of the software in overall) is the one where the circumference of a circle is drawn in a colour different to that of the body of the circle, as in this graph



The only glitch is that this is not supported in gnuplot. Well, this is not completely true, because one can use a new prologue for the postscript output, but that works for postscript only, nothing else. What should we do then? There are two options: one has already been discussed, in connection with the bubble charts, sometime in early September. Nevertheless, that method is feasible only when the points do not overlap. What we did there was to plot the data twice (or however many times). But the problem is that those are actually two different plots, so if the adjacent circles overlap, we will not have what we want. Just try it, if you are still not convinced! The second option is that we keep reading. So, the question is then, how could we trick gnuplot into thinking that we have only one plot, not two. Well, the short answer is that we plot only one plot, not two, and we then make sure that points in the plot are coloured properly. I think it is time to show my script, that should make everything clear.

reset
set table 'two.dat'
plot [0:10] sin(x)+0.1*rand(0), cos(x)+0.1*rand(0)
unset table

f(x) = cos(x*pi)
set cbrange [-1:2]; unset colorbox; set border back

set palette defined (-1 "#000000", 1 "#ff0000", 2 "#0000ff")

plot "< gawk '{print $0; print $0}' two.dat" using 1:2:(2+f($0)/5.0):(f($0+1)) index 0 \
with points pt 7 ps var lt palette title 'red', \
'' using 1:2:(2+f($0)/5.0):(f($0+1)*2) index 1 \
with points pt 7 ps var lt palette title 'blue'

As always, we generate some data. Note, that we plot sin(x) and cos(x), which will be in the same data file, but in two separate data blocks. This will become important later. What we need to know about data blocks is that in the data file they are separated by pairs of blank lines, and that we can ask gnuplot to plot only certain data blocks. We indicate our choice with the use of the index keyword. If you want to know more about this, issue the
?index
command, and look at the data file 'two.dat'!

OK, so we have some data, if a bit obscure at the moment. Next, we define a funny function. If you look closely, you will realise that f(x) takes the value of -1 for odd, and 1 for even numbers. In principle, any function should do here, but one has got to be a bit careful: those functions that take 1 or -1 at isolated points might not work. If you do not want to dive into the details of this, take my word for it that f(x) defined above will just be perfect for our purposes.

The next step is the definition of a colour palette with the colour range. (We take off the colour box, too, and set the border to back, but these are irrelevant nuances.) We use the palette for colouring our graph: in the first curve, every other point will be black and red, while in the second curve, every other point will be black and blue. Thus, I have already divulged my trick: we have to duplicate our data set in a way that every line is copied at its position, so, in 'two.dat' we will have something like this
...
 9.29293 -0.93128  i
 9.29293 -0.93128  i
 9.39394 -0.978266  i
 9.39394 -0.978266  i
 9.49495 -0.923256  i
 9.49495 -0.923256  i
...

For this we use an external gawk script, but it is really just one line, it could not be any simpler. Once we duplicated our data, we plot it, and colour every second point as black, and every second point as red. Also note that we use the index keyword to choose the data block that we need. If you have only one data block, you can skip this. But not only do we colour the points differently, but we also resize them: after all, if all had the same size, we would not see the ones that are plotted first. For all this machinery, we use the fact that when a 2D plot is given as 4 columns, the size of the points can be determined by the third column, while the fourth column can be assigned to take care of the colour. In this case, the colour is taken from the palette that we defined above. The general form of such a plot is
plot 'foo' using 1:2:3:4 with points pt 7 ps var lt palette
where 'var' signifies the variable point size (ps), while palette is the colour. Note that we use our snappy f(x) function to choose the size and the colour of every second point, simply by counting the ordinal number of the particular data record, $0: For even numbers, we set the size 2+1/5, and the colour to black (colour value of -1), while for odd numbers, the size is 2-1/5, and the colour is red (colour value of 0). If you have a single data, the whole script would be simply
f(x) = cos(x*pi)
set cbrange [-1:1]; unset colorbox; set border back

set palette defined (-1 "#000000", 1 "#ff0000")

plot "< gawk '{print $0; print $0}' two.dat" using 1:2:(2+f($0)/5.0):(f($0+1)) \
with points pt 7 ps var lt palette title 'red'
Well, this is it for today. Next time I will try to show some of the new stuff in gnuplot 4.4. Cheers,
Zoltán

Tuesday, 1 December 2009

New version of gnuplot is released!

I have been waiting for this for a long time, but at long last, it has happened! The new version of gnuplot has been released with the designation 4.4. You can download the binary or the source code from the sourceforge repository. In the future, I will discuss the new things, and show what can be done with them.
Cheers,
Zoltán

Restricting fit parameters

Chris asked an interesting question today, namely, how one can restrict the fit range in gnuplot. What he meant by that was not the range of the data points (that is really easy, the syntax is the same as for plot), but the range of fit parameters. In some cases, it is a quite reasonable requirement, because we might know from somewhere that certain parameter values just do not make any sense. As it turns out, it is rather easy to achieve this in gnuplot. All we have to do is to come up with a function that restricts its values in the desired range.

After this interlude, let us see an example! We will create some data with the following gnuplot script:
reset
a=1.0; b=1.0; c=1.0
f(x) = a*exp(-(x-b)*(x-b)/c/c)
set table 'restrict.dat'
plot [-2:4] f(x)+0.1*(rand(0)-0.5)
unset table

We take a Gaussian, with some noise added to it. Naturally, we would like to fit a Gaussian to this data, and in particular, f(x). But what, if our model is such that 'a' must be in the range [1.1:2], 'b' must be in the range [0.1:0.9], and 'c' must be in the range [0.5:1.5]? We just use in our fit, instead of f(x), another function, g(x), say, of the form
g(x) = A(a)*exp(-(x-B(b))*(x-B(b))/C(c)/C(c))
where A(a), B(b), and C(c) take care of our restrictions. These functions are somewhat arbitrary, but for better or worse, I will take the following three arcus tangents
# Restrict a to the range of [1.1:2]
A(x) = (2-1.1)/pi*(atan(x)+pi/2)+1.1

# Restrict b to the range of [0.1:0.9]
B(x) = (0.9-0.1)/pi*(atan(x)+pi/2)+0.1

# Restrict c to the range of [0.5:1.5]
C(x) = (1.5-0.5)/pi*(atan(x)+pi/2)+0.5
which would look like this


The point here is that as x runs from negative infinity to positive infinity, A(x) runs between 1.1, and 2.0, and likewise for B(x), and C(x). Then the fit goes on as it would normally. Our script is, thus, the following in its full glory:

# Restrict a to the range of [1.1:2]
A(x) = (2-1.1)/pi*(atan(x)+pi/2)+1.1

# Restrict b to the range of [0.1:0.9]
B(x) = (0.9-0.1)/pi*(atan(x)+pi/2)+0.1

# Restrict c to the range of [0.5:1.5]
C(x) = (1.5-0.5)/pi*(atan(x)+pi/2)+0.5

a=0.0
b=0.5
c=0.9
fit f(x) 'restrict.dat' via a, b, c

g(x) = A(aa)*exp(-(x-B(bb))*(x-B(bb))/C(cc)/C(cc))
aa=1.5
bb=0.5
cc=0.9
fit g(x) 'restrict.dat' via aa, bb, cc

plot f(x), g(x), 'restrict.dat' w p pt 6

and it produces the following graph:


At this point, we should not forget, that what we are interested in is not the value of 'aa', 'bb', or 'cc', but the value of 'a', 'b', and 'c'. This means that what we have to take is
A(aa), B(bb), and C(cc). If you print the values of 'a', 'b', and 'c' from the fit to f(x), the value of 'aa', 'bb', and 'cc', and the value of A(aa), B(bb), and C(cc), we get the following results
gnuplot> pr a, b, c
0.984221984191135 0.996600824504231 1.00765240463672

gnuplot> pr aa, bb, cc
-3442408.91578921 1443864.45093385 -0.201236322474146

gnuplot> pr A(aa), B(bb), C(cc)
1.10000008322047 0.899999823634477 0.936788734177637

Obviously, the second print does not make too much sense, we have to compare the last one to the first one. We can here see that we got values in the ranges [1.1:2], [0.1:0.9], and [0.5:1.5], as we wanted to.

Sunday, 29 November 2009

Broken histograms

Sometimes, a histogram is just a bit awkward, for the simple reason that one or two values are extremely high compared to the rest of the graph. In the case of a standard graph, we would use a broken axis to bring all points to the same order of magnitude. We can play the same trick with histograms, in fact, it is, in some sense, even simpler, than the broken axes. All we have to do is to plot a thick line at the proper position in the proper colour. This is the graph that we are going to make today



Our data file, brokenhist.dat, is as follows
"Jan" 1 2
"Feb" 44 4
"Mar" 3 1
"Apr" 2 25
"May" 4 5
"June" 2 1
and here is our script:
reset
blue = "#babaff"
set xrange [-0.5:5.5]
set yrange [0:11]
set isosample 2, 100
set table 'brokenhist_b.dat'
splot 1-exp(-y/2.0)
unset table

unset colorbox
set border 1
set xtics rotate by 45 nomirror offset 0, -1
unset ytics
f(x) = (x < 6 ? x : (x < 30 ? x-17 : x-35) )
g(x) = (x < 6 ? 1/0 : 6)
set boxwidth 0.85
set style fill solid 0.8 border -1
set style data histograms
set palette defined (0 "#ffff   ff", 1 "#babaff")
plot 'brokenhist_b.dat' w ima,\
'brokenhist.dat' u (f($2)) t 'Red bars',\
'' u (f($3)) lc rgb "#00bb00" t 'Green bars', \
'' u 0:(f($2)):2 w labels center offset 0,0.5 t '',\
'' u ($0+0.25):(f($3)):3 w labels center offset 0,0.5 t '',\
'' u 0:(-1):xticlabel(1) w l t '', \
'' u ($0+0.12):(g($3)+0.12):(0.25):(0.25) w vectors lt -1 lc rgb blue lw 5 nohead t '', \
'' u ($0-0.12):(g($2)+0.12):(0.25):(0.25) w vectors lt -1 lc rgb blue lw 5 nohead t ''
OK, so let us look at the code! The first couple of lines are required only, if you want to have some posh background. Likewise, you can drop the 'unset colorbox' line, when you have a white background. We set only the bottom axis, which means that we have to unset the ytics and set to xtics to nomirror. Then we have two helper functions. The definitions of these depend on where you want to have the break point in the histogram. In this particular case, I took 6, but it is arbitrary.

In the next step, we set the properties of the histogram, like the width of the columns, the fill style, and the data style. We also define a palette, but this is needed for the background only. For white background, you can skip this step. You can also skip the first plot, because that is nothing but our fancy background.

The actual plotting begins after this. We plot the two sets of columns, and also plot the data file with labels. The labels are placed at the top of each column (this is why we could do away with the yaxis.) We also 'plot' the axis labels, and finally, plot the two break points. Note that the plotting of the break points is automatic, once we have the definitions of the two helper functions. If you want to have a steeper cut, you could

'' u ($0+0.12):(g($3)+0.12):(0.25):(0.5) w vectors lt -1 lc rgb blue lw 5 nohead t ''
e.g., which stretches the vectors in the vertical direction. Otherwise, we have finished the plot, there is nothing else to do.
I should point out here that, in order to have a seamless cut, we have to use a colour for the vectors that is identical to the background at that particular point. This implies that we could not have a gradient at y=6. The background colour is virtually constant at y=6 (c.f. the definition of 'brokenhist_b.dat'. While it would not be impossible to implement a cut over a gradient, I believe, it is probably not worth the trouble it involves.

Sunday, 22 November 2009

Update

I have had some time, so I moved all recent posts to their permanent place on
my web page
. I have "sexed up" the homepage a bit, so, hopefully, browsing will be a tad easier. Let me know, if there are any problems! (I know that there is a small glitch with the cascaded style sheets on IE6. IE8 should work without problems. Firefox 3.5 is also OK.) There is a zipped version of the complete site, if you want to read it off-line.
Comments should still be posted here, please!



Cheers,
Zoltán

Some basic statistics with gnuplot


In my previous post, I mentioned a patch that you can compile into gnuplot, and that should make plots with some statistical properties a bit easier. Now, the problem with that patch is that, if you don't want to, or can't take the trouble of compiling gnuplot for yourself, it is no use. However, for most things contained in the patch, there is a work-around that should function properly in gnuplot 4.2. I will discuss those now.

The first thing I did with the statistical patch was to plot the mean, minimum and maximum of a data set. This can easily be done in the following way.
reset
# Produce some dummy data
set sample 200
set table 'stats2.dat'
plot [0:10] 0.5+rand(0)
unset table

set yrange [0:2]
unset key

# Retrieve statistical properties
plot 'stats2.dat' u 1:2
min_y = GPVAL_DATA_Y_MIN
max_y = GPVAL_DATA_Y_MAX

f(x) = mean_y
fit f(x) 'stats2.dat' u 1:2 via mean_y

# Plotting the minimum and maximum ranges with a shaded background
set label 1 gprintf("Minimum = %g", min_y) at 2, min_y-0.2
set label 2 gprintf("Maximum = %g", max_y) at 2, max_y+0.2
set label 3 gprintf("Mean = %g", mean_y) at 2, max_y+0.35
plot min_y with filledcurves y1=mean_y lt 1 lc rgb "#bbbbdd", \
max_y with filledcurves y1=mean_y lt 1 lc rgb "#bbddbb", \
'stats2.dat' u 1:2 w p pt 7 lt 1 ps 1

At the beginning of our script, we just produce some dummy data, and call a dummy plot. This plot does nothing but fills in the values of the minimum and maximum of the data set. Then we fit a constant function. You can convince yourself that this returns the average of the data set.

In the plotting section, we produce three labels that tell us something about the data set, and plot the data range with shaded region. Easy enough, and in just a couple of lines, we created this figure



Now, what should we do, if we were to calculate the standard deviation. Well, we know how to calculate the average, so we will use that. Here is the script:

reset
set sample 200
set table 'stats2.dat'
plot [0:10] 0.5+rand(0)
unset table

set yrange [0:2]
unset key
f(x) = mean_y
fit f(x) 'stats2.dat' u 1:2 via mean_y

stddev_y = sqrt(FIT_WSSR / (FIT_NDF + 1 ))

# Plotting the range of standard deviation with a shaded background
set label 1 gprintf("Mean = %g", mean_y) at 2, min_y-0.2
set label 2 gprintf("Standard deviation = %g", stddev_y) at 2, min_y-0.35
plot mean_y-stddev_y with filledcurves y1=mean_y lt 1 lc rgb "#bbbbdd", \
mean_y+stddev_y with filledcurves y1=mean_y lt 1 lc rgb "#bbbbdd", \
mean_y w l lt 3, 'stats2.dat' u 1:2 w p pt 7 lt 1 ps 1

What we utilise here is the fact that the fit function also sets a couple of variables. One of them is the sum of the residuals, which is called FIT_WSSR, while another is the number of degrees of freedom, FIT_NDF. However, we know that the number of degrees of freedoms is one less, than the number of data points, for we fit a function with a single parameter. Therefore, if we take the square root of the sum of residuals divided by the number of degrees of freedom plus one, we get the standard deviation. The rest of the plot is trivial, and this script results in the following graph:


Incidentally, this can also be used for removing points that are very far from the mean. The following script takes out those data that are more than one standard deviation away from the mean.
reset
set sample 200
set table 'stats2.dat'
plot [0:10] 0.5+rand(0)
unset table

set yrange [0:2]
unset key
f(x) = mean_y
fit f(x) 'stats2.dat' u 1:2 via mean_y

stddev_y = sqrt(FIT_WSSR / (FIT_NDF + 1 ))

# Removing points based on the standard deviation
set label 1 gprintf("Mean = %g", mean_y) at 2, min_y-0.15
set label 2 gprintf("Sigma = %g", stddev_y) at 2, min_y-0.3
plot mean_y w l lt 3, mean_y+stddev_y w l lt 3, mean_y-stddev_y w l lt 3, \
'stats2.dat' u 1:(abs($2-mean_y) < stddev_y ? $2 : 1/0) w p pt 7 lt 1 ps 1
with the corresponding figure


Only the last line is relevant: we use the ternary operator to decide whether we want to keep the point: if the deviation from the mean is less, than the standard deviation, we hold on to our data, otherwise, we replace it by 1/0, which is undefined, and gnuplot quietly ignores it. If you want to learn more about the working of the ternary operator, check out my post on the plotting of an inequality.

We have, thus, already found a solution for two of the problems addressed in the patch. What about the third one, adding arrows to the plot at the position of the minimum or maximum, say. We can do that, too. Here is the script:

reset
set sample 50
set table 'stats1.dat'
plot [0:10] 0.5+rand(0)
unset table

set yrange [0:2]
unset key
plot 'stats1.dat' u 1:2
min_y = GPVAL_DATA_Y_MIN
max_y = GPVAL_DATA_Y_MAX

plot 'stats1.dat' u ($2 == min_y ? $2 : 1/0):1
min_pos_x = GPVAL_DATA_Y_MIN
plot 'stats1.dat' u ($2 == max_y ? $2 : 1/0):1
max_pos_x = GPVAL_DATA_Y_MAX

# Automatically adding an arrow at a position that depends on the min/max
set arrow 1 from min_pos_x, min_y-0.2 to min_pos_x, min_y-0.02 lw 0.5
set arrow 2 from max_pos_x, max_y+0.2 to max_pos_x, max_y+0.02 lw 0.5
set label 1 'Minimum' at min_pos_x, min_y-0.3 centre
set label 2 'Maximum' at max_pos_x, max_y+0.3 centre
plot 'stats1.dat' u 1:2 w p pt 6

First, we retrieve the values of the minimum and the maximum by using a dummy plot. Having done that, we retrieve the positions of the minimum and maximum, by calling a dummy plot on the columns
plot 'stats1.dat' u ($2 == min_y ? $2 : 1/0):1

What this line does is substitute min_y, when the second column (whose minimum we extracted before) is equal to the minimum, and an undefined value, 1/0, otherwise. The minimum of this plot is nothing, but the x position of the first minimum. Likewise, had we assigned
min_pos_x = GPVAL_DATA_Y_MAX

that would have given the position of the last minimum of the data file. Obviously, these distinctions make sense only, if there are more than one minimum or maximum. Knowing the x and y positions of the minimum and maximum, we can easily set the arrows. We, thus, have the following figure



Adding labels showing the value should not be a problem now.

Well, this is for today. Till next time!

Monday, 9 November 2009

Patching gnuplot

One of the major advantages of open-source code is that if you would like to add some new features, you can easily do that. This applies to gnuplot, too, in fact, doing that does not require anything special. I have been quite inactive on this blog recently, and the reason is that Philipp Janert and I have been working on a patch to gnuplot.

The steps of patching gnuplot are described on gnuplot's main web page. There are a number of patches uploaded to gnuplot's patch tracker, on which quite a few new features, still in the development phase, are published. It is really worthwhile to try them out, first, to provide feedback as to what is useful and what is not, and second, to help the developers to find bugs and other glitches, like what the syntax of a command should be and so on.

Our patch is related to an old debate as to what gnuplot really is. At many a place, you will find the statement that "gnuplot is a plotting utility, not a statistical analysis package". I have nothing against this statement, however, when saying so, we have to tell what we mean by plotting. So, is plotting just placing a thousand dots at positions that represent our data? Or do we want more? E.g., throwing out data points that are unreasonably far from the mean. Or showing the mean, and the standard deviation? Or calling the reader's attentional to some special points, like the minimum or the maximum in a data set? And many similar things. I believe, plotting requires much more, than just showing the measurement data: a plot makes sense only, if we can point out what is to be pointed out. By the way, fitting falls into this category, and fitting has been an integral part of gnuplot for ages. The point being that the original statement (gnuplot is a plotting utility, not a statistical analysis package) has been wrong for a long time.

The patch that I mentioned above was announced yesterday on the gnuplot development mailing list and you can find the patch for the source and the
documentation on patch tracker. I have put a couple of examples on my gnuplot web site under patch. You can also find the full documentation.

I would like to ask you, if you feel crafty and you can, download the patch, and try it, and let us know whether you find it useful, what else, do you think, we could do with it and so on. It would really help the development. Once the patch makes it to the main code, I will discuss various option on these pages.

Just to wet your appetite, here is a figure that you could very easily make with the new patch. (You can find the code on my web site.)




Many cheers,
Zoltán