Sunday 9 August 2009

Plotting the recession - some basic data analysis with gnuplot

Today, we will see how we can do some very rudimentary data analysis, using gnuplot. Of course, if you want to analyse your data, then you can either use a dedicated script, written in a far better fit language, e.g., in Perl or in gawk, or you can do the whole thing in a tabular environment, using Excel or Origin, or Grace, or Labplot, or I do not know what. And I have also read somewhere that gnuplot conforms with the Unix philosophy, "one task - one tool", and we should not do any data processing with it. Now, there are at least two problems that I see with this approach: one is that some data analysis are really simple, and it is not clear, why one should use a separate tool for that. Furthermore, it will always cause some trouble (at least, in my experience) when the data analysis is not done from gnuplot, for one can never be sure how and on which version of the analysed data file the figure was based. And pipes can only be used on Linux/Unix-type systems, thus, the platform independence of gnuplot is lost. But more importantly, and if we are at the "philosophy" anyway, it is completely unclear (at least, to me) what constitutes plotting: just placing the curves, points and dots on an empty canvas, and writing a couple of labels for the axes, or this and the choice of the curves, points and dots. The problem is that one can in most cases decide what to do with erroneous points and the like only when it is already plotted and shown. So, if that is the case, why should we sequester data analysis from the plotting? Are they not the two complementing parts of the very same procedure?

So, these are my reasons for doing data processing in gnuplot. Of course, in most cases, I do that using an external script, and piping it from the gnuplot prompt. But that would deprive windows users from the pleasures of doing something fancier:)

By the way, if you look at some of the demo scripts of gnuplot 4.3, you will realise that the developers of gnuplot have already parted with the idea of "one task - one tool".

The kind of data processing we will do today requires a "recursive" access to data points, in particular, I will demonstrate how we can calculate the derivative and the integral of a series of data points. Calculating the derivative and integral of an analytic function is trivial, even if you do not have the analytical answer: for the derivative, you can simply plot f(x+dx)-f(x), where dx is just some pre-defined number, and for the integral you can apply a recursive scheme, detailed on the gnuplot demo web site. But they work on known functions, and we would like to calculate something based on "measured" data. For this, I will take the weekly oil price from this year, which you can download from the link. Let us suppose that we plot the oil price, but we also want to highlight the time intervals where the price is dropping. This requires us to calculate the derivative at each point (In this particular case, we could get away just by comparing the value and the next, but we have got to get access to the values anyway, and we would get it by fitting a linear function, whose linear coefficient gives the derivative. So, for better of worse, I will just stick with the derivative.)

For a starter, let us see our two scripts! As I have already pointed out, we will have to walk through the values, one by one, so we will need a 'for' loop, which means two scripts: a "master", and one that we call repeatedly. I have discussed this method at great length in my recent posts.
reset
unset key
A=0; sa=1; sb=0; integ=0.0
f(x) = a*x+b
g(x) = (x>0?GPVAL_Y_MIN:GPVAL_Y_MAX)
set xlabel 'Time [weeks]'
set ylabel 'Crude oil price [$]'
p 'oil.dat' u 0:8

xmax = GPVAL_DATA_X_MAX

set print 'oild.dat' append
l 'derint_r.gnu'
set print

plot 'oild.dat' u 1:(g($2)) w filledcurve lc rgb "#eeeeee", 'oil.dat' u 0:8 w l

and our 'for' loop, that we call 'derint_r.gnu'
a=1.0; b=40.0
fit [A:A+1] f(x) 'oil.dat' u 0:8 via a, b
integ = integ+(a*A+b+a*(A+1)+b)/2.0
if(sa*a<0 a="" b="" integ="" print="" sa="a;" sb="b<br">if(sa*a>0) print A, a, b, integ; sa=a; sb=b
A=A+1
if(A<xmax) reread


The first part of the main script should be clear: we set up the style of our figure, and do a dummy plot to learn something about our data file, 'oil.dat'. f(x) is our linear function, while g(x) is just a helper, whose role will become evident a bit later. Having set up the plot, we call 'derint_r.gnu' 'xmax' times. So, what is in 'derint_r.gnu'? We simply fit our linear function over a range that contains only two points, and we step through the whole file by moving those two points by one in each call to the loop. Then we calculate the integral by adding the area of the next trapezium, and then depending on whether the sign of the derivative, 'a', is different to the previous one, we print out the results once (if the signs are identical), or twice (if they are different). This distinction is necessary because we want to plot rectangles where the derivative is negative. After we are done with the data processing, we close our new file by setting the print, and plot the results, to get the following figure



Note that in the plotting, we called the function g(x), which takes on the value of the minimum of the yrange, if the argument is negative (the price is dropping), while returns the maximum of the yrange otherwise. Of course, there are many things that we could do to improve the image, but I wanted only to demonstrate how we can do the data processing part. You can dress up the image as you like, and for that you can also draw on many a post of mine.

6 comments:

  1. A great blog you have here. Can you suggest a simple method just for highlighting a desired data range? For example, xrange is 0-100 and yrange is 0-20, and I just want to highlight the chart area between x = 20 and 40. Thanks very much.

    ReplyDelete
  2. Hello Ben,

    I am not quite sure of whether I understand your intentions, but let me have a go at it! So, if you want to plot everything, and then highlight (i.e., plot it in a different colour) those values that fall in the range 20-40, then you can do the following:
    f(x) = (x < 20 ? x : (x > 40 ? x : 1/0))
    g(x) = (x > 20 ? (x < 40 ? x : 1/0) : 1/0)
    plot 'foo' u (f($1)):2 with lines lt 1, '' u (g($1)) with lines lt 2

    If, however, you simply want to have a different background colour for that range, then you can either follow the recipe given in the post (with some obvious modifications), or you can use the rectange object as

    set object 1 rectangle from graph 20,0 to graph 40,100 behind fillcolor rgb "#ff0000"

    which will place a red rectangle behind your plot.
    I hope this helps!
    Cheers,
    Zoltán

    ReplyDelete
  3. How might you plot a 3d object but only show the extreme edges as lines? For instance, how would you show just a "V" when plotting an upside down cone given by: splot sqrt(x**2+y**2)*tan(45*pi/180)

    ReplyDelete
  4. I don't think it can be done automatically. I guess, one has just got to know the parametrisation of the surface, and then restrict it in a way that it gives you the cross-section.
    Cheers,
    Zoltán

    ReplyDelete
  5. Thanks very much for your help.
    Ben

    ReplyDelete
  6. Dear Gnuplotter,
    I have a problem in gnuplote to load an external .dat file and to draw it graphically.
    I am working on my thesis with simulator called peersim. the problem is that when i run my simulator in eclipse so compile successfully and i have make a .dat file to which simulation result automatically gone.
    I have tried different command to load that file and draw graph but it gives error. I can mail that file to whether the data contained in it is excutable or not. I am working from two months but got nothing.
    Plz help me in this regards.

    ReplyDelete