Monday, 9 November 2009

Patching gnuplot

One of the major advantages of open-source code is that if you would like to add some new features, you can easily do that. This applies to gnuplot, too, in fact, doing that does not require anything special. I have been quite inactive on this blog recently, and the reason is that Philipp Janert and I have been working on a patch to gnuplot.

The steps of patching gnuplot are described on gnuplot's main web page. There are a number of patches uploaded to gnuplot's patch tracker, on which quite a few new features, still in the development phase, are published. It is really worthwhile to try them out, first, to provide feedback as to what is useful and what is not, and second, to help the developers to find bugs and other glitches, like what the syntax of a command should be and so on.

Our patch is related to an old debate as to what gnuplot really is. At many a place, you will find the statement that "gnuplot is a plotting utility, not a statistical analysis package". I have nothing against this statement, however, when saying so, we have to tell what we mean by plotting. So, is plotting just placing a thousand dots at positions that represent our data? Or do we want more? E.g., throwing out data points that are unreasonably far from the mean. Or showing the mean, and the standard deviation? Or calling the reader's attentional to some special points, like the minimum or the maximum in a data set? And many similar things. I believe, plotting requires much more, than just showing the measurement data: a plot makes sense only, if we can point out what is to be pointed out. By the way, fitting falls into this category, and fitting has been an integral part of gnuplot for ages. The point being that the original statement (gnuplot is a plotting utility, not a statistical analysis package) has been wrong for a long time.

The patch that I mentioned above was announced yesterday on the gnuplot development mailing list and you can find the patch for the source and the
documentation on patch tracker. I have put a couple of examples on my gnuplot web site under patch. You can also find the full documentation.

I would like to ask you, if you feel crafty and you can, download the patch, and try it, and let us know whether you find it useful, what else, do you think, we could do with it and so on. It would really help the development. Once the patch makes it to the main code, I will discuss various option on these pages.

Just to wet your appetite, here is a figure that you could very easily make with the new patch. (You can find the code on my web site.)



Many cheers,
Zoltán

5 comments:

  1. Dear Zoltán,
    thank you very much for this patch (and also thanks to Philipp Janert). I agree with you that these features (min, max, mean, etc.) are not introducing statistics into Gnuplot, but merely provide a clean and simple way to help representing scientific data.
    I personally needed these features in too many occasions and I had to go through messy tricks to get the results I wanted.
    I look forward to see this patch included in the official distribution of Gnuplot.

    Thank you again,
    Matteo Tommasini

    ReplyDelete
  2. Dear Matteo,

    Many thanks for the feedback. We have implemented some improvements (e.g., matrix, and iterations are now supported by stats), and hopefully it will make it to the main branch sometime soon.
    Cheers,
    Zoltán

    ReplyDelete
  3. Dear Zoltán,
    this is great news :-)
    I will stay tuned...
    All the best,
    Matteo

    ReplyDelete
  4. Hello everybody,

    I already posted this on SF, but I guess you don't mind if I repeat it here:
    First of all I want to say that I really appreciate the stats patch. I really
    needed this functionallity, too. I hope it will make it to the public version.

    Now, that I have worked some time with stats, I would like to suggest
    another way of invoking the stats functionallity. Because, I have not yet
    read the source code, I don't know if it is possible.

    Instead of invoking stats before one or serveral data file(s) are plotted
    and storing each stats results in tons of variables with different prefix,
    isn't it possible to enhance the plot command by a "stat" tag? May be an
    example explains more:

    Instead of
    stats "file1" noout var=file1_
    stats "file2" noout var=file2_
    plot "file1" u 1:($2/file1_max_y), \
    "file2" u 1:($2/file2_max_y)
    ... this will give a plot with a relative y-axis (0 to 1)
    One can easily imagine the work overhead if you want to compare several
    files in such way.

    For this reason I propose something like that as additional
    functionallity:
    plot "file1" u 1:($2/$max_y) stats statsusing 1:2, \
    "file2" u 1:($2/$max_y) stats statsusing 1:2
    Of course the syntax is not perfect, but hopefully the idea can be seen.
    For the standard usage statsusing (or other {datafile-modifiers}) may be
    omitted. In genereall there must be at least a way to have a different
    using than that of the plot command, because the latter one will contain
    "variables" like $max_y. I have no idea if the return values like $max_y is
    a good way to handle the stats results.

    Anyway, the stats tag has to retrieve the information and set variables
    before the actual ploting starts.

    I hope you like this idea. Looking forward for comments on that.
    Best regards.

    ReplyDelete
  5. Hello Andreas,

    First of all, many thanks for the feedback! What you suggest is a sound idea, but there are a few obstacles in the way.
    First, we wanted an implementation that doesn't interfere with any existing code, which means that it had to be independent of the plot command, and the like.
    Second, if we go your way, we would have re-written the command parsing mechanism, which wouldn't have been a piece of cake...
    Thirdly, and perhaps, most importantly, in your scheme, statistical properties can't be used for anything else, but plotting. What I had in mind at the very beginning is putting an arrow at the place of the maximum, and adding a label, showing the value of the maximum. In your scenario this cannot be done.
    I don't think that assigning the values to variables is such a hassle, by the way. If the many new variables disturb you, you can do the following

    stats 'foo' using 1:2 var pre "baz_"
    max1 = baz_max
    unset baz_*

    which will leave you with one variable, max1, and nothing else. Note that the syntax of stats is a bit different to the original version. Also note the unset command which can take a wild char. There is a recent patch by Philipp Janert that can do this. (By the way, the reason for changing the syntax a bit is that in this version, the prefix is a string, while originally it was a bareword. This latter one can't be scripted, while the string can.)
    Well, these were our arguments for the syntax of stats. I will try to roll out the new version in the coming days.
    Cheers,
    Zoltán

    ReplyDelete