How to create stacked area graphs with Gnuplot

2011-03-23 » data, gnuplot, Perl, plotting

Gnuplot is a really great plotting utility, that can be used either interactively or automatically, from inside scripts of all sorts. However, sometimes it can be quite difficult to use simply because there are lots of documentation, but it is hard to figure out exactly what piece of documentation you should read and where it is.

This is a big problem, because the way you plot data, that is which Gnuplot options you set, can make a huge difference in the readability of the plot. For example, I had this ASCII file called initial_data.dat, that lists the number of daily visitors to four different websites I follow (the complete file used for this article is downloadable from the link at the end of this page):


  20110101 2481 89 896 98
  20110102 2341 83 1341 762
  20110103 560 208 1795 890
  20110104 936 409 1665 419
  20110105 534 562 937 341
  20110106 171 728 953 612
  20110107 200 199 1297 569
  20110108 521 557 990 295
  20110109 1592 227 535 466
  20110110 1363 211 21 299
  20110111 437 222 110 302
  ...


The file format is very simple: the first column is the date, then there are four space-separated columns, one for each website. However, if you tell gnuplot to plot those numbers as simple lines with these commands (the plot command here is wrapped around for readability, but it must stay all on one line!):



  set terminal png size 1400, 600
  set output "with_lines.png"
  set key center top
  set style fill solid
  set xdata time
  set timefmt "%Y%m%d"
  set format x "%b %d"
  set ytics 300
  set y2tics 300 border
  set xlabel "Website traffic from 20110101 to 20110322"

  set style line 1 linetype 1 pointtype 0 linewidth 1 linecolor 6
  set style line 2 linetype 2 pointtype 0 linewidth 1 linecolor 7
  set style line 3 linetype 3 pointtype 0 linewidth 1 linecolor 8
  set style line 4 linetype 4 pointtype 0 linewidth 1 linecolor 9
  plot ["20110101":"20110131"][:] 'initial_data.dat' using 1:5 t   \
                                  "Website 4" w lines linestyle 4, \
   'initial_data.dat' using 1:4 t "Website 3" w lines linestyle 3, \
   'initial_data.dat' using 1:3 t "Website 2" w lines linestyle 2, \
   'initial_data.dat' using 1:2 t "Website 1" w lines linestyle 1


with_lines you’ll get the diagram on the left, that isn’t really informative: the lines continuously overlap, so it’s hard to see quickly which website had the highest traffic on each day. It’s even harder to see the total amount of traffic on each day. The solution is to stack the graphs, that is to use each line as the foundation on which to draw the others. In other words, instead of drawing four lines, one for each website, I wanted to draw:

  • traffic for website 1 using the numbers in the second columns, as they are in the data file
  • traffic for website 2 //summing its numbers in column 3 with the corresponding numbers in column 2
  • traffic for website 3 //summing its numbers in column 4 with the corresponding numbers in column 2 and 3
  • traffic for website 4 //summing its numbers in column 5 with the corresponding numbers in column 2, 3 and 4

It is possible to do this kind of stuff directly in Gnuplot, by calling utilities like sed or awk directly from within the Gnuplot command file as explained in this page. However, in my opinion it is much better and cleaner to do stuff like this before calling gnuplot, with an auxiliary script in Perl, Python or whatever strikes your fancy. The reason is that in this way you can pre-process the numbers in much more flexible, and self-documenting ways that Gnuplot is capable of: another way to say this is that it is better to keep number calculation and number plotting completely separate. For example, I also run an expanded version of the script below that calculates and stacks on the fly the moving averages over one month of the traffic to each website.

In this case, using a separate script to “stack” the data is also quick. All I had to do to rearrange the numbers as described above was to feed the data file to another script and save the result as another data file:

cat initial_data.dat | perl stack_data.pl > stacked_data.dat

Stack_data.pl is the simple Perl script below:


  #! /usr/bin/perl

  use strict;
  my $I;
    while (<>) {                            # read the data file from Standard Input
      chomp;
      my @FIELDS = split ' ', $_;           # put each column in a different
      my $DATE = shift @FIELDS;             # field of the @FIELDS array
      print "$DATE ";
      my $LINE_TOTAL = 0;
      for ($I = 0; $I <= $#FIELDS; $I++) {  # re-print the data, but making of
         $LINE_TOTAL += $FIELDS[$I];        # each column the sum of all the ones
         print " $LINE_TOTAL";              # preceeding it
      }
      print "\n";                           # print the new numbers to Standard Output
   }


stacked_lines Here is what you get when plotting the stacked_data.dat file with the same style of the initial graph, that is with the same “plot” command. That is much better but, in my opinion, it wasn’t good enough yet. Personally, I think filled areas are more informative in cases like this. The way to obtain them is to use another style in the plot command. Replace the last line of the Gnuplot instruction file with this one (on ONE line!):


  plot ["20110101":"20110131"][:]
  'initial_data.dat' using 1:5 t "Website 4" w filledcurves x1 linestyle 4,
  'initial_data.dat' using 1:4 t "Website 3" w filledcurves x1 linestyle 3,
  'initial_data.dat' using 1:3 t "Website 2" w filledcurves x1 linestyle 2,
  'initial_data.dat' using 1:2 t "Website 1" w filledcurves x1 linestyle 1


with_filledcurves and this is what the plot will look like. Nicer, uh? Please note that the order in which each column is read and plotted in the command above is important. When Gnuplot draws a new line or area, it does it over what it has already drawn or paint. With lines (second diagram in this page) it makes no practical difference. With areas, instead, it’s esential to paint them in the right order. This means that you must plot first the highest area (“Website 4” in our example) that is the one that is the sum of all the original columns. Then comes the one that is the sum of only the first three columns etc…