You're now viewing all of my posts relating to graphs. Enjoy!

Traffic Monitoring with rrdtool

So, as some of you may know I've been having issues with my Internet at home lately as well as at work (today it appears totally dead at work). High packet loss and high latency were abounding, and I wanted to see exactly when/where it was happening. I've been using ntop for awhile to display usage statistics but found that it was lacking in terms of both fine resolution and error reporting. So, what's a geek to do? Write his own!

Surprisingly, this was a fairly difficult little project to get up and running. Maybe I was mis-understanding the rrdtool docs - but it took me almost a full day just to get basic graphing down. Well, not to let my efforts go to waste I'm here to share with you my final results today!

The Database

rrdtool is a "round robin database tool" which stores data points and statistics while providing a very nice (albeit difficult) graphing facility. The initial setup of the database can be tricky with a lot of stuff needing explanation. Please note, this is all a nice tidy shell script. No complicated Perl here!

interval=35
start=`date +%s`
b=`expr $interval \* 2`

rrdtool create status.rrd \
    --start $start\
    --step $interval\
    DS:packet_loss:GAUGE:$b:U:U\
    DS:latency:GAUGE:$b:0:450\
    DS:kbytes_in:DERIVE:$b:0:U\
    DS:kbytes_out:DERIVE:$b:0:U\
    RRA:AVERAGE:0.5:1:576\
    RRA:AVERAGE:0.5:6:672\
    RRA:AVERAGE:0.5:24:732\
    RRA:AVERAGE:0.5:144:1460

So, what's this all about now? Well, the part of the script which actually feeds the database does so at a rate of interval (35 here). The --start format for rrdtool is in seconds since the Epoch (0000 Jan 1st 1970) so, we need the "+%s" format for the starting date command.

You'll see multiple "DS" entries - these are the actual variables which are stored. The GAUGE datatype allows for constantly changing data which can be decreasing or increasing. The DERIVE datatype takes into account constantly increasing values, which allow the system to compute the difference for charting purposes. The $b variable is just the script interval multiplied by two. This value is supposed to be the interval at which rrdtool expects data to be provided - for some reason, providing two entries for each of these works. Baffle.

The following :0:U establish max and minimum boundaries. Everything outside of these is tagged "invalid" and not displayed. This allows for nice chart smoothing and preventing the scale from getting out of whack. The U simply signified "Unknown" or "Unspecified" allowing it to be variable.

The RRA Averages are the actual "statistical" storage utilities which store all of the statistics. Frankly, this took so much futzing that I have no idea what these mean. Every site that I referenced from had totally different and non-correlationary values. Take them as you want, but the 1,6,24,144 should provide for long term storage of "576 points at 1 interval each", "672 points at 6 intervals each" and "732 points at 24 intervals each" (each being the AVERAGE amongst those intervals). For some reason if these values are not just right, the whole system refuses to display data. Baffle.

Data Collection

On to the actual collection portion! Firstly, a host needs to be picked out for pinging purposes. You noticed at the beginning of the script my interval was set to 35. That's a pretty low interval for a host that you don't own. Luckily, I have access to a nice dedicated box that I like to ping into abandon in order to get my stats. You probably won't be so lucky! You can try your ISP's gateway or DNS servers, but I would definitely recommend upping the interval somewhere greater than three minutes.

With that in hand, we continue. The collection itself is contained in a while loop - which any sysadmin should understand;

while (true); do ( sleep $interval ... ); done

Ok, so the first bit of the collection process is to get the average ping time and packet loss. My link is unreliable, so packet loss is of interest to me - it might not be so to you.

p=`ping -c4 $host`
packet_loss=`echo $p | egrep -o "[0-9]+% packet loss"`
packet_loss=`echo $packet_loss | egrep -o [0-9]+`
latency=`echo $p | egrep -o "= .*?\/[0-9]+"`
latency=`echo $latency | egrep -o "\/[0-9]+"`  
latency=`echo $latency | egrep -o "[0-9]+"`
latency=`echo $latency | egrep -o " [0-9]+ "`
latency=`echo $latency | egrep -o "[0-9]+"`  

That will ping $host 4 times and yield the total packet loss and average latency across those four. This is done using the version of ping included with Ubuntu 6.0.1 - anything else may yield different numbers or nothing at all.

Next, we need to determine that amount of traffic flowing through our ethernet device. My external firewall is "eth1", your mileage may vary. I also blatantly ripped this (except the kilobyte conversion) from someone else's site. Sadly, my Internets are in such disarray that I can't find it ATM.

kbytes_in=`ifconfig eth1 |grep bytes|cut -d":" -f2|cut -d" " -f1`
kbytes_out=`ifconfig eth1 |grep bytes|cut -d":" -f3|cut -d" " -f1` 
kbytes_in=`expr $kbytes_in / 1000`
kbytes_out=`expr $kbytes_out / 1000`

Puttin' On the Ritz

We've got all the data that we need now. It's just a quick rrdtool update away from being in our database!

rrdtool update status.rrd $now:$packet_loss:$latency:$kbytes_in:$kbytes_out
# $now is specified earlier in the script as `date +%s`

rrdtool updates it's databases in the order in which they were created. So, this is setting the interval $now to have values as specified in the colon separated list. Pretty simple, eh?

Graphing

Aah, graphs! Everybody loves 'em. These are fairly straight forward (and my cat is on my keyboard) so I'm just going to copy-paste the most advanced one here. You can find the rest in my network_status.sh.

d_stamp=`date +%Y\/%m\/%d`
rrdtool graph combined.png -a PNG --title="$d_stamp - Combined Last Hour"\
    --vertical-label "Combined Data" --start=end-1h --end now\
    'DEF:mypacketloss=status.rrd:packet_loss:AVERAGE:step=1' \
    'DEF:mylatency=status.rrd:latency:AVERAGE' \
    'CDEF:mypacketloss_neg=mypacketloss,-1,*'\
    'AREA:mypacketloss_neg#FF0000:Packet Loss' \
    'GPRINT:mypacketloss:MAX:Max\: %2.1lf'\
    'GPRINT:mypacketloss:AVERAGE:Avg\: %2.1lf'\
    'GPRINT:mypacketloss:LAST:Last\: %2.1lf Percent\j'\
    'AREA:mylatency#0000FF:Latency' \
    'GPRINT:mylatency:MAX:Max\: %2.1lf'\
    'GPRINT:mylatency:AVERAGE:Avg\: %2.1lf'\
    'GPRINT:mylatency:LAST:Last\: %2.1lf ms\j'\
    > /dev/null    

The above graphing routines yield this lovely graph (well, yours probably doesn't display this much packet loss);

Conclusion

So, there you have it. Nice traffic monitoring with rrdtool. You can download the script here and give it a whirl. It uses nothing more fancy than ping, ifconfig and rrdtool itself.