Sunday, May 3. 2009

Time Series Analysis on RRD Files

Crist Clark, in a posting on the NANOG mailing list, started an interesting thread on analyzing network traffic based upon frequency analysis rather than the traditional time based analysis. He started the thread by asking about Fourier Analysis on network traffic time series. A number of responses indicated that Wavelet Analysis might be the 'more modern' approrach. This type of analysis has been used for Network Traffic Anomoalies Detection. The responses indicate that operating systems can be deduced through analysis of RTD (Round Trip Delay) of ping generated traffic.

The thread started with:

Crist Clark started:

Has anyone found any value in examining network utilization numbers with Fourier analyses? After staring at pretty MRTG graphs for a bit too long today, I'm wondering if there are some interesting periodic characteristics in the data that could be easily teased out beyond, "Well, the diurnal fluctuations are obvious, but looks like we may have some hourly traffic spikes in there too. And maybe some of those are bigger every fourth hour."

Dave Plonka Responded:

Such techniques are used in the are of network anomaly detection. For instance, a search for "network anomaly detection" at scholar.google.com will yield very many results.

Our 2002 paper, "A Signal Analysis of Network Traffic Anomalies" [ACM SIGCOMM Internet Measurement Workshop 2002, Barford, et al.], is one such work. We mention that we use wavelet analysis rather than Fourier analysis because wavelet/framelet analysis is able to localize events both in the frequency and time domains, whereas Fourier analysis would localize the events only in frequency, so an iterative approach (with varying intervals of time) would be necessary. In general, this is the reason why Fourier analysis has not been a common technique used in network anomaly detection.

That work used data stored in RRD files at five minute intervals. Our subsequent work used data stored at one second intervals, again in RRD files.

Anton Kapela had a couple of messages and a link (look for Kapela):

Indeed, there are. Interesting things emerge in frequency (or phase) space - bits/sec, packets/sec, and ave size, etc. - all have new meaning, often revealing subtle details otherwise missed. The UW paper [Barford/Plonka et. al] is one of my favories and often referenced in other publications.

Along similar lines, I presented a lightning talk at nanog that demonstrates using windowed Ft's (mostly Gaussian or Hamming) in three-axis graphs (i.e. 'waterfalls') available in common tools (buadline, sigview, labview, etc) for characterizing round trip times through various network queues and queue states. Unexpectedly, interesting details regarding host IP stacks and OS scheduler behavior became visible.

I want to suggest that time windowed Ft might be a reasonable middle ground, certainly for Crist's case. Naturally, the trade-offs will be in frequency accuracy (ie. longer window) vs. temporal accuracy (ie. short window). Another solution for your needs might be cascaded FIR "bandpass" filters, but again, you're subject to time/frequency error trade-offs as related a filter's bandwidth.

While you're at it, consider processing your time series data into histogram stacks, or nested histograms. I haven't specifically seen a paper covering this, but another UW gent (DW, are you reading this?) used to process their 30 second ifmib data into a raw .ps file, and printed this out weekly/daily. The trends visible here were quite interesting, but I don't think much further work was done to see if anything super-interesting was more/less visible in this form than traditional ones.

... one point - since packets/bits/etc data is more monotonic than not (math wizards, please debate/chime in) and since it's not a 'signal' in the continuous sense, you might find value in differentially filtering the input data *before* FT or wavelet processing. This would serve to remove the weird-looking "DC" offset in the output simply by creating a semi-even distribution of both positive and negative input sample values.