To demonstrate one approach to spike detection, we are exploring website visitor data from the BigQuery Public Google Analytics sample data bigquery-public-data.google_analytics_sample.ga_sessions_*
.
More specifically, the number of daily visits.
Now looking at the above graph, you don't need to be a statistician to work out where the spikes occurred. Visually obvious signs lead us to the conclusion that there have been two major spikes across the time series data.
When we talk about spikes what do we actually mean? In simple terms, the value for that day is out side of the expected range for what would be normal.
Now visually spotting spikes is great, but in the real world not actually that useful. Spikes often mean something is happening that you should know about. Now waiting until you have had a chance to manually look at some data isn't a great way to monitor anything.
So, how do we write a simple query to detect spikes in our time series data?
First we aggregate the data we need, in this case daily visits.
SELECT
date
, SUM(totals.visits) AS visits
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`
GROUP BY date
We then find the moving average and standard deviation across the data using a 7 day window prior to the current day. The time window could be larger or smaller depending on your data.
SELECT
*
, AVG(visits) OVER(ORDER BY date ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING) AS visits_ma
, STDDEV(visits) OVER(ORDER BY date ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING) AS visits_stddev
FROM aggregation
Finally we use all of the above to detect spikes for a given day. Pay attention to this particular line
IF(visits > visits_ma + 2 * visits_stddev, 1, 0) AS spike
What we are saying here is if the current days visits is greater than the moving average + 2 times the standard deviation, we flag it as a spike.
The full query can be found below.
Resulting in:
Thats its, see its quite simple. The output of this query could trigger notifications or email alerts or sound alarms. Whatever the needs this simple technique can be applied to any time series data.
Adjusting the sensitivity of the spike detection can be done by changing the number of standard deviations, increase the value for detection of more extreme values.
Finally with regards to the Google Analytics data, vishalg19 on Reddit kindly pointed out that on real GA data using totals.visits
doesn't match up to the numbers you see in the dashboard for sessions. Instead he suggested the work around of COUNT(DISTINCT CONCAT(fullvisitorid, CAST(visitstarttime AS string))) AS sessions
as the aggregation metric and also include WHERE totals.visit > 0
as well.