In simple terms Z-scores tell you how far away from the mean a value is, more specifically how many standard deviations away it is.
Often Z-scores can be used as an easy method of outlier detection, i.e. spotting the unusual.
In todays simple example we are looking at London Cycle hire data from the open public data sets on BigQuery
We will be using Z-scores to identify bikes that have much higher average duration than normal. We are specifically looking for bikes which are at least 3 standard deviations away from the mean.
This is of course an example use case, however if you were a Cycle hire operator you might want to have a method to identify which bikes should have more regular service intervals based on the average duration they travel for each trip.
We are first taking the
AVG(duration) of each
, AVG(duration) AS avg_duration
We then calculate the Z-score for each
bike_id, this is easy using WINDOW functions. The
OVER() clauses after
STDDEV allow us to calculate the respective values across the entire dataset against each row of data.
(avg_duration - AVG(avg_duration) OVER ()) / STDDEV(avg_duration) OVER ()
Finally we filter based on the
WHERE z_score >= 3
When we run the query there are a total of 71 bikes out of 13k that match our criteria.