Working with time series data in SQL Server 2022 and Azure SQL
Part of the SQL Server 2022 blog series.
Time series data is a set of values organized in the order in which they occur and arrive for processing. Unlike transactional data in SQL Server, which is not time-based and may be updated often, time series data is typically written once and rarely, if ever, updated.
Some examples of time series data include stock prices, telemetry from equipment sensors on a manufacturing floor, and performance metrics from SQL Server, such as CPU, memory, I/O, and network utilization.
Time series data is often used for historical comparisons, anomaly detection and alerting, predictive analysis, and reporting, where time is a meaningful axis for viewing or analyzing data.
Time series capabilities in SQL Server were introduced in Azure SQL Edge, Microsoft’s version of SQL Server for the Internet of Things (IoT) which combines capabilities such as data streaming and time series with built-in machine learning and graph features.
SQL Server 2022
The most Azure-enabled release yet, with continued performance, security, and availability innovation.
With SQL Server 2022 and Azure SQL, we’ve brought time series capabilities to the entire SQL Server family. Time series capabilities in SQL Server consist of enhancements to existing T-SQL functions for handling NULL
values, plus two new functions that make working with temporal-based data easier than ever.
Create contiguous ranges with GENERATE_SERIES
When analyzing time series data, it’s common to create a contiguous set of datetime values in evenly spaced intervals (for example, every second) within a specific range. One way to accomplish this is by creating a numbers table, also known as a tally table, which contains a set of consecutive numbers between a lower and upper bound. The numbers in the table can then be used in combination with the DATEADD
function to create the range of datetime values.
Prior to SQL Server 2022, creating a numbers table usually involved some form of common table expressions, CROSS JOIN
of system objects, looping, or some other creative T-SQL. These solutions are neither elegant nor efficient at scale, with additional complexity when the step between interval values is larger than 1.
The GENERATE_SERIES
relational operator in SQL Server 2022 makes creating a numbers table simple by returning a single-column table of numbers between a start and stop value, with an optional parameter defining the number of values to increment/decrement between steps in the series:
GENERATE_SERIES (start, stop [, step ])
This example creates a series of numbers between 1 and 100 in steps of 5:
SELECT value
FROM GENERATE_SERIES(1, 100, 5);
Taking this concept one step further, the next example shows how GENERATE_SERIES is used with DATEADD to create a set of values between 1:00 PM and 2:00 PM in 1-minute intervals:
SELECT DATEADD(minute, s.value, 'Dec 10, 2022 1:00 PM') AS [Interval]
FROM GENERATE_SERIES(0, 59, 1) AS s;
If the step argument is omitted, a default value of 1 is used when computing interval values. GENEATE_SERIES also works with decimal values, with a requirement that the start, stop, and step arguments must all be the same data type. If start is greater than stop and the step is a negative value, then the resulting series will be a decrementing set of values. If start is greater than stop and the step is positive, an empty table will be returned.
Finally, GENERATE_SERIES
requires a compatibility level of 160 or higher.
Group data in intervals with DATE_BUCKET
Time series data is often grouped into fixed intervals, or buckets, for analytical purposes. For example, sensor measurements taken every minute may be averaged over 15-minute or 1-hour intervals. While GENERATE_SERIES
and DATEADD
are used to create the buckets, we need a way to determine which bucket/interval a measurement belongs to.
The DATE_BUCKET
function returns the datetime value corresponding to the start of each datetime bucket for an arbitrary bucket size, with an optional parameter to define the origin from which to calculate each bucket. If no origin is provided, the default value of Jan 1, 1900, will be used as the origin date:
DATE_BUCKET (datepart, number, date, origin)
The following example shows the buckets for Dec 10, 2022, for several date parts with a bucket size of 1 and an origin date of Jan 1, 2022:
DECLARE @date DATETIME = 'Dec 10, 2022 12:05 PM';
DECLARE @origin DATETIME = 'Jan 1, 2022 12:00 AM';
SELECT 'Now' AS [BucketName], @date AS [DateBucketValue]
UNION ALL
SELECT 'Year', DATE_BUCKET (YEAR, 1, @date, @origin)
UNION ALL
SELECT 'Quarter', DATE_BUCKET (QUARTER, 1, @date, @origin)
UNION ALL
SELECT 'Month', DATE_BUCKET (MONTH, 1, @date, @origin)
UNION ALL
SELECT 'Week', DATE_BUCKET (WEEK, 1, @date, @origin)
UNION ALL
SELECT 'Day', DATE_BUCKET (DAY, 1, @date, @origin)
UNION ALL
SELECT 'Hour', DATE_BUCKET (HOUR, 1, @date, @origin)
UNION ALL
SELECT 'Minutes', DATE_BUCKET (MINUTE, 1, @date, @origin)
UNION ALL
SELECT 'Seconds', DATE_BUCKET (SECOND, 1, @date, @origin)
Notice how the date bucket value for the Week date part is Dec 10, 2022, which is a Saturday. That’s because the provided origin date (Jan 1, 2022) is also a Saturday. (Note the default origin date of Jan 1, 1900, is a Monday). Therefore, when working with the Week date part, if you want your Week bucket to begin on a Sunday then be sure to use a known origin that falls on a Sunday.
Where DATE_BUCKET
becomes especially useful is for bucket sizes larger than 1, for example when grouping data in 5-minute or 15-minute buckets.
SELECT 'Now' AS [BucketName], GETDATE() AS [BucketDate]
UNION ALL
SELECT '5 Minute Buckets', DATE_BUCKET (MINUTE, 5, GETDATE())
UNION ALL
SELECT 'Quarter Hour', DATE_BUCKET (MINUTE, 15, GETDATE());
DATE_BUCKET
provides an easy way to determine which time-based interval a timestamped measurement belongs to using any arbitrary-sized interval.
Gap analysis with FIRST_VALUE and LAST_VALUE
FIRST_VALUE
and LAST_VALUE
are not new functions to SQL Server 2022; what is new is how NULL
values are handled. In previous versions of SQL Server, NULL
values are preserved.
When working with time series data, it’s possible to have gaps between measurements. Ideally, gaps are filled in with an imputed value. When using FIRST_VALUE
and LAST_VALUE
to compute the value corresponding to an interval, preserving NULL
values isn’t ideal.
In the following example, a series of sensor readings taken at 15-second intervals has some gaps:
If analyzing the data in 1-minute intervals (using DATE_BUCKET
), the default value returned by FIRST_VALUE
will include the null values:
SELECT [timestamp]
, DATE_BUCKET(MINUTE, 1, [timestamp]) AS [timestamp_bucket]
, SensorReading
, FIRST_VALUE (SensorReading) OVER (
PARTITION BY DATE_BUCKET(MINUTE, 1, [timestamp])
ORDER BY [timestamp]
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS [Default (RESPECT NULLS)]
FROM MachineTelemetry
ORDER BY [timestamp];
FIRST_VALUE
and LAST_VALUE
include new syntax (IGNORE NULLS
or RESPECT NULLS
) in SQL Server 2022 which allows you to decide how NULL
values should be handled:
FIRST_VALUE ( [scalar_expression ] ) [ IGNORE NULLS | RESPECT NULLS ]
OVER ( [ partition_by_clause ] order_by_clause [ rows_range_clause ] )
RESPECT NULLS
is the default behavior and will include null values in the result when computing the first or last value within a partition. Specifying IGNORE NULLS
will cause NULL
values to be excluded when computing the first or last value over a partition.
SELECT [timestamp]
, DATE_BUCKET(MINUTE, 1, [timestamp]) AS [timestamp_bucket]
, SensorReading
, FIRST_VALUE (SensorReading) IGNORE NULLS OVER (
PARTITION BY DATE_BUCKET(MINUTE, 1, [timestamp])
ORDER BY [timestamp]
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS [Default (RESPECT NULLS)]
FROM MachineTelemetry
ORDER BY [timestamp];
The new options for IGNORE NULLS
and RESPECT NULLS
allow you to decide how null values should be treated when analyzing your data.
Learn more
Get started with SQL Server 2022 today. For more information and additional examples, visit the following resources: