Working with time series data in SQL Server 2022 and Azure SQL

January 12, 2023

Product
SQL

Part of the SQL Server 2022 blog series.

Time series data is a set of values organized in the order in which they occur and arrive for processing. Unlike transactional data in SQL Server, which is not time-based and may be updated often, time series data is typically written once and rarely, if ever, updated.

Some examples of time series data include stock prices, telemetry from equipment sensors on a manufacturing floor, and performance metrics from SQL Server, such as CPU, memory, I/O, and network utilization.

Time series data is often used for historical comparisons, anomaly detection and alerting, predictive analysis, and reporting, where time is a meaningful axis for viewing or analyzing data.

Time series capabilities in SQL Server were introduced in Azure SQL Edge, Microsoft’s version of SQL Server for the Internet of Things (IoT) which combines capabilities such as data streaming and time series with built-in machine learning and graph features.

SQL Server 2022

The most Azure-enabled release yet, with continued performance, security, and availability innovation.

Learn more

With SQL Server 2022 and Azure SQL, we’ve brought time series capabilities to the entire SQL Server family. Time series capabilities in SQL Server consist of enhancements to existing T-SQL functions for handling NULL values, plus two new functions that make working with temporal-based data easier than ever.

Create contiguous ranges with GENERATE_SERIES

When analyzing time series data, it’s common to create a contiguous set of datetime values in evenly spaced intervals (for example, every second) within a specific range. One way to accomplish this is by creating a numbers table, also known as a tally table, which contains a set of consecutive numbers between a lower and upper bound. The numbers in the table can then be used in combination with the DATEADD function to create the range of datetime values.

Prior to SQL Server 2022, creating a numbers table usually involved some form of common table expressions, CROSS JOIN of system objects, looping, or some other creative T-SQL. These solutions are neither elegant nor efficient at scale, with additional complexity when the step between interval values is larger than 1.

The GENERATE_SERIES relational operator in SQL Server 2022 makes creating a numbers table simple by returning a single-column table of numbers between a start and stop value, with an optional parameter defining the number of values to increment/decrement between steps in the series:

GENERATE_SERIES (start, stop [, step ])

This example creates a series of numbers between 1 and 100 in steps of 5:

SELECT value
FROM GENERATE_SERIES(1, 100, 5);

Taking this concept one step further, the next example shows how GENERATE_SERIES is used with DATEADD to create a set of values between 1:00 PM and 2:00 PM in 1-minute intervals:

SELECT DATEADD(minute, s.value, 'Dec 10, 2022 1:00 PM') AS [Interval]
FROM GENERATE_SERIES(0, 59, 1) AS s;

If the step argument is omitted, a default value of 1 is used when computing interval values. GENEATE_SERIES also works with decimal values, with a requirement that the start, stop, and step arguments must all be the same data type. If start is greater than stop and the step is a negative value, then the resulting series will be a decrementing set of values. If start is greater than stop and the step is positive, an empty table will be returned.

Finally, GENERATE_SERIES requires a compatibility level of 160 or higher.

Group data in intervals with DATE_BUCKET

Time series data is often grouped into fixed intervals, or buckets, for analytical purposes. For example, sensor measurements taken every minute may be averaged over 15-minute or 1-hour intervals. While GENERATE_SERIES and DATEADD are used to create the buckets, we need a way to determine which bucket/interval a measurement belongs to.

The DATE_BUCKET function returns the datetime value corresponding to the start of each datetime bucket for an arbitrary bucket size, with an optional parameter to define the origin from which to calculate each bucket. If no origin is provided, the default value of Jan 1, 1900, will be used as the origin date:

DATE_BUCKET (datepart, number, date, origin)

The following example shows the buckets for Dec 10, 2022, for several date parts with a bucket size of 1 and an origin date of Jan 1, 2022:

DECLARE @date DATETIME = 'Dec 10, 2022 12:05 PM';
DECLARE @origin DATETIME = 'Jan 1, 2022 12:00 AM';

SELECT 'Now' AS [BucketName], @date AS [DateBucketValue]
UNION ALL
SELECT 'Year', DATE_BUCKET (YEAR, 1, @date, @origin)
UNION ALL
SELECT 'Quarter', DATE_BUCKET (QUARTER, 1, @date, @origin)
UNION ALL
SELECT 'Month', DATE_BUCKET (MONTH, 1, @date, @origin)
UNION ALL
SELECT 'Week', DATE_BUCKET (WEEK, 1, @date, @origin)
UNION ALL
SELECT 'Day', DATE_BUCKET (DAY, 1, @date, @origin)
UNION ALL
SELECT 'Hour', DATE_BUCKET (HOUR, 1, @date, @origin)
UNION ALL
SELECT 'Minutes', DATE_BUCKET (MINUTE, 1, @date, @origin)
UNION ALL
SELECT 'Seconds', DATE_BUCKET (SECOND, 1, @date, @origin)

Notice how the date bucket value for the Week date part is Dec 10, 2022, which is a Saturday. That’s because the provided origin date (Jan 1, 2022) is also a Saturday. (Note the default origin date of Jan 1, 1900, is a Monday). Therefore, when working with the Week date part, if you want your Week bucket to begin on a Sunday then be sure to use a known origin that falls on a Sunday.

Where DATE_BUCKET becomes especially useful is for bucket sizes larger than 1, for example when grouping data in 5-minute or 15-minute buckets.

SELECT 'Now' AS [BucketName], GETDATE() AS [BucketDate]
UNION ALL
SELECT '5 Minute Buckets', DATE_BUCKET (MINUTE, 5, GETDATE())
UNION ALL
SELECT 'Quarter Hour', DATE_BUCKET (MINUTE, 15, GETDATE());

DATE_BUCKET provides an easy way to determine which time-based interval a timestamped measurement belongs to using any arbitrary-sized interval.

Gap analysis with FIRST_VALUE and LAST_VALUE

FIRST_VALUE and LAST_VALUE are not new functions to SQL Server 2022; what is new is how NULL values are handled. In previous versions of SQL Server, NULL values are preserved.

When working with time series data, it’s possible to have gaps between measurements. Ideally, gaps are filled in with an imputed value. When using FIRST_VALUE and LAST_VALUE to compute the value corresponding to an interval, preserving NULL values isn’t ideal.

In the following example, a series of sensor readings taken at 15-second intervals has some gaps:

If analyzing the data in 1-minute intervals (using DATE_BUCKET), the default value returned by FIRST_VALUE will include the null values:

SELECT [timestamp]
   , DATE_BUCKET(MINUTE, 1, [timestamp]) AS [timestamp_bucket]
   , SensorReading
   , FIRST_VALUE (SensorReading) OVER ( 
PARTITION BY DATE_BUCKET(MINUTE, 1, [timestamp]) 
ORDER BY [timestamp] 
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS [Default (RESPECT NULLS)]
FROM MachineTelemetry
ORDER BY [timestamp];

FIRST_VALUE and LAST_VALUE include new syntax (IGNORE NULLS or RESPECT NULLS) in SQL Server 2022 which allows you to decide how NULL values should be handled:

FIRST_VALUE ( [scalar_expression ] )  [ IGNORE NULLS | RESPECT NULLS ]
    OVER ( [ partition_by_clause ] order_by_clause [ rows_range_clause ] )

RESPECT NULLS is the default behavior and will include null values in the result when computing the first or last value within a partition. Specifying IGNORE NULLS will cause NULL values to be excluded when computing the first or last value over a partition.

SELECT [timestamp]
   , DATE_BUCKET(MINUTE, 1, [timestamp]) AS [timestamp_bucket]
   , SensorReading
   , FIRST_VALUE (SensorReading) IGNORE NULLS OVER ( 
PARTITION BY DATE_BUCKET(MINUTE, 1, [timestamp]) 
ORDER BY [timestamp] 
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS [Default (RESPECT NULLS)]
FROM MachineTelemetry
ORDER BY [timestamp];

The new options for IGNORE NULLS and RESPECT NULLS allow you to decide how null values should be treated when analyzing your data.