What to Know About Time-series Database Management Systems

Time-series database management systems (DBMS) are not new. For a few reasons, however, time-series technology is suddenly hot. Today, for example, there is no shortage of purpose-built time-series DBMS platforms, including several open-source software (OSS) platforms. Similarly, most commercial DBMSes now incorporate time-series capabilities, while vendors such as IBM, MongoDB, Microsoft, Oracle, Redis, and Snowflake, just to name a few, actively promote their platforms for time-series use cases.

Nor is that all. The database engine at the heart of Prometheus, the OSS platform that provides core observability and analysis capabilities for the Cloud Native Computing Foundation’s cloud-native software stack, also incorporates a purpose-built time-series database engine.

Think about it: A time-series DBMS -- or, more precisely, a DBMS engine that can ingest, engineer, and query time-series data -- is an essential piece of next-gen software architecture.

The larger point is that time-series DBMS platforms are hot because time-series data is hot. And time-series data is hot for a few reasons, not least because it gives organizations a different kind of view -- a new lens, so to speak -- into the real-world behavior of their organic, physical, and virtual resources.

The Case for a Purpose-built Time-series DBMS

These observations invite a couple of questions. First, what distinguishes a purpose-built time-series DBMS from a DBMS that incorporates time-series capabilities? Second, when do you need a purpose-built time-series DBMS vs. a DBMS that provides time-series capabilities?

The high-level answers to these questions are as follows:

A purpose-built time-series database is designed to ingest and manage very large volumes of data. This stems from the fact that time-series records are usually written as append-only -- i.e., as new rows appended to a database table. For this reason, and owing to the sheer preponderance of time-series sources, time-series data volumes can mushroom rapidly.
A purpose-built time-series DBMS is designed to perform certain types of operations in or close to real-time (e.g., calculating min, mean, median, and max values for sensor data). A conventional DBMS can do this, too; however, it is easier to scale a time-series DBMS to perform concurrent ingest and analytical operations in real time on very large volumes of data.
A purpose-built time-series DBMS usually exposes APIs and/or a SQL-like query interface that consumers can use to retrieve and manipulate data. Again, this combination of large-volume data ingestion and analytical/query processing in real time is extremely demanding.
A purpose-built time-series DBMS uses algorithms and functions to resolve the set of conflicts, problems, and anomalies that is peculiar to time-series data: e.g., issues with clock-synchronization, unit-conversion, and correlation across different types of units, among others.
A purpose-built time-series database may eschew a traditional database schema to accommodate the real-time ingest and storage of very large volumes of data. So, for example, some time-series DBMS platforms use schema-on-read and/or schema-on-write data models.
In most cases, it is possible to configure a non-purpose-built DBMS, such as a key-value store or relational database management system (RDBMS), to ingest, manage, and perform operations on time-series data. In view of the requirements and problems discussed above, however, a purpose-built time-series DBMS tends to perform and scale much better, and will, on the whole, require less maintenance.
If your use case requires you to collect, store, and analyze time-series data generated by connected endpoints -- be they physical devices (such as sensors) or virtual instruments (e.g., software) -- you should consider using a purpose-built time-series database.
Even if a time-series use case starts out small (e.g., a finite group of sensors monitoring a single process), there is no guarantee it will remain small. Time-series use cases tend to compound as a function of economic and technological trends (IoT, interest in modeling and analyzing more specific or granular activities), and of the successful use of time-series data.

This is just a précis of what is distinctive about a purpose-built time-series DBMS. Read on for more.

What Does a Purpose-built Time-series DBMS and General-purpose DBMS Have in Common?

With respect to how it models, stores, and retrieves data, a purpose-built time-series database need not do anything special: A key-value store or RDBMS can store and retrieve time-series data, too. (In fact, almost all RDBMSes now implement a specific data type -- e.g., TIMESTAMP -- to classify time-series data.) Like these platforms, purpose-built time-series databases usually expose API endpoints designed to permit data access and manipulation. In the same way, almost all time-series DBMSes also expose SQL interfaces. In most cases, then, you would access and retrieve data from a time-series DBMS much as you would access and retrieve data from a key-value store or an RDBMS. (Coders gonna kvetch, but SQL is a useful, established language for accessing and manipulating data.) Given the ubiquity of SQL, it is also useful if a time-series DBMS can perform relational-like operations, such as joins, as it processes queries.

Again, this might be why basically all commercial RDBMS (and most OSS) platforms now support a specific time-series data type. In fact, at least one OSS time-series database (TimeScaleDB) is in fact based on an underlying OSS relational platform (PostgreSQL). But TimeScaleDB is not vanilla PostgreSQL; it was engineered to address the unique requirements of time-series use cases.

In fact, the phrase “need not do anything special” is doing quite a bit of heavy lifting in the first paragraph in this section. In other words, while it is true that a time-series database need not do anything special with respect to how it models and stores data, a purpose-built time-series DBMS must do several things differently to deal with the unique set of conflicts, dissimilarities, and anomalies that may occur as the DBMS ingests, retrieves, and performs operations on time-series data -- often in real time.

What Is Different About a Time-series DBMS

A Radical Difference in Scale?

Time-series data nominally consists of two columns and a key. This means that if database schema were the only factor, time-series data could easily be managed using a key-value store or RDBMS.

The curveball comes by way of the volume of data that a time-series DBMS must ingest, manage, and perform operations on, sometimes in support of applications that require real- or right-time results. A conventional DBMS platform can usually be configured to store and manage time-series data. In production use cases, however, a general-purpose DBMS will not be able to keep up with demand.[i]

So, for example, in OLTP-like applications, an RDBMS records data in response to specific, usually aperiodic events. In most cases, too, the RDBMS updates an existing record, instead of appending a new one. Typically, the volume of OLTP data increases only marginally, if at all, usually in response to predictable changes (a new database schema, additional tables, indexes, materialized views, etc.).

Time-series data is different. First of all, time-series events are usually unpredictable. The “event” that triggers a database write is usually the sampling interval for which the DBMS records data: The interval itself is the event. Second, time-series records do not get updated: The DBMS appends each event as a new record. Absent automated pruning, time-series data volumes can mushroom at a very rapid rate.

In real-world, production use cases, time-series databases are used to support data-intensive applications, such as telemetry monitoring, that require the databases to record dozens, hundreds, or potentially even thousands of observations per second. That isn’t all. In production use cases, time-series event data is rarely generated by a single, isolated signaler -- i.e., a single sensor on an individual turbine, photosensor, vibration sensor, etc. Rather, a time-series DBMS must record and, if necessary, correlate the event data generated by hundreds, even thousands, of distributed sensors. If you have 10,000 sensors that each generates 100 events per second, the DBMS must record one million transactions per second. Moreover, it must do this in addition to supporting its concurrent query-processing workload. (In this respect, time-series workloads are part OLTP, part analytical -- much like a translytical DBMS platform.) On the plus side, because time-series events are usually appended as new records, rather than as updates, some of the ACID safeguards that -- in OLTP applications -- tend to impair DBMS performance are not required. So, for example, the database need not lock a row as it writes data.

Timing Is (Quite Literally) Everything?

The adjective “distributed” in the prior paragraph is significant. The flood of data that is generated by a time-series signaler (a vibration sensor, for example) is always time-stamped. This time stamp is derived from a clock. However, in distributed applications, clock synchronization is a nontrivial problem. Phenomena such as clock skew and clock drift may distort the trends captured in a time-series.

Moreover, as previously noted, an individual sensor is relatively rare: It is more likely that (e.g.) a turbine will have n vibration sensors mounted at critical points. In the same way, n turbines will have n sensors mounted at critical points. By definition, these sensors are distributed. Moreover, each has its own clock, although, in production use cases, this clock is usually synchronized to a master clock. The problem is that time synchronization is only so effective. With this in mind, it is helpful to understand the totality of n sensors in n turbines as also comprising a network of quasi-autonomous clocks.

So, for example, minor differences between the clocks in sensors a1, a2, a3, and a4 may increase over time. (This is clock drift.) And even if a system uses a mechanism to synchronize the clocks among distributed devices, it must also control for clock skew. For these reasons, a purpose-built time-series DBMS incorporates error-correction algorithms to resolve common time-related conflicts, anomalies, etc.

Querying Across Dimensions?

A related issue is that the time-series data associated with a single source or type of source (e.g., an individual vibration sensor, or vibration sensors as a source-type) is only so useful in isolation. It is most useful when combined with data belonging to other time-series source-types or to other, non-time-series applications. Say, for example, that a turbine has both temperature and vibration sensors: Is there a relationship between vibration and temperature? Does vibration increase or decrease as temperature increases? Is there an optimal temperature (say, 51° C) at which vibration is minimized?

An industrial engineer, data scientist, etc., can query a time-series database to answer a question such as this. But queries of this kind can be more or less complex, especially if answering them requires that the database apply functions/algorithms to correct dissimilarities in measurement, or to resolve other anomalies. (For example, to convert between different units of pressure -- say, Pascals to PSI -- or from the time to the frequency domain, etc.) As discussed in the prior section, the DBMS may have to enforce logic to resolve clock-drift problems if querying in close to real time against (e.g.) sensor data.

Querying Across Data Models?

The time-series DBMS may also need to query other types of databases, such as an RDBMS that records different kinds of metrics as part of a manufacturing process. Ideally, an organization would persist precalculated answers to useful questions relating to time series in a data warehouse or similar platform. This would simplify the task of contextualizing time-series data with other useful business data.

However, this is not always practicable. After all, engineers or data scientists must first identify useful time-series queries to precalculate them for the warehouse! Ergo: For experimental use cases in which an industrial engineer, process engineer, etc., is just asking questions, or for real-time queries in response to emergent problems, it is useful if a time-series DBMS can (a) query against different types of external databases and (b) contextualize the results of these queries with time-series data.

Most purpose-built time-series databases can perform multidimensional/multimodel queries of this kind. Yes, an RDBMS with time-series capabilities can perform this workload, too -- i.e., the DBMS can query against both relational and time-series data. In real-world use cases, however, it likely will struggle to support both real-time ingest and complex time-series analytical-processing workloads.

Coda: Time-series Data and the (Increasingly) Observable World

The prior examples have focused on connected sensors. But the fact is that a sensor (or any other connected device, for that matter) is just one type of signaler. Another, increasingly common signaler is software itself: i.e., platforms, systems, middleware, applications, services, microservices, and so on.

Software architects and engineers are prioritizing the development not only of observable software -- i.e., software that generates useful data which can be used to anticipate, diagnose, and correct problems -- but of observable systems, too. For the purposes of this article, what matters is that vendors and enterprises alike are building observability instrumentation into their software. The time-series data generated by this instrumentation is an essential ingredient in the observability cocktail. It is, however, just one of several essential ingredients; time-series data is most useful if contextualized with multidimensional data derived from other producers. (See the previous section for more on this.)

Observability is not just a concept in software architecture and engineering, however. Rather, it is a metaphor that gets at a new way of thinking about how we apprehend and model different facets or shards of our world. At a minimum, the concept of observability gives us both a program and a set of patterns we can use to assemble the facets or shards of our models into hyper-realistic mosaics: i.e., holistic representations of complex behaviors in the world. At its most audacious, the concept of observability aspires to model reality itself. This, for example, is the logic of the so-called digital twin.

The digital twin is audacious to the point of naïveté. What is important, however, is that observability also aims to enable us to manipulate events (e.g., tasks, activities, or interventions that we apply to app workflows or digital business services) at higher levels of abstraction. Architects, software engineers, etc., aim to design observable software abstractions that are easier to operate, manipulate, and change.

So, for example, observability in the context of an abstracted business service (say, an e-tailer’s virtual catalog service) makes it easier for operations personnel to provision extra resources in response to an observed service impairment -- or, if necessary, to redirect shoppers to resources that engineers have provisioned in a separate cloud region or data/co-location center. True, it is possible to perform these tasks today, but the logic of observability aims to create higher-level abstractions (e.g., an observable customer-onboarding workflow or an observable ecommerce virtual catalog service) that are susceptible to control. In this respect, observability is inseparable from data collection, analytics, and (rule-based) automation: On the one hand, the data generated by observability instrumentation provides a lens into what is happening at certain pre-determined levels of abstraction; on the other hand, software architects and engineers can design orchestration workflows that permit manipulation at these same levels of abstraction. Observability instrumentation and the data it generates likewise permit IT to anticipate and proactively respond to other potential problems. Lastly, it gives decision makers better insight into the impact of service impairment or disruption on operations, revenues, etc.

The sine qua non of this is data: more data, with different characteristics, derived from more and varied producers. Looked at in isolation, the flood of sensor data seems incoherent. Framed in a context of some kind (e.g., improved engine performance and/or longevity), this flood can be modeled in such a way as to represent a useful shard or facet of the world. Of course, sensor data is not exclusively time series in character. Nevertheless, as with observability instrumentation of any type, the time dimension constitutes an essential lens for an enormous set of possible use cases. The upshot is that time-series data is poised to become indispensable: If you are not already using a purpose-built time-series database -- or the time-series features of a general-purpose DBMS -- you will at some point.

Thanks to Tim Hall of InfluxData, Phil Harvey of Microsoft, and Mark Madsen of Teradata for context and criticism.

____________________________________________________________

[1] Processing time-series data at this scale should not be a problem for a massively parallel processing data warehouse, especially if it uses an append-only schema -- i.e., it appends each new TIMESTAMPed times-series event as a new record in a column. There are other problems with this scheme, however. Read on in the article above to discover what they are.

Comments

Plain text