By Michael Risse, VP/CMO at Seeq Corporation
Recently I wrote about the explosion of interest and innovation in time-series data storage offerings,
including historians, open source and data-lake options, and cloud-based services. The plethora of choices ensures industrial-manufacturing customers will find a data-management option that fits their needs. Whatever the priorities—data governance, data consolidation, security, analytics or a cloud-first initiative—customers will have many good choices for where to store data.
At the same time, if an organization is planning to consolidate their manufacturing data in an enterprise historian or data lake, they may find they do not accrue the same benefits as they do with other data types. In fact, if they use expectations based on experience with relational data as a justification for aggregating their manufacturing data, they will be disappointed with the results. With some types of data, consolidation in a single system provides advantages for analytics and insights as compared to distributed data sets, but it’s not the same with time-series data. Whether it’s a data historian, lake, platform, pond, puddle or silo—time-series data won’t necessarily yield better insights just because it’s all in one place.
To understand this, let’s consider scenarios where centralizing data does benefit the user—relational data, for example. Relational data has keys that work as handles to the data, i.e. tables, fields and column names, so aggregating or centralizing data yields more possible relationships among the tables, fields, and databases. This isn’t new; business intelligence solutions (Cognos, PowerBuilder, etc.) gained significant traction with this approach starting in the 1990’s. Today, data storage is so inexpensive vendors can offer platforms providing “any to any” indexes, enabling complete self-service for a business analyst.
Another example is platforms that index all data contained within a semi-structured data set—think web pages, machine logs, and various forms of “digital exhaust.” Two variations of this approach are used by Google and “document” NoSQL databases such as MongoDB. The idea is the structure of the data doesn’t have to be consistent or defined in advance as in a relational table. Instead, a schema is overlaid on the fly, or after the fact, which enables the user to work with any “handle” created by the index. Again, this means that the more data is centralized and indexed, the better. Users get to see more insights across larger data sets and the data is pre-indexed or organized and ready to work.
Keep your data where it is
With structured (relational) and semi-structured (log files, web pages) strategies as success stories for centralizing data, it’s easy to see why one could assume consolidating time-series data into one place might yield equal benefits to end users, but it doesn’t. IT-centric data solutions may try to convince themselves their centralization models apply to time-series data, but they fail like trying to climb a greasy flagpole: it doesn’t work without handles.
Why is this? Time-series data simply doesn’t lend itself to pre-processing the way structured data (relationships) or semi-structured data (indexes) does. There are no “handles” in a time-series signal, so there is no way to add value in pre-processing the data for analytics. This is a key issue for engineers working with the data as they have to (at the time they do their analysis) find a way to integrate “What am I measuring” (the sensor data) with “What am I doing” (what an asset or process is doing at the time) and even “What part of the data is important to me?”
As an example of the challenges in working with time-series data, let’s consider a simple time-series data set that has sensor data recorded every second for a year, or 3.6M samples, in the form timestamp:value.
Most likely, the user doesn’t want all the signal data for their analysis; instead they want to identify time periods of interest “within” the signal. For example, perhaps the user needs handles to periods of time within the data for analysis defined by:
- Time period, such as by day, by shift, by Tuesdays, by weekdays vs. weekends, etc.
- Asset state: on, off, warm up, shutdown, etc.
- A calculation: time periods when the 2nd derivative of moving average is negative
- Context in a manufacturing application like an MES, such as when plant or process line is creating a certain product
- Data from a business application, for example when energy price is greater than x
- A multi-dimensional combination of any or all of these time periods of interest (like where they overlap, or where they don’t)
- An event, for example if a user wants to see data for the 90-minute period prior to an alarm
In other words, time periods of interest are when a defined condition is true, and the rest of the data can be ignored for the analysis.