As the hardware required to collect, synthesize, and analyze data becomes cheaper, the opportunity to apply Machine Learning solutions to complex environments grows. The area that is ripe for disruption in the industrial production processes, by leveraging constant streams of data that record measurements from different sections of the production process. However, the promise of fresh insights that these analytics unlock is not without its pitfalls. As we recently discovered through work at an industrial company, any effort to distill this influx of information requires a balanced approach: looking at the data through both the lens of a Data Scientist and an Industrial Engineer. We have split this project review into two separate posts for readability: the first to cover data assumptions and preprocessing while the second focuses on modeling approaches and architecture.
Initial Query: Is it possible to accurately forecast the stability of the process using current and historical temperature, pressure, and flow metrics from across the plant to better understand what causes negative outcomes?
It aligns well with the standard multivariate time series methodology. Still, as we confronted the intricacies of the project, we found that there were many adjustments required to overcome the assumptions made about the data generating process.
Using Microsoft Azure and Databricks we configured an architecture to handle the velocity and volume of the dataset. Azure Data Lake Gen 2 served as the storage system for the raw and preprocessed datasets from the source with Databricks serving as our main ETL and analytics engine. To ensure that we balanced cost considerations with computational complexity, we configured an auto-scaling Standard_DS3_v2 cluster (2 min-size, 10 max-size) with an 8.3 ML Runtime that comes pre-installed with the required Machine Learning packages. Most of the preprocessing and modeling approaches utilized Spark implementations to minimize the memory requirements for each step except where it couldn’t be avoided. To ingest new streaming data from plant sensors, an Event Hub was configured with a Databricks notebook taking data from raw to processed by performing the required transformations and model predictions from a serialized version of the saved model. The scalability and flexibility provided by Databricks throughout this project proved why it is one of the premier services in the cloud computing landscape today.
Our dataset was an aggregation of sixty to seventy sensors measuring Pressure, Temperature, Flow Rate, and other physical attributes that are a part of the production process. Readings were taken at thirty-second intervals and sent to Microsoft Azure to capture the current state across the plant’s target area, leading to a consistent flow of data from the source through to a processed dataset that also contained multiple years of historical readings. Our project group partnered with the companies’ engineering team to get a holistic view of the process, combining the best Data Science practices with assumptions backed by Chemical Engineering principles to best extract insights from the data.
Before we could start modeling for predictive stability, we had to first cleanse any impurities in the source data. When working with data from sensors and other IoT devices, it is imperative that noise and faulty or incoherent values be accounted for in the preprocessing stage itself. As each metric has a different valid range of readings (for example, Pressure measured in Millibar has a different valid range as opposed to Temperature measured in Celsius), we used distribution plots from the data in tandem with feedback from the engineering team to define the possible bounds for each sensor. To add values that fell outside these bounds to the subset of data that was considered bad or missing, a multi-step replacement procedure was implemented to best approximate the most likely actual reading at each point in time. Since we were trying to capture anomalous behavior, uniformly replacing all these values was not viable because each error could represent something different for the system. In areas where there was a measurement, but fell outside the bounds laid down by the engineering team, we replaced it with the nearest boundary reading which we hoped would limit the effect of these outliers but still capture the potentially anomalous behavior. Missing values were replaced by a five-minute moving average of past and future valid readings. This assumed process continuity while covering for possible sensor downtime or errors.
One last factor that we had to consider for this project was the noise that typically accompanies streaming data from IoT devices. Turning a continuous problem into a discrete one requires sampling at regular intervals, which in our case was once every thirty seconds. As each data point represents a sample from the continuous flow inside the plant, every reading will likely include minor fluctuations from the true operating point at the time of measurement. To reduce these deviations’ effect on model training, we aggregated the process to a few different time scales, taking the moving average from the past 2N readings to have a model that represents one value every N minutes. For this problem statement, models fit on data from within one to five minutes intervals worked best as the moving average process stabilized the data fluctuations while still capturing areas where the individual metric appears to shift dramatically away from where they recently were.
Most research that we found regarding predictive maintenance (which is often the type of problem solved using IoT time-series datasets) involves utilizing outcomes from many similar devices with a clearly defined outcome. For example, either the item completely broke and stopped working within the given timeline, or it didn’t. Our dataset only represented a few hundred distinct runtimes from a single process. There were only a few examples of the process completely failing. There was also a large difference in the runtimes for each process run, which, when analyzed alongside the outcome (failure or safe shutdown), seemed to point to the existence of external factors contributing to a shutdown that we couldn’t measure: other inputs or manual intervention. Due to these limitations, we worked with the engineers to create a metric that measured future stability, which best approximated the internal dynamics of the process into a singular metric that we needed for modeling. While this approach is not perfect, it converts a binary classification problem with limited data to something that could feed both a regression or classification problem that focuses on using recent history to predict the immediate future and not some potentially distant outcome.
Read the next part of the Blog – Machine Learning with Industrial Production – Data Modeling