In the second of two posts covering a recent Machine Learning project at an industrial company. The first post covers the dataset, preprocessing, and features used to analyze the data. With these data assumptions, we can now hope to better understand the algorithms attempted in the project’s next phases.
We won’t cover any specific findings or outcomes in this blog post, but it is still useful to discuss the models we trained to explain what worked and what did not. On top of the Regression vs. Classification vs. Time Series paradigm differences, it feels like Machine Learning is quickly separating between the more traditional methods (Generalized Linear Models, Ensemble Tree-based methods, etc.) and the more nebulous but currently in vogue Neural Networks (Keras, Pytorch, TensorFlow, etc.). We tried to leverage both methods to see if we could tease increased accuracy out of the added complexity that comes with Neural Nets and found that while there were some moderate improvements, the lengthened training times did not seem to outweigh this boost.
For the regression problem, we settled on the Pyspark distribution of XGBoost. It had a similar MAE/MSE to the python distribution but with the increased performance efficiency afforded by using Spark. Other model attempts such as Linear Regression and Random Forest were tested. Still, as is so often the case, XGBoost outperformed both. One improvement recently made by Databricks that can be leveraged is distributing the XGBoost training workload across workers, as prior versions put all model training on the driver node. We adjusted a few different parameters to find the optimal model configuration for the Pyspark model to find the right balance between increased depth, complexity, and training time with increased output accuracy. One aspect where the python approach is superior to spark is the ability to quickly and easily implement a Randomized Grid Search from the sklearn package. We accomplished the same functionality but had to create a custom function. Randomized hyperparameter tuning methods often explore the parameter space more quickly than standard tuning iterations. It was essential to limit the number of models attempted for cost and performance considerations.
Upon closer investigation, we also found that the distribution of regression variables led to dampened instability predictions. As we will cover in more detail in the classification section, the dataset was imbalanced with many more references to stable periods instead of unstable periods. In our case, zero reflected extreme future stability, whereas bigger numbers pointed to areas of extreme process fluctuation. To make the model more sensitive to these higher readings, we tried transforming the response variable using a few different link functions to gauge the impact. We found that using the log function with an offset (zero values were valid in our response variable) or a box-cox transformation improved accuracy while also increasing the sensitivity to the spikes that we were trying to predict.
From this regression variable, we also created a classification problem using a threshold for values that were discussed with the engineering team and represented the point at which they felt the process was going askew. With most of the data points reflecting periods of stability, we ran into a class imbalance problem. In my experience, applying a weight factor to model training is often a better approach than up/downsampling, so we focused on this when fitting our models. After training a few different classification algorithms, the headline model accuracy metrics were promising, but upon further investigation, this was more due to the class imbalance than the model extracting meaningful discrepancies between the input features. We cared much more about instability and its causes, so the fact that these predictions had low precision and recall scores meant the models had little value to the business.
Regression and Classification problems often remove temporal considerations from the model structure. For our last approach, we wanted to see if there were any time-dependent components that affected process stability that was missed by mapping these features to a lower dimensionality representation. Since we were using data from a complex multidimensional system, we decided to implement a more advanced modeling technique such as Recurrent Neural Networks to use artificial memory to factor in the temporal connections between data points. Before we could test the new methodology, we first needed to define the model structure and subsequently alter the data input shape to fit the needs of the model. For our model, we employed a bidirectional LSTM layer with a variable number of hidden nodes depending on the length of the window that we were feeding as inputs. The number of sensors per time step remained static at around 75, but as we expanded the time frame, the input size increased linearly, drastically increasing the number of trainable parameters and thus training time. Unfortunately, we could not access a GPU-enabled cluster for this project, so finding the right balance between model complexity and training time remained a factor throughout this entire process.
While the initial Bidirectional LSTM layer remained constant for all our model constructs, we tried adding one to three hidden layers with variable nodes to increase accuracy. For this problem, the added complexity did not seem to have much impact on the output, but we by no means performed an exhaustive analysis of different model structures. Other important configurations that we set were to use the Adam optimization algorithm in conjunction with a step-wise rate decay and early stopping. The weights were initialized across each layer using the Glorot Normal method, which is my personal preference! We alternated between using MAE and MSE as the loss function because MAE is supposedly more robust to data with outliers. However, there was no significant difference between the two when it came to the final accuracy. The output layer consisted of a single node that either had a linear activation function or a sigmoid one, depending on the type of problem. The business did not necessarily care to forecast multiple timesteps ahead but was more concerned with the potential for a drastic spike within some undefined future time frame. In this way, we could boil down a multi-step future window to the singular metric that they cared about and to model against this.
To quickly iterate over the different model configurations, we created functions to preprocess the data, reshape the input matrix and train the models. To help with model training efficiency, we fed the readings from each sensor through a StandardScaler before reformatting each data point so that it had the shape specified by the input layer for the model [Samples, Timesteps, features]. The response variable for the regression problems was also transformed using the box-cox methodology covered above due to model convergence issues with an untransformed value that I think was more due to the distribution of the response variable as opposed to the absolute range of the values. To fit the model, we wrapped the Tensorflow object within the Petastorm library and spark_converter that enables distributed training and evaluation of deep learning objects. With the preprocessed data stored in partitioned Parquet files, this allowed us to batch train over the entire dataset without running into the memory constraints that typically accompany fitting these types of Tensorflow models. Petastorm also ensured that we would scale the process regardless of any increased historical or sensor considerations. As data becomes more entrenched in distributed systems, I would highly recommend checking out Petastorm to train Neural Networks on big data! With IoT devices becoming more pervasive across different industries, new solutions will be required to unlock business insights from these datasets. Crafting a solution that best meets business expectations always involves close collaboration with industry experts to translate statistical outputs to meaningful outcomes for the team. While this may not be the only way to approach these types of problems, we felt that it was the best given the realities of our situation and enjoyed tackling this unique problem. This project was an excellent exercise in understanding the difficulties presented by dirty real-world datasets and the assumptions and procedures required to overcome them.