Notebook: Dataset analysis¶
Sensor data quality plays a vital role in Internet of Things (IoT) applications, particularly in predictive maintenance and Remaining Useful Life (RUL) estimation. Poor data quality can lead to unreliable predictions and decisions, which can cause significant operational and safety issues.
One of the most commonly encountered problems in sensor data is missing data. Missing data can result from various factors, including unstable wireless connections due to network congestion, sensor device outages from limited battery life or environmental interferences. Prolonged periods of missing data can lead to inaccurate RUL predictions.
In addition to addressing missing data, understanding the correlation between different sensor readings and the distribution of values in the dataset is crucial for accurate RUL estimation. Correlation analysis helps identify relationships between variables, which can be used to enhance predictive models. Analyzing value distribution provides insights into the data's behavior and highlights any anomalies or biases.
In this notebook, we will perform an analysis of an RUL dataset. Our focus will be on:
- Missing Data: Identifying and handling missing data to ensure the dataset is complete and reliable for analysis.
- Monotonicty: Checking the monotonicity of sensor readings to ensure the data follows a logical pattern.
- Correlation Analysis: Examining the relationships between different sensor readings to improve the accuracy of predictive models.
- Value Distribution: Analyzing the distribution of sensor values to understand the data's behavior and identify any anomalies.
%load_ext autoreload
%autoreload 2
The autoreload extension is already loaded. To reload it, use: %reload_ext autoreload
import matplotlib.pyplot as plt
import seaborn as sbn
sbn.set_theme()
Load the PHMDataset2018 dataset¶
from ceruleo.dataset.catalog.PHMDataset2018 import PHMDataset2018, FailureType
dataset = PHMDataset2018(tools=["01M01", "04M01"])
Create a transformer for a dataset¶
from ceruleo.dataset.analysis.numerical_features import analyze_as_dataframe, analyze
from ceruleo.transformation.features.selection import (
ByNameFeatureSelector,
)
from ceruleo.transformation.functional.pipeline.pipeline import make_pipeline
from ceruleo.transformation.functional.transformers import Transformer
from ceruleo.transformation.features.cast import ToDateTime
from ceruleo.transformation.features.scalers import MinMaxScaler
FEATURES = [
"IONGAUGEPRESSURE",
"ETCHBEAMVOLTAGE",
"ETCHBEAMCURRENT",
"ETCHSUPPRESSORVOLTAGE",
"ETCHSUPPRESSORCURRENT",
"FLOWCOOLFLOWRATE",
"FLOWCOOLPRESSURE",
"ETCHGASCHANNEL1READBACK",
"ETCHPBNGASREADBACK",
"FIXTURETILTANGLE",
"ROTATIONSPEED",
"ACTUALROTATIONANGLE",
"FIXTURESHUTTERPOSITION",
"ETCHSOURCEUSAGE",
"ETCHAUXSOURCETIMER",
"ETCHAUX2SOURCETIMER",
"ACTUALSTEPDURATION",
]
transformer = Transformer(
pipelineX=make_pipeline(
ToDateTime(index=True),
ByNameFeatureSelector(features=FEATURES),
MinMaxScaler(range=(-1, 1))
),
pipelineY=make_pipeline(
ToDateTime(index=True),
ByNameFeatureSelector(features=["RUL"]),
),
)
transformed_dataset = transformer.fit_map(dataset)
Sample rate¶
We can evalute the sample rate of the dataset without transformation. We can see that there are huge variations of the sample rate, but the vast majority of points are sampled after 4 seconds.
from ceruleo.dataset.analysis.sample_rate import sample_rate, sample_rate_summary
sample_rates = sample_rate(dataset)
fig, ax = plt.subplots(1, 2, figsize=(17, 5))
ax[0].boxplot(sample_rates, labels=["Sample rate"])
ax[0].set_ylabel("Seconds [s]")
ax[1].boxplot(sample_rates, labels=["Sample rate"])
ax[1].set_ylabel("Seconds [s]")
ax[1].set_ylim(0, 10)
(0.0, 10.0)
sample_rate_summary(dataset)
Median: 4.0 [s]
Mean +- Std: 3.980 +- 0.219 [s]
Numeric feature analysis¶
The analysis function of the ceruleo.dataset.analysis.numerical_features module provides an overview of the numeric features in the dataset. The function calculates metrics for each feature and for each run-to-failure-cycle. In the following example the result of the function is a dictionary that contains for each feature a NumericalFeaturesAnalysis object that holds for each run-to-failure-cycle the metrics computed.
IN this case we will have for the two metrics: null and std, 24 values for each metric. One metric for each run-to-failure-cycle. The null metric computes the percentage of null values in the feature and the std metric computes the standard deviation of the feature in a cycle.
rr = analyze(transformed_dataset, show_progress=False, what_to_compute=["null", "std"])
rr["IONGAUGEPRESSURE"]["std"]
[0.1538784215135119, 0.16151160115580462, 0.15799453196073904, 0.13861749490903919, 0.16566794042040264, 0.16425937954116393, 0.1703799320940931, 0.17711007507379947, 0.16845221328428622, 0.1703432255813474, 0.17112279778174258, 0.16960183601086146, 0.1644573963958884, 0.1667180196658114, 0.14469551314577048, 0.16112822632189672, 0.15470216323219949, 0.19875805450944148, 0.1573624712039302, 0.18292432023762176, 0.17527184147763705, 0.1867674364630246, 0.17406487007437757, 0.16549787029020377]
We can compute the summary that contains the mean, standard deviation, minimum, maximum, and quantiles of the features for each run-to-failure-cycle. This summary can be used to identify patterns and anomalies in the data. In this case we have the mean value averaged over all the cycles. If a feature was not present in multiple cycles the value will tend to increase.
feature_summary = rr["IONGAUGEPRESSURE"].summarize()
feature_summary["null"]
MetricValuesSummary(mean=0.0, std=0.0, max=0.0, min=0.0)
It is possible finally obtain everything the summary as a DataFrame
analyze_as_dataframe(transformed_dataset, what_to_compute=["null", "std"])
null | std | |||||||
---|---|---|---|---|---|---|---|---|
Mean value across the cycles | Standard deviation across the cycles | Maximum value found in a cycle | Minimum value found in a cycle | Mean value across the cycles | Standard deviation across the cycles | Maximum value found in a cycle | Minimum value found in a cycle | |
IONGAUGEPRESSURE | 0.000000 | 0.000000 | 0.0000 | 0.0 | 0.166720 | 0.012588 | 0.198758 | 1.386175e-01 |
ETCHBEAMVOLTAGE | 0.000000 | 0.000000 | 0.0000 | 0.0 | 0.380300 | 0.019524 | 0.409515 | 3.239813e-01 |
ETCHBEAMCURRENT | 0.000000 | 0.000000 | 0.0000 | 0.0 | 0.414080 | 0.020573 | 0.441704 | 3.509712e-01 |
ETCHSUPPRESSORVOLTAGE | 0.000000 | 0.000000 | 0.0000 | 0.0 | 0.557341 | 0.050995 | 0.716162 | 4.599301e-01 |
ETCHSUPPRESSORCURRENT | 0.000000 | 0.000000 | 0.0000 | 0.0 | 0.077735 | 0.005504 | 0.088157 | 6.658743e-02 |
FLOWCOOLFLOWRATE | 0.000000 | 0.000000 | 0.0000 | 0.0 | 0.412940 | 0.035307 | 0.503905 | 3.417936e-01 |
FLOWCOOLPRESSURE | 0.000000 | 0.000000 | 0.0000 | 0.0 | 0.130684 | 0.078748 | 0.401497 | 8.276058e-02 |
ETCHGASCHANNEL1READBACK | 0.000000 | 0.000000 | 0.0000 | 0.0 | 0.447334 | 0.027615 | 0.516869 | 3.771744e-01 |
ETCHPBNGASREADBACK | 0.000000 | 0.000000 | 0.0000 | 0.0 | 0.431853 | 0.030993 | 0.513432 | 3.650782e-01 |
FIXTURETILTANGLE | 0.000000 | 0.000000 | 0.0000 | 0.0 | 0.089086 | 0.023832 | 0.148195 | 5.551115e-16 |
ROTATIONSPEED | 0.000000 | 0.000000 | 0.0000 | 0.0 | 0.038895 | 0.027090 | 0.085676 | 0.000000e+00 |
ACTUALROTATIONANGLE | 0.000000 | 0.000000 | 0.0000 | 0.0 | 0.049563 | 0.020486 | 0.087564 | 8.974330e-03 |
FIXTURESHUTTERPOSITION | 0.000083 | 0.000241 | 0.0011 | 0.0 | 0.019717 | 0.018796 | 0.057795 | 3.994086e-03 |
ETCHSOURCEUSAGE | 0.000000 | 0.000000 | 0.0000 | 0.0 | 0.154297 | 0.191935 | 0.732096 | 9.668065e-05 |
ETCHAUXSOURCETIMER | 0.000000 | 0.000000 | 0.0000 | 0.0 | 0.125170 | 0.164681 | 0.495712 | 4.452867e-05 |
ETCHAUX2SOURCETIMER | 0.000000 | 0.000000 | 0.0000 | 0.0 | 0.191197 | 0.206324 | 0.620918 | 9.676074e-05 |
ACTUALSTEPDURATION | 0.000000 | 0.000000 | 0.0000 | 0.0 | 0.111877 | 0.042060 | 0.192831 | 4.440892e-16 |
Missing values¶
Usually, the information of sensors is incomplete and this causes numerous missing values in the features. This library provides functions to analyse the proportion of the missing values for each feature for each life. In some cases, if the feature values are missing in multiple lives, that feature can be discarded. We can see that FIXTURESHUTTERPOSITION has a cycle with 0.0011% of missing values, in this case we can just impute those missingvalues.
analyze_as_dataframe(transformed_dataset, what_to_compute=["null"])
null | ||||
---|---|---|---|---|
Mean value across the cycles | Standard deviation across the cycles | Maximum value found in a cycle | Minimum value found in a cycle | |
IONGAUGEPRESSURE | 0.000000 | 0.000000 | 0.0000 | 0.0 |
ETCHBEAMVOLTAGE | 0.000000 | 0.000000 | 0.0000 | 0.0 |
ETCHBEAMCURRENT | 0.000000 | 0.000000 | 0.0000 | 0.0 |
ETCHSUPPRESSORVOLTAGE | 0.000000 | 0.000000 | 0.0000 | 0.0 |
ETCHSUPPRESSORCURRENT | 0.000000 | 0.000000 | 0.0000 | 0.0 |
FLOWCOOLFLOWRATE | 0.000000 | 0.000000 | 0.0000 | 0.0 |
FLOWCOOLPRESSURE | 0.000000 | 0.000000 | 0.0000 | 0.0 |
ETCHGASCHANNEL1READBACK | 0.000000 | 0.000000 | 0.0000 | 0.0 |
ETCHPBNGASREADBACK | 0.000000 | 0.000000 | 0.0000 | 0.0 |
FIXTURETILTANGLE | 0.000000 | 0.000000 | 0.0000 | 0.0 |
ROTATIONSPEED | 0.000000 | 0.000000 | 0.0000 | 0.0 |
ACTUALROTATIONANGLE | 0.000000 | 0.000000 | 0.0000 | 0.0 |
FIXTURESHUTTERPOSITION | 0.000083 | 0.000241 | 0.0011 | 0.0 |
ETCHSOURCEUSAGE | 0.000000 | 0.000000 | 0.0000 | 0.0 |
ETCHAUXSOURCETIMER | 0.000000 | 0.000000 | 0.0000 | 0.0 |
ETCHAUX2SOURCETIMER | 0.000000 | 0.000000 | 0.0000 | 0.0 |
ACTUALSTEPDURATION | 0.000000 | 0.000000 | 0.0000 | 0.0 |
We can se that there are 4 cycles with a proportion of missing values for this feature > 0, but with an small proportion.
rr = analyze(transformed_dataset, show_progress=False, what_to_compute=["null"])
rr["FIXTURESHUTTERPOSITION"]["null"]
[0.0, 0.0011000495022276003, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0001789318484375671, 0.0005336307435343965, 0.00017414200235439987, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
Feature standard deviation¶
The standard deviation of the features can be used to identify features with low variance, which may not provide useful information for predictive maintenance tasks. Features with low variance can be removed from the dataset to reduce complexity and improve model performance.
We can se that ROTATIONSPEED has a cycle with a minimum value of 0.0, this can be a problem for the model, because the feature will not provide any information for the model.
analyze_as_dataframe(transformed_dataset, what_to_compute=["std"])
std | ||||
---|---|---|---|---|
Mean value across the cycles | Standard deviation across the cycles | Maximum value found in a cycle | Minimum value found in a cycle | |
IONGAUGEPRESSURE | 0.166720 | 0.012588 | 0.198758 | 1.386175e-01 |
ETCHBEAMVOLTAGE | 0.380300 | 0.019524 | 0.409515 | 3.239813e-01 |
ETCHBEAMCURRENT | 0.414080 | 0.020573 | 0.441704 | 3.509712e-01 |
ETCHSUPPRESSORVOLTAGE | 0.557341 | 0.050995 | 0.716162 | 4.599301e-01 |
ETCHSUPPRESSORCURRENT | 0.077735 | 0.005504 | 0.088157 | 6.658743e-02 |
FLOWCOOLFLOWRATE | 0.412940 | 0.035307 | 0.503905 | 3.417936e-01 |
FLOWCOOLPRESSURE | 0.130684 | 0.078748 | 0.401497 | 8.276058e-02 |
ETCHGASCHANNEL1READBACK | 0.447334 | 0.027615 | 0.516869 | 3.771744e-01 |
ETCHPBNGASREADBACK | 0.431853 | 0.030993 | 0.513432 | 3.650782e-01 |
FIXTURETILTANGLE | 0.089086 | 0.023832 | 0.148195 | 5.551115e-16 |
ROTATIONSPEED | 0.038895 | 0.027090 | 0.085676 | 0.000000e+00 |
ACTUALROTATIONANGLE | 0.049563 | 0.020486 | 0.087564 | 8.974330e-03 |
FIXTURESHUTTERPOSITION | 0.019717 | 0.018796 | 0.057795 | 3.994086e-03 |
ETCHSOURCEUSAGE | 0.154297 | 0.191935 | 0.732096 | 9.668065e-05 |
ETCHAUXSOURCETIMER | 0.125170 | 0.164681 | 0.495712 | 4.452867e-05 |
ETCHAUX2SOURCETIMER | 0.191197 | 0.206324 | 0.620918 | 9.676074e-05 |
ACTUALSTEPDURATION | 0.111877 | 0.042060 | 0.192831 | 4.440892e-16 |
We can se that there are a lot of cycles with a low variance
rr = analyze(transformed_dataset, show_progress=False, what_to_compute=["std"])
rr["ROTATIONSPEED"]["std"]
[0.0195906016364457, 0.04336851470681955, 0.05799620898185587, 1.1102230246251565e-16, 0.0360890061340275, 0.038201237893065954, 1.1102230246251565e-16, 0.06606608560204476, 0.04426971331940611, 0.06907071252869629, 0.08567629006783237, 0.05527193794092481, 0.048860280663576686, 0.048314164539748315, 0.0630512429349472, 0.038968725642062135, 0.06539096888505944, 0.07326999721034945, 0.01789179320551508, 5.551115123125783e-17, 1.1102230246251565e-16, 0.0, 5.551115123125783e-17, 0.06212923988601241]
If we inspect these cycles we can se that, in fact, this feature has a constant value.
fig, ax = plt.subplots(3, 1, figsize=(17, 5))
ax[0].plot(transformed_dataset[3][0]["ROTATIONSPEED"])
ax[1].plot(transformed_dataset[6][0]["ROTATIONSPEED"])
ax[2].plot(transformed_dataset[-2][0]["ROTATIONSPEED"])
fig.tight_layout()
monotonicity¶
The monotonicity of the features can be used to identify features that exhibit a consistent trend over time. Monotonic features can provide valuable information for predictive maintenance tasks, as they capture the gradual degradation of equipment.
We can see that that the timers are monotonic, this is expected, because the time is always increasing. And then the ETCHSOURCEUSAGE it's also monotonic, this is also expected, because the etch source usage is always increasing. In this case the presence of multiple correlated monotonic features can be a sign of multicollinearity, which can affect the performance of predictive models.
analyze_as_dataframe(transformed_dataset, what_to_compute=["monotonicity"]).sort_values(
by=("monotonicity", "Mean value across the cycles"), ascending=False
)
monotonicity | ||||
---|---|---|---|---|
Mean value across the cycles | Standard deviation across the cycles | Maximum value found in a cycle | Minimum value found in a cycle | |
ETCHAUXSOURCETIMER | 0.629001 | 0.081907 | 0.767106 | 0.420611 |
ETCHAUX2SOURCETIMER | 0.628923 | 0.081898 | 0.766641 | 0.420611 |
ETCHSOURCEUSAGE | 0.591408 | 0.108967 | 0.781429 | 0.325093 |
IONGAUGEPRESSURE | 0.131580 | 0.037162 | 0.246032 | 0.081020 |
ETCHBEAMCURRENT | 0.079963 | 0.022276 | 0.122318 | 0.039683 |
ETCHSUPPRESSORVOLTAGE | 0.049960 | 0.021527 | 0.119048 | 0.021107 |
FLOWCOOLPRESSURE | 0.036722 | 0.060791 | 0.317460 | 0.001423 |
ETCHSUPPRESSORCURRENT | 0.027669 | 0.016906 | 0.085868 | 0.000000 |
ETCHPBNGASREADBACK | 0.010919 | 0.006596 | 0.025808 | 0.005138 |
FLOWCOOLFLOWRATE | 0.008001 | 0.003239 | 0.014125 | 0.002467 |
ETCHGASCHANNEL1READBACK | 0.007628 | 0.004352 | 0.023810 | 0.001789 |
ETCHBEAMVOLTAGE | 0.004334 | 0.005614 | 0.023810 | 0.000193 |
FIXTURETILTANGLE | 0.000713 | 0.000339 | 0.001484 | 0.000000 |
ACTUALROTATIONANGLE | 0.000430 | 0.001567 | 0.007937 | 0.000000 |
ACTUALSTEPDURATION | 0.000287 | 0.000339 | 0.001789 | 0.000000 |
FIXTURESHUTTERPOSITION | 0.000087 | 0.000067 | 0.000194 | 0.000000 |
ROTATIONSPEED | 0.000005 | 0.000021 | 0.000108 | 0.000000 |
fig, ax = plt.subplots(1, 1, figsize=(17, 5))
ax.plot(transformed_dataset[-4][0]["ETCHSOURCEUSAGE"].values)
[<matplotlib.lines.Line2D at 0x1bbaee821a0>]
Relation with the target¶
The relation between the features and the target variable can be analyzed using correlation coefficients. The correlation analysis can help identify features that are strongly related to the target variable and can be used to build predictive models.
We can see that only the timers and usage are highly correlated with the target variable. This is expected because the target variable is the RUL and the timers and usage are the only features that are related to the time. The othrs features are not strongly correlated with the target variable.
analyze_as_dataframe(transformed_dataset, what_to_compute=["correlation"]).sort_values(
by=("correlation", "Mean value across the cycles"), ascending=False
)
correlation | ||||
---|---|---|---|---|
Mean value across the cycles | Standard deviation across the cycles | Maximum value found in a cycle | Minimum value found in a cycle | |
ETCHAUXSOURCETIMER | 0.844595 | 0.371078 | 1.000000 | -0.264857 |
ETCHSOURCEUSAGE | 0.783816 | 0.437367 | 0.999995 | -0.432908 |
ETCHAUX2SOURCETIMER | 0.743158 | 0.469988 | 1.000000 | -0.432908 |
FLOWCOOLPRESSURE | 0.034503 | 0.305846 | 0.603682 | -0.916936 |
ACTUALSTEPDURATION | 0.015444 | 0.183581 | 0.412032 | -0.459476 |
FIXTURETILTANGLE | 0.003311 | 0.122210 | 0.318482 | -0.223626 |
ETCHGASCHANNEL1READBACK | -0.010918 | 0.202759 | 0.307965 | -0.688285 |
ETCHBEAMCURRENT | -0.011158 | 0.209831 | 0.419246 | -0.606570 |
ROTATIONSPEED | -0.013429 | 0.074998 | 0.118146 | -0.187191 |
ACTUALROTATIONANGLE | -0.024491 | 0.271434 | 0.751982 | -0.558149 |
FLOWCOOLFLOWRATE | -0.025738 | 0.209273 | 0.352479 | -0.575904 |
ETCHSUPPRESSORVOLTAGE | -0.028417 | 0.236531 | 0.407153 | -0.609866 |
ETCHBEAMVOLTAGE | -0.029537 | 0.226633 | 0.475763 | -0.677671 |
FIXTURESHUTTERPOSITION | -0.032806 | 0.199978 | 0.295252 | -0.701687 |
IONGAUGEPRESSURE | -0.036029 | 0.276078 | 0.562739 | -0.771059 |
ETCHSUPPRESSORCURRENT | -0.040677 | 0.216098 | 0.275416 | -0.845663 |
ETCHPBNGASREADBACK | -0.045505 | 0.206210 | 0.268059 | -0.593234 |
Feature pairwise correlation¶
The pairwise correlation between features can be used to identify relationships between variables and detect multicollinearity. Multicollinearity occurs when two or more features are highly correlated, which can lead to unstable model coefficients and inaccurate predictions.
We can see that ETCHGASCHANNEL1READBACK and ETCHPBNGASREADBACK are highly correlated. This is expected because these features are related to the same process.
Another example is ETCHBEAMCURRENT and ETCHSUPPRESSORCURRENT.
from ceruleo.dataset.analysis.correlation import correlation_analysis
(
correlation_analysis(transformed_dataset)
.to_pandas()
.sort_values(by="abs_mean_correlation", ascending=False)
.head(15)
)
feature_1 | feature_2 | mean_correlation | std_correlation | max_correlation | min_correlation | abs_mean_correlation | std_abs_mean_correlation | |
---|---|---|---|---|---|---|---|---|
81 | ETCHGASCHANNEL1READBACK | ETCHPBNGASREADBACK | 0.974040 | 0.015899 | 0.998851 | 0.909530 | 0.974040 | 0.015899 |
62 | ETCHBEAMCURRENT | ETCHSUPPRESSORCURRENT | 0.969555 | 0.017441 | 0.999941 | 0.926967 | 0.969555 | 0.017441 |
73 | ETCHBEAMVOLTAGE | ETCHSUPPRESSORCURRENT | 0.961725 | 0.019763 | 0.999911 | 0.907479 | 0.961725 | 0.019763 |
89 | ETCHGASCHANNEL1READBACK | IONGAUGEPRESSURE | 0.961469 | 0.012155 | 0.983084 | 0.935980 | 0.961469 | 0.012155 |
58 | ETCHBEAMCURRENT | ETCHBEAMVOLTAGE | 0.961391 | 0.018625 | 0.999966 | 0.896134 | 0.961391 | 0.018625 |
98 | ETCHPBNGASREADBACK | IONGAUGEPRESSURE | 0.951679 | 0.033223 | 0.987659 | 0.807029 | 0.951679 | 0.033223 |
31 | ETCHAUX2SOURCETIMER | ETCHAUXSOURCETIMER | 0.833866 | 0.481966 | 1.000000 | -0.772497 | 0.951071 | 0.118294 |
96 | ETCHPBNGASREADBACK | FLOWCOOLFLOWRATE | 0.933385 | 0.020826 | 0.991879 | 0.885555 | 0.933385 | 0.020826 |
87 | ETCHGASCHANNEL1READBACK | FLOWCOOLFLOWRATE | 0.914023 | 0.027393 | 0.996820 | 0.867820 | 0.914023 | 0.027393 |
68 | ETCHBEAMCURRENT | IONGAUGEPRESSURE | 0.903116 | 0.023172 | 0.970433 | 0.873936 | 0.903116 | 0.023172 |
113 | ETCHSUPPRESSORCURRENT | IONGAUGEPRESSURE | 0.897708 | 0.025894 | 0.970394 | 0.842836 | 0.897708 | 0.025894 |
36 | ETCHAUX2SOURCETIMER | ETCHSOURCEUSAGE | 0.878960 | 0.299466 | 1.000000 | -0.172562 | 0.893340 | 0.251169 |
131 | FLOWCOOLFLOWRATE | IONGAUGEPRESSURE | 0.887545 | 0.038323 | 0.970444 | 0.777352 | 0.887545 | 0.038323 |
93 | ETCHPBNGASREADBACK | ETCHSUPPRESSORVOLTAGE | 0.885015 | 0.035778 | 0.992222 | 0.818617 | 0.885015 | 0.035778 |
59 | ETCHBEAMCURRENT | ETCHGASCHANNEL1READBACK | 0.880769 | 0.033310 | 0.996809 | 0.839003 | 0.880769 | 0.033310 |
ETCHBEAMCURRENT and ETCHSUPPRESSORCURRENT are highly correlated¶
fig, ax = plt.subplots(figsize=(17, 5))
ax.plot(transformed_dataset.get_features_of_life(12)['ETCHBEAMCURRENT'].values[-1000:],
label='ETCHBEAMCURRENT')
ax.plot(transformed_dataset.get_features_of_life(12)['ETCHSUPPRESSORCURRENT'].values[-1000:],
label='ETCHSUPPRESSORCURRENT')
ax.legend()
<matplotlib.legend.Legend at 0x1bba056e530>
Feature distribution¶
The distribution of feature values can provide insights into the data's behavior and identify any anomalies or biases. Understanding the distribution of features is crucial for building accurate predictive models and making informed decisions. ALso if there are distribution values difference accross cycles may indicate different operating conditions or equipment states that may affect the predictive model's performance.
We are going to compare cycles with similar length to avoid problems when computing the distribution values. We can see that the top offenders present different distribution that may indicate a change in the operating conditions.
from ceruleo.dataset.analysis.distribution import features_divergeces
d = features_divergeces(transformed_dataset, number_of_bins=5)
# Keep cycles with length > 1000
d = d[ (d["Cycle 1 length"] > 1000) & (d["Cycle 2 length"] > 1000)]
# Keep Pairs of cycles with similar length
d = d[d["Abs Length difference"] < 1000]
d.sort_values(by=[ "KL", "Abs Length difference",], ascending=[False, True]).head(15)
Cycle 1 | Cycle 2 | Cycle 1 length | Cycle 2 length | Abs Length difference | Wasserstein | KL | feature | |
---|---|---|---|---|---|---|---|---|
615 | 2 | 21 | 22801 | 22588 | 213 | 0.000000 | 23.025851 | ETCHAUX2SOURCETIMER |
891 | 2 | 21 | 22801 | 22588 | 213 | 0.000000 | 23.025851 | ETCHAUXSOURCETIMER |
689 | 6 | 21 | 21591 | 22588 | 997 | 0.000000 | 23.025851 | ETCHAUX2SOURCETIMER |
2345 | 6 | 21 | 21591 | 22588 | 997 | 0.000000 | 23.025851 | ETCHSOURCEUSAGE |
965 | 6 | 21 | 21591 | 22588 | 997 | 0.028845 | 10.919734 | ETCHAUXSOURCETIMER |
1995 | 2 | 21 | 22801 | 22588 | 213 | 0.037269 | 0.895635 | ETCHPBNGASREADBACK |
4479 | 2 | 21 | 22801 | 22588 | 213 | 0.027420 | 0.771321 | ROTATIONSPEED |
2069 | 6 | 21 | 21591 | 22588 | 997 | 0.005030 | 0.686197 | ETCHPBNGASREADBACK |
63 | 2 | 21 | 22801 | 22588 | 213 | 0.018315 | 0.513243 | ACTUALROTATIONANGLE |
339 | 2 | 21 | 22801 | 22588 | 213 | 0.013813 | 0.386058 | ACTUALSTEPDURATION |
413 | 6 | 21 | 21591 | 22588 | 997 | 0.013813 | 0.386058 | ACTUALSTEPDURATION |
1443 | 2 | 21 | 22801 | 22588 | 213 | 0.157799 | 0.138403 | ETCHBEAMVOLTAGE |
1719 | 2 | 21 | 22801 | 22588 | 213 | 0.093787 | 0.138191 | ETCHGASCHANNEL1READBACK |
1793 | 6 | 21 | 21591 | 22588 | 997 | 0.082148 | 0.096925 | ETCHGASCHANNEL1READBACK |
1517 | 6 | 21 | 21591 | 22588 | 997 | 0.077764 | 0.072103 | ETCHBEAMVOLTAGE |
def plot_distributions(ax, ds, life1, life2, feature, bins=10):
ax.hist(
ds.get_features_of_life(life1)[feature],
label=feature,
density=True,
alpha=0.8,
bins=bins,
)
ax.hist(
ds.get_features_of_life(life2)[feature],
label=feature,
density=True,
alpha=0.8,
bins=bins,
)
ax.legend()
def plot_timeseries(ax, ds, life, feature):
ax.plot(ds.get_features_of_life(life)[feature].values, alpha=0.5, label=feature)
KL Divergence¶
In this case if we see the top offenders, we can see tha tha timers are in the top. Probably because at the start of the cycles the usage or timers were at diffeernt values. The differnece is between the cycles 2 and 21, and 6 and 21 We can also see that ETCHPBNGASREADBACK present a bias in the values
first_row = d.sort_values(by=[ "KL", "Abs Length difference",], ascending=[False, True]).iloc[5, :]
feature = first_row["feature"]
life1 = first_row["Cycle 1"]
life2 = first_row["Cycle 2"]
fig, ax = plt.subplots(2, 1, figsize=(17, 9))
plot_distributions(ax[0], transformed_dataset, life1, life2, feature, bins=25)
plot_timeseries(ax[1], transformed_dataset, life1, feature)
plot_timeseries(ax[1], transformed_dataset, life2, feature)
ax[1].legend()
<matplotlib.legend.Legend at 0x1bbd516ee60>
Wasserstein¶
We can also see that ETCHBEAMVoltage present a bias in the values
fig, ax = plt.subplots(2, 1, figsize=(17, 9))
first_row = d.sort_values(by=[ "Wasserstein", "Abs Length difference",], ascending=[False, True]).iloc[0, :]
feature = first_row["feature"]
life1 = first_row["Cycle 1"]
life2 = first_row["Cycle 2"]
plot_distributions(ax[0], transformed_dataset, life1, life2, feature, bins=25)
plot_timeseries(ax[1], transformed_dataset, life1, feature)
plot_timeseries(ax[1], transformed_dataset, life2, feature)
ax[1].legend()
<matplotlib.legend.Legend at 0x1bbf08305e0>