Association Rules and Sequential Patterns#
This notebook focuses on mining association rules and discovering sequential patterns using a time-series dataset. We’ll extract relevant features, discretize them, and then apply algorithms to uncover interesting relationships and sequences in the extracted features of data.
Table of Contents#
Objectives#
This tutorial covers:
How to extract features from time-series data.
Mining association rules to identify relationships between discretized features.
Discovering sequential patterns to analyze the most frequent sequences among features.
Key differences:
Association Rules: Show which feature categories frequently occur together and which features often follow or are dependent on others.
Sequential Patterns: Highlight the most frequent sequences and temporal order among feature categories.
Installation and Imports#
First, we ensure that the required libraries are installed
[ ]:
%pip install mlxtend prefixspan
Import necessary libraries
[ ]:
import pandas as pd
from interpreTS.utils.data_validation import validate_time_series_data
from interpreTS.utils.data_conversion import convert_to_time_series
from interpreTS.core.feature_extractor import FeatureExtractor, Features
from mlxtend.frequent_patterns import apriori, association_rules
from prefixspan import PrefixSpan
Version check for interpreTS
[2]:
import interpreTS
print(f"Version: {interpreTS.__version__}")
Version: 0.5.0
Load and Preview Data#
Load the data
[3]:
df = pd.read_csv('../data/radiator.csv')
Convert the ‘timestamp’ column to datetime and set it as the index
[4]:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df.set_index('timestamp', inplace=True)
Preview the data
[5]:
display(df.head())
power | |
---|---|
timestamp | |
2020-12-23 16:42:05+00:00 | 1.0 |
2020-12-23 16:42:06+00:00 | 1.0 |
2020-12-23 16:42:07+00:00 | 1.0 |
2020-12-23 16:42:08+00:00 | 2.5 |
2020-12-23 16:42:09+00:00 | 3.0 |
Check dataset information
[6]:
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2592001 entries, 2020-12-23 16:42:05+00:00 to 2021-01-22 16:42:05+00:00
Data columns (total 1 columns):
# Column Dtype
--- ------ -----
0 power float64
dtypes: float64(1)
memory usage: 39.6 MB
Validate and Convert the data#
To ensure the dataset is suitable for time-series analysis, we validate it using the validate_time_series_data
function.
[7]:
try:
validate_time_series_data(df)
print("Time series data validation passed.")
except (TypeError, ValueError) as e:
print(f"Validation error: {e}")
Time series data validation passed.
The data is converted into an interpreTS TimeSeriesData
object for further analysis.
[8]:
time_series_data = convert_to_time_series(df)
[9]:
# Print the converted TimeSeriesData object and its underlying data
print(time_series_data)
display(time_series_data.data)
<interpreTS.core.time_series_data.TimeSeriesData object at 0x0000018C55299BB0>
power | |
---|---|
timestamp | |
2020-12-23 16:42:05+00:00 | 1.0 |
2020-12-23 16:42:06+00:00 | 1.0 |
2020-12-23 16:42:07+00:00 | 1.0 |
2020-12-23 16:42:08+00:00 | 2.5 |
2020-12-23 16:42:09+00:00 | 3.0 |
... | ... |
2021-01-22 16:42:01+00:00 | 1178.0 |
2021-01-22 16:42:02+00:00 | 1167.0 |
2021-01-22 16:42:03+00:00 | 1178.0 |
2021-01-22 16:42:04+00:00 | 1190.0 |
2021-01-22 16:42:05+00:00 | 1190.0 |
2592001 rows × 1 columns
Feature Extraction#
We extract statistical features from the time-series data, such as mean, peak, trough, variance, and spikeness, using a sliding window approach.
[10]:
extractor = FeatureExtractor(
features=[
Features.MEAN,
Features.PEAK,
Features.TROUGH,
Features.VARIANCE,
Features.SPIKENESS
],
window_size=60,
stride=30
)
features = extractor.extract_features(time_series_data.data)
Display the extracted features
[11]:
display(features)
mean_power | peak_power | trough_power | variance_power | spikeness_power | |
---|---|---|---|---|---|
0 | 601.708333 | 1314.0 | 1.0 | 414014.519097 | 0.156262 |
1 | 775.850000 | 1314.0 | 1.0 | 396195.760833 | -0.398615 |
2 | 176.033333 | 1303.0 | 1.0 | 191088.632222 | 2.212143 |
3 | 380.816667 | 1314.0 | 1.0 | 341707.616389 | 0.942842 |
4 | 808.200000 | 1314.0 | 2.0 | 353516.293333 | -0.441064 |
... | ... | ... | ... | ... | ... |
86394 | 901.433333 | 1212.0 | 1.0 | 246873.478889 | -1.182936 |
86395 | 1003.950000 | 1212.0 | 1.0 | 186048.780833 | -1.834241 |
86396 | 1193.233333 | 1201.0 | 1178.0 | 43.512222 | -0.296779 |
86397 | 828.750000 | 1201.0 | 1.0 | 247346.620833 | -0.703579 |
86398 | 821.283333 | 1201.0 | 1.0 | 241983.030139 | -0.704786 |
86399 rows × 5 columns
Feature Discretization#
The extracted features are discretized into three bins (low
, medium
, and high
) to simplify analysis.
[12]:
columns_to_discretize = [
'mean_power',
'peak_power',
'trough_power',
'variance_power',
'spikeness_power'
]
Discretize features into bins
[13]:
binned_features = pd.DataFrame()
for col in columns_to_discretize:
binned_features[f"{col}_bin"] = pd.cut(features[col], bins=3, labels=["low", "medium", "high"])
Encode the binned features into one-hot encoding
[14]:
encoded_features = pd.get_dummies(binned_features, prefix=binned_features.columns)
Association Rule Mining#
Using the Apriori algorithm, we extract frequent itemsets with a minimum support threshold of 0.35.
[15]:
frequent_itemsets = apriori(encoded_features, min_support=0.35, use_colnames=True)
num_itemsets = len(frequent_itemsets)
We generate association rules using lift
as the evaluation metric.
[16]:
rules = association_rules(frequent_itemsets, num_itemsets, metric="lift", min_threshold=1.0)
Display the results
[17]:
print("Association Rules:")
display(rules.head())
Association Rules:
antecedents | consequents | antecedent support | consequent support | support | confidence | lift | representativity | leverage | conviction | zhangs_metric | jaccard | certainty | kulczynski | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | (peak_power_bin_high) | (mean_power_bin_medium) | 0.936643 | 0.362527 | 0.362527 | 0.387050 | 1.067643 | 1.0 | 0.022969 | 1.040007 | 1.000000 | 0.387050 | 0.038468 | 0.693525 |
1 | (mean_power_bin_medium) | (peak_power_bin_high) | 0.362527 | 0.936643 | 0.362527 | 1.000000 | 1.067643 | 1.0 | 0.022969 | inf | 0.099388 | 0.387050 | 1.000000 | 0.693525 |
2 | (trough_power_bin_low) | (mean_power_bin_medium) | 0.883922 | 0.362527 | 0.362527 | 0.410135 | 1.131321 | 1.0 | 0.042081 | 1.080709 | 1.000000 | 0.410135 | 0.074682 | 0.705067 |
3 | (mean_power_bin_medium) | (trough_power_bin_low) | 0.362527 | 0.883922 | 0.362527 | 1.000000 | 1.131321 | 1.0 | 0.042081 | inf | 0.182091 | 0.410135 | 1.000000 | 0.705067 |
4 | (spikeness_power_bin_medium) | (mean_power_bin_medium) | 0.880774 | 0.362527 | 0.362527 | 0.411601 | 1.135365 | 1.0 | 0.043223 | 1.083402 | 1.000000 | 0.411601 | 0.076981 | 0.705800 |
Sequential Pattern Mining#
We prepare the sequence data by combining feature names with their discretized categories (e.g., mean_power: low
).
[18]:
sequences = []
for index, row in binned_features.iterrows():
sequence = []
for i, value in enumerate(row):
feature_name = columns_to_discretize[i]
sequence.append(f"{feature_name}: {value}")
sequences.append(sequence)
Display a sample of the sequences
[19]:
display(sequences[:5])
[['mean_power: medium',
'peak_power: high',
'trough_power: low',
'variance_power: high',
'spikeness_power: medium'],
['mean_power: medium',
'peak_power: high',
'trough_power: low',
'variance_power: high',
'spikeness_power: medium'],
['mean_power: low',
'peak_power: high',
'trough_power: low',
'variance_power: medium',
'spikeness_power: medium'],
['mean_power: low',
'peak_power: high',
'trough_power: low',
'variance_power: high',
'spikeness_power: medium'],
['mean_power: medium',
'peak_power: high',
'trough_power: low',
'variance_power: high',
'spikeness_power: medium']]
The PrefixSpan algorithm is used to find frequent sequential patterns with a minimum support of 0.9.
[20]:
ps = PrefixSpan(sequences)
patterns = ps.frequent(minsup=0.9)
Display the top 20 patterns
[21]:
for pattern in patterns[:20]:
print(pattern)
(31322, ['mean_power: medium'])
(31322, ['mean_power: medium', 'peak_power: high'])
(31322, ['mean_power: medium', 'peak_power: high', 'trough_power: low'])
(25850, ['mean_power: medium', 'peak_power: high', 'trough_power: low', 'variance_power: high'])
(25850, ['mean_power: medium', 'peak_power: high', 'trough_power: low', 'variance_power: high', 'spikeness_power: medium'])
(31322, ['mean_power: medium', 'peak_power: high', 'trough_power: low', 'spikeness_power: medium'])
(5472, ['mean_power: medium', 'peak_power: high', 'trough_power: low', 'variance_power: medium'])
(5472, ['mean_power: medium', 'peak_power: high', 'trough_power: low', 'variance_power: medium', 'spikeness_power: medium'])
(25850, ['mean_power: medium', 'peak_power: high', 'variance_power: high'])
(25850, ['mean_power: medium', 'peak_power: high', 'variance_power: high', 'spikeness_power: medium'])
(31322, ['mean_power: medium', 'peak_power: high', 'spikeness_power: medium'])
(5472, ['mean_power: medium', 'peak_power: high', 'variance_power: medium'])
(5472, ['mean_power: medium', 'peak_power: high', 'variance_power: medium', 'spikeness_power: medium'])
(31322, ['mean_power: medium', 'trough_power: low'])
(25850, ['mean_power: medium', 'trough_power: low', 'variance_power: high'])
(25850, ['mean_power: medium', 'trough_power: low', 'variance_power: high', 'spikeness_power: medium'])
(31322, ['mean_power: medium', 'trough_power: low', 'spikeness_power: medium'])
(5472, ['mean_power: medium', 'trough_power: low', 'variance_power: medium'])
(5472, ['mean_power: medium', 'trough_power: low', 'variance_power: medium', 'spikeness_power: medium'])
(25850, ['mean_power: medium', 'variance_power: high'])
Results and Analysis#
Association Rule Results
Association rules reveal which feature categories often occur together or imply each other.
Key metrics such as support, confidence, and lift are used to evaluate the rules.
Sequential Pattern Results
Sequential patterns show the most common sequences among feature categories, helping to identify temporal dependencies and trends.
Conclusion#
Association Rules:
Apriori successfully identified frequent itemsets and meaningful rules.
The rules provide insights into co-occurring feature behaviors.
Sequential Patterns:
PrefixSpan uncovered frequent temporal patterns, which can be leveraged to understand the sequence of events in the data.
This approach demonstrates the power of feature engineering and pattern discovery for time-series data.