Association Rules and Sequential Patterns#

This notebook focuses on mining association rules and discovering sequential patterns using a time-series dataset. We’ll extract relevant features, discretize them, and then apply algorithms to uncover interesting relationships and sequences in the extracted features of data.

Table of Contents#

  1. Objectives

  2. Installation and Imports

  3. Load and Preview Data

  4. Validate and Convert the Data

  5. Feature Extraction

  6. Feature Discretization

  7. Association Rule Mining

  8. Sequential Pattern Mining

  9. Results and Analysis

  10. Conclusion

Objectives#

This tutorial covers:

  • How to extract features from time-series data.

  • Mining association rules to identify relationships between discretized features.

  • Discovering sequential patterns to analyze the most frequent sequences among features.

  • Key differences:

    • Association Rules: Show which feature categories frequently occur together and which features often follow or are dependent on others.

    • Sequential Patterns: Highlight the most frequent sequences and temporal order among feature categories.

Installation and Imports#

First, we ensure that the required libraries are installed

[ ]:
%pip install mlxtend prefixspan

Import necessary libraries

[ ]:
import pandas as pd
from interpreTS.utils.data_validation import validate_time_series_data
from interpreTS.utils.data_conversion import convert_to_time_series
from interpreTS.core.feature_extractor import FeatureExtractor, Features
from mlxtend.frequent_patterns import apriori, association_rules
from prefixspan import PrefixSpan

Version check for interpreTS

[2]:
import interpreTS
print(f"Version: {interpreTS.__version__}")
Version: 0.5.0

Load and Preview Data#

Load the data

[3]:
df = pd.read_csv('../data/radiator.csv')

Convert the ‘timestamp’ column to datetime and set it as the index

[4]:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df.set_index('timestamp', inplace=True)

Preview the data

[5]:
display(df.head())
power
timestamp
2020-12-23 16:42:05+00:00 1.0
2020-12-23 16:42:06+00:00 1.0
2020-12-23 16:42:07+00:00 1.0
2020-12-23 16:42:08+00:00 2.5
2020-12-23 16:42:09+00:00 3.0

Check dataset information

[6]:
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2592001 entries, 2020-12-23 16:42:05+00:00 to 2021-01-22 16:42:05+00:00
Data columns (total 1 columns):
 #   Column  Dtype
---  ------  -----
 0   power   float64
dtypes: float64(1)
memory usage: 39.6 MB

Validate and Convert the data#

To ensure the dataset is suitable for time-series analysis, we validate it using the validate_time_series_data function.

[7]:
try:
    validate_time_series_data(df)
    print("Time series data validation passed.")
except (TypeError, ValueError) as e:
    print(f"Validation error: {e}")
Time series data validation passed.

The data is converted into an interpreTS TimeSeriesData object for further analysis.

[8]:
time_series_data = convert_to_time_series(df)
[9]:
# Print the converted TimeSeriesData object and its underlying data
print(time_series_data)
display(time_series_data.data)
<interpreTS.core.time_series_data.TimeSeriesData object at 0x0000018C55299BB0>
power
timestamp
2020-12-23 16:42:05+00:00 1.0
2020-12-23 16:42:06+00:00 1.0
2020-12-23 16:42:07+00:00 1.0
2020-12-23 16:42:08+00:00 2.5
2020-12-23 16:42:09+00:00 3.0
... ...
2021-01-22 16:42:01+00:00 1178.0
2021-01-22 16:42:02+00:00 1167.0
2021-01-22 16:42:03+00:00 1178.0
2021-01-22 16:42:04+00:00 1190.0
2021-01-22 16:42:05+00:00 1190.0

2592001 rows × 1 columns

Feature Extraction#

We extract statistical features from the time-series data, such as mean, peak, trough, variance, and spikeness, using a sliding window approach.

[10]:
extractor = FeatureExtractor(
    features=[
        Features.MEAN,
        Features.PEAK,
        Features.TROUGH,
        Features.VARIANCE,
        Features.SPIKENESS
    ],
    window_size=60,
    stride=30
)
features = extractor.extract_features(time_series_data.data)

Display the extracted features

[11]:
display(features)
mean_power peak_power trough_power variance_power spikeness_power
0 601.708333 1314.0 1.0 414014.519097 0.156262
1 775.850000 1314.0 1.0 396195.760833 -0.398615
2 176.033333 1303.0 1.0 191088.632222 2.212143
3 380.816667 1314.0 1.0 341707.616389 0.942842
4 808.200000 1314.0 2.0 353516.293333 -0.441064
... ... ... ... ... ...
86394 901.433333 1212.0 1.0 246873.478889 -1.182936
86395 1003.950000 1212.0 1.0 186048.780833 -1.834241
86396 1193.233333 1201.0 1178.0 43.512222 -0.296779
86397 828.750000 1201.0 1.0 247346.620833 -0.703579
86398 821.283333 1201.0 1.0 241983.030139 -0.704786

86399 rows × 5 columns

Feature Discretization#

The extracted features are discretized into three bins (low, medium, and high) to simplify analysis.

[12]:
columns_to_discretize = [
    'mean_power',
    'peak_power',
    'trough_power',
    'variance_power',
    'spikeness_power'
    ]

Discretize features into bins

[13]:
binned_features = pd.DataFrame()
for col in columns_to_discretize:
    binned_features[f"{col}_bin"] = pd.cut(features[col], bins=3, labels=["low", "medium", "high"])

Encode the binned features into one-hot encoding

[14]:
encoded_features = pd.get_dummies(binned_features, prefix=binned_features.columns)

Association Rule Mining#

Using the Apriori algorithm, we extract frequent itemsets with a minimum support threshold of 0.35.

[15]:
frequent_itemsets = apriori(encoded_features, min_support=0.35, use_colnames=True)
num_itemsets = len(frequent_itemsets)

We generate association rules using lift as the evaluation metric.

[16]:
rules = association_rules(frequent_itemsets, num_itemsets, metric="lift", min_threshold=1.0)

Display the results

[17]:
print("Association Rules:")
display(rules.head())
Association Rules:
antecedents consequents antecedent support consequent support support confidence lift representativity leverage conviction zhangs_metric jaccard certainty kulczynski
0 (peak_power_bin_high) (mean_power_bin_medium) 0.936643 0.362527 0.362527 0.387050 1.067643 1.0 0.022969 1.040007 1.000000 0.387050 0.038468 0.693525
1 (mean_power_bin_medium) (peak_power_bin_high) 0.362527 0.936643 0.362527 1.000000 1.067643 1.0 0.022969 inf 0.099388 0.387050 1.000000 0.693525
2 (trough_power_bin_low) (mean_power_bin_medium) 0.883922 0.362527 0.362527 0.410135 1.131321 1.0 0.042081 1.080709 1.000000 0.410135 0.074682 0.705067
3 (mean_power_bin_medium) (trough_power_bin_low) 0.362527 0.883922 0.362527 1.000000 1.131321 1.0 0.042081 inf 0.182091 0.410135 1.000000 0.705067
4 (spikeness_power_bin_medium) (mean_power_bin_medium) 0.880774 0.362527 0.362527 0.411601 1.135365 1.0 0.043223 1.083402 1.000000 0.411601 0.076981 0.705800

Sequential Pattern Mining#

We prepare the sequence data by combining feature names with their discretized categories (e.g., mean_power: low).

[18]:
sequences = []
for index, row in binned_features.iterrows():
    sequence = []
    for i, value in enumerate(row):
        feature_name = columns_to_discretize[i]
        sequence.append(f"{feature_name}: {value}")
    sequences.append(sequence)

Display a sample of the sequences

[19]:
display(sequences[:5])
[['mean_power: medium',
  'peak_power: high',
  'trough_power: low',
  'variance_power: high',
  'spikeness_power: medium'],
 ['mean_power: medium',
  'peak_power: high',
  'trough_power: low',
  'variance_power: high',
  'spikeness_power: medium'],
 ['mean_power: low',
  'peak_power: high',
  'trough_power: low',
  'variance_power: medium',
  'spikeness_power: medium'],
 ['mean_power: low',
  'peak_power: high',
  'trough_power: low',
  'variance_power: high',
  'spikeness_power: medium'],
 ['mean_power: medium',
  'peak_power: high',
  'trough_power: low',
  'variance_power: high',
  'spikeness_power: medium']]

The PrefixSpan algorithm is used to find frequent sequential patterns with a minimum support of 0.9.

[20]:
ps = PrefixSpan(sequences)
patterns = ps.frequent(minsup=0.9)

Display the top 20 patterns

[21]:

for pattern in patterns[:20]: print(pattern)
(31322, ['mean_power: medium'])
(31322, ['mean_power: medium', 'peak_power: high'])
(31322, ['mean_power: medium', 'peak_power: high', 'trough_power: low'])
(25850, ['mean_power: medium', 'peak_power: high', 'trough_power: low', 'variance_power: high'])
(25850, ['mean_power: medium', 'peak_power: high', 'trough_power: low', 'variance_power: high', 'spikeness_power: medium'])
(31322, ['mean_power: medium', 'peak_power: high', 'trough_power: low', 'spikeness_power: medium'])
(5472, ['mean_power: medium', 'peak_power: high', 'trough_power: low', 'variance_power: medium'])
(5472, ['mean_power: medium', 'peak_power: high', 'trough_power: low', 'variance_power: medium', 'spikeness_power: medium'])
(25850, ['mean_power: medium', 'peak_power: high', 'variance_power: high'])
(25850, ['mean_power: medium', 'peak_power: high', 'variance_power: high', 'spikeness_power: medium'])
(31322, ['mean_power: medium', 'peak_power: high', 'spikeness_power: medium'])
(5472, ['mean_power: medium', 'peak_power: high', 'variance_power: medium'])
(5472, ['mean_power: medium', 'peak_power: high', 'variance_power: medium', 'spikeness_power: medium'])
(31322, ['mean_power: medium', 'trough_power: low'])
(25850, ['mean_power: medium', 'trough_power: low', 'variance_power: high'])
(25850, ['mean_power: medium', 'trough_power: low', 'variance_power: high', 'spikeness_power: medium'])
(31322, ['mean_power: medium', 'trough_power: low', 'spikeness_power: medium'])
(5472, ['mean_power: medium', 'trough_power: low', 'variance_power: medium'])
(5472, ['mean_power: medium', 'trough_power: low', 'variance_power: medium', 'spikeness_power: medium'])
(25850, ['mean_power: medium', 'variance_power: high'])

Results and Analysis#

Association Rule Results

  • Association rules reveal which feature categories often occur together or imply each other.

  • Key metrics such as support, confidence, and lift are used to evaluate the rules.

Sequential Pattern Results

  • Sequential patterns show the most common sequences among feature categories, helping to identify temporal dependencies and trends.

Conclusion#

  1. Association Rules:

  • Apriori successfully identified frequent itemsets and meaningful rules.

  • The rules provide insights into co-occurring feature behaviors.

  1. Sequential Patterns:

  • PrefixSpan uncovered frequent temporal patterns, which can be leveraged to understand the sequence of events in the data.

This approach demonstrates the power of feature engineering and pattern discovery for time-series data.