Association Rules and Sequential Patterns#

This notebook focuses on mining association rules and discovering sequential patterns using a time-series dataset. We’ll extract relevant features, discretize them, and then apply algorithms to uncover interesting relationships and sequences in the extracted features of data.

Objectives#

This tutorial covers:

How to extract features from time-series data.
Mining association rules to identify relationships between discretized features.
Discovering sequential patterns to analyze the most frequent sequences among features.
Key differences:
- Association Rules: Show which feature categories frequently occur together and which features often follow or are dependent on others.
- Sequential Patterns: Highlight the most frequent sequences and temporal order among feature categories.

Installation and Imports#

First, we ensure that the required libraries are installed

[ ]:

%pip install mlxtend prefixspan

Import necessary libraries

[ ]:

import pandas as pd
from interpreTS.utils.data_validation import validate_time_series_data
from interpreTS.utils.data_conversion import convert_to_time_series
from interpreTS.core.feature_extractor import FeatureExtractor, Features
from mlxtend.frequent_patterns import apriori, association_rules
from prefixspan import PrefixSpan

Version check for interpreTS

[2]:

import interpreTS
print(f"Version: {interpreTS.__version__}")

Version: 0.5.0

Load and Preview Data#

Load the data

[3]:

df = pd.read_csv('../data/radiator.csv')

Convert the ‘timestamp’ column to datetime and set it as the index

[4]:

df['timestamp'] = pd.to_datetime(df['timestamp'])
df.set_index('timestamp', inplace=True)

Preview the data

[5]:

display(df.head())

	power
timestamp
2020-12-23 16:42:05+00:00	1.0
2020-12-23 16:42:06+00:00	1.0
2020-12-23 16:42:07+00:00	1.0
2020-12-23 16:42:08+00:00	2.5
2020-12-23 16:42:09+00:00	3.0

Check dataset information

[6]:

df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2592001 entries, 2020-12-23 16:42:05+00:00 to 2021-01-22 16:42:05+00:00
Data columns (total 1 columns):
 #   Column  Dtype
---  ------  -----
 0   power   float64
dtypes: float64(1)
memory usage: 39.6 MB

Validate and Convert the data#

To ensure the dataset is suitable for time-series analysis, we validate it using the validate_time_series_data function.

[7]:

try:
    validate_time_series_data(df)
    print("Time series data validation passed.")
except (TypeError, ValueError) as e:
    print(f"Validation error: {e}")

Time series data validation passed.

The data is converted into an interpreTS TimeSeriesData object for further analysis.

[8]:

time_series_data = convert_to_time_series(df)

[9]:

# Print the converted TimeSeriesData object and its underlying data
print(time_series_data)
display(time_series_data.data)

<interpreTS.core.time_series_data.TimeSeriesData object at 0x0000018C55299BB0>

	power
timestamp
2020-12-23 16:42:05+00:00	1.0
2020-12-23 16:42:06+00:00	1.0
2020-12-23 16:42:07+00:00	1.0
2020-12-23 16:42:08+00:00	2.5
2020-12-23 16:42:09+00:00	3.0
...	...
2021-01-22 16:42:01+00:00	1178.0
2021-01-22 16:42:02+00:00	1167.0
2021-01-22 16:42:03+00:00	1178.0
2021-01-22 16:42:04+00:00	1190.0
2021-01-22 16:42:05+00:00	1190.0

2592001 rows × 1 columns

Feature Extraction#

We extract statistical features from the time-series data, such as mean, peak, trough, variance, and spikeness, using a sliding window approach.

[10]:

extractor = FeatureExtractor(
    features=[
        Features.MEAN,
        Features.PEAK,
        Features.TROUGH,
        Features.VARIANCE,
        Features.SPIKENESS
    ],
    window_size=60,
    stride=30
)
features = extractor.extract_features(time_series_data.data)

Display the extracted features

[11]:

display(features)

	mean_power	peak_power	trough_power	variance_power	spikeness_power
0	601.708333	1314.0	1.0	414014.519097	0.156262
1	775.850000	1314.0	1.0	396195.760833	-0.398615
2	176.033333	1303.0	1.0	191088.632222	2.212143
3	380.816667	1314.0	1.0	341707.616389	0.942842
4	808.200000	1314.0	2.0	353516.293333	-0.441064
...	...	...	...	...	...
86394	901.433333	1212.0	1.0	246873.478889	-1.182936
86395	1003.950000	1212.0	1.0	186048.780833	-1.834241
86396	1193.233333	1201.0	1178.0	43.512222	-0.296779
86397	828.750000	1201.0	1.0	247346.620833	-0.703579
86398	821.283333	1201.0	1.0	241983.030139	-0.704786

86399 rows × 5 columns

Feature Discretization#

The extracted features are discretized into three bins (low, medium, and high) to simplify analysis.

[12]:

columns_to_discretize = [
    'mean_power',
    'peak_power',
    'trough_power',
    'variance_power',
    'spikeness_power'
    ]

Discretize features into bins

[13]:

binned_features = pd.DataFrame()
for col in columns_to_discretize:
    binned_features[f"{col}_bin"] = pd.cut(features[col], bins=3, labels=["low", "medium", "high"])

Encode the binned features into one-hot encoding

[14]:

encoded_features = pd.get_dummies(binned_features, prefix=binned_features.columns)

Association Rule Mining#

Using the Apriori algorithm, we extract frequent itemsets with a minimum support threshold of 0.35.

[15]:

frequent_itemsets = apriori(encoded_features, min_support=0.35, use_colnames=True)
num_itemsets = len(frequent_itemsets)

We generate association rules using lift as the evaluation metric.

[16]:

rules = association_rules(frequent_itemsets, num_itemsets, metric="lift", min_threshold=1.0)

Display the results

[17]:

print("Association Rules:")
display(rules.head())

Association Rules:

	antecedents	consequents	antecedent support	consequent support	support	confidence	lift	representativity	leverage	conviction	zhangs_metric	jaccard	certainty	kulczynski
0	(peak_power_bin_high)	(mean_power_bin_medium)	0.936643	0.362527	0.362527	0.387050	1.067643	1.0	0.022969	1.040007	1.000000	0.387050	0.038468	0.693525
1	(mean_power_bin_medium)	(peak_power_bin_high)	0.362527	0.936643	0.362527	1.000000	1.067643	1.0	0.022969	inf	0.099388	0.387050	1.000000	0.693525
2	(trough_power_bin_low)	(mean_power_bin_medium)	0.883922	0.362527	0.362527	0.410135	1.131321	1.0	0.042081	1.080709	1.000000	0.410135	0.074682	0.705067
3	(mean_power_bin_medium)	(trough_power_bin_low)	0.362527	0.883922	0.362527	1.000000	1.131321	1.0	0.042081	inf	0.182091	0.410135	1.000000	0.705067
4	(spikeness_power_bin_medium)	(mean_power_bin_medium)	0.880774	0.362527	0.362527	0.411601	1.135365	1.0	0.043223	1.083402	1.000000	0.411601	0.076981	0.705800

Sequential Pattern Mining#

We prepare the sequence data by combining feature names with their discretized categories (e.g., mean_power: low).

[18]:

sequences = []
for index, row in binned_features.iterrows():
    sequence = []
    for i, value in enumerate(row):
        feature_name = columns_to_discretize[i]
        sequence.append(f"{feature_name}: {value}")
    sequences.append(sequence)

Display a sample of the sequences

[19]:

display(sequences[:5])

[['mean_power: medium',
  'peak_power: high',
  'trough_power: low',
  'variance_power: high',
  'spikeness_power: medium'],
 ['mean_power: medium',
  'peak_power: high',
  'trough_power: low',
  'variance_power: high',
  'spikeness_power: medium'],
 ['mean_power: low',
  'peak_power: high',
  'trough_power: low',
  'variance_power: medium',
  'spikeness_power: medium'],
 ['mean_power: low',
  'peak_power: high',
  'trough_power: low',
  'variance_power: high',
  'spikeness_power: medium'],
 ['mean_power: medium',
  'peak_power: high',
  'trough_power: low',
  'variance_power: high',
  'spikeness_power: medium']]

The PrefixSpan algorithm is used to find frequent sequential patterns with a minimum support of 0.9.

[20]:

ps = PrefixSpan(sequences)
patterns = ps.frequent(minsup=0.9)

Display the top 20 patterns

[21]:

for pattern in patterns[:20]:
    print(pattern)

(31322, ['mean_power: medium'])
(31322, ['mean_power: medium', 'peak_power: high'])
(31322, ['mean_power: medium', 'peak_power: high', 'trough_power: low'])
(25850, ['mean_power: medium', 'peak_power: high', 'trough_power: low', 'variance_power: high'])
(25850, ['mean_power: medium', 'peak_power: high', 'trough_power: low', 'variance_power: high', 'spikeness_power: medium'])
(31322, ['mean_power: medium', 'peak_power: high', 'trough_power: low', 'spikeness_power: medium'])
(5472, ['mean_power: medium', 'peak_power: high', 'trough_power: low', 'variance_power: medium'])
(5472, ['mean_power: medium', 'peak_power: high', 'trough_power: low', 'variance_power: medium', 'spikeness_power: medium'])
(25850, ['mean_power: medium', 'peak_power: high', 'variance_power: high'])
(25850, ['mean_power: medium', 'peak_power: high', 'variance_power: high', 'spikeness_power: medium'])
(31322, ['mean_power: medium', 'peak_power: high', 'spikeness_power: medium'])
(5472, ['mean_power: medium', 'peak_power: high', 'variance_power: medium'])
(5472, ['mean_power: medium', 'peak_power: high', 'variance_power: medium', 'spikeness_power: medium'])
(31322, ['mean_power: medium', 'trough_power: low'])
(25850, ['mean_power: medium', 'trough_power: low', 'variance_power: high'])
(25850, ['mean_power: medium', 'trough_power: low', 'variance_power: high', 'spikeness_power: medium'])
(31322, ['mean_power: medium', 'trough_power: low', 'spikeness_power: medium'])
(5472, ['mean_power: medium', 'trough_power: low', 'variance_power: medium'])
(5472, ['mean_power: medium', 'trough_power: low', 'variance_power: medium', 'spikeness_power: medium'])
(25850, ['mean_power: medium', 'variance_power: high'])

Results and Analysis#

Association Rule Results

Association rules reveal which feature categories often occur together or imply each other.
Key metrics such as support, confidence, and lift are used to evaluate the rules.

Sequential Pattern Results

Sequential patterns show the most common sequences among feature categories, helping to identify temporal dependencies and trends.

Conclusion#

Association Rules:

Apriori successfully identified frequent itemsets and meaningful rules.
The rules provide insights into co-occurring feature behaviors.

Sequential Patterns:

PrefixSpan uncovered frequent temporal patterns, which can be leveraged to understand the sequence of events in the data.

This approach demonstrates the power of feature engineering and pattern discovery for time-series data.