Extract Features Using Dask for Large Time Series Data#
This notebook demonstrates how to use the interpreTS
library with the Dask framework to process and extract features efficiently from large time series datasets.
Step 1: Import Libraries#
[ ]:
import pandas as pd
import numpy as np
from interpreTS.core.feature_extractor import FeatureExtractor, Features
Step 2: Generate Large Time Series Data#
Here, we create a large dataset with 100 unique time series (id
) and 1,000 data points for each, resulting in a total of 100,000 rows. Each id
represents a distinct time series.
[2]:
data = pd.DataFrame({
'id': np.repeat(range(100), 1000), # 100 time series
'time': np.tile(range(1000), 100), # 1,000 time steps per series
'value': np.random.randn(100000) # Random values
})
Step 3: Initialize the FeatureExtractor#
We specify the following parameters for feature extraction:
features
: Extracting only the mean (Features.MEAN) from the value column.feature_column
: The column from which to calculate the feature.id_column
: Grouping the data by the unique id column.window_size
: Rolling windows of 3 samples.stride
: Sliding by 5 samples per step.
[3]:
feature_extractor = FeatureExtractor(
features=[Features.MEAN], # Extract mean feature
feature_column="value", # Target column
id_column="id", # Unique identifier for time series
window_size=3, # Rolling window size
stride=5 # Sliding step size
)
Step 4: Extract Features Using Dask#
To handle the large dataset efficiently, we use the mode='dask'
parameter in the extract_features
method. This processes the data in parallel using Dask.
[4]:
features_df = feature_extractor.extract_features(data, mode='dask')
[########################################] | 100% Completed | 3.59 sms
Step 5: Display the Extracted Features#
Finally, we print the first few rows of the extracted features.
[5]:
display(features_df.head())
mean_value | |
---|---|
0 | 0.147607 |
1 | -1.034064 |
2 | 0.846525 |
3 | -0.319443 |
4 | -0.763688 |