{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Dimensionality Reduction and Clustering" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this tutorial, we explore the process of analyzing and clustering time-series data for power consumption of a radiator. We will leverage methods such as Principal Component Analysis (PCA) for dimensionality reduction and K-means & DBSCAN for clustering. Along the way, we will visualize the data using scatter plots." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Table of Contents\n", "1. [Objectives](#objectives)\n", "2. [Import Required Libraries](#import-required-libraries)\n", "3. [Load and Preview the Data](#load-and-preview-the-data)\n", "4. [Validate and Convert the Data](#validate-and-convert-the-data)\n", "5. [Feature Extraction](#feature-extraction)\n", "6. [Standardize the Features](#standardize-the-features)\n", "7. [PCA for Dimensionality Reduction](#pca-for-dimensionality-reduction)\n", " - [Determine the Number of Components](#determine-the-number-of-components)\n", " - [Apply PCA with Optimal Components](#apply-pca-with-optimal-components)\n", "8. [K-means Clustering](#k-means-clustering)\n", " - [Determine the Optimal Number of Clusters](#determine-the-optimal-number-of-clusters)\n", " - [Apply K-means with Optimal Clusters](#apply-k-means-with-optimal-clusters)\n", " - [Visualize K-means Clustering](#visualize-k-means-clustering)\n", "9. [DBSCAN Clustering](#dbscan-clustering)\n", " - [Apply DBSCAN](#apply-dbscan)\n", " - [Visualize DBSCAN clustering](#visualize-dbscan-clustering)\n", "10. [Conclusion](#conclusion)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Objectives " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This tutorial covers:\n", "\n", "- Loading and preprocessing time-series data.\n", "- Extracting features from the time-series data.\n", "- Performing dimensionality reduction using Principal Component Analysis (PCA).\n", "- Clustering data using K-means and DBSCAN.\n", "- Visualizing results of dimensionality reduction and clustering.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import Required Libraries " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The libraries below are used for this tutorial:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from interpreTS.utils.data_validation import validate_time_series_data\n", "from interpreTS.utils.data_conversion import convert_to_time_series\n", "from interpreTS.core.feature_extractor import FeatureExtractor, Features\n", "from sklearn.decomposition import PCA\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.cluster import KMeans, DBSCAN\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Additionally, check the version of the `interpreTS` library:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "interpreTS version: 0.5.0\n" ] } ], "source": [ "import interpreTS\n", "print(f\"interpreTS version: {interpreTS.__version__}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load and Preview the Data " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Load the data containing time-series data for power consumption from the provided CSV file." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('../data/radiator.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Convert the `timestamp` column to datetime format and set it as the index:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "df['timestamp'] = pd.to_datetime(df['timestamp'])\n", "df.set_index('timestamp', inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Preview the dataset:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | power | \n", "
---|---|
timestamp | \n", "\n", " |
2020-12-23 16:42:05+00:00 | \n", "1.0 | \n", "
2020-12-23 16:42:06+00:00 | \n", "1.0 | \n", "
2020-12-23 16:42:07+00:00 | \n", "1.0 | \n", "
2020-12-23 16:42:08+00:00 | \n", "2.5 | \n", "
2020-12-23 16:42:09+00:00 | \n", "3.0 | \n", "
... | \n", "... | \n", "
2021-01-22 16:42:01+00:00 | \n", "1178.0 | \n", "
2021-01-22 16:42:02+00:00 | \n", "1167.0 | \n", "
2021-01-22 16:42:03+00:00 | \n", "1178.0 | \n", "
2021-01-22 16:42:04+00:00 | \n", "1190.0 | \n", "
2021-01-22 16:42:05+00:00 | \n", "1190.0 | \n", "
2592001 rows × 1 columns
\n", "\n", " | power | \n", "
---|---|
timestamp | \n", "\n", " |
2020-12-23 16:42:05+00:00 | \n", "1.0 | \n", "
2020-12-23 16:42:06+00:00 | \n", "1.0 | \n", "
2020-12-23 16:42:07+00:00 | \n", "1.0 | \n", "
2020-12-23 16:42:08+00:00 | \n", "2.5 | \n", "
2020-12-23 16:42:09+00:00 | \n", "3.0 | \n", "
... | \n", "... | \n", "
2021-01-22 16:42:01+00:00 | \n", "1178.0 | \n", "
2021-01-22 16:42:02+00:00 | \n", "1167.0 | \n", "
2021-01-22 16:42:03+00:00 | \n", "1178.0 | \n", "
2021-01-22 16:42:04+00:00 | \n", "1190.0 | \n", "
2021-01-22 16:42:05+00:00 | \n", "1190.0 | \n", "
2592001 rows × 1 columns
\n", "\n", " | mean_power | \n", "dominant_power | \n", "trend_strength_power | \n", "peak_power | \n", "variance_power | \n", "
---|---|---|---|---|---|
0 | \n", "601.708333 | \n", "1.0 | \n", "0.755769 | \n", "1314.0 | \n", "414014.519097 | \n", "
1 | \n", "775.850000 | \n", "1182.7 | \n", "0.484017 | \n", "1314.0 | \n", "396195.760833 | \n", "
2 | \n", "176.033333 | \n", "1.0 | \n", "0.355786 | \n", "1303.0 | \n", "191088.632222 | \n", "
3 | \n", "380.816667 | \n", "1.0 | \n", "0.633966 | \n", "1314.0 | \n", "341707.616389 | \n", "
4 | \n", "808.200000 | \n", "1182.8 | \n", "0.009364 | \n", "1314.0 | \n", "353516.293333 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
86394 | \n", "901.433333 | \n", "1090.9 | \n", "0.004244 | \n", "1212.0 | \n", "246873.478889 | \n", "
86395 | \n", "1003.950000 | \n", "1090.9 | \n", "0.407723 | \n", "1212.0 | \n", "186048.780833 | \n", "
86396 | \n", "1193.233333 | \n", "1189.5 | \n", "0.002166 | \n", "1201.0 | \n", "43.512222 | \n", "
86397 | \n", "828.750000 | \n", "1081.0 | \n", "0.217282 | \n", "1201.0 | \n", "247346.620833 | \n", "
86398 | \n", "821.283333 | \n", "1081.0 | \n", "0.626586 | \n", "1201.0 | \n", "241983.030139 | \n", "
86399 rows × 5 columns
\n", "\n", " | PC1 | \n", "PC2 | \n", "PC3 | \n", "
---|---|---|---|
mean_power | \n", "-0.629927 | \n", "0.181104 | \n", "0.174236 | \n", "
dominant_power | \n", "-0.597785 | \n", "0.165984 | \n", "0.157005 | \n", "
trend_strength_power | \n", "0.008113 | \n", "-0.590912 | \n", "0.800670 | \n", "
peak_power | \n", "-0.492595 | \n", "-0.366028 | \n", "-0.351975 | \n", "
variance_power | \n", "-0.055940 | \n", "-0.675646 | \n", "-0.424303 | \n", "
KMeans(n_clusters=3, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KMeans(n_clusters=3, random_state=42)
KMeans(n_clusters=2, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KMeans(n_clusters=2, random_state=42)