{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Generating rules using the CN2 algorithm\n", "\n", "In this tutorial we will learn how to integrate decision rules with other rule induction\n", "packages. We will cover multiple topics such as:\n", "* creating decision-rules rulesets programiclly,\n", "* writing custom rule quality measures,\n", "* programmatic creation of decision rule sets.\n", "\n", "We will use the implementation of the CN2 algorithm provided by the [Orange](https://orange3.readthedocs.io/projects/orange-data-mining-library/en/latest/) package. In this tutorial\n", "we will use the popular [titanic](https://www.kaggle.com/c/titanic/data) dataset. Later, we will write a custom rule set factory class that will transform an instance of the Orange rule set to an instance of the `decision_rules.classification.ClassificationRuleSet` class. Finally, we will briefly introduce the various operations that can be performed using such an object. \n", "\n", "Custom classes and methods presented in tutorial are already implemented and available in the **decision-rules** package, but this tutorial will teach you how to implement them yourself. This knowledge will enable you to add your own custom factories, quality measures, or prediction strategies in the future if needed.\n", "\n", "We will start by loading the [titanic](https://www.kaggle.com/c/titanic/data) dataset. We \n", "will use the `submodule` of Orange data here." ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Column typeColumn nameData typePossible values
0featurestatusdiscrete(crew, first, second, third)
1featureagediscrete(adult, child)
2featuresexdiscrete(female, male)
3labelsurviveddiscrete(no, yes)
\n", "
" ], "text/plain": [ " Column type Column name Data type Possible values\n", "0 feature status discrete (crew, first, second, third)\n", "1 feature age discrete (adult, child)\n", "2 feature sex discrete (female, male)\n", "3 label survived discrete (no, yes)" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "from Orange.data import Table\n", "\n", "titanic: Table = Table(\"titanic\")\n", "\n", "# print some information about columns\n", "pd.DataFrame(\n", " [\n", " {\n", " \"Column type\": (\n", " \"label\" if index == len(titanic.domain.attributes) else \"feature\"\n", " ),\n", " \"Column name\": attr.name,\n", " \"Data type\": \"discrete\" if attr.is_discrete else \"continuous\",\n", " \"Possible values\": attr.values,\n", " }\n", " for index, attr in enumerate(\n", " list(titanic.domain.attributes) + [titanic.domain.class_var]\n", " )\n", " ]\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that Orange models do not operate on padnas data frames, as is the case with decision_rules, \n", "but use a custom representation of `Orange.data.Table`. It automatically encodes nominal columns\n", "using floats. We need to convert the data to a pandas data frame in order to use it with the\n", "decision_rules." ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
statusagesex
_o88041.00.01.0
_o88051.00.01.0
_o88061.00.01.0
_o88071.00.01.0
_o88081.00.01.0
............
_o110000.00.00.0
_o110010.00.00.0
_o110020.00.00.0
_o110030.00.00.0
_o110040.00.00.0
\n", "

2201 rows × 3 columns

\n", "
" ], "text/plain": [ " status age sex\n", "_o8804 1.0 0.0 1.0\n", "_o8805 1.0 0.0 1.0\n", "_o8806 1.0 0.0 1.0\n", "_o8807 1.0 0.0 1.0\n", "_o8808 1.0 0.0 1.0\n", "... ... ... ...\n", "_o11000 0.0 0.0 0.0\n", "_o11001 0.0 0.0 0.0\n", "_o11002 0.0 0.0 0.0\n", "_o11003 0.0 0.0 0.0\n", "_o11004 0.0 0.0 0.0\n", "\n", "[2201 rows x 3 columns]" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "X = pd.DataFrame(titanic.X_df)\n", "y = pd.Series(titanic.Y_df.values[:, 0])\n", "\n", "X" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that now all functions are of type float, although they are still discrete. \n", "\n", "Now we will continue to train the rule set using the CN2 algorithm. We will then print out the generated rules." ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "IF sex==female AND status==first AND age!=adult THEN survived=yes \n", "IF sex==female AND status!=third AND age!=adult THEN survived=yes \n", "IF sex!=female AND status==second AND age!=adult THEN survived=yes \n", "IF sex==female AND status==first THEN survived=yes \n", "IF status!=third AND age!=adult THEN survived=yes \n", "IF sex!=female AND status==second THEN survived=no \n", "IF status==crew AND sex!=male THEN survived=yes \n", "IF status==second THEN survived=yes \n", "IF sex!=female AND status==third AND age!=child THEN survived=no \n", "IF status==crew THEN survived=no \n", "IF sex!=female AND status!=first THEN survived=no \n", "IF status==first THEN survived=no \n", "IF age!=adult THEN survived=no \n", "IF status==third THEN survived=no \n", "IF TRUE THEN survived=no \n" ] } ], "source": [ "from Orange.classification.rules import CN2Learner, CN2Classifier\n", "\n", "# train the ruleset using CN2 algorithm\n", "cn2_clasifier: CN2Classifier = CN2Learner()(titanic)\n", "\n", "for r in cn2_clasifier.rule_list:\n", " print(str(r))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To use our ruleset from decision-rules, we need to convert it to decision-rules\n", "ruleset. Here we are dealing with a classification problem, so we need a `decision_rules.classification.ClassificationRuleSet` object.\n", "\n", "To convert CN2 classifiers into a `ClassificationRuleSet` object we will write a simple factory class." ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from decision_rules.classification import (\n", " ClassificationConclusion,\n", " ClassificationRule,\n", " ClassificationRuleSet,\n", ")\n", "from decision_rules.core.condition import AbstractCondition\n", "from decision_rules.conditions import (\n", " CompoundCondition,\n", " ElementaryCondition,\n", " NominalCondition,\n", ")\n", "from Orange.classification.rules import CN2Classifier\n", "from Orange.classification.rules import Rule as CN2Rule\n", "from Orange.classification.rules import Selector\n", "\n", "\n", "class OrangeCN2RuleSetFactory:\n", "\n", " def make(\n", " self, model: CN2Classifier, X_train: pd.DataFrame\n", " ) -> ClassificationRuleSet:\n", " ruleset: ClassificationRuleSet = ClassificationRuleSet(\n", " rules=[\n", " self._make_rule(\n", " rule,\n", " column_names=list(X_train.columns),\n", " )\n", " for rule in model.rule_list[:-1]\n", " ]\n", " )\n", " # last rule is a default one\n", " ruleset.default_conclusion = self._make_rule_conclusion(model.rule_list[-1])\n", "\n", " return ruleset\n", "\n", " def _make_rule(\n", " self, cn2_rule: CN2Rule, column_names: list[str]\n", " ) -> ClassificationRule:\n", " return ClassificationRule(\n", " premise=self._make_rule_premise(cn2_rule),\n", " conclusion=self._make_rule_conclusion(cn2_rule),\n", " column_names=column_names,\n", " )\n", "\n", " def _make_rule_premise(self, cn2_rule: CN2Rule) -> CompoundCondition:\n", " return CompoundCondition(\n", " subconditions=[\n", " self._make_subcondition(selector) for selector in cn2_rule.selectors\n", " ]\n", " )\n", "\n", " def _make_subcondition(self, selector: Selector) -> AbstractCondition:\n", " # tiny wrapping function to return negated version of given condition\n", " def negated(condition: AbstractCondition) -> AbstractCondition:\n", " condition.negated = not condition.negated\n", " return condition\n", "\n", " # maps different selectors types for decision-rules conditions\n", " return {\n", " \"==\": lambda c_index, c_value: NominalCondition(\n", " column_index=c_index, value=c_value\n", " ),\n", " \"!=\": lambda c_index, c_value: negated(\n", " NominalCondition(column_index=c_index, value=c_value)\n", " ),\n", " \"<=\": lambda c_index, c_value: ElementaryCondition(\n", " column_index=c_index, right=float(c_value), right_closed=True\n", " ),\n", " \">=\": lambda c_index, c_value: ElementaryCondition(\n", " column_index=c_index, left=float(c_value), left_closed=True\n", " ),\n", " }[selector.op](selector.column, selector.value)\n", "\n", " def _make_rule_conclusion(self, cn2_rule: CN2Rule) -> ClassificationConclusion:\n", " return ClassificationConclusion(\n", " column_name=cn2_rule.domain.class_var.name, value=cn2_rule.prediction\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we will use our class with our ruleset." ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "IF sex = {0.0} AND status = {1.0} AND age != {0.0} THEN survived = 1\n", "IF sex = {0.0} AND status != {3.0} AND age != {0.0} THEN survived = 1\n", "IF sex != {0.0} AND status = {2.0} AND age != {0.0} THEN survived = 1\n", "IF sex = {0.0} AND status = {1.0} THEN survived = 1\n", "IF status != {3.0} AND age != {0.0} THEN survived = 1\n", "IF sex != {0.0} AND status = {2.0} THEN survived = 0\n", "IF status = {0.0} AND sex != {1.0} THEN survived = 1\n", "IF status = {2.0} THEN survived = 1\n", "IF sex != {0.0} AND status = {3.0} AND age != {1.0} THEN survived = 0\n", "IF status = {0.0} THEN survived = 0\n", "IF sex != {0.0} AND status != {1.0} THEN survived = 0\n", "IF status = {1.0} THEN survived = 0\n", "IF age != {0.0} THEN survived = 0\n", "IF status = {3.0} THEN survived = 0\n", "Default rule: survived = 0\n" ] } ], "source": [ "ruleset: ClassificationRuleSet = OrangeCN2RuleSetFactory().make(cn2_clasifier, X)\n", "\n", "for rule in ruleset.rules:\n", " print(rule)\n", "print('Default rule: ', ruleset.default_conclusion)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that the rules are semantically identical, even though their textual representation is different.\n", "\n", "Now we can check whether both models predict in the same way" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "ename": "InvalidStateError", "evalue": "Rule coverages have to be calculated before performing prediction.Did you forget to call update(...) method?", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mInvalidStateError\u001b[0m Traceback (most recent call last)", "Cell \u001b[1;32mIn[57], line 6\u001b[0m\n\u001b[0;32m 4\u001b[0m cn2_pred: np\u001b[38;5;241m.\u001b[39mndarray \u001b[38;5;241m=\u001b[39m np\u001b[38;5;241m.\u001b[39margmax(cn2_clasifier\u001b[38;5;241m.\u001b[39mpredict(titanic\u001b[38;5;241m.\u001b[39mX), axis\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m1\u001b[39m)\n\u001b[0;32m 5\u001b[0m \u001b[38;5;66;03m# decision-rules ruleset predictions\u001b[39;00m\n\u001b[1;32m----> 6\u001b[0m ruleset_pred: np\u001b[38;5;241m.\u001b[39mndarray \u001b[38;5;241m=\u001b[39m \u001b[43mruleset\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mpredict\u001b[49m\u001b[43m(\u001b[49m\u001b[43mX\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 8\u001b[0m np\u001b[38;5;241m.\u001b[39marray_equal(cn2_pred, ruleset_pred)\n", "File \u001b[1;32mc:\\Users\\cezar\\OneDrive\\Pulpit\\EMAG\\GIT\\decision-rules\\cn2\\..\\decision_rules\\core\\ruleset.py:340\u001b[0m, in \u001b[0;36mAbstractRuleSet.predict\u001b[1;34m(self, X)\u001b[0m\n\u001b[0;32m 338\u001b[0m X: np\u001b[38;5;241m.\u001b[39mndarray \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_sanitize_dataset(X)\n\u001b[0;32m 339\u001b[0m coverage_matrix: np\u001b[38;5;241m.\u001b[39mndarray \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mcalculate_coverage_matrix(X)\n\u001b[1;32m--> 340\u001b[0m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_validate_object_state_before_prediction\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 341\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mpredict_using_coverage_matrix(coverage_matrix)\n", "File \u001b[1;32mc:\\Users\\cezar\\OneDrive\\Pulpit\\EMAG\\GIT\\decision-rules\\cn2\\..\\decision_rules\\core\\ruleset.py:55\u001b[0m, in \u001b[0;36mAbstractRuleSet._validate_object_state_before_prediction\u001b[1;34m(self)\u001b[0m\n\u001b[0;32m 51\u001b[0m voting_weights_calculated: \u001b[38;5;28mbool\u001b[39m \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mall\u001b[39m(\n\u001b[0;32m 52\u001b[0m [rule\u001b[38;5;241m.\u001b[39mcoverage \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m \u001b[38;5;28;01mfor\u001b[39;00m rule \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mrules]\n\u001b[0;32m 53\u001b[0m )\n\u001b[0;32m 54\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m voting_weights_calculated:\n\u001b[1;32m---> 55\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m InvalidStateError(\n\u001b[0;32m 56\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mRule coverages have to be calculated before performing prediction.\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[0;32m 57\u001b[0m \u001b[38;5;241m+\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mDid you forget to call update(...) method?\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[0;32m 58\u001b[0m )\n", "\u001b[1;31mInvalidStateError\u001b[0m: Rule coverages have to be calculated before performing prediction.Did you forget to call update(...) method?" ] } ], "source": [ "import numpy as np\n", "\n", "# original model predicitons (we need to transform class probabilities to class labels)\n", "cn2_pred: np.ndarray = np.argmax(cn2_clasifier.predict(titanic.X), axis=1)\n", "# decision-rules ruleset predictions\n", "ruleset_pred: np.ndarray = ruleset.predict(X)\n", "\n", "np.array_equal(cn2_pred, ruleset_pred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As the error message indicates, we forgot to call the update(...) method on the ruleset object. This method requires a\n", "dataset and a quality measure to calculate the quality of the rules used later to resolve conflicts during prediction.\n", "\n", "The Orange CN2 algorithm uses a confidence quality measure to resolve conflicts during prediction.\n", "The **decision-rules** package already provides a built-in implementation of this measure, but in this section, we will demonstrate how you can easily define your own custom measure. This is useful for extending the functionality or tailoring it to specific needs.\n", "Below we can see the equation for calculating confidence based on the confusion matrix.\n", "```text\n", "\n", " Confidence = (TP + TN) / (TP + TN + FP + FN)\n", "\n", "```\n", "However, in decision rules, quality is measured by a function taking a single parameter of type\n", "`decision_rules.core.coverage.Coverage`. Coverage objects contain information about\n", "examples covered by the rule, and contain four fields: p, n, P and N. p is the number of positive examples predicted as positive. \n", "positive examples predicted as positive, n is the number of negative examples predicted as positive.\n", "As positive. P and N are the total number of positive and negative examples in the entire dataset. \n", "As we can see, the `Coverage` objects are simply another representation of the confusion matrix.\n", "Knowing this, we can simply write our trust quality measurement function as follows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from decision_rules.core.coverage import Coverage\n", "\n", "def confidence(c: Coverage) -> float:\n", " return (c.P + c.N) / (c.P + c.N + c.n + c.p)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can use our quality function with the `update(...)` method of the rule set to calculate the\n", "the quality of the rules." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "_ = ruleset.update(X, y, measure=lambda c: (c.P + c.N) / (c.P + c.N + c.n + c.p))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's check again that both models predicts the same." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ruleset_pred: np.ndarray = ruleset.predict(X)\n", "\n", "np.array_equal(cn2_pred, ruleset_pred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is still a difference in the predictions. It is because both models are using\n", "different conflict resolution strategies. Decision-rules models use voting strategy by\n", "default. It means that when more than one rules from different classes covers the same example,\n", "they \"vote\" for their class. Each vote of the specific rule is multiplied by its quality (voting_weight). \n", "Finally, the class with the highest score is selected. However, this is not the way conflicts\n", "are resolved in the CN2 algorithm. Instead of conducting a vote, it selects the first rule from a set of rules that\n", "that covers the given examples and uses it to get the predicted value. \n", "\n", "Decision-rules package provides us with three prediction strategies: `vote` and `best_rule`, `first_rule` . The first one we've already\n", "discussed. The second one select the rule with the highest quality from all the rules covering\n", "given examples and use it to predict for this example. While the package includes a ready-to-use `FirstRuleCoveringStrategy` that does exactly the same thing as our original algorithm, we will write it from scratch to demonstrate how prediction strategies can be easily implemented and extended.\n", "\n", "To implement our own prediction strategy we need to write a class inheriting from `decision_rules.core.prediction.PredictionStrategy`\n", "class. It have to implement a single method `_perform_prediction` (and optionally overwrite others). This method takes a single\n", "parameter called `voting_matrix`. This is a numpy array containing as many rows as the predicted dataset and as many columns as the\n", "as many columns as the number of rules in our ruleset. For each row (example) and column (rule), it stores zero (if the rule\n", "does not include the example) or the quality of the rule (if the rule includes the example). Such an array is very convenient for\n", "implementation of various prediction strategies. Below we can see the finished implementation. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from decision_rules.core.prediction import PredictionStrategy\n", "\n", "\n", "class FirstRuleCoveringStrategy(PredictionStrategy):\n", "\n", " def _perform_prediction(self, voting_matrix: np.ndarray) -> np.ndarray:\n", " coverage_matrix = self.coverage_matrix\n", " predictions = np.array(\n", " [\n", " self.rules[i].conclusion.value\n", " # numpy argmax will return first occurence of the maximum\n", " for i in coverage_matrix.argmax(axis=1)\n", " ]\n", " )\n", " # we need to handle examples uncovered by any rule using default rule/conclusion\n", " predictions[coverage_matrix.sum(axis=1) == 0] = self.default_conclusion.value\n", " return predictions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we configure our rule set to use our custom prediction strategy. Then we check again\n", "if the predictions are the same for both models." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ruleset.set_prediction_strategy(FirstRuleCoveringStrategy)\n", "\n", "# original model predicitons (we need to transform class probabilities to class labels)\n", "cn2_pred: np.ndarray = np.argmax(cn2_clasifier.predict(titanic.X), axis=1)\n", "# decision-rules ruleset predictions\n", "ruleset_pred: np.ndarray = ruleset.predict(X)\n", "\n", "np.array_equal(cn2_pred, ruleset_pred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally we achieved the same predictions. \n", "\n", "This example showed us how the decision rule package can be used with different rule induction algorithms.\n", "Even if it does not implement all possible prediction mechanisms\n", "we can easily extend it to fit many different cases." ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.0" } }, "nbformat": 4, "nbformat_minor": 2 }