Generating rules using the CN2 algorithm

In this tutorial we will learn how to integrate decision rules with other rule induction packages. We will cover multiple topics such as: * creating decision-rules rulesets programiclly, * writing custom rule quality measures, * programmatic creation of decision rule sets.

We will use the implementation of the CN2 algorithm provided by the Orange package. In this tutorial we will use the popular titanic dataset. Later, we will write a custom rule set factory class that will transform an instance of the Orange rule set to an instance of the decision_rules.classification.ClassificationRuleSet class. Finally, we will briefly introduce the various operations that can be performed using such an object.

Custom classes and methods presented in tutorial are already implemented and available in the decision-rules package, but this tutorial will teach you how to implement them yourself. This knowledge will enable you to add your own custom factories, quality measures, or prediction strategies in the future if needed.

We will start by loading the titanic dataset. We will use the submodule of Orange data here.

[58]:

import pandas as pd
from Orange.data import Table

titanic: Table = Table("titanic")

# print some information about columns
pd.DataFrame(
    [
        {
            "Column type": (
                "label" if index == len(titanic.domain.attributes) else "feature"
            ),
            "Column name": attr.name,
            "Data type": "discrete" if attr.is_discrete else "continuous",
            "Possible values": attr.values,
        }
        for index, attr in enumerate(
            list(titanic.domain.attributes) + [titanic.domain.class_var]
        )
    ]
)

[58]:

	Column type	Column name	Data type	Possible values
0	feature	status	discrete	(crew, first, second, third)
1	feature	age	discrete	(adult, child)
2	feature	sex	discrete	(female, male)
3	label	survived	discrete	(no, yes)

Note that Orange models do not operate on padnas data frames, as is the case with decision_rules, but use a custom representation of Orange.data.Table. It automatically encodes nominal columns using floats. We need to convert the data to a pandas data frame in order to use it with the decision_rules.

[53]:

import pandas as pd

X = pd.DataFrame(titanic.X_df)
y = pd.Series(titanic.Y_df.values[:, 0])

X

[53]:

	status	age	sex
_o8804	1.0	0.0	1.0
_o8805	1.0	0.0	1.0
_o8806	1.0	0.0	1.0
_o8807	1.0	0.0	1.0
_o8808	1.0	0.0	1.0
...	...	...	...
_o11000	0.0	0.0	0.0
_o11001	0.0	0.0	0.0
_o11002	0.0	0.0	0.0
_o11003	0.0	0.0	0.0
_o11004	0.0	0.0	0.0

2201 rows × 3 columns

Notice that now all functions are of type float, although they are still discrete.

Now we will continue to train the rule set using the CN2 algorithm. We will then print out the generated rules.

[59]:

from Orange.classification.rules import CN2Learner, CN2Classifier

# train the ruleset using CN2 algorithm
cn2_clasifier: CN2Classifier = CN2Learner()(titanic)

for r in cn2_clasifier.rule_list:
    print(str(r))

IF sex==female AND status==first AND age!=adult THEN survived=yes
IF sex==female AND status!=third AND age!=adult THEN survived=yes
IF sex!=female AND status==second AND age!=adult THEN survived=yes
IF sex==female AND status==first THEN survived=yes
IF status!=third AND age!=adult THEN survived=yes
IF sex!=female AND status==second THEN survived=no
IF status==crew AND sex!=male THEN survived=yes
IF status==second THEN survived=yes
IF sex!=female AND status==third AND age!=child THEN survived=no
IF status==crew THEN survived=no
IF sex!=female AND status!=first THEN survived=no
IF status==first THEN survived=no
IF age!=adult THEN survived=no
IF status==third THEN survived=no
IF TRUE THEN survived=no

To use our ruleset from decision-rules, we need to convert it to decision-rules ruleset. Here we are dealing with a classification problem, so we need a decision_rules.classification.ClassificationRuleSet object.

To convert CN2 classifiers into a ClassificationRuleSet object we will write a simple factory class.

[55]:

import pandas as pd
from decision_rules.classification import (
    ClassificationConclusion,
    ClassificationRule,
    ClassificationRuleSet,
)
from decision_rules.core.condition import AbstractCondition
from decision_rules.conditions import (
    CompoundCondition,
    ElementaryCondition,
    NominalCondition,
)
from Orange.classification.rules import CN2Classifier
from Orange.classification.rules import Rule as CN2Rule
from Orange.classification.rules import Selector


class OrangeCN2RuleSetFactory:

    def make(
        self, model: CN2Classifier, X_train: pd.DataFrame
    ) -> ClassificationRuleSet:
        ruleset: ClassificationRuleSet = ClassificationRuleSet(
            rules=[
                self._make_rule(
                    rule,
                    column_names=list(X_train.columns),
                )
                for rule in model.rule_list[:-1]
            ]
        )
        # last rule is a default one
        ruleset.default_conclusion = self._make_rule_conclusion(model.rule_list[-1])

        return ruleset

    def _make_rule(
        self, cn2_rule: CN2Rule, column_names: list[str]
    ) -> ClassificationRule:
        return ClassificationRule(
            premise=self._make_rule_premise(cn2_rule),
            conclusion=self._make_rule_conclusion(cn2_rule),
            column_names=column_names,
        )

    def _make_rule_premise(self, cn2_rule: CN2Rule) -> CompoundCondition:
        return CompoundCondition(
            subconditions=[
                self._make_subcondition(selector) for selector in cn2_rule.selectors
            ]
        )

    def _make_subcondition(self, selector: Selector) -> AbstractCondition:
        # tiny wrapping function to return negated version of given condition
        def negated(condition: AbstractCondition) -> AbstractCondition:
            condition.negated = not condition.negated
            return condition

        # maps different selectors types for decision-rules conditions
        return {
            "==": lambda c_index, c_value: NominalCondition(
                column_index=c_index, value=c_value
            ),
            "!=": lambda c_index, c_value: negated(
                NominalCondition(column_index=c_index, value=c_value)
            ),
            "<=": lambda c_index, c_value: ElementaryCondition(
                column_index=c_index, right=float(c_value), right_closed=True
            ),
            ">=": lambda c_index, c_value: ElementaryCondition(
                column_index=c_index, left=float(c_value), left_closed=True
            ),
        }[selector.op](selector.column, selector.value)

    def _make_rule_conclusion(self, cn2_rule: CN2Rule) -> ClassificationConclusion:
        return ClassificationConclusion(
            column_name=cn2_rule.domain.class_var.name, value=cn2_rule.prediction
        )

Now we will use our class with our ruleset.

[56]:

ruleset: ClassificationRuleSet = OrangeCN2RuleSetFactory().make(cn2_clasifier, X)

for rule in ruleset.rules:
    print(rule)
print('Default rule: ', ruleset.default_conclusion)

IF sex = {0.0} AND status = {1.0} AND age != {0.0} THEN survived = 1
IF sex = {0.0} AND status != {3.0} AND age != {0.0} THEN survived = 1
IF sex != {0.0} AND status = {2.0} AND age != {0.0} THEN survived = 1
IF sex = {0.0} AND status = {1.0} THEN survived = 1
IF status != {3.0} AND age != {0.0} THEN survived = 1
IF sex != {0.0} AND status = {2.0} THEN survived = 0
IF status = {0.0} AND sex != {1.0} THEN survived = 1
IF status = {2.0} THEN survived = 1
IF sex != {0.0} AND status = {3.0} AND age != {1.0} THEN survived = 0
IF status = {0.0} THEN survived = 0
IF sex != {0.0} AND status != {1.0} THEN survived = 0
IF status = {1.0} THEN survived = 0
IF age != {0.0} THEN survived = 0
IF status = {3.0} THEN survived = 0
Default rule:  survived = 0

We can see that the rules are semantically identical, even though their textual representation is different.

Now we can check whether both models predict in the same way

[57]:

import numpy as np

# original model predicitons (we need to transform class probabilities to class labels)
cn2_pred: np.ndarray = np.argmax(cn2_clasifier.predict(titanic.X), axis=1)
# decision-rules ruleset predictions
ruleset_pred: np.ndarray = ruleset.predict(X)

np.array_equal(cn2_pred, ruleset_pred)

---------------------------------------------------------------------------
InvalidStateError                         Traceback (most recent call last)
Cell In[57], line 6
      4 cn2_pred: np.ndarray = np.argmax(cn2_clasifier.predict(titanic.X), axis=1)
      5 # decision-rules ruleset predictions
----> 6 ruleset_pred: np.ndarray = ruleset.predict(X)
      8 np.array_equal(cn2_pred, ruleset_pred)

File c:\Users\cezar\OneDrive\Pulpit\EMAG\GIT\decision-rules\cn2\..\decision_rules\core\ruleset.py:340, in AbstractRuleSet.predict(self, X)
    338 X: np.ndarray = self._sanitize_dataset(X)
    339 coverage_matrix: np.ndarray = self.calculate_coverage_matrix(X)
--> 340 self._validate_object_state_before_prediction()
    341 return self.predict_using_coverage_matrix(coverage_matrix)

File c:\Users\cezar\OneDrive\Pulpit\EMAG\GIT\decision-rules\cn2\..\decision_rules\core\ruleset.py:55, in AbstractRuleSet._validate_object_state_before_prediction(self)
     51 voting_weights_calculated: bool = all(
     52     [rule.coverage is not None for rule in self.rules]
     53 )
     54 if not voting_weights_calculated:
---> 55     raise InvalidStateError(
     56         "Rule coverages have to be calculated before performing prediction."
     57         + "Did you forget to call update(...) method?"
     58     )

InvalidStateError: Rule coverages have to be calculated before performing prediction.Did you forget to call update(...) method?

As the error message indicates, we forgot to call the update(…) method on the ruleset object. This method requires a dataset and a quality measure to calculate the quality of the rules used later to resolve conflicts during prediction.

The Orange CN2 algorithm uses a confidence quality measure to resolve conflicts during prediction. The decision-rules package already provides a built-in implementation of this measure, but in this section, we will demonstrate how you can easily define your own custom measure. This is useful for extending the functionality or tailoring it to specific needs. Below we can see the equation for calculating confidence based on the confusion matrix.

Confidence = (TP + TN) / (TP + TN + FP + FN)

However, in decision rules, quality is measured by a function taking a single parameter of type decision_rules.core.coverage.Coverage. Coverage objects contain information about examples covered by the rule, and contain four fields: p, n, P and N. p is the number of positive examples predicted as positive. positive examples predicted as positive, n is the number of negative examples predicted as positive. As positive. P and N are the total number of positive and negative examples in the entire dataset. As we can see, the Coverage objects are simply another representation of the confusion matrix. Knowing this, we can simply write our trust quality measurement function as follows:

[ ]:

from decision_rules.core.coverage import Coverage

def confidence(c: Coverage) -> float:
    return (c.P + c.N) / (c.P + c.N + c.n + c.p)

Now we can use our quality function with the update(...) method of the rule set to calculate the the quality of the rules.

[ ]:

_ = ruleset.update(X, y, measure=lambda c: (c.P + c.N) / (c.P + c.N + c.n + c.p))

Now let’s check again that both models predicts the same.

[ ]:

ruleset_pred: np.ndarray = ruleset.predict(X)

np.array_equal(cn2_pred, ruleset_pred)

False

There is still a difference in the predictions. It is because both models are using different conflict resolution strategies. Decision-rules models use voting strategy by default. It means that when more than one rules from different classes covers the same example, they “vote” for their class. Each vote of the specific rule is multiplied by its quality (voting_weight). Finally, the class with the highest score is selected. However, this is not the way conflicts are resolved in the CN2 algorithm. Instead of conducting a vote, it selects the first rule from a set of rules that that covers the given examples and uses it to get the predicted value.

Decision-rules package provides us with three prediction strategies: vote and best_rule, first_rule . The first one we’ve already discussed. The second one select the rule with the highest quality from all the rules covering given examples and use it to predict for this example. While the package includes a ready-to-use FirstRuleCoveringStrategy that does exactly the same thing as our original algorithm, we will write it from scratch to demonstrate how prediction strategies can be easily implemented and extended.

To implement our own prediction strategy we need to write a class inheriting from decision_rules.core.prediction.PredictionStrategy class. It have to implement a single method _perform_prediction (and optionally overwrite others). This method takes a single parameter called voting_matrix. This is a numpy array containing as many rows as the predicted dataset and as many columns as the as many columns as the number of rules in our ruleset. For each row (example) and column (rule), it stores zero (if the rule does not include the example) or the quality of the rule (if the rule includes the example). Such an array is very convenient for implementation of various prediction strategies. Below we can see the finished implementation.

[ ]:

from decision_rules.core.prediction import PredictionStrategy


class FirstRuleCoveringStrategy(PredictionStrategy):

    def _perform_prediction(self, voting_matrix: np.ndarray) -> np.ndarray:
        coverage_matrix = self.coverage_matrix
        predictions = np.array(
            [
                self.rules[i].conclusion.value
                # numpy argmax will return first occurence of the maximum
                for i in coverage_matrix.argmax(axis=1)
            ]
        )
        # we need to handle examples uncovered by any rule using default rule/conclusion
        predictions[coverage_matrix.sum(axis=1) == 0] = self.default_conclusion.value
        return predictions

Now we configure our rule set to use our custom prediction strategy. Then we check again if the predictions are the same for both models.

[ ]:

ruleset.set_prediction_strategy(FirstRuleCoveringStrategy)

# original model predicitons (we need to transform class probabilities to class labels)
cn2_pred: np.ndarray = np.argmax(cn2_clasifier.predict(titanic.X), axis=1)
# decision-rules ruleset predictions
ruleset_pred: np.ndarray = ruleset.predict(X)

np.array_equal(cn2_pred, ruleset_pred)

True

Finally we achieved the same predictions.

This example showed us how the decision rule package can be used with different rule induction algorithms. Even if it does not implement all possible prediction mechanisms we can easily extend it to fit many different cases.