Use rules in textual form

In this tutorial, we will load a set of classification rules in textual form and evaluate them

Load and prepare dataset

We begin by loading the titanic dataset into a DataFrame.

[4]:
import pandas as pd
TITANIC_PATH = (
    'https://raw.githubusercontent.com/ruleminer/decision-rules/'
    'refs/heads/docs/docs-src/source/tutorials/resources/titanic.csv'
)
titanic_df = pd.read_csv(TITANIC_PATH)
display(titanic_df)
print('Columns: ', titanic_df.columns.values)
print('Class names:', titanic_df['class'].unique())
X = titanic_df.drop("class", axis=1)
y = titanic_df["class"]
pclass age sex class
0 1st adult male yes
1 1st adult male yes
2 1st adult male yes
3 1st adult male yes
4 1st adult male yes
... ... ... ... ...
2196 crew adult female yes
2197 crew adult female yes
2198 crew adult female no
2199 crew adult female no
2200 crew adult female no

2201 rows × 4 columns

Columns:  ['pclass' 'age' 'sex' 'class']
Class names: ['yes' 'no']

Load the ruleset in textual form

Now we need to load the ruleset provided in a text file

[5]:
import urllib

FILE_PATH: str = (
    'https://raw.githubusercontent.com/ruleminer/decision-rules/'
    'refs/heads/docs/docs-src/source/tutorials/resources/classification/'
    'text_ruleset.txt'
)

with urllib.request.urlopen(FILE_PATH) as response:
    text_rules_model = response.read().decode('utf-8').splitlines()


text_rules_model
[5]:
['IF sex = {male} AND age = {adult} THEN class = {no}',
 'IF sex = {male} AND pclass != {1st} THEN class = {no}',
 'IF sex = {female} THEN class = {yes}',
 'IF sex = {male} AND age = {adult} AND pclass != {1st} THEN class = {no}',
 'IF pclass != {3rd} AND sex = {female} THEN class = {yes}']

Convert the textual ruleset to a decision-rules model

Now that the rules are loaded, we convert them into a decision-rules model using the TextRulesetFactory from decision-rules library. This conversion enables us to evaluate and modify the ruleset programmatically.

[6]:
from decision_rules.ruleset_factories._factories.classification import TextRuleSetFactory

factory = TextRuleSetFactory()
ruleset = factory.make(text_rules_model, X, y)

After conversion in the decision-rules library, we can easilythe display the model

[7]:
for rule in ruleset.rules:
    print(rule)
IF sex = {male} AND age = {adult} THEN class = no (p=1329, n=338, P=1490, N=711)
IF sex = {male} AND pclass != {1st} THEN class = no (p=1246, n=305, P=1490, N=711)
IF sex = {female} THEN class = yes (p=344, n=126, P=711, N=1490)
IF sex = {male} AND age = {adult} AND pclass != {1st} THEN class = no (p=1211, n=281, P=1490, N=711)
IF pclass != {3rd} AND sex = {female} THEN class = yes (p=254, n=20, P=711, N=1490)

Analyze the ruleset statistics

We can compute various metrics for the ruleset, such as average precision, coverage, and lift. This step involves retrieving statistical information about the rules.

We start by calculating and displaying the general characteristics of the ruleset

[8]:
ruleset_stats = ruleset.calculate_ruleset_stats(X, y)

print(ruleset_stats)
{'rules_count': 5, 'avg_conditions_count': 2.0, 'avg_precision': 0.81, 'avg_coverage': 0.68, 'total_conditions_count': 10}

Now let’s calculate metrics for each rule. To make the output more readable and easier to interpret, we will organize the metrics into a DataFrame

[9]:
rule_metrics = ruleset.calculate_rules_metrics(X, y)
print(rule_metrics)
rule_metrics_df = pd.DataFrame([
    {
        'Rule': f"r{i+1}",
        'p': metrics['p'],
        'n': metrics['n'],
        'P': metrics['P'],
        'N': metrics['N'],
        'Unique in Positive': metrics.get('unique_in_pos', 0),
        'Unique in Negative': metrics.get('unique_in_neg', 0),
        'P Unique': metrics.get('p_unique', 0),
        'N Unique': metrics.get('n_unique', 0),
        'All Unique': metrics.get('all_unique', 0),
        'Support': round(metrics.get('support', 0), 3),
        'Conditions Count': metrics.get('conditions_count', 0),
        'Precision': round(metrics.get('precision', 0), 3),
        'Coverage': round(metrics.get('coverage', 0), 3),
        'C2': round(metrics.get('C2', 0), 3),
        'RSS': round(metrics.get('RSS', 0), 3),
        'Correlation': round(metrics.get('correlation', 0), 3),
        'Lift': round(metrics.get('lift', 0), 3),
        'P Value': metrics.get('p_value', 0),
        'TP': metrics.get('TP', 0),
        'FP': metrics.get('FP', 0),
        'TN': metrics.get('TN', 0),
        'FN': metrics.get('FN', 0),
        'Sensitivity': round(metrics.get('sensitivity', 0), 3),
        'Specificity': round(metrics.get('specificity', 0), 3),
        'Negative Predictive Value': round(metrics.get('negative_predictive_value', 0), 3),
        'Odds Ratio': round(metrics.get('odds_ratio', 0), 3),
        'Relative Risk': round(metrics.get('relative_risk', 0), 3),
        'LR+': round(metrics.get('lr+', 0), 3),
        'LR-': round(metrics.get('lr-', 0), 3),
    }
    for i, (_, metrics) in enumerate(rule_metrics.items())
])
display(rule_metrics_df)
{'800057da-2fc6-436b-8058-ec6903015c6f': {'p': 1329, 'n': 338, 'P': 1490, 'N': 711, 'unique_in_pos': 118, 'unique_in_neg': 57, 'p_unique': 118, 'n_unique': 57, 'all_unique': 175, 'support': 0.7573830077237619, 'conditions_count': 2, 'precision': 0.7972405518896221, 'coverage': 0.8919463087248322, 'C2': 0.3522139513422039, 'RSS': 0.4165595295405846, 'correlation': 0.45442956675744167, 'lift': 1.1776687615497035, 'p_value': 2.627480562242127e-96, 'TP': 1329, 'FP': 338, 'TN': 373, 'FN': 161, 'sensitivity': 0.8919463087248322, 'specificity': 0.5246132208157525, 'negative_predictive_value': 0.6985018726591761, 'odds_ratio': 9.109263308770833, 'relative_risk': 2.6279410784509762, 'lr+': 1.876253921607561, 'lr-': 0.20596829623765237}, '96746bc9-7e93-4158-b7f3-39c8709cd355': {'p': 1246, 'n': 305, 'P': 1490, 'N': 711, 'unique_in_pos': 35, 'unique_in_neg': 24, 'p_unique': 35, 'n_unique': 24, 'all_unique': 59, 'support': 0.704679691049523, 'conditions_count': 2, 'precision': 0.8033526756931012, 'coverage': 0.836241610738255, 'C2': 0.35921539680977327, 'RSS': 0.40726833366371207, 'correlation': 0.4174899265648551, 'lift': 1.1866974759735005, 'p_value': 1.2499563232509209e-82, 'TP': 1246, 'FP': 305, 'TN': 406, 'FN': 244, 'sensitivity': 0.836241610738255, 'specificity': 0.5710267229254571, 'negative_predictive_value': 0.6246153846153846, 'odds_ratio': 6.797489955792048, 'relative_risk': 2.131343833471493, 'lr+': 1.9494025745406534, 'lr-': 0.28677885410123327}, 'b774df16-9114-4df0-a86d-586edda9d696': {'p': 344, 'n': 126, 'P': 711, 'N': 1490, 'unique_in_pos': 90, 'unique_in_neg': 106, 'p_unique': 90, 'n_unique': 106, 'all_unique': 196, 'support': 0.21353930031803725, 'conditions_count': 1, 'precision': 0.7319148936170212, 'coverage': 0.4838255977496484, 'C2': 0.4481077026863914, 'RSS': 0.3992618393603866, 'correlation': 0.45560478314893393, 'lift': 2.265744980099949, 'p_value': 2.6906937468626293e-96, 'TP': 344, 'FP': 126, 'TN': 1364, 'FN': 367, 'sensitivity': 0.4838255977496484, 'specificity': 0.9154362416107382, 'negative_predictive_value': 0.7879838243789717, 'odds_ratio': 10.146746534610644, 'relative_risk': 3.4427844588344123, 'lr+': 5.721429687674411, 'lr-': 0.5638562018717184}, '44f9bfbc-1095-46be-b726-333cba0bed6f': {'p': 1211, 'n': 281, 'P': 1490, 'N': 711, 'unique_in_pos': 0, 'unique_in_neg': 0, 'p_unique': 0, 'n_unique': 0, 'all_unique': 0, 'support': 0.6778736937755566, 'conditions_count': 3, 'precision': 0.811662198391421, 'coverage': 0.812751677852349, 'C2': 0.3779351395045058, 'RSS': 0.4175336750394095, 'correlation': 0.41784182864340963, 'lift': 1.1989721467513539, 'p_value': 2.251021760405628e-83, 'TP': 1211, 'FP': 281, 'TN': 430, 'FN': 279, 'sensitivity': 0.812751677852349, 'specificity': 0.6047819971870605, 'negative_predictive_value': 0.6064880112834978, 'odds_ratio': 6.641964285714286, 'relative_risk': 2.055244638069705, 'lr+': 2.056464209797225, 'lr-': 0.3096129233650694}, '082dbf24-8be4-47f6-a228-808c1c69b1a9': {'p': 254, 'n': 20, 'P': 711, 'N': 1490, 'unique_in_pos': 0, 'unique_in_neg': 0, 'p_unique': 0, 'n_unique': 0, 'all_unique': 0, 'support': 0.12448886869604725, 'conditions_count': 2, 'precision': 0.927007299270073, 'coverage': 0.35724331926863573, 'C2': 0.6054503338686228, 'RSS': 0.34382050047668944, 'correlation': 0.48701637522934677, 'lift': 2.8696808237600995, 'p_value': 1.5489192105023278e-113, 'TP': 254, 'FP': 20, 'TN': 1470, 'FN': 457, 'sensitivity': 0.35724331926863573, 'specificity': 0.9865771812080537, 'negative_predictive_value': 0.7628437986507525, 'odds_ratio': 40.84673449294388, 'relative_risk': 3.9003123705096736, 'lr+': 26.6146272855134, 'lr-': 0.6515016695848522}}
Rule p n P N Unique in Positive Unique in Negative P Unique N Unique All Unique ... FP TN FN Sensitivity Specificity Negative Predictive Value Odds Ratio Relative Risk LR+ LR-
0 r1 1329 338 1490 711 118 57 118 57 175 ... 338 373 161 0.892 0.525 0.699 9.109 2.628 1.876 0.206
1 r2 1246 305 1490 711 35 24 35 24 59 ... 305 406 244 0.836 0.571 0.625 6.797 2.131 1.949 0.287
2 r3 344 126 711 1490 90 106 90 106 196 ... 126 1364 367 0.484 0.915 0.788 10.147 3.443 5.721 0.564
3 r4 1211 281 1490 711 0 0 0 0 0 ... 281 430 279 0.813 0.605 0.606 6.642 2.055 2.056 0.310
4 r5 254 20 711 1490 0 0 0 0 0 ... 20 1470 457 0.357 0.987 0.763 40.847 3.900 26.615 0.652

5 rows × 30 columns

We can also calculate statistics like condition importances

[10]:
from decision_rules.measures import c2
condition_importances = ruleset.calculate_condition_importances(X, y, measure=c2)
condition_importances
[10]:
{'no': [{'condition': 'sex = {male}',
   'attributes': ['sex'],
   'importance': 0.7637259326667143},
  {'condition': 'pclass != {1st}',
   'attributes': ['pclass'],
   'importance': 0.15287127806103967},
  {'condition': 'age = {adult}',
   'attributes': ['age'],
   'importance': 0.04417532222905417}],
 'yes': [{'condition': 'sex = {female}',
   'attributes': ['sex'],
   'importance': 0.9532496919435065},
  {'condition': 'pclass != {3rd}',
   'attributes': ['pclass'],
   'importance': 0.10030834461150767}]}

Modify the ruleset

The decision-rule model can be easily edited. For example, we will create a new rule stating “IF age = child THEN class = 1” and then add it to the ruleset.

[11]:
from decision_rules.classification.rule import ClassificationConclusion
from decision_rules.classification.rule import ClassificationRule
from decision_rules.conditions import NominalCondition, CompoundCondition

rule = ClassificationRule(
    premise=CompoundCondition(
        subconditions=[
            NominalCondition(
                column_index=X.columns.get_loc('age'),
                value = "child"
            )
        ]
    ),
    conclusion=ClassificationConclusion(
        value='yes',
        column_name='class',
    ),
    column_names=X.columns,
)
print(rule)
IF age = {child} THEN class = yes
[12]:
rule.coverage = rule.calculate_coverage(X.to_numpy(), y.to_numpy())
print(rule.coverage)
(p=57, n=52, P=711, N=1490)
[13]:
ruleset.rules.append(rule)

print("Updated Ruleset:")
for rule in ruleset.rules:
    print(rule)
Updated Ruleset:
IF sex = {male} AND age = {adult} THEN class = no (p=1329, n=338, P=1490, N=711)
IF sex = {male} AND pclass != {1st} THEN class = no (p=1246, n=305, P=1490, N=711)
IF sex = {female} THEN class = yes (p=344, n=126, P=711, N=1490)
IF sex = {male} AND age = {adult} AND pclass != {1st} THEN class = no (p=1211, n=281, P=1490, N=711)
IF pclass != {3rd} AND sex = {female} THEN class = yes (p=254, n=20, P=711, N=1490)
IF age = {child} THEN class = yes (p=57, n=52, P=711, N=1490)

Now let’s remove from the rule “IF sex = male AND pclass != 1st THEN class = no” the condition “pclass != 1st”

[14]:
condition_to_remove = ruleset.rules[1].premise.subconditions[1]
ruleset.rules[1].premise.subconditions.remove(condition_to_remove)
ruleset.rules[1].coverage = ruleset.rules[1].calculate_coverage(X.to_numpy(), y.to_numpy())

print("Updated Ruleset:")
for rule in ruleset.rules:
    print(rule)
Updated Ruleset:
IF sex = {male} AND age = {adult} THEN class = no (p=1329, n=338, P=1490, N=711)
IF sex = {male} THEN class = no (p=1364, n=367, P=1490, N=711)
IF sex = {female} THEN class = yes (p=344, n=126, P=711, N=1490)
IF sex = {male} AND age = {adult} AND pclass != {1st} THEN class = no (p=1211, n=281, P=1490, N=711)
IF pclass != {3rd} AND sex = {female} THEN class = yes (p=254, n=20, P=711, N=1490)
IF age = {child} THEN class = yes (p=57, n=52, P=711, N=1490)

We can also modify the value of a condition. In the rule “IF sex = male AND age = adult AND pclass != 1st THEN class = no” we will update the condition “pclass != 1st” to “pclass = 3”

[15]:
ruleset.rules[3].premise.subconditions[2].value = "3st"
ruleset.rules[3].premise.subconditions[2].negated = False
ruleset.rules[3].coverage = ruleset.rules[3].calculate_coverage(X.to_numpy(), y.to_numpy())

print("Updated Ruleset:")
for rule in ruleset.rules:
    print(rule)
Updated Ruleset:
IF sex = {male} AND age = {adult} THEN class = no (p=1329, n=338, P=1490, N=711)
IF sex = {male} THEN class = no (p=1364, n=367, P=1490, N=711)
IF sex = {female} THEN class = yes (p=344, n=126, P=711, N=1490)
IF sex = {male} AND age = {adult} AND pclass = {3st} THEN class = no (p=0, n=0, P=1490, N=711)
IF pclass != {3rd} AND sex = {female} THEN class = yes (p=254, n=20, P=711, N=1490)
IF age = {child} THEN class = yes (p=57, n=52, P=711, N=1490)