Use rules in textual form
In this tutorial, we will load a set of classification rules in textual form and evaluate them
Load and prepare dataset
We begin by loading the titanic dataset into a DataFrame.
[4]:
import pandas as pd
TITANIC_PATH = (
'https://raw.githubusercontent.com/ruleminer/decision-rules/'
'refs/heads/docs/docs-src/source/tutorials/resources/titanic.csv'
)
titanic_df = pd.read_csv(TITANIC_PATH)
display(titanic_df)
print('Columns: ', titanic_df.columns.values)
print('Class names:', titanic_df['class'].unique())
X = titanic_df.drop("class", axis=1)
y = titanic_df["class"]
pclass | age | sex | class | |
---|---|---|---|---|
0 | 1st | adult | male | yes |
1 | 1st | adult | male | yes |
2 | 1st | adult | male | yes |
3 | 1st | adult | male | yes |
4 | 1st | adult | male | yes |
... | ... | ... | ... | ... |
2196 | crew | adult | female | yes |
2197 | crew | adult | female | yes |
2198 | crew | adult | female | no |
2199 | crew | adult | female | no |
2200 | crew | adult | female | no |
2201 rows × 4 columns
Columns: ['pclass' 'age' 'sex' 'class']
Class names: ['yes' 'no']
Load the ruleset in textual form
Now we need to load the ruleset provided in a text file
[5]:
import urllib
FILE_PATH: str = (
'https://raw.githubusercontent.com/ruleminer/decision-rules/'
'refs/heads/docs/docs-src/source/tutorials/resources/classification/'
'text_ruleset.txt'
)
with urllib.request.urlopen(FILE_PATH) as response:
text_rules_model = response.read().decode('utf-8').splitlines()
text_rules_model
[5]:
['IF sex = {male} AND age = {adult} THEN class = {no}',
'IF sex = {male} AND pclass != {1st} THEN class = {no}',
'IF sex = {female} THEN class = {yes}',
'IF sex = {male} AND age = {adult} AND pclass != {1st} THEN class = {no}',
'IF pclass != {3rd} AND sex = {female} THEN class = {yes}']
Convert the textual ruleset to a decision-rules model
Now that the rules are loaded, we convert them into a decision-rules model using the TextRulesetFactory from decision-rules library. This conversion enables us to evaluate and modify the ruleset programmatically.
[6]:
from decision_rules.ruleset_factories._factories.classification import TextRuleSetFactory
factory = TextRuleSetFactory()
ruleset = factory.make(text_rules_model, X, y)
After conversion in the decision-rules library, we can easilythe display the model
[7]:
for rule in ruleset.rules:
print(rule)
IF sex = {male} AND age = {adult} THEN class = no (p=1329, n=338, P=1490, N=711)
IF sex = {male} AND pclass != {1st} THEN class = no (p=1246, n=305, P=1490, N=711)
IF sex = {female} THEN class = yes (p=344, n=126, P=711, N=1490)
IF sex = {male} AND age = {adult} AND pclass != {1st} THEN class = no (p=1211, n=281, P=1490, N=711)
IF pclass != {3rd} AND sex = {female} THEN class = yes (p=254, n=20, P=711, N=1490)
Analyze the ruleset statistics
We can compute various metrics for the ruleset, such as average precision, coverage, and lift. This step involves retrieving statistical information about the rules.
We start by calculating and displaying the general characteristics of the ruleset
[8]:
ruleset_stats = ruleset.calculate_ruleset_stats(X, y)
print(ruleset_stats)
{'rules_count': 5, 'avg_conditions_count': 2.0, 'avg_precision': 0.81, 'avg_coverage': 0.68, 'total_conditions_count': 10}
Now let’s calculate metrics for each rule. To make the output more readable and easier to interpret, we will organize the metrics into a DataFrame
[9]:
rule_metrics = ruleset.calculate_rules_metrics(X, y)
print(rule_metrics)
rule_metrics_df = pd.DataFrame([
{
'Rule': f"r{i+1}",
'p': metrics['p'],
'n': metrics['n'],
'P': metrics['P'],
'N': metrics['N'],
'Unique in Positive': metrics.get('unique_in_pos', 0),
'Unique in Negative': metrics.get('unique_in_neg', 0),
'P Unique': metrics.get('p_unique', 0),
'N Unique': metrics.get('n_unique', 0),
'All Unique': metrics.get('all_unique', 0),
'Support': round(metrics.get('support', 0), 3),
'Conditions Count': metrics.get('conditions_count', 0),
'Precision': round(metrics.get('precision', 0), 3),
'Coverage': round(metrics.get('coverage', 0), 3),
'C2': round(metrics.get('C2', 0), 3),
'RSS': round(metrics.get('RSS', 0), 3),
'Correlation': round(metrics.get('correlation', 0), 3),
'Lift': round(metrics.get('lift', 0), 3),
'P Value': metrics.get('p_value', 0),
'TP': metrics.get('TP', 0),
'FP': metrics.get('FP', 0),
'TN': metrics.get('TN', 0),
'FN': metrics.get('FN', 0),
'Sensitivity': round(metrics.get('sensitivity', 0), 3),
'Specificity': round(metrics.get('specificity', 0), 3),
'Negative Predictive Value': round(metrics.get('negative_predictive_value', 0), 3),
'Odds Ratio': round(metrics.get('odds_ratio', 0), 3),
'Relative Risk': round(metrics.get('relative_risk', 0), 3),
'LR+': round(metrics.get('lr+', 0), 3),
'LR-': round(metrics.get('lr-', 0), 3),
}
for i, (_, metrics) in enumerate(rule_metrics.items())
])
display(rule_metrics_df)
{'800057da-2fc6-436b-8058-ec6903015c6f': {'p': 1329, 'n': 338, 'P': 1490, 'N': 711, 'unique_in_pos': 118, 'unique_in_neg': 57, 'p_unique': 118, 'n_unique': 57, 'all_unique': 175, 'support': 0.7573830077237619, 'conditions_count': 2, 'precision': 0.7972405518896221, 'coverage': 0.8919463087248322, 'C2': 0.3522139513422039, 'RSS': 0.4165595295405846, 'correlation': 0.45442956675744167, 'lift': 1.1776687615497035, 'p_value': 2.627480562242127e-96, 'TP': 1329, 'FP': 338, 'TN': 373, 'FN': 161, 'sensitivity': 0.8919463087248322, 'specificity': 0.5246132208157525, 'negative_predictive_value': 0.6985018726591761, 'odds_ratio': 9.109263308770833, 'relative_risk': 2.6279410784509762, 'lr+': 1.876253921607561, 'lr-': 0.20596829623765237}, '96746bc9-7e93-4158-b7f3-39c8709cd355': {'p': 1246, 'n': 305, 'P': 1490, 'N': 711, 'unique_in_pos': 35, 'unique_in_neg': 24, 'p_unique': 35, 'n_unique': 24, 'all_unique': 59, 'support': 0.704679691049523, 'conditions_count': 2, 'precision': 0.8033526756931012, 'coverage': 0.836241610738255, 'C2': 0.35921539680977327, 'RSS': 0.40726833366371207, 'correlation': 0.4174899265648551, 'lift': 1.1866974759735005, 'p_value': 1.2499563232509209e-82, 'TP': 1246, 'FP': 305, 'TN': 406, 'FN': 244, 'sensitivity': 0.836241610738255, 'specificity': 0.5710267229254571, 'negative_predictive_value': 0.6246153846153846, 'odds_ratio': 6.797489955792048, 'relative_risk': 2.131343833471493, 'lr+': 1.9494025745406534, 'lr-': 0.28677885410123327}, 'b774df16-9114-4df0-a86d-586edda9d696': {'p': 344, 'n': 126, 'P': 711, 'N': 1490, 'unique_in_pos': 90, 'unique_in_neg': 106, 'p_unique': 90, 'n_unique': 106, 'all_unique': 196, 'support': 0.21353930031803725, 'conditions_count': 1, 'precision': 0.7319148936170212, 'coverage': 0.4838255977496484, 'C2': 0.4481077026863914, 'RSS': 0.3992618393603866, 'correlation': 0.45560478314893393, 'lift': 2.265744980099949, 'p_value': 2.6906937468626293e-96, 'TP': 344, 'FP': 126, 'TN': 1364, 'FN': 367, 'sensitivity': 0.4838255977496484, 'specificity': 0.9154362416107382, 'negative_predictive_value': 0.7879838243789717, 'odds_ratio': 10.146746534610644, 'relative_risk': 3.4427844588344123, 'lr+': 5.721429687674411, 'lr-': 0.5638562018717184}, '44f9bfbc-1095-46be-b726-333cba0bed6f': {'p': 1211, 'n': 281, 'P': 1490, 'N': 711, 'unique_in_pos': 0, 'unique_in_neg': 0, 'p_unique': 0, 'n_unique': 0, 'all_unique': 0, 'support': 0.6778736937755566, 'conditions_count': 3, 'precision': 0.811662198391421, 'coverage': 0.812751677852349, 'C2': 0.3779351395045058, 'RSS': 0.4175336750394095, 'correlation': 0.41784182864340963, 'lift': 1.1989721467513539, 'p_value': 2.251021760405628e-83, 'TP': 1211, 'FP': 281, 'TN': 430, 'FN': 279, 'sensitivity': 0.812751677852349, 'specificity': 0.6047819971870605, 'negative_predictive_value': 0.6064880112834978, 'odds_ratio': 6.641964285714286, 'relative_risk': 2.055244638069705, 'lr+': 2.056464209797225, 'lr-': 0.3096129233650694}, '082dbf24-8be4-47f6-a228-808c1c69b1a9': {'p': 254, 'n': 20, 'P': 711, 'N': 1490, 'unique_in_pos': 0, 'unique_in_neg': 0, 'p_unique': 0, 'n_unique': 0, 'all_unique': 0, 'support': 0.12448886869604725, 'conditions_count': 2, 'precision': 0.927007299270073, 'coverage': 0.35724331926863573, 'C2': 0.6054503338686228, 'RSS': 0.34382050047668944, 'correlation': 0.48701637522934677, 'lift': 2.8696808237600995, 'p_value': 1.5489192105023278e-113, 'TP': 254, 'FP': 20, 'TN': 1470, 'FN': 457, 'sensitivity': 0.35724331926863573, 'specificity': 0.9865771812080537, 'negative_predictive_value': 0.7628437986507525, 'odds_ratio': 40.84673449294388, 'relative_risk': 3.9003123705096736, 'lr+': 26.6146272855134, 'lr-': 0.6515016695848522}}
Rule | p | n | P | N | Unique in Positive | Unique in Negative | P Unique | N Unique | All Unique | ... | FP | TN | FN | Sensitivity | Specificity | Negative Predictive Value | Odds Ratio | Relative Risk | LR+ | LR- | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | r1 | 1329 | 338 | 1490 | 711 | 118 | 57 | 118 | 57 | 175 | ... | 338 | 373 | 161 | 0.892 | 0.525 | 0.699 | 9.109 | 2.628 | 1.876 | 0.206 |
1 | r2 | 1246 | 305 | 1490 | 711 | 35 | 24 | 35 | 24 | 59 | ... | 305 | 406 | 244 | 0.836 | 0.571 | 0.625 | 6.797 | 2.131 | 1.949 | 0.287 |
2 | r3 | 344 | 126 | 711 | 1490 | 90 | 106 | 90 | 106 | 196 | ... | 126 | 1364 | 367 | 0.484 | 0.915 | 0.788 | 10.147 | 3.443 | 5.721 | 0.564 |
3 | r4 | 1211 | 281 | 1490 | 711 | 0 | 0 | 0 | 0 | 0 | ... | 281 | 430 | 279 | 0.813 | 0.605 | 0.606 | 6.642 | 2.055 | 2.056 | 0.310 |
4 | r5 | 254 | 20 | 711 | 1490 | 0 | 0 | 0 | 0 | 0 | ... | 20 | 1470 | 457 | 0.357 | 0.987 | 0.763 | 40.847 | 3.900 | 26.615 | 0.652 |
5 rows × 30 columns
We can also calculate statistics like condition importances
[10]:
from decision_rules.measures import c2
condition_importances = ruleset.calculate_condition_importances(X, y, measure=c2)
condition_importances
[10]:
{'no': [{'condition': 'sex = {male}',
'attributes': ['sex'],
'importance': 0.7637259326667143},
{'condition': 'pclass != {1st}',
'attributes': ['pclass'],
'importance': 0.15287127806103967},
{'condition': 'age = {adult}',
'attributes': ['age'],
'importance': 0.04417532222905417}],
'yes': [{'condition': 'sex = {female}',
'attributes': ['sex'],
'importance': 0.9532496919435065},
{'condition': 'pclass != {3rd}',
'attributes': ['pclass'],
'importance': 0.10030834461150767}]}
Modify the ruleset
The decision-rule model can be easily edited. For example, we will create a new rule stating “IF age = child THEN class = 1” and then add it to the ruleset.
[11]:
from decision_rules.classification.rule import ClassificationConclusion
from decision_rules.classification.rule import ClassificationRule
from decision_rules.conditions import NominalCondition, CompoundCondition
rule = ClassificationRule(
premise=CompoundCondition(
subconditions=[
NominalCondition(
column_index=X.columns.get_loc('age'),
value = "child"
)
]
),
conclusion=ClassificationConclusion(
value='yes',
column_name='class',
),
column_names=X.columns,
)
print(rule)
IF age = {child} THEN class = yes
[12]:
rule.coverage = rule.calculate_coverage(X.to_numpy(), y.to_numpy())
print(rule.coverage)
(p=57, n=52, P=711, N=1490)
[13]:
ruleset.rules.append(rule)
print("Updated Ruleset:")
for rule in ruleset.rules:
print(rule)
Updated Ruleset:
IF sex = {male} AND age = {adult} THEN class = no (p=1329, n=338, P=1490, N=711)
IF sex = {male} AND pclass != {1st} THEN class = no (p=1246, n=305, P=1490, N=711)
IF sex = {female} THEN class = yes (p=344, n=126, P=711, N=1490)
IF sex = {male} AND age = {adult} AND pclass != {1st} THEN class = no (p=1211, n=281, P=1490, N=711)
IF pclass != {3rd} AND sex = {female} THEN class = yes (p=254, n=20, P=711, N=1490)
IF age = {child} THEN class = yes (p=57, n=52, P=711, N=1490)
Now let’s remove from the rule “IF sex = male AND pclass != 1st THEN class = no” the condition “pclass != 1st”
[14]:
condition_to_remove = ruleset.rules[1].premise.subconditions[1]
ruleset.rules[1].premise.subconditions.remove(condition_to_remove)
ruleset.rules[1].coverage = ruleset.rules[1].calculate_coverage(X.to_numpy(), y.to_numpy())
print("Updated Ruleset:")
for rule in ruleset.rules:
print(rule)
Updated Ruleset:
IF sex = {male} AND age = {adult} THEN class = no (p=1329, n=338, P=1490, N=711)
IF sex = {male} THEN class = no (p=1364, n=367, P=1490, N=711)
IF sex = {female} THEN class = yes (p=344, n=126, P=711, N=1490)
IF sex = {male} AND age = {adult} AND pclass != {1st} THEN class = no (p=1211, n=281, P=1490, N=711)
IF pclass != {3rd} AND sex = {female} THEN class = yes (p=254, n=20, P=711, N=1490)
IF age = {child} THEN class = yes (p=57, n=52, P=711, N=1490)
We can also modify the value of a condition. In the rule “IF sex = male AND age = adult AND pclass != 1st THEN class = no” we will update the condition “pclass != 1st” to “pclass = 3”
[15]:
ruleset.rules[3].premise.subconditions[2].value = "3st"
ruleset.rules[3].premise.subconditions[2].negated = False
ruleset.rules[3].coverage = ruleset.rules[3].calculate_coverage(X.to_numpy(), y.to_numpy())
print("Updated Ruleset:")
for rule in ruleset.rules:
print(rule)
Updated Ruleset:
IF sex = {male} AND age = {adult} THEN class = no (p=1329, n=338, P=1490, N=711)
IF sex = {male} THEN class = no (p=1364, n=367, P=1490, N=711)
IF sex = {female} THEN class = yes (p=344, n=126, P=711, N=1490)
IF sex = {male} AND age = {adult} AND pclass = {3st} THEN class = no (p=0, n=0, P=1490, N=711)
IF pclass != {3rd} AND sex = {female} THEN class = yes (p=254, n=20, P=711, N=1490)
IF age = {child} THEN class = yes (p=57, n=52, P=711, N=1490)