Use rules in textual form
In this tutorial, we will load a set of regression rules in textual form and evaluate them
Load and prepare dataset
We begin by loading the boston-housing dataset into a DataFrame.
[24]:
import pandas as pd
BOSTON_HOUSING_PATH = (
'https://raw.githubusercontent.com/ruleminer/decision-rules/'
'refs/heads/docs/docs-src/source/tutorials/resources/boston-housing.csv'
)
boston_housing_df = pd.read_csv(BOSTON_HOUSING_PATH)
display(boston_housing_df)
print('Columns: ', boston_housing_df.columns.values)
X = boston_housing_df.drop("MEDV", axis=1)
y = boston_housing_df["MEDV"]
index | CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0.00632 | 18 | 2.31 | 0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1 | 296 | 15 | 396.90 | 4.98 | 24.0 |
1 | 1 | 0.02731 | 0 | 7.07 | 0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2 | 242 | 17 | 396.90 | 9.14 | 21.6 |
2 | 2 | 0.02729 | 0 | 7.07 | 0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2 | 242 | 17 | 392.83 | 4.03 | 34.7 |
3 | 3 | 0.03237 | 0 | 2.18 | 0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3 | 222 | 18 | 394.63 | 2.94 | 33.4 |
4 | 4 | 0.06905 | 0 | 2.18 | 0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3 | 222 | 18 | 396.90 | 5.33 | 36.2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
501 | 501 | 0.06263 | 0 | 11.93 | 0 | 0.573 | 6.593 | 69.1 | 2.4786 | 1 | 273 | 21 | 391.99 | 9.67 | 22.4 |
502 | 502 | 0.04527 | 0 | 11.93 | 0 | 0.573 | 6.120 | 76.7 | 2.2875 | 1 | 273 | 21 | 396.90 | 9.08 | 20.6 |
503 | 503 | 0.06076 | 0 | 11.93 | 0 | 0.573 | 6.976 | 91.0 | 2.1675 | 1 | 273 | 21 | 396.90 | 5.64 | 23.9 |
504 | 504 | 0.10959 | 0 | 11.93 | 0 | 0.573 | 6.794 | 89.3 | 2.3889 | 1 | 273 | 21 | 393.45 | 6.48 | 22.0 |
505 | 505 | 0.04741 | 0 | 11.93 | 0 | 0.573 | 6.030 | 80.8 | 2.5050 | 1 | 273 | 21 | 396.90 | 7.88 | 11.9 |
506 rows × 15 columns
Columns: ['index' 'CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX'
'PTRATIO' 'B' 'LSTAT' 'MEDV']
Load the ruleset in textual form
Now we need to load the ruleset provided in a text file
[ ]:
import urllib
FILE_PATH: str = (
'https://raw.githubusercontent.com/ruleminer/decision-rules/'
'refs/heads/docs/docs-src/source/tutorials/resources/regression/text_ruleset.txt'
)
with urllib.request.urlopen(FILE_PATH) as response:
text_rules_model = response.read().decode('utf-8').splitlines()
text_rules_model
['IF AGE >= 80.05 AND RM < 7.20 AND LSTAT >= 14.74 AND CRIM >= 1.06',
'IF LSTAT >= 14.43 AND AGE >= 77.95 AND CRIM >= 0.24',
'IF TAX >= 300.00 AND CRIM < 15.72 AND RM >= 5.06 AND LSTAT < 32.00 AND LSTAT >= 14.73',
'IF RM < 6.45 AND RM >= 5.75 AND AGE < 91.05 AND LSTAT < 14.16',
'IF RM < 6.59 AND B >= 198.44 AND LSTAT < 16.12 AND RM >= 5.64 AND DIS >= 1.15']
Convert the textual ruleset to a decision-rules model
Now that the rules are loaded, we convert them into a decision-rules model using the TextRulesetFactory from decision-rules library. This conversion enables us to evaluate and modify the ruleset programmatically.
[26]:
from decision_rules.ruleset_factories._factories.regression import TextRuleSetFactory
factory = TextRuleSetFactory()
ruleset = factory.make(text_rules_model, X, y)
After conversion in the decision-rules library, we can easilythe display the model
[27]:
for rule in ruleset.rules:
print(rule)
IF AGE >= 80.05 AND RM < 7.20 AND LSTAT >= 14.74 AND CRIM >= 1.06 THEN MEDV = {13.13} [9.41, 16.84] (p=72, n=31, P=105, N=401)
IF LSTAT >= 14.43 AND AGE >= 77.95 AND CRIM >= 0.24 THEN MEDV = {14.08} [10.07, 18.10] (p=100, n=36, P=125, N=381)
IF TAX >= 300.00 AND CRIM < 15.72 AND RM >= 5.06 AND LSTAT < 32.00 AND LSTAT >= 14.73 THEN MEDV = {15.23} [11.46, 19.00] (p=89, n=32, P=139, N=367)
IF RM < 6.45 AND RM >= 5.75 AND AGE < 91.05 AND LSTAT < 14.16 THEN MEDV = {22.05} [18.49, 25.62] (p=129, n=11, P=227, N=279)
IF RM < 6.59 AND B >= 198.44 AND LSTAT < 16.12 AND RM >= 5.64 AND DIS >= 1.15 THEN MEDV = {21.75} [18.05, 25.45] (p=184, n=33, P=236, N=270)
Analyze the ruleset statistics
We can compute various metrics for the ruleset. This step involves retrieving statistical information about the rules.
We start by calculating and displaying the general characteristics of the ruleset
[28]:
ruleset_stats = ruleset.calculate_ruleset_stats(X, y)
print(ruleset_stats)
{'rules_count': 5, 'avg_conditions_count': 4.2, 'avg_precision': 0.79, 'avg_coverage': 0.69, 'total_conditions_count': 21}
Now let’s calculate metrics for each rule. To make the output more readable and easier to interpret, we will organize the metrics into a DataFrame
[29]:
rule_metrics = ruleset.calculate_rules_metrics(X, y)
rule_metrics_df = pd.DataFrame([
{
'Rule': f"r{i+1}",
'p': metrics['p'],
'n': metrics['n'],
'P': metrics['P'],
'N': metrics['N'],
'unique_in_pos': metrics['unique_in_pos'],
'unique_in_neg': metrics['unique_in_neg'],
'p_unique': metrics['p_unique'],
'n_unique': metrics['n_unique'],
'all_unique': metrics['all_unique'],
'support':round(metrics['support'],3),
'conditions_count': metrics['conditions_count'],
'y_covered_avg': round(metrics['y_covered_avg'],3),
'y_covered_median': round(metrics['y_covered_median'],3),
'y_covered_min': metrics['y_covered_min'],
'y_covered_max': metrics['y_covered_max'],
'mae': round(metrics['mae'],3),
'rmse': round(metrics['rmse'],3),
'mape': round(metrics['mape'],3),
'p-value': round(metrics['p-value'],3)
}
for i, (_, metrics) in enumerate(rule_metrics.items())
])
display(rule_metrics_df)
Rule | p | n | P | N | unique_in_pos | unique_in_neg | p_unique | n_unique | all_unique | support | conditions_count | y_covered_avg | y_covered_median | y_covered_min | y_covered_max | mae | rmse | mape | p-value | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | r1 | 72 | 31 | 105 | 401 | 3 | 6 | 0 | 0 | 0 | 0.204 | 4 | 13.128 | 13.40 | 5.0 | 27.5 | 10.096 | 13.148 | 0.398 | 0.0 |
1 | r2 | 100 | 36 | 125 | 381 | 10 | 7 | 3 | 3 | 6 | 0.269 | 3 | 14.081 | 13.95 | 5.0 | 30.7 | 9.397 | 12.484 | 0.374 | 0.0 |
2 | r3 | 89 | 32 | 139 | 367 | 13 | 18 | 10 | 11 | 21 | 0.239 | 5 | 15.233 | 14.90 | 6.3 | 27.5 | 8.650 | 11.735 | 0.351 | 0.0 |
3 | r4 | 129 | 11 | 227 | 279 | 1 | 4 | 1 | 2 | 3 | 0.277 | 4 | 22.053 | 21.70 | 11.9 | 50.0 | 6.577 | 9.201 | 0.352 | 0.0 |
4 | r5 | 184 | 33 | 236 | 270 | 53 | 25 | 43 | 15 | 58 | 0.429 | 5 | 21.747 | 21.40 | 11.9 | 50.0 | 6.549 | 9.222 | 0.346 | 0.0 |
We can also calculate statistics like condition importances
[30]:
from decision_rules.measures import c2
condition_importances = ruleset.calculate_condition_importances(X, y, measure=c2)
condition_importances
[30]:
[{'condition': 'LSTAT >= 14.43',
'attributes': ['LSTAT'],
'importance': 0.32767459373187646},
{'condition': 'LSTAT >= 14.74',
'attributes': ['LSTAT'],
'importance': 0.16689142510435745},
{'condition': 'LSTAT >= 14.73',
'attributes': ['LSTAT'],
'importance': 0.15660227492592682},
{'condition': 'RM < 6.45',
'attributes': ['RM'],
'importance': 0.1464841812280583},
{'condition': 'LSTAT < 14.16',
'attributes': ['LSTAT'],
'importance': 0.14473028867871895},
{'condition': 'RM < 6.59',
'attributes': ['RM'],
'importance': 0.1174108026185524},
{'condition': 'LSTAT < 16.12',
'attributes': ['LSTAT'],
'importance': 0.1090749271904711},
{'condition': 'AGE < 91.05',
'attributes': ['AGE'],
'importance': 0.08332534928637518},
{'condition': 'CRIM >= 1.06',
'attributes': ['CRIM'],
'importance': 0.07594939786770724},
{'condition': 'AGE >= 77.95',
'attributes': ['AGE'],
'importance': 0.051180099750744834},
{'condition': 'RM < 7.20',
'attributes': ['RM'],
'importance': 0.04416566101660634},
{'condition': 'AGE >= 80.05',
'attributes': ['AGE'],
'importance': 0.03676886132555876},
{'condition': 'CRIM < 15.72',
'attributes': ['CRIM'],
'importance': 0.030665695828604723},
{'condition': 'RM >= 5.75',
'attributes': ['RM'],
'importance': 0.030479048373156292},
{'condition': 'B >= 198.44',
'attributes': ['B'],
'importance': 0.028619349699875198},
{'condition': 'RM >= 5.64',
'attributes': ['RM'],
'importance': 0.02736509389238543},
{'condition': 'TAX >= 300.00',
'attributes': ['TAX'],
'importance': 0.02714594443889693},
{'condition': 'CRIM >= 0.24',
'attributes': ['CRIM'],
'importance': 0.020564473495815715},
{'condition': 'RM >= 5.06',
'attributes': ['RM'],
'importance': 0.01044814267795384},
{'condition': 'LSTAT < 32.00',
'attributes': ['LSTAT'],
'importance': 0.003366620183006248},
{'condition': 'DIS >= 1.15',
'attributes': ['DIS'],
'importance': 0.003149404172213438}]
Modify the ruleset
The decision-rule model can be easily edited. For example, we will create a new rule stating “IF RM < 6.95 AND TAX >= 219.00 AND LSTAT < 14.17 THEN MEDV = 23.35” and then add it to the ruleset.
[31]:
from decision_rules.regression.rule import RegressionConclusion
from decision_rules.regression.rule import RegressionRule
from decision_rules.conditions import ElementaryCondition, CompoundCondition
rule = RegressionRule(
premise=CompoundCondition(
subconditions=[
# Condition: RM < 6.95
ElementaryCondition(
column_index=X.columns.get_loc('RM'),
left=float('-inf'),
right=6.95,
left_closed=False,
right_closed=False
),
# Condition: TAX >= 219.00
ElementaryCondition(
column_index=X.columns.get_loc('TAX'),
left=219.00,
right=float('inf'),
left_closed=True,
right_closed=False
),
# Condition: LSTAT < 14.17
ElementaryCondition(
column_index=X.columns.get_loc('LSTAT'),
left=float('-inf'),
right=14.17,
left_closed=False,
right_closed=False
),
]
),
conclusion=RegressionConclusion(
value=23.35,
column_name='MEDV',
low = 20.0,
high = 25.0
),
column_names=X.columns,
)
print(rule)
IF RM < 6.95 AND TAX >= 219.00 AND LSTAT < 14.17 THEN MEDV = {23.35} [20.00, 25.00]
[32]:
rule.coverage = rule.calculate_coverage(X.to_numpy(), y.to_numpy())
print(rule.coverage)
(p=203, n=42, P=255, N=251)
[33]:
ruleset.rules.append(rule)
print("Updated Ruleset:")
for rule in ruleset.rules:
print(rule)
Updated Ruleset:
IF AGE >= 80.05 AND RM < 7.20 AND LSTAT >= 14.74 AND CRIM >= 1.06 THEN MEDV = {13.13} [9.41, 16.84] (p=72, n=31, P=105, N=401)
IF LSTAT >= 14.43 AND AGE >= 77.95 AND CRIM >= 0.24 THEN MEDV = {14.08} [10.07, 18.10] (p=100, n=36, P=125, N=381)
IF TAX >= 300.00 AND CRIM < 15.72 AND RM >= 5.06 AND LSTAT < 32.00 AND LSTAT >= 14.73 THEN MEDV = {15.23} [11.46, 19.00] (p=89, n=32, P=139, N=367)
IF RM < 6.45 AND RM >= 5.75 AND AGE < 91.05 AND LSTAT < 14.16 THEN MEDV = {22.05} [18.49, 25.62] (p=129, n=11, P=227, N=279)
IF RM < 6.59 AND B >= 198.44 AND LSTAT < 16.12 AND RM >= 5.64 AND DIS >= 1.15 THEN MEDV = {21.75} [18.05, 25.45] (p=184, n=33, P=236, N=270)
IF RM < 6.95 AND TAX >= 219.00 AND LSTAT < 14.17 THEN MEDV = {23.28} [18.17, 28.39] (p=203, n=42, P=255, N=251)
Now let’s remove from the rule “IF LSTAT >= 14.43 AND AGE >= 77.95 AND CRIM >= 0.24 THEN MEDV = {14.08} [10.07, 18.10]” the condition “AGE >= 77.95”
[34]:
condition_to_remove = ruleset.rules[1].premise.subconditions[1]
ruleset.rules[1].premise.subconditions.remove(condition_to_remove)
ruleset.rules[1].coverage = ruleset.rules[1].calculate_coverage(X.to_numpy(), y.to_numpy())
print("Updated Ruleset:")
for rule in ruleset.rules:
print(rule)
Updated Ruleset:
IF AGE >= 80.05 AND RM < 7.20 AND LSTAT >= 14.74 AND CRIM >= 1.06 THEN MEDV = {13.13} [9.41, 16.84] (p=72, n=31, P=105, N=401)
IF LSTAT >= 14.43 AND CRIM >= 0.24 THEN MEDV = {14.22} [10.01, 18.44] (p=105, n=41, P=134, N=372)
IF TAX >= 300.00 AND CRIM < 15.72 AND RM >= 5.06 AND LSTAT < 32.00 AND LSTAT >= 14.73 THEN MEDV = {15.23} [11.46, 19.00] (p=89, n=32, P=139, N=367)
IF RM < 6.45 AND RM >= 5.75 AND AGE < 91.05 AND LSTAT < 14.16 THEN MEDV = {22.05} [18.49, 25.62] (p=129, n=11, P=227, N=279)
IF RM < 6.59 AND B >= 198.44 AND LSTAT < 16.12 AND RM >= 5.64 AND DIS >= 1.15 THEN MEDV = {21.75} [18.05, 25.45] (p=184, n=33, P=236, N=270)
IF RM < 6.95 AND TAX >= 219.00 AND LSTAT < 14.17 THEN MEDV = {23.28} [18.17, 28.39] (p=203, n=42, P=255, N=251)
We can also modify the value of a condition. In the rule “IF RM < 6.45 AND RM >= 5.75 AND AGE < 91.05 AND LSTAT < 14.16 THEN MEDV = {22.05} [18.49, 25.62]” we will update the condition “AGE < 91.05” to “AGE <= 71.5”
[36]:
ruleset.rules[3].premise.subconditions[2].right = 71.5
ruleset.rules[3].premise.subconditions[2].right_closed = True
ruleset.rules[3].coverage = ruleset.rules[3].calculate_coverage(X.to_numpy(), y.to_numpy())
print("Updated Ruleset:")
for rule in ruleset.rules:
print(rule)
Updated Ruleset:
IF AGE >= 80.05 AND RM < 7.20 AND LSTAT >= 14.74 AND CRIM >= 1.06 THEN MEDV = {13.13} [9.41, 16.84] (p=72, n=31, P=105, N=401)
IF LSTAT >= 14.43 AND CRIM >= 0.24 THEN MEDV = {14.22} [10.01, 18.44] (p=105, n=41, P=134, N=372)
IF TAX >= 300.00 AND CRIM < 15.72 AND RM >= 5.06 AND LSTAT < 32.00 AND LSTAT >= 14.73 THEN MEDV = {15.23} [11.46, 19.00] (p=89, n=32, P=139, N=367)
IF RM < 6.45 AND RM >= 5.75 AND AGE <= 71.50 AND LSTAT < 14.16 THEN MEDV = {22.28} [19.69, 24.87] (p=85, n=22, P=173, N=333)
IF RM < 6.59 AND B >= 198.44 AND LSTAT < 16.12 AND RM >= 5.64 AND DIS >= 1.15 THEN MEDV = {21.75} [18.05, 25.45] (p=184, n=33, P=236, N=270)
IF RM < 6.95 AND TAX >= 219.00 AND LSTAT < 14.17 THEN MEDV = {23.28} [18.17, 28.39] (p=203, n=42, P=255, N=251)