Use rules in textual form
In this tutorial, we will load a set of survival rules in textual form and evaluate them
Load and prepare dataset
We begin by loading the boston-housing dataset into a DataFrame.
[1]:
import pandas as pd
BONE_MARROW_PATH: str =(
'https://raw.githubusercontent.com/ruleminer/decision-rules/'
'refs/heads/docs/docs-src/source/tutorials/resources/bone-marrow.csv'
)
bone_marrow_df = pd.read_csv(BONE_MARROW_PATH, index_col='index')
display(bone_marrow_df)
print('Columns: ', bone_marrow_df.columns.values)
X = bone_marrow_df.drop("survival_status", axis=1)
y = bone_marrow_df["survival_status"].astype(str)
donor_age | donor_age_below_35 | donor_ABO | donor_CMV | recipient_age | recipient_age_below_10 | recipient_age_int | recipient_gender | recipient_body_mass | recipient_ABO | ... | CD3_to_CD34_ratio | ANC_recovery | PLT_recovery | acute_GvHD_II_III_IV | acute_GvHD_III_IV | time_to_acute_GvHD_III_IV | extensive_chronic_GvHD | relapse | survival_time | survival_status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
index | |||||||||||||||||||||
0 | 22.830137 | yes | A | present | 9.6 | yes | 5_10 | male | 35.0 | A | ... | 1.338760 | 19.0 | 51.0 | yes | yes | 32.0 | no | no | 999.0 | 0 |
1 | 23.342466 | yes | B | absent | 4.0 | yes | 0_5 | male | 20.6 | B | ... | 11.078295 | 16.0 | 37.0 | yes | no | 1000000.0 | no | yes | 163.0 | 1 |
2 | 26.394521 | yes | B | absent | 6.6 | yes | 5_10 | male | 23.4 | B | ... | 19.013230 | 23.0 | 20.0 | yes | no | 1000000.0 | no | yes | 435.0 | 1 |
3 | 39.684932 | no | A | present | 18.1 | no | 10_20 | female | 50.0 | AB | ... | 29.481647 | 23.0 | 29.0 | yes | yes | 19.0 | NaN | no | 53.0 | 1 |
4 | 33.358904 | yes | A | absent | 1.3 | yes | 0_5 | female | 9.0 | AB | ... | 3.972255 | 14.0 | 14.0 | no | no | 1000000.0 | no | no | 2043.0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
182 | 37.575342 | no | A | present | 12.9 | no | 10_20 | male | 44.0 | A | ... | 2.522750 | 15.0 | 22.0 | yes | yes | 16.0 | no | yes | 385.0 | 1 |
183 | 22.895890 | yes | A | absent | 13.9 | no | 10_20 | female | 44.5 | 0 | ... | 1.038858 | 12.0 | 30.0 | no | no | 1000000.0 | no | no | 634.0 | 1 |
184 | 27.347945 | yes | A | present | 10.4 | no | 10_20 | female | 33.0 | B | ... | 1.635559 | 16.0 | 16.0 | yes | no | 1000000.0 | no | no | 1895.0 | 0 |
185 | 27.780822 | yes | A | absent | 8.0 | yes | 5_10 | male | 24.0 | 0 | ... | 8.077770 | 13.0 | 14.0 | yes | yes | 54.0 | yes | no | 382.0 | 1 |
186 | 55.553425 | no | A | present | 9.5 | yes | 5_10 | male | 37.0 | AB | ... | 0.948135 | 18.0 | 20.0 | yes | no | 1000000.0 | no | no | 1109.0 | 0 |
187 rows × 37 columns
Columns: ['donor_age' 'donor_age_below_35' 'donor_ABO' 'donor_CMV' 'recipient_age'
'recipient_age_below_10' 'recipient_age_int' 'recipient_gender'
'recipient_body_mass' 'recipient_ABO' 'recipient_rh' 'recipient_CMV'
'disease' 'disease_group' 'gender_match' 'ABO_match' 'CMV_status'
'HLA_match' 'HLA_mismatch' 'antigen' 'allel' 'HLA_group_1' 'risk_group'
'stem_cell_source' 'tx_post_relapse' 'CD34_x1e6_per_kg' 'CD3_x1e8_per_kg'
'CD3_to_CD34_ratio' 'ANC_recovery' 'PLT_recovery' 'acute_GvHD_II_III_IV'
'acute_GvHD_III_IV' 'time_to_acute_GvHD_III_IV' 'extensive_chronic_GvHD'
'relapse' 'survival_time' 'survival_status']
Load the ruleset in textual form
Now we need to load the ruleset provided in a text file
[2]:
import urllib
FILE_PATH: str = (
'https://raw.githubusercontent.com/ruleminer/decision-rules/'
'refs/heads/docs/docs-src/source/tutorials/resources/survival/text_ruleset.txt'
)
with urllib.request.urlopen(FILE_PATH) as response:
text_rules_model = response.read().decode('utf-8').splitlines()
text_rules_model
[2]:
['IF recipient_age < 17.45 AND relapse = {no} AND donor_age < 45.16',
'IF HLA_mismatch = {matched} AND gender_match = {other} AND recipient_rh = {plus} AND recipient_age >= 3.30 AND donor_age < 42.14 AND donor_age >= 33.34',
'IF recipient_age < 18.00 AND recipient_body_mass < 69.00',
'IF CD34_x1e6_per_kg < 8.14 AND donor_age >= 27.02 AND gender_match = {other} AND PLT_recovery >= 26.00 AND recipient_age >= 17.75']
Convert the textual ruleset to a decision-rules model
Now that the rules are loaded, we convert them into a decision-rules model using the TextRulesetFactory from decision-rules library. This conversion enables us to evaluate and modify the ruleset programmatically.
[3]:
from decision_rules.ruleset_factories._factories.survival import TextRuleSetFactory
factory = TextRuleSetFactory()
ruleset = factory.make(text_rules_model, X, y)
After conversion in the decision-rules library, we can easilythe display the model
[4]:
for rule in ruleset.rules:
print(rule)
IF recipient_age < 17.45 AND relapse = {no} AND donor_age < 45.16 THEN survival_status = {inf} (p=134, n=0, P=187, N=0)
IF HLA_mismatch = {matched} AND gender_match = {other} AND recipient_rh = {plus} AND recipient_age >= 3.30 AND donor_age < 42.14 AND donor_age >= 33.34 THEN survival_status = {261.0} (p=35, n=0, P=187, N=0)
IF recipient_age < 18.00 AND recipient_body_mass < 69.00 THEN survival_status = {inf} (p=167, n=0, P=187, N=0)
IF CD34_x1e6_per_kg < 8.14 AND donor_age >= 27.02 AND gender_match = {other} AND PLT_recovery >= 26.00 AND recipient_age >= 17.75 THEN survival_status = {41.0} (p=6, n=0, P=187, N=0)
Analyze the ruleset statistics
We can compute various metrics for the ruleset. This step involves retrieving statistical information about the rules.
We start by calculating and displaying the general characteristics of the ruleset
[5]:
ruleset_stats = ruleset.calculate_ruleset_stats(X, y)
print(ruleset_stats)
{'rules_count': 4, 'avg_conditions_count': 4.0, 'avg_precision': 1.0, 'avg_coverage': 0.46, 'total_conditions_count': 16}
Now let’s calculate metrics for each rule. To make the output more readable and easier to interpret, we will organize the metrics into a DataFrame
[6]:
rule_metrics = ruleset.calculate_rules_metrics(X, y)
rule_metrics_df = pd.DataFrame([
{
'Rule': f"r{i+1}",
'p': metrics['p'],
'n': metrics['n'],
'P': metrics['P'],
'N': metrics['N'],
'Unique': metrics['unique'],
'Median Survival Time': metrics['median_survival_time'],
'Median Survival Time CI Lower': metrics['median_survival_time_ci_lower'],
'Median Survival Time CI Upper': metrics['median_survival_time_ci_upper'],
'Events Count': metrics['events_count'],
'Censored Count': metrics['censored_count'],
'Log Rank': round(metrics['log_rank'], 6)
}
for i, (_, metrics) in enumerate(rule_metrics.items())
])
display(rule_metrics_df)
Rule | p | n | P | N | Unique | Median Survival Time | Median Survival Time CI Lower | Median Survival Time CI Upper | Events Count | Censored Count | Log Rank | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | r1 | 134 | 0 | 187 | 0 | 4 | inf | inf | inf | 44 | 90 | 1.000000 |
1 | r2 | 35 | 0 | 187 | 0 | 2 | 261.0 | 66.0 | 996.0 | 24 | 11 | 0.999369 |
2 | r3 | 167 | 0 | 187 | 0 | 31 | inf | 1243.0 | inf | 68 | 99 | 1.000000 |
3 | r4 | 6 | 0 | 187 | 0 | 5 | 41.0 | 15.0 | 202.0 | 6 | 0 | 1.000000 |
We can also calculate statistics like condition importances
[7]:
condition_importances = ruleset.calculate_condition_importances(X, y)
condition_importances
[7]:
[{'condition': 'recipient_age < 18.00',
'attributes': ['recipient_age'],
'importance': 0.4999998982831645},
{'condition': 'recipient_body_mass < 69.00',
'attributes': ['recipient_body_mass'],
'importance': 0.49999961193118536},
{'condition': 'recipient_age < 17.45',
'attributes': ['recipient_age'],
'importance': 0.3332428003216177},
{'condition': 'relapse = {no}',
'attributes': ['relapse'],
'importance': 0.33297498018547933},
{'condition': 'donor_age < 45.16',
'attributes': ['donor_age'],
'importance': 0.3325001557811621},
{'condition': 'recipient_age >= 17.75',
'attributes': ['recipient_age'],
'importance': 0.2127866552730239},
{'condition': 'donor_age >= 33.34',
'attributes': ['donor_age'],
'importance': 0.20513529109132178},
{'condition': 'PLT_recovery >= 26.00',
'attributes': ['PLT_recovery'],
'importance': 0.19995047949703404},
{'condition': 'CD34_x1e6_per_kg < 8.14',
'attributes': ['CD34_x1e6_per_kg'],
'importance': 0.1971795671241863},
{'condition': 'recipient_rh = {plus}',
'attributes': ['recipient_rh'],
'importance': 0.1530846349015695},
{'condition': 'recipient_age >= 3.30',
'attributes': ['recipient_age'],
'importance': 0.13088032818287365},
{'condition': 'donor_age >= 27.02',
'attributes': ['donor_age'],
'importance': 0.06192325509108898},
{'condition': 'gender_match = {other}',
'attributes': ['gender_match'],
'importance': 0.05737520104853856},
{'condition': 'donor_age < 42.14',
'attributes': ['donor_age'],
'importance': 0.05514007434486943},
{'condition': 'HLA_mismatch = {matched}',
'attributes': ['HLA_mismatch'],
'importance': 0.0512592007733366}]
Modify the ruleset
The decision-rule model can be easily edited. For example, we will create a new rule stating “IF relapse = {yes} AND HLA_mismatch = {matched} THEN survival_status = 413” and then add it to the ruleset.
[8]:
from decision_rules.survival.rule import SurvivalConclusion
from decision_rules.survival.rule import SurvivalRule
from decision_rules.conditions import NominalCondition, CompoundCondition
rule = SurvivalRule(
premise=CompoundCondition(
subconditions=[
NominalCondition(
column_index=X.columns.get_loc('relapse'),
value='yes'
),
NominalCondition(
column_index=X.columns.get_loc('HLA_mismatch'),
value='matched'
)
]
),
conclusion=SurvivalConclusion(
value=413,
column_name='survival_status'
),
column_names=X.columns.tolist(),
survival_time_attr='survival_time'
)
print(rule)
IF relapse = {yes} AND HLA_mismatch = {matched} THEN survival_status = {413}
[9]:
rule.coverage = rule.calculate_coverage(X.to_numpy(), y.to_numpy())
print(rule.coverage)
(p=24, n=0, P=187, N=0)
[10]:
ruleset.rules.append(rule)
print("Updated Ruleset:")
for rule in ruleset.rules:
print(rule)
Updated Ruleset:
IF recipient_age < 17.45 AND relapse = {no} AND donor_age < 45.16 THEN survival_status = {inf} (p=134, n=0, P=187, N=0)
IF HLA_mismatch = {matched} AND gender_match = {other} AND recipient_rh = {plus} AND recipient_age >= 3.30 AND donor_age < 42.14 AND donor_age >= 33.34 THEN survival_status = {261.0} (p=35, n=0, P=187, N=0)
IF recipient_age < 18.00 AND recipient_body_mass < 69.00 THEN survival_status = {inf} (p=167, n=0, P=187, N=0)
IF CD34_x1e6_per_kg < 8.14 AND donor_age >= 27.02 AND gender_match = {other} AND PLT_recovery >= 26.00 AND recipient_age >= 17.75 THEN survival_status = {41.0} (p=6, n=0, P=187, N=0)
IF relapse = {yes} AND HLA_mismatch = {matched} THEN survival_status = {413.0} (p=24, n=0, P=187, N=0)
Now let’s remove from the rule “IF HLA_mismatch = {matched} AND gender_match = {other} AND recipient_rh = {plus} AND recipient_age >= 3.30 AND donor_age < 42.14 AND donor_age >= 33.34 THEN survival_status = {261.0}” the condition “gender_match = {other}”
[11]:
condition_to_remove = ruleset.rules[1].premise.subconditions[1]
ruleset.rules[1].premise.subconditions.remove(condition_to_remove)
ruleset.rules[1].calculate_coverage(X.to_numpy(), y.to_numpy())
ruleset.rules[1].coverage = ruleset.rules[1].calculate_coverage(X.to_numpy(), y.to_numpy())
print("Updated Ruleset:")
for rule in ruleset.rules:
print(rule)
Updated Ruleset:
IF recipient_age < 17.45 AND relapse = {no} AND donor_age < 45.16 THEN survival_status = {inf} (p=134, n=0, P=187, N=0)
IF HLA_mismatch = {matched} AND recipient_rh = {plus} AND recipient_age >= 3.30 AND donor_age < 42.14 AND donor_age >= 33.34 THEN survival_status = {403.0} (p=41, n=0, P=187, N=0)
IF recipient_age < 18.00 AND recipient_body_mass < 69.00 THEN survival_status = {inf} (p=167, n=0, P=187, N=0)
IF CD34_x1e6_per_kg < 8.14 AND donor_age >= 27.02 AND gender_match = {other} AND PLT_recovery >= 26.00 AND recipient_age >= 17.75 THEN survival_status = {41.0} (p=6, n=0, P=187, N=0)
IF relapse = {yes} AND HLA_mismatch = {matched} THEN survival_status = {413.0} (p=24, n=0, P=187, N=0)
We can also modify the value of a condition. In the rule “IF CD34_x1e6_per_kg < 8.14 AND donor_age >= 27.02 AND gender_match = {other} AND PLT_recovery >= 26.00 AND recipient_age >= 17.75 THEN survival_status = {41.0}” we will update the condition “donor_age >= 27.02” to “donor_age > 25.0”
[12]:
ruleset.rules[3].premise.subconditions[1].left = 25.0
ruleset.rules[3].premise.subconditions[1].left_closed = False
ruleset.rules[3].coverage = ruleset.rules[3].calculate_coverage(X.to_numpy(), y.to_numpy())
print("Updated Ruleset:")
for rule in ruleset.rules:
print(rule)
Updated Ruleset:
IF recipient_age < 17.45 AND relapse = {no} AND donor_age < 45.16 THEN survival_status = {inf} (p=134, n=0, P=187, N=0)
IF HLA_mismatch = {matched} AND recipient_rh = {plus} AND recipient_age >= 3.30 AND donor_age < 42.14 AND donor_age >= 33.34 THEN survival_status = {403.0} (p=41, n=0, P=187, N=0)
IF recipient_age < 18.00 AND recipient_body_mass < 69.00 THEN survival_status = {inf} (p=167, n=0, P=187, N=0)
IF CD34_x1e6_per_kg < 8.14 AND donor_age > 25.00 AND gender_match = {other} AND PLT_recovery >= 26.00 AND recipient_age >= 17.75 THEN survival_status = {53.0} (p=7, n=0, P=187, N=0)
IF relapse = {yes} AND HLA_mismatch = {matched} THEN survival_status = {413.0} (p=24, n=0, P=187, N=0)