Use rules in textual form

In this tutorial, we will load a set of survival rules in textual form and evaluate them

Load and prepare dataset

We begin by loading the boston-housing dataset into a DataFrame.

[1]:
import pandas as pd
BONE_MARROW_PATH: str =(
    'https://raw.githubusercontent.com/ruleminer/decision-rules/'
    'refs/heads/docs/docs-src/source/tutorials/resources/bone-marrow.csv'
)
bone_marrow_df = pd.read_csv(BONE_MARROW_PATH, index_col='index')
display(bone_marrow_df)
print('Columns: ', bone_marrow_df.columns.values)
X = bone_marrow_df.drop("survival_status", axis=1)
y = bone_marrow_df["survival_status"].astype(str)
donor_age donor_age_below_35 donor_ABO donor_CMV recipient_age recipient_age_below_10 recipient_age_int recipient_gender recipient_body_mass recipient_ABO ... CD3_to_CD34_ratio ANC_recovery PLT_recovery acute_GvHD_II_III_IV acute_GvHD_III_IV time_to_acute_GvHD_III_IV extensive_chronic_GvHD relapse survival_time survival_status
index
0 22.830137 yes A present 9.6 yes 5_10 male 35.0 A ... 1.338760 19.0 51.0 yes yes 32.0 no no 999.0 0
1 23.342466 yes B absent 4.0 yes 0_5 male 20.6 B ... 11.078295 16.0 37.0 yes no 1000000.0 no yes 163.0 1
2 26.394521 yes B absent 6.6 yes 5_10 male 23.4 B ... 19.013230 23.0 20.0 yes no 1000000.0 no yes 435.0 1
3 39.684932 no A present 18.1 no 10_20 female 50.0 AB ... 29.481647 23.0 29.0 yes yes 19.0 NaN no 53.0 1
4 33.358904 yes A absent 1.3 yes 0_5 female 9.0 AB ... 3.972255 14.0 14.0 no no 1000000.0 no no 2043.0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
182 37.575342 no A present 12.9 no 10_20 male 44.0 A ... 2.522750 15.0 22.0 yes yes 16.0 no yes 385.0 1
183 22.895890 yes A absent 13.9 no 10_20 female 44.5 0 ... 1.038858 12.0 30.0 no no 1000000.0 no no 634.0 1
184 27.347945 yes A present 10.4 no 10_20 female 33.0 B ... 1.635559 16.0 16.0 yes no 1000000.0 no no 1895.0 0
185 27.780822 yes A absent 8.0 yes 5_10 male 24.0 0 ... 8.077770 13.0 14.0 yes yes 54.0 yes no 382.0 1
186 55.553425 no A present 9.5 yes 5_10 male 37.0 AB ... 0.948135 18.0 20.0 yes no 1000000.0 no no 1109.0 0

187 rows × 37 columns

Columns:  ['donor_age' 'donor_age_below_35' 'donor_ABO' 'donor_CMV' 'recipient_age'
 'recipient_age_below_10' 'recipient_age_int' 'recipient_gender'
 'recipient_body_mass' 'recipient_ABO' 'recipient_rh' 'recipient_CMV'
 'disease' 'disease_group' 'gender_match' 'ABO_match' 'CMV_status'
 'HLA_match' 'HLA_mismatch' 'antigen' 'allel' 'HLA_group_1' 'risk_group'
 'stem_cell_source' 'tx_post_relapse' 'CD34_x1e6_per_kg' 'CD3_x1e8_per_kg'
 'CD3_to_CD34_ratio' 'ANC_recovery' 'PLT_recovery' 'acute_GvHD_II_III_IV'
 'acute_GvHD_III_IV' 'time_to_acute_GvHD_III_IV' 'extensive_chronic_GvHD'
 'relapse' 'survival_time' 'survival_status']

Load the ruleset in textual form

Now we need to load the ruleset provided in a text file

[2]:
import urllib

FILE_PATH: str = (
    'https://raw.githubusercontent.com/ruleminer/decision-rules/'
    'refs/heads/docs/docs-src/source/tutorials/resources/survival/text_ruleset.txt'
)

with urllib.request.urlopen(FILE_PATH) as response:
    text_rules_model = response.read().decode('utf-8').splitlines()

text_rules_model
[2]:
['IF recipient_age < 17.45 AND relapse = {no} AND donor_age < 45.16',
 'IF HLA_mismatch = {matched} AND gender_match = {other} AND recipient_rh = {plus} AND recipient_age >= 3.30 AND donor_age < 42.14 AND donor_age >= 33.34',
 'IF recipient_age < 18.00 AND recipient_body_mass < 69.00',
 'IF CD34_x1e6_per_kg < 8.14 AND donor_age >= 27.02 AND gender_match = {other} AND PLT_recovery >= 26.00 AND recipient_age >= 17.75']

Convert the textual ruleset to a decision-rules model

Now that the rules are loaded, we convert them into a decision-rules model using the TextRulesetFactory from decision-rules library. This conversion enables us to evaluate and modify the ruleset programmatically.

[3]:
from decision_rules.ruleset_factories._factories.survival import TextRuleSetFactory

factory = TextRuleSetFactory()
ruleset = factory.make(text_rules_model, X, y)

After conversion in the decision-rules library, we can easilythe display the model

[4]:
for rule in ruleset.rules:
    print(rule)
IF recipient_age < 17.45 AND relapse = {no} AND donor_age < 45.16 THEN survival_status = {inf} (p=134, n=0, P=187, N=0)
IF HLA_mismatch = {matched} AND gender_match = {other} AND recipient_rh = {plus} AND recipient_age >= 3.30 AND donor_age < 42.14 AND donor_age >= 33.34 THEN survival_status = {261.0} (p=35, n=0, P=187, N=0)
IF recipient_age < 18.00 AND recipient_body_mass < 69.00 THEN survival_status = {inf} (p=167, n=0, P=187, N=0)
IF CD34_x1e6_per_kg < 8.14 AND donor_age >= 27.02 AND gender_match = {other} AND PLT_recovery >= 26.00 AND recipient_age >= 17.75 THEN survival_status = {41.0} (p=6, n=0, P=187, N=0)

Analyze the ruleset statistics

We can compute various metrics for the ruleset. This step involves retrieving statistical information about the rules.

We start by calculating and displaying the general characteristics of the ruleset

[5]:
ruleset_stats = ruleset.calculate_ruleset_stats(X, y)

print(ruleset_stats)
{'rules_count': 4, 'avg_conditions_count': 4.0, 'avg_precision': 1.0, 'avg_coverage': 0.46, 'total_conditions_count': 16}

Now let’s calculate metrics for each rule. To make the output more readable and easier to interpret, we will organize the metrics into a DataFrame

[6]:
rule_metrics = ruleset.calculate_rules_metrics(X, y)
rule_metrics_df = pd.DataFrame([
    {
        'Rule': f"r{i+1}",
        'p': metrics['p'],
        'n': metrics['n'],
        'P': metrics['P'],
        'N': metrics['N'],
        'Unique': metrics['unique'],
        'Median Survival Time': metrics['median_survival_time'],
        'Median Survival Time CI Lower': metrics['median_survival_time_ci_lower'],
        'Median Survival Time CI Upper': metrics['median_survival_time_ci_upper'],
        'Events Count': metrics['events_count'],
        'Censored Count': metrics['censored_count'],
        'Log Rank': round(metrics['log_rank'], 6)

    }
    for i, (_, metrics) in enumerate(rule_metrics.items())
])
display(rule_metrics_df)
Rule p n P N Unique Median Survival Time Median Survival Time CI Lower Median Survival Time CI Upper Events Count Censored Count Log Rank
0 r1 134 0 187 0 4 inf inf inf 44 90 1.000000
1 r2 35 0 187 0 2 261.0 66.0 996.0 24 11 0.999369
2 r3 167 0 187 0 31 inf 1243.0 inf 68 99 1.000000
3 r4 6 0 187 0 5 41.0 15.0 202.0 6 0 1.000000

We can also calculate statistics like condition importances

[7]:
condition_importances = ruleset.calculate_condition_importances(X, y)
condition_importances
[7]:
[{'condition': 'recipient_age < 18.00',
  'attributes': ['recipient_age'],
  'importance': 0.4999998982831645},
 {'condition': 'recipient_body_mass < 69.00',
  'attributes': ['recipient_body_mass'],
  'importance': 0.49999961193118536},
 {'condition': 'recipient_age < 17.45',
  'attributes': ['recipient_age'],
  'importance': 0.3332428003216177},
 {'condition': 'relapse = {no}',
  'attributes': ['relapse'],
  'importance': 0.33297498018547933},
 {'condition': 'donor_age < 45.16',
  'attributes': ['donor_age'],
  'importance': 0.3325001557811621},
 {'condition': 'recipient_age >= 17.75',
  'attributes': ['recipient_age'],
  'importance': 0.2127866552730239},
 {'condition': 'donor_age >= 33.34',
  'attributes': ['donor_age'],
  'importance': 0.20513529109132178},
 {'condition': 'PLT_recovery >= 26.00',
  'attributes': ['PLT_recovery'],
  'importance': 0.19995047949703404},
 {'condition': 'CD34_x1e6_per_kg < 8.14',
  'attributes': ['CD34_x1e6_per_kg'],
  'importance': 0.1971795671241863},
 {'condition': 'recipient_rh = {plus}',
  'attributes': ['recipient_rh'],
  'importance': 0.1530846349015695},
 {'condition': 'recipient_age >= 3.30',
  'attributes': ['recipient_age'],
  'importance': 0.13088032818287365},
 {'condition': 'donor_age >= 27.02',
  'attributes': ['donor_age'],
  'importance': 0.06192325509108898},
 {'condition': 'gender_match = {other}',
  'attributes': ['gender_match'],
  'importance': 0.05737520104853856},
 {'condition': 'donor_age < 42.14',
  'attributes': ['donor_age'],
  'importance': 0.05514007434486943},
 {'condition': 'HLA_mismatch = {matched}',
  'attributes': ['HLA_mismatch'],
  'importance': 0.0512592007733366}]

Modify the ruleset

The decision-rule model can be easily edited. For example, we will create a new rule stating “IF relapse = {yes} AND HLA_mismatch = {matched} THEN survival_status = 413” and then add it to the ruleset.

[8]:
from decision_rules.survival.rule import SurvivalConclusion
from decision_rules.survival.rule import SurvivalRule
from decision_rules.conditions import NominalCondition, CompoundCondition

rule = SurvivalRule(
    premise=CompoundCondition(
        subconditions=[
            NominalCondition(
                column_index=X.columns.get_loc('relapse'),
                value='yes'
            ),
            NominalCondition(
                column_index=X.columns.get_loc('HLA_mismatch'),
                value='matched'
            )
        ]
    ),
    conclusion=SurvivalConclusion(
        value=413,
        column_name='survival_status'
    ),
    column_names=X.columns.tolist(),
    survival_time_attr='survival_time'
)
print(rule)

IF relapse = {yes} AND HLA_mismatch = {matched} THEN survival_status = {413}
[9]:
rule.coverage = rule.calculate_coverage(X.to_numpy(), y.to_numpy())
print(rule.coverage)
(p=24, n=0, P=187, N=0)
[10]:
ruleset.rules.append(rule)

print("Updated Ruleset:")
for rule in ruleset.rules:
    print(rule)
Updated Ruleset:
IF recipient_age < 17.45 AND relapse = {no} AND donor_age < 45.16 THEN survival_status = {inf} (p=134, n=0, P=187, N=0)
IF HLA_mismatch = {matched} AND gender_match = {other} AND recipient_rh = {plus} AND recipient_age >= 3.30 AND donor_age < 42.14 AND donor_age >= 33.34 THEN survival_status = {261.0} (p=35, n=0, P=187, N=0)
IF recipient_age < 18.00 AND recipient_body_mass < 69.00 THEN survival_status = {inf} (p=167, n=0, P=187, N=0)
IF CD34_x1e6_per_kg < 8.14 AND donor_age >= 27.02 AND gender_match = {other} AND PLT_recovery >= 26.00 AND recipient_age >= 17.75 THEN survival_status = {41.0} (p=6, n=0, P=187, N=0)
IF relapse = {yes} AND HLA_mismatch = {matched} THEN survival_status = {413.0} (p=24, n=0, P=187, N=0)

Now let’s remove from the rule “IF HLA_mismatch = {matched} AND gender_match = {other} AND recipient_rh = {plus} AND recipient_age >= 3.30 AND donor_age < 42.14 AND donor_age >= 33.34 THEN survival_status = {261.0}” the condition “gender_match = {other}”

[11]:
condition_to_remove = ruleset.rules[1].premise.subconditions[1]
ruleset.rules[1].premise.subconditions.remove(condition_to_remove)
ruleset.rules[1].calculate_coverage(X.to_numpy(), y.to_numpy())
ruleset.rules[1].coverage = ruleset.rules[1].calculate_coverage(X.to_numpy(), y.to_numpy())

print("Updated Ruleset:")
for rule in ruleset.rules:
    print(rule)
Updated Ruleset:
IF recipient_age < 17.45 AND relapse = {no} AND donor_age < 45.16 THEN survival_status = {inf} (p=134, n=0, P=187, N=0)
IF HLA_mismatch = {matched} AND recipient_rh = {plus} AND recipient_age >= 3.30 AND donor_age < 42.14 AND donor_age >= 33.34 THEN survival_status = {403.0} (p=41, n=0, P=187, N=0)
IF recipient_age < 18.00 AND recipient_body_mass < 69.00 THEN survival_status = {inf} (p=167, n=0, P=187, N=0)
IF CD34_x1e6_per_kg < 8.14 AND donor_age >= 27.02 AND gender_match = {other} AND PLT_recovery >= 26.00 AND recipient_age >= 17.75 THEN survival_status = {41.0} (p=6, n=0, P=187, N=0)
IF relapse = {yes} AND HLA_mismatch = {matched} THEN survival_status = {413.0} (p=24, n=0, P=187, N=0)

We can also modify the value of a condition. In the rule “IF CD34_x1e6_per_kg < 8.14 AND donor_age >= 27.02 AND gender_match = {other} AND PLT_recovery >= 26.00 AND recipient_age >= 17.75 THEN survival_status = {41.0}” we will update the condition “donor_age >= 27.02” to “donor_age > 25.0”

[12]:
ruleset.rules[3].premise.subconditions[1].left = 25.0
ruleset.rules[3].premise.subconditions[1].left_closed = False
ruleset.rules[3].coverage = ruleset.rules[3].calculate_coverage(X.to_numpy(), y.to_numpy())

print("Updated Ruleset:")
for rule in ruleset.rules:
    print(rule)
Updated Ruleset:
IF recipient_age < 17.45 AND relapse = {no} AND donor_age < 45.16 THEN survival_status = {inf} (p=134, n=0, P=187, N=0)
IF HLA_mismatch = {matched} AND recipient_rh = {plus} AND recipient_age >= 3.30 AND donor_age < 42.14 AND donor_age >= 33.34 THEN survival_status = {403.0} (p=41, n=0, P=187, N=0)
IF recipient_age < 18.00 AND recipient_body_mass < 69.00 THEN survival_status = {inf} (p=167, n=0, P=187, N=0)
IF CD34_x1e6_per_kg < 8.14 AND donor_age > 25.00 AND gender_match = {other} AND PLT_recovery >= 26.00 AND recipient_age >= 17.75 THEN survival_status = {53.0} (p=7, n=0, P=187, N=0)
IF relapse = {yes} AND HLA_mismatch = {matched} THEN survival_status = {413.0} (p=24, n=0, P=187, N=0)