Breast Cancer Classification with Machine Learning

This project builds an end-to-end machine learning pipeline to classify breast cancer diagnoses as benign or malignant using the Wisconsin Breast Cancer Dataset.

The focus is not only on predictive performance but also on model evaluation under medical constraints, where false negatives (missed cancers) are significantly more costly than false positives.

In [ ]:

%pip install pandas scikit-learn optuna matplotlib seaborn -q

In [231]:

# Handle imports
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score, confusion_matrix, classification_report, ConfusionMatrixDisplay

import optuna
from optuna.samplers import TPESampler

In [232]:

RANDOM_STATE = 42
TEST_SIZE = 0.2
CV_FOLDS = 5

Dataset Description

In [ ]:

# Load Breast Cancer data
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name="target")

The dataset contains 569 samples with 30 numeric features, extracted from digitized images of breast tissue. The target variable is binary:
- 0 → Malignant
- 1 → Benign

Class distribution is moderately imbalanced (~63% benign / 37% malignant), motivating the use of stratified cross-validation and ROC-AUC–based evaluation.

In [234]:

print("Dataset shape:", X.shape)
print("Class distribution:\n", y.value_counts(normalize=True))

Dataset shape: (569, 30) Class distribution: target 1 0.627417 0 0.372583 Name: proportion, dtype: float64

The dataset contains 30 features extracted from tumor cell nuclei. For each property (e.g., radius, texture, perimeter), three statistics are recorded:
- mean: average value across all cells
- error: standard error, measuring variability
- worst: largest observed value, often most indicative of malignancy

In [235]:

# Visualization of radius feature
features = ['mean radius', 'radius error', 'worst radius']

plt.figure(figsize=(6,5))
sns.boxplot(data=X[features])
plt.title("Comparison of Mean, Error, and Worst Radius")
plt.ylabel("Value")
plt.show()

Data Preprocessing

A stratified train/test split ensures class proportions remain consistent.
Model selection and tuning are performed only on the training set using stratified k-fold cross-validation to prevent data leakage.

In [236]:

# Split data into training and testing sets with stratification
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=TEST_SIZE,
    stratify=y,
    random_state=RANDOM_STATE
)

cv = StratifiedKFold(n_splits=CV_FOLDS, shuffle=True, random_state=RANDOM_STATE)

All preprocessing steps are encapsulated in a scikit-learn pipeline to ensure:
- No data leakage
- Reproducibility
- Consistent preprocessing across cross-validation folds

In [237]:

# Apply data transformations using sklearn PIPELINE
numeric_transformer = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])

preprocessor = ColumnTransformer([
('num', numeric_transformer, X.columns)
])

Model Training and Evaluation

We evaluate several models using stratified cross-validation:

In [238]:

# Evaluate multiple models using cross-validation
models = {
    "Logistic Regression": LogisticRegression(max_iter=2000),
    "SVC": SVC(probability=True),
    "Random Forest": RandomForestClassifier(random_state=RANDOM_STATE),
    "Gradient Boosting": GradientBoostingClassifier(random_state=RANDOM_STATE)
}

results = []

for name, clf in models.items():
    pipe = Pipeline([
        ("prep", preprocessor),
        ("clf", clf)
    ])
    roc_auc = cross_val_score(pipe, X_train, y_train, cv=cv, scoring="roc_auc").mean()
    recall = cross_val_score(pipe, X_train, y_train, cv=cv, scoring="recall").mean()
    results.append([name, roc_auc, recall])

pd.DataFrame(results, columns=["Model", "ROC-AUC", "Recall"]).sort_values("ROC-AUC", ascending=False)

Out[238]:

	Model	ROC-AUC	Recall
0	Logistic Regression	0.995872	0.985965
1	SVC	0.995562	0.978947
3	Gradient Boosting	0.991847	0.961404
2	Random Forest	0.989577	0.964912

Logistic Regression and SVC achieved the strongest cross-validated ROC-AUC.

We evaluate ROC curve to avoid missing malignant cancers and unnecessary biopsies.

Logistic Regression was selected for further optimization due to:
- Comparable performance
- Better interpretability
- Well-calibrated probability outputs

Hyperparameter Optimization

Bayesian optimization via Optuna was used to tune the C parameter efficiently:

In [239]:

# Function to tell Optuna what to optimize
def objective(trial):
    C = trial.suggest_float("C", 0.001, 10.0, log=True)
    model = LogisticRegression(C=C, max_iter=3000)

    pipe = Pipeline([
        ("prep", preprocessor),
        ("clf", model)
    ])

    return cross_val_score(pipe, X_train, y_train, cv=cv, scoring="roc_auc").mean()

study = optuna.create_study(direction="maximize", sampler=TPESampler(seed=RANDOM_STATE), study_name="Logistic_Regression_Optimization")
study.optimize(objective, n_trials=30)

study.best_params

[I 2026-01-07 16:05:33,933] A new study created in memory with name: Logistic_Regression_Optimization [I 2026-01-07 16:05:34,083] Trial 0 finished with value: 0.9941176470588236 and parameters: {'C': 0.03148911647956861}. Best is trial 0 with value: 0.9941176470588236. [I 2026-01-07 16:05:34,203] Trial 1 finished with value: 0.9932920536635705 and parameters: {'C': 6.351221010640703}. Best is trial 0 with value: 0.9941176470588236. [I 2026-01-07 16:05:34,314] Trial 2 finished with value: 0.9957688338493291 and parameters: {'C': 0.8471801418819978}. Best is trial 2 with value: 0.9957688338493291. [I 2026-01-07 16:05:34,451] Trial 3 finished with value: 0.9953560371517026 and parameters: {'C': 0.24810409748678125}. Best is trial 2 with value: 0.9957688338493291. [I 2026-01-07 16:05:34,584] Trial 4 finished with value: 0.9900928792569659 and parameters: {'C': 0.004207988669606638}. Best is trial 2 with value: 0.9957688338493291. [I 2026-01-07 16:05:34,680] Trial 5 finished with value: 0.9900928792569659 and parameters: {'C': 0.004207053950287938}. Best is trial 2 with value: 0.9957688338493291. [I 2026-01-07 16:05:34,770] Trial 6 finished with value: 0.9878224974200206 and parameters: {'C': 0.0017073967431528124}. Best is trial 2 with value: 0.9957688338493291. [I 2026-01-07 16:05:34,884] Trial 7 finished with value: 0.9944272445820432 and parameters: {'C': 2.9154431891537547}. Best is trial 2 with value: 0.9957688338493291. [I 2026-01-07 16:05:34,987] Trial 8 finished with value: 0.9953560371517026 and parameters: {'C': 0.2537815508265665}. Best is trial 2 with value: 0.9957688338493291. [I 2026-01-07 16:05:35,087] Trial 9 finished with value: 0.9957688338493291 and parameters: {'C': 0.679657809075816}. Best is trial 2 with value: 0.9957688338493291. [I 2026-01-07 16:05:35,202] Trial 10 finished with value: 0.993704850361197 and parameters: {'C': 0.019188765741731954}. Best is trial 2 with value: 0.9957688338493291. [I 2026-01-07 16:05:35,482] Trial 11 finished with value: 0.9957688338493291 and parameters: {'C': 1.1281508722889861}. Best is trial 2 with value: 0.9957688338493291. [I 2026-01-07 16:05:35,820] Trial 12 finished with value: 0.9957688338493291 and parameters: {'C': 0.6087951913361184}. Best is trial 2 with value: 0.9957688338493291. [I 2026-01-07 16:05:36,255] Trial 13 finished with value: 0.9950464396284829 and parameters: {'C': 0.07870338212964717}. Best is trial 2 with value: 0.9957688338493291. [I 2026-01-07 16:05:36,471] Trial 14 finished with value: 0.995252837977296 and parameters: {'C': 1.717777390062754}. Best is trial 2 with value: 0.9957688338493291. [I 2026-01-07 16:05:36,738] Trial 15 finished with value: 0.9956656346749225 and parameters: {'C': 0.4740514114700396}. Best is trial 2 with value: 0.9957688338493291. [I 2026-01-07 16:05:36,915] Trial 16 finished with value: 0.9929824561403509 and parameters: {'C': 7.094400611517827}. Best is trial 2 with value: 0.9957688338493291. [I 2026-01-07 16:05:37,014] Trial 17 finished with value: 0.9950464396284829 and parameters: {'C': 0.07821711764360048}. Best is trial 2 with value: 0.9957688338493291. [I 2026-01-07 16:05:37,121] Trial 18 finished with value: 0.9950464396284829 and parameters: {'C': 2.1881094675744945}. Best is trial 2 with value: 0.9957688338493291. [I 2026-01-07 16:05:37,214] Trial 19 finished with value: 0.995252837977296 and parameters: {'C': 0.16101280030617346}. Best is trial 2 with value: 0.9957688338493291. [I 2026-01-07 16:05:37,322] Trial 20 finished with value: 0.9957688338493291 and parameters: {'C': 0.7473456802129258}. Best is trial 2 with value: 0.9957688338493291. [I 2026-01-07 16:05:37,470] Trial 21 finished with value: 0.9956656346749225 and parameters: {'C': 1.2389260705332845}. Best is trial 2 with value: 0.9957688338493291. [I 2026-01-07 16:05:37,709] Trial 22 finished with value: 0.9943240454076367 and parameters: {'C': 3.8612121431699657}. Best is trial 2 with value: 0.9957688338493291. [I 2026-01-07 16:05:37,849] Trial 23 finished with value: 0.9958720330237357 and parameters: {'C': 0.9914156315561331}. Best is trial 23 with value: 0.9958720330237357. [I 2026-01-07 16:05:37,981] Trial 24 finished with value: 0.9955624355005159 and parameters: {'C': 0.47059166634892613}. Best is trial 23 with value: 0.9958720330237357. [I 2026-01-07 16:05:38,093] Trial 25 finished with value: 0.9953560371517026 and parameters: {'C': 0.3007016232607128}. Best is trial 23 with value: 0.9958720330237357. [I 2026-01-07 16:05:38,563] Trial 26 finished with value: 0.995252837977296 and parameters: {'C': 0.13176068071625943}. Best is trial 23 with value: 0.9958720330237357. [I 2026-01-07 16:05:38,699] Trial 27 finished with value: 0.9922600619195047 and parameters: {'C': 9.642078659880074}. Best is trial 23 with value: 0.9958720330237357. [I 2026-01-07 16:05:38,804] Trial 28 finished with value: 0.9944272445820432 and parameters: {'C': 0.040234933894050716}. Best is trial 23 with value: 0.9958720330237357. [I 2026-01-07 16:05:39,065] Trial 29 finished with value: 0.9957688338493291 and parameters: {'C': 1.0366565726204244}. Best is trial 23 with value: 0.9958720330237357.

Out[239]:

{'C': 0.9914156315561331}

In [240]:

# Fitting the model with the best C parameter from the study
best_model = Pipeline([
    ("prep", preprocessor),
    ("clf", LogisticRegression(**study.best_params, max_iter=3000))
])

best_model.fit(X_train, y_train)

y_pred = best_model.predict(X_test)
y_prob = best_model.predict_proba(X_test)[:, 1]

In [241]:

print("Classification Report:", classification_report(y_test, y_pred))
print("ROC-AUC:", roc_auc_score(y_test, y_prob))
print("Confusion Matrix:", ConfusionMatrixDisplay.from_estimator(best_model, X_test, y_test))

Classification Report: precision recall f1-score support 0 0.98 0.98 0.98 42 1 0.99 0.99 0.99 72 accuracy 0.98 114 macro avg 0.98 0.98 0.98 114 weighted avg 0.98 0.98 0.98 114 ROC-AUC: 0.9953703703703703 Confusion Matrix: <sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay object at 0x000001FF0F94D550>

The linear model identifies the most influential features for prediction, with the highest coefficients attributed to worst texture, radius error, and worst concave points. These features contributed most strongly to the model’s output.

In [ ]:

# Extract coefficients and sort by absolute value
coef = pd.Series(
    best_model.named_steps["clf"].coef_[0],
    index=X.columns
).sort_values(key=abs, ascending=False)

# Take top 10 features
top_features = coef.head(10)

# Plot
plt.figure(figsize=(8,3))
colors = ['green' if c > 0 else 'red' for c in top_features]  # positive = green, negative = red
plt.barh(top_features.index[::-1], top_features[::-1], color=colors)
plt.xlabel("Coefficient value")
plt.title("Top 10 influential features according to logistic regression coefficients")
plt.grid(axis='x', linestyle='--', alpha=0.7)
plt.show()

Threshold analysis demonstrates how the same model can be adapted to:
- High-sensitivity screening contexts
- Balanced diagnostic confirmation

We can adjust the decision threshold to prioritize high recall (sensitivity) or specificity:

In [243]:

for threshold in [0.35, 0.5, 0.65]:
    preds = (y_prob >= threshold).astype(int)
    tn, fp, fn, tp = confusion_matrix(y_test, preds).ravel()
    print(f"\nThreshold {threshold}")
    print(f"Recall: {tp/(tp+fn):.3f}, Specificity: {tn/(tn+fp):.3f}")

Threshold 0.35 Recall: 1.000, Specificity: 0.952 Threshold 0.5 Recall: 0.986, Specificity: 0.976 Threshold 0.65 Recall: 0.931, Specificity: 0.976

Conclusions

- Built a production-style ML pipeline with robust preprocessing and cross-validation.
- Achieved ROC-AUC ≈ 0.99 with high sensitivity on held-out data
- Demonstrated how medical risk considerations influence model evaluation and threshold selection.