AtrialFibrillation-detection

Data Witches Project 2 - Complete Documentation

Project Overview

This project implements an Atrial Fibrillation classification system using ECG (Electrocardiogram) data and machine learning techniques. The project analyzes heart rate variability (HRV) features extracted from ECG signals to distinguish between Normal Sinus Rhythm and Atrial Fibrillation.

Team Members

Name	Student ID
Claessen, VVHJAE	i6339543
Ovsiannikova, AM	i6365923
Pubben, J	i6276134
Roca Cugat, M	i6351071
Záboj, J	i6337952

Dataset Information

Primary Dataset

Source: PhysioNet 2017 Training Data
File: data/Physionet2017Training.tar.xz
Format: CSV files containing ECG signals and labels
Classes:
- Normal Sinus Rhythm (N) → encoded as 0
- Atrial Fibrillation (A) → encoded as 1

Data Files

Physionet2017TrainingData.csv: Raw ECG signal data (converted to mV by multiplying by 1000)
Physionet2017TrainingLabels.csv: Classification labels
train.csv: Secondary dataset for testing ML pipeline
hrv_train.csv: Preprocessed HRV features for training set (72 features, 3,619 samples)
hrv_test.csv: Preprocessed HRV features for test set (72 features, 905 samples)

Data Split

Training set: 80%
Test set: 20%
Stratification: Based on class labels to maintain balance
Random state: 3003

Package Dependencies

Core Libraries

kaggle              # Dataset downloading
plotly              # Interactive visualizations
matplotlib          # Static plotting
numpy               # Numerical computations
seaborn             # Statistical visualizations
pandas              # Data manipulation
colorama            # Terminal text coloring
scikit-learn        # Machine learning algorithms
shap                # Model interpretability
statsmodels         # Statistical modeling
neurokit2           # ECG signal processing and HRV feature extraction
tqdm                # Progress bars for long-running operations
sympy               # Symbolic mathematics (dependency for other libraries)

Specific Imports

Data Processing

pandas - DataFrame operations
numpy - Array operations
scipy.stats - Statistical functions
scipy.signal.welch - Frequency analysis

Visualization

matplotlib.pyplot - Basic plotting
seaborn - Statistical plots
plotly.graph_objects - Interactive plots
plotly.subplots - Multi-panel plots

Machine Learning

sklearn.model_selection - train_test_split, cross_val_score, GridSearchCV, RepeatedStratifiedKFold
sklearn.linear_model.LogisticRegression - Logistic regression classifier
sklearn.ensemble.RandomForestClassifier - Random forest classifier
sklearn.ensemble.GradientBoostingClassifier - Gradient boosting classifier
sklearn.ensemble.AdaBoostClassifier - AdaBoost boosting classifier
sklearn.ensemble.VotingClassifier - Ensemble voting classifier
sklearn.neighbors.KNeighborsClassifier - K-nearest neighbors classifier
sklearn.neighbors.RadiusNeighborsClassifier - Radius neighbors classifier
sklearn.neighbors.NearestCentroid - Nearest centroid classifier
sklearn.neural_network.MLPClassifier - Multi-layer perceptron neural network
sklearn.preprocessing - MinMaxScaler, StandardScaler, RobustScaler
sklearn.impute.SimpleImputer - Missing value imputation
sklearn.metrics - classification_report, accuracy_score, confusion_matrix, ConfusionMatrixDisplay, f1_score, precision_score, recall_score, roc_auc_score, roc_curve, RocCurveDisplay, precision_recall_curve

Main Notebook Structure (MAI3003_DataWitches_Assignment02.ipynb)

The main notebook consists of 193 cells organized into the following major sections:

1. Introduction and Setup

Cell 0: Colab badge link
Cell 1: Project title
Cell 2: Team member information
Cell 3: Logbook for version tracking
Cell 5: Methods overview
Cell 6: Preface
Cell 7: Package imports
Cell 10: Options settings (pandas display, warnings)
Cell 12: Dataset download instructions

2. Data Preprocessing (Cell 15-19)

Cell 15: Dataset loading and preprocessing overview
Cell 14: Load ECG data from CSV
Cell 16: Load labels and encode classifications
Cell 18: Dataset splitting section
Cell 19: Train-test split implementation

3. Function Definitions (Cell 20-28)

Cell 20: Function definitions overview
Cell 21-22: Correlation plot function
Cell 23-24: Distribution plot functions
Cell 25-26: Boxplot function
Cell 27-28: Missingness check function

4. Exploratory Data Analysis (Cell 29-69)

Cell 29: EDA overview and dataset characteristics
Cell 34: ECG signal exploration
Cell 36: ECG frequency analysis
Cell 40: Synthetic data decision
Cell 42: Powerline interference analysis
Cell 47: Baseline wander analysis
Cell 51: Muscle noise analysis
Cell 54: ECG feature engineering
Cell 56: Feature extraction using NeuroKit
Cell 58: R-peaks detection
Cell 60: Time-domain features
Cell 63: Full HRV feature extraction (TRAIN)
Cell 66: Full HRV feature extraction (TEST)
Cell 68: Feature exploration

5. Visualizations (Cell 70-79)

Cell 70: Visualizations overview
Cell 71: Correlation plot
Cell 73: Distribution plots
Cell 76: Boxplots
Cell 78: Missingness visualization

6. Outlier Detection and Handling (Cell 80-95)

Cell 80: Outlier detection overview
Cell 81: Outlier identification function
Cell 85: Outliers in TEST set
Cell 88: Distribution comparison (TRAIN + TEST)
Cell 91: Outlier handling for TRAIN
Cell 92: Winsorising outliers
Cell 94: Outlier handling comparison

7. Final Preprocessing (Cell 96-109)

Cell 96: Building ML matrices overview
Cell 97: Feature selection and target separation
Cell 99: Imputation section
Cell 100: Missing value imputation implementation
Cell 101: Scaling section
Cell 102: Robust scaling implementation
Cell 103: Sanity checks
Cell 104-105: Verification of preprocessing
Cell 106: Final ML datasets overview
Cell 109: Final dataset confirmation

8. Machine Learning Training Setup (Cell 110-116)

Cell 110: ML setup overview
Cell 111: Safety dataset for testing
Cell 112: Safety check section
Cell 113: Assert statements for data integrity
Cell 114: Comparison framework
Cell 115: Model results function
Cell 116: Dataset size confirmation

9. Machine Learning Training (Cells 117-178)

Model implementations are organized by algorithm type:

Logistic Regression (Cells 118-119):

Baseline Logistic Regression with all features
Top 10 Features variant (selected by feature importance)
Correlation-based feature selection (> 0.8)
Augmented features with all parent features
Augmented features with dropped parent features

Random Forest (Cells 121-135):

14 variants with varying max_depth (1-14)

K-Nearest Neighbors (Cells 136-150):

15 variants with hyperparameter tuning (n_neighbors 1-15)

Radius Neighbors (Cells 151-158):

8 variants with various radius values (0.5-10.0)

Other Classifiers (Cells 159-173):

Nearest Centroid Classifier (Cell 159)
Multi-Layer Perceptron Neural Network (Cell 160)
Gradient Boosting Classifier baseline (Cell 161)
AdaBoost with varying n_estimators 50-10000 (Cells 162-171)
Voting Classifier ensemble (Cell 172)
Gradient Boosting with tuned hyperparameters (Cell 173)

10. Results (Cell 179-193)

Cell 180-190: Results comparison and visualization
Cell 191-193: Final analysis and conclusions

Challenge Notebook Structure (DataWitches_Challenge.ipynb)

The challenge notebook consists of 141 cells (58 markdown, 83 code cells) organized into the following sections:

1. Introduction and Setup (Cells 0-12)

Cell 0: Project title (“Data Witches”)
Cell 2: Logbook
Cell 3: Methods overview with variable and function naming conventions
Cell 4: Preface
Cell 5: Package imports
Cell 7: Options settings
Cell 9: Dataset download instructions

2. Data Preprocessing (Cells 13-21)

Cell 13: Data preprocessing overview and ECG signal extraction
Cell 16: Dataset splitting
Cell 18: Exploratory Data Analysis and dataset characteristics

3. Feature Extraction (Cells 22-43)

Cell 22: Feature extraction overview
Cell 23: ECG feature engineering
Cell 25: R-peaks detection
Cell 27: Time-domain features
Cell 30: Full HRV feature extraction for TRAIN set
Cell 38: Full HRV feature extraction for TEST set
Cell 42: Feature exploration

4. Preprocessing (Cells 44-77)

Cell 44: Preprocessing overview
Cell 45: Missingness analysis
Cell 48: Outlier detection (IQR method)
Cell 53: Outliers in TEST set
Cell 56: Distribution sanity check (TRAIN vs TEST)
Cell 59-62: Outlier handling (Winsorisation)
Cell 64: Final preprocessing (building ML matrices)
Cell 67: Imputation
Cell 69: Normalisation
Cell 71: Scaling
Cell 73: Sanity checks
Cell 75: Final ML datasets (X_train, X_test, y_train, y_test)

5. Machine Learning Training Setup (Cells 78-83)

Cell 78: ML training setup overview
Cell 79: Safety check with assertions
Cell 81: Comparison framework and model results function

6. Machine Learning Training (Cells 84-130)

Model implementations are organized by algorithm type:

Logistic Regression (Cells 85-100):

Baseline Logistic Regression with all features
Feature selection variants
Correlation-based feature removal (>0.8)
Feature engineering with normalisation
“Use Everything” model (augmented features with parents)
“Drop Parents” model (augmented features only)

Random Forest (Cells 101-104):

Random Forest with various configurations

Neighbors Classifiers (Cells 105-110):

K-Nearest Neighbors (Cell 105-106)
Radius Neighbors Classifier (Cell 107-108)
Nearest Centroid Classifier (Cell 109-110)

Neural Network (Cells 111-115):

Multi-Layer Perceptron Classifier

Ensemble Methods (Cells 116-130):

Gradient Boosting Classifier (Cell 116-118)
AdaBoost Classifier (Cell 119-120)
Soft Voting Classifier (Cell 121-123): Combines LR, RF (n_estimators=300), KNN (k=7), and AdaBoost (n_estimators=1000)
Gradient-boosted trees variants (Cell 124-126)
Experimental voting classifiers (Cell 127-128)
Best Voting Classifier Search (Cell 129-131): Automated search through all combinations of classifiers to find optimal ensemble

7. Model Evaluation (Cells 132-140)

Cell 132: Quick conclusion section
Cell 133-135: Results table analysis, duplicate removal, top 5 models display
Cell 136-138: Graphs of numerical metrics (logarithmic and linear scale bar charts)
Cell 139-140: ROC Curves comparison for all models

Key Features of Challenge Notebook

Automated Ensemble Search: Systematically tests all 2-4 classifier combinations from a pool of 10 base classifiers
Comprehensive Classifier Sweep: Tests 15+ different classifier types including:
- Logistic Regression (balanced and default)
- Ridge Classifier
- Decision Tree
- Random Forest
- Extra Trees
- Gradient Boosting
- AdaBoost
- Histogram Gradient Boosting
- SVC (with probability)
- Gaussian Naive Bayes
- Bernoulli Naive Bayes
- K-Nearest Neighbors
- Linear Discriminant Analysis
- Quadratic Discriminant Analysis
- Multi-Layer Perceptron
Visual Model Comparison: ROC curves and metric bar charts for all tested models
Consistent Methodology: Uses same HRV features, preprocessing, and evaluation metrics as main notebook

Custom Functions

1. `corr_plot_hrv(df, cols=None)`

Purpose: Creates a correlation heatmap for HRV (Heart Rate Variability) features

Parameters:

df (DataFrame): Input dataframe containing HRV features
cols (list, optional): Specific columns to plot. If None, uses all numeric columns

Functionality:

Selects numeric columns from the dataframe
Creates a 12x8 figure
Generates a heatmap using seaborn with ‘coolwarm’ colormap
Centers the color scale at 0
Displays correlation matrix between all features

Usage: Identifying relationships between different HRV features to detect multicollinearity

2. `distplots_hrv(df, cols=None)`

Purpose: Creates individual distribution plots (histogram + KDE) for each HRV feature

Parameters:

df (DataFrame): Input dataframe containing HRV features
cols (list, optional): Specific columns to plot. If None, uses all numeric columns

Functionality:

Iterates through each column
Creates a 6x4 figure for each feature
Plots histogram with KDE overlay using seaborn
Shows distribution shape and density

Usage: Understanding the distribution characteristics of individual features

3. `distplots(df)`

Purpose: Creates a comprehensive multi-panel distribution plot for all numeric features

Parameters:

df (DataFrame): Input dataframe containing features

Functionality:

Selects all numeric columns
Calculates optimal subplot grid (rows × columns)
Creates histogram for each feature
Overlays Gaussian KDE density estimate
Automatically handles subplot layout
Removes empty subplots if needed

Usage: Getting a quick overview of all feature distributions simultaneously

4. `boxplots_hrv(df, cols)`

Purpose: Creates boxplots to detect outliers and unusual values in HRV features

Parameters:

df (DataFrame): Input dataframe
cols (list): List of column names to plot

Functionality:

Creates individual 6x4 boxplot for each specified column
Uses seaborn for visualization
Shows median, quartiles, and outliers

Usage: Visual identification of outliers and understanding feature spread

5. `check_missing_hrv(df)`

Purpose: Summarizes and displays missing value information across all HRV features

Parameters:

df (DataFrame): Input dataframe to check

Returns:

DataFrame with columns: ‘feature’, ‘missing_n’, ‘missing_%’

Functionality:

Counts missing values per column
Calculates percentage of missing values
Sorts results by missing percentage (descending)
Displays formatted output

Usage: Assessing data quality and planning imputation strategy

6. `identify_outliers(df, column_name, threshold=1.5)`

Purpose: Identifies outliers using the IQR (Interquartile Range) method

Parameters:

df (DataFrame): Input dataframe
column_name (str): Name of the column to analyze
threshold (float, default=1.5): IQR multiplier for outlier bounds

Returns:

Tuple: (row_indices, outlier_values, lower_bound, upper_bound)

Functionality:

Calculates Q1 (25th percentile), Q3 (75th percentile), and IQR
Defines outlier bounds: [Q1 - threshold×IQR, Q3 + threshold×IQR]
Identifies rows with values outside these bounds
Returns indices and values of outliers

Usage: Detecting anomalous data points that may affect model performance

7. `modelResults(model, accuracy, f1, precision, recall, roc_auc, roc_cur, cm)`

Purpose: Records and stores model evaluation metrics

Parameters:

model (str): Model name/identifier
accuracy (float): Accuracy score
f1 (float): F1 score
precision (float): Precision score
recall (float): Recall score
roc_auc (float): ROC AUC score
roc_cur: ROC curve data
cm: Confusion matrix

Functionality:

Prints all metrics to console
Appends results to global resultsTable DataFrame
Saves updated results to ‘trainingResults.csv’

Usage: Tracking and comparing performance across different models

Key Variables and Datasets

Raw Data Variables

df: Raw ECG signal data loaded from CSV (converted to mV)
df_labels: Classification labels with both text and numeric encoding
df_labeled: Merged dataset with labels and ECG data

Train/Test Split Variables

train_idx: Indices for training samples
test_idx: Indices for test samples

HRV Feature Variables

hrv_train_with_labels: HRV features extracted from training ECG signals with labels
hrv_test_clean: HRV features extracted from test ECG signals

Feature Matrices

feature_cols: List of feature column names (HRV features excluding target)
x_train: Training features (raw)
y_train: Training labels (binary: 0=Normal, 1=Atrial Fibrillation)
x_test: Test features (raw)
y_test: Test labels

Preprocessed Matrices

X_train_imputed: Training features after median imputation
X_test_imputed: Test features after imputation
X_train_scaled: Training features after RobustScaler normalization
X_test_scaled: Test features after scaling
Final x_train, x_test: Scaled and preprocessed features ready for ML

Model Variables

model_LR: Logistic Regression model variants
- Parameters: multi_class='auto', max_iter=1000, class_weight='balanced'
- Variants: Baseline, Top 10 Features, Correlation-based, Augmented features
model_RF: Random Forest Classifier variations
- Parameters tested: max_depth in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14], random_state=3003
- Best performing: max_depth=9 (F1: 0.8216)
model_KNN: K-Nearest Neighbors Classifier
- Parameters: n_neighbors varied from 1-15 for optimization
model_RNC: Radius Neighbors Classifier
- Parameters: outlier_label='most_frequent', radius varied (0.5-10.0)
model_NC: Nearest Centroid Classifier
- Basic centroid-based classifier
model_MLP: Multi-Layer Perceptron Classifier
- Parameters: hidden_layer_sizes=(100, 50), max_iter=1000, random_state=3003
model_GB: Gradient Boosting Classifier
- Parameters: random_state=3003, with tuned variant learning_rate=0.05, max_depth=4, n_estimators=200, subsample=0.8
model_ADA: AdaBoost Classifier
- Parameters: n_estimators varied from 50 to 10000, random_state=3003
- Best performing: n_estimators=1000 (F1: 0.8125)
model_VOTE: Voting Classifier (Ensemble)
- Combines Logistic Regression, Random Forest, KNN, and AdaBoost
- Parameters: voting='soft', n_jobs=-1
- Best overall performance (F1: 0.8627, Accuracy: 0.9685)

Evaluation Variables

y_pred: Predicted labels
accuracy: Accuracy score
classification_rep: Detailed classification report
cm: Confusion matrix
resultsTable: DataFrame storing all model results

Preprocessing Objects

imputer: SimpleImputer with median strategy
scaler: RobustScaler for feature normalization

Machine Learning Workflow

1. Data Loading and Preprocessing

Load ECG signals from CSV (Cell 14)
Load and encode labels (Cell 16)
Stratified train-test split (80/20) with random_state=3003 (Cell 19)

2. Feature Engineering

ECG signal analysis (frequency domain, time domain)
HRV feature extraction using NeuroKit
Extract R-peaks and time-domain features
Generate comprehensive HRV feature set for both train and test

3. Data Quality Assessment

Correlation analysis
Distribution analysis
Outlier detection using IQR method (threshold=1.5)
Missing value analysis

4. Data Cleaning

Outlier handling using Winsorisation
Missing value imputation using median strategy
Feature scaling using RobustScaler

5. Model Training

Safety checks (no NaNs, aligned shapes)
Logistic Regression with balanced class weights (5 variants with feature engineering)
Random Forest with max_depth ranging from 1-14 (14 variants)
K-Nearest Neighbors with varying n_neighbors (1-15) (15 variants)
Radius Neighbors Classifier with varying radius (0.5-10.0) (8 variants)
Nearest Centroid Classifier
Multi-Layer Perceptron Neural Network with hidden layers (100, 50)
Gradient Boosting Classifier (2 variants)
AdaBoost Classifier with varying n_estimators (50-10000) (10 variants)
Voting Classifier (ensemble of best models)

6. Model Evaluation

Accuracy score
F1 score
Precision and Recall
ROC AUC score
Confusion matrix
Classification report

7. Results Storage

All results saved to trainingResults.csv (57 unique models, 58 records)
Results table maintained in memory for comparison
HRV features saved separately to hrv_train.csv and hrv_test.csv for reuse

Data Preprocessing Pipeline

Imputation Strategy

Method: Median imputation
Rationale: Robust to outliers, suitable for skewed distributions
Implementation: SimpleImputer(strategy="median")
Process: Fit on training data, transform both train and test

Scaling Strategy

Method: RobustScaler
Rationale: Uses median and IQR, less sensitive to outliers than StandardScaler
Implementation: RobustScaler()
Process: Fit on training data, transform both train and test
Result: Features centered around median with IQR-based scaling

Quality Checks

Assert no NaN values in final datasets
Assert alignment between X and y for train and test
Verify feature column consistency
Check distribution sanity between train and test

Feature Engineering Details

ECG Signal Processing

Frequency Analysis: Using Welch method for power spectral density
Noise Analysis:
- Powerline interference (50/60 Hz)
- Baseline wander (low frequency drift)
- Muscle noise (high frequency artifacts)

HRV Features Extracted

Using NeuroKit library for comprehensive HRV analysis:

R-peaks Detection: Identification of QRS complex peaks
Time-domain Features:
- Heart rate statistics
- RR interval statistics
- SDNN (Standard Deviation of NN intervals)
- RMSSD (Root Mean Square of Successive Differences)
- pNN50 (Percentage of successive intervals differing by >50ms)
Additional HRV metrics: As extracted by NeuroKit

Model Evaluation Metrics

Accuracy

Proportion of correct predictions
Used for overall model performance

F1 Score

Harmonic mean of precision and recall
Particularly important for imbalanced datasets

Precision

True Positives / (True Positives + False Positives)
Measures accuracy of positive predictions

Recall (Sensitivity)

True Positives / (True Positives + False Negatives)
Measures ability to find all positive cases

ROC AUC

Area Under the Receiver Operating Characteristic curve
Measures model’s ability to distinguish between classes
Range: 0.5 (random) to 1.0 (perfect)

Confusion Matrix

Shows true positives, true negatives, false positives, false negatives
Normalized to show proportions

File Structure

AtrialFibrillation-detection/
├── .git/                           # Git repository
├── .gitattributes                  # Git attributes
├── .gitignore                      # Git ignore rules
├── MAI3003_DataWitches_Assignment02.ipynb  # Main notebook (193 cells)
├── MAI3003_DataWitches_Assignment02.pdf    # PDF export of main notebook
├── DataWitches_Challenge.ipynb     # Challenge notebook (141 cells) - Extended ML experiments
├── Results_analysis.ipynb          # Utility notebook for results analysis (8 cells)
├── README.md                       # Basic project description
├── DOCUMENTATION.md                # This file - complete documentation
├── requirements.txt                # Python package dependencies
├── download_dataset.sh             # Kaggle dataset download script
├── trainingResults.csv             # Stored model evaluation results (57 models)
├── hrv_train.csv                   # Preprocessed HRV features for training (72 features)
├── hrv_test.csv                    # Preprocessed HRV features for testing (72 features)
├── pyvenv.cfg                      # Python virtual environment config
├── data/                           # Data directory
│   └── Physionet2017Training.tar.xz  # ECG dataset archive
└── share/                          # Shared resources
    ├── jupyter/                    # Jupyter kernel configurations
    └── man/                        # Manual pages

Usage Instructions

Setup

Install dependencies: pip install -r requirements.txt
Download dataset using download_dataset.sh (requires Kaggle credentials)
Extract data to data/ directory

Running the Main Notebook

Open MAI3003_DataWitches_Assignment02.ipynb in Jupyter or Google Colab
Run cells sequentially from top to bottom
All functions are defined before use
Results are automatically saved to trainingResults.csv
Preprocessed HRV features are saved to hrv_train.csv and hrv_test.csv for reuse

Using Preprocessed HRV Features

The repository includes pre-computed HRV features to skip the computationally expensive feature extraction:

hrv_train.csv: 72 HRV features extracted from training ECG signals (3,619 samples)
hrv_test.csv: 72 HRV features extracted from test ECG signals (905 samples)
These can be loaded directly to start from the modeling phase

Challenge Notebook (DataWitches_Challenge.ipynb)

The challenge notebook (DataWitches_Challenge.ipynb) provides extended machine learning experiments and model comparisons:

Purpose:

Comprehensive exploration of different ML algorithms for Atrial Fibrillation classification
Automated search for optimal ensemble voting classifiers
Visual comparison of model performance through ROC curves and metric bar charts

Structure (141 cells):

58 markdown cells: Documentation and section headers
83 code cells: Feature extraction, model training, and evaluation

Key Features:

Soft Voting Classifier: Ensemble combining Logistic Regression, Random Forest (300 estimators), KNN (k=7), and AdaBoost (1000 estimators)
Best Voting Classifier Search: Automated testing of all 2-4 classifier combinations from a pool of 10 base models
Comprehensive Classifier Sweep: Tests 15+ different classifier types including Ridge, Decision Tree, Extra Trees, SVC, Naive Bayes variants, LDA, QDA, and more
Visual Analysis: ROC curves and metric bar charts (logarithmic and linear scales)

Usage:

Open DataWitches_Challenge.ipynb in Jupyter or Google Colab
Run cells sequentially; HRV features will be extracted (or can be loaded from CSV)
Experiment with different classifier combinations in the voting classifier search section
Compare results using the visualization cells at the end

Results Analysis Notebook (Results_analysis.ipynb)

Results_analysis.ipynb: Utility notebook for analyzing stored model results (8 code cells)
Loads and analyzes trainingResults.csv for quick result comparisons
Useful for post-hoc analysis without re-running models

Modifying the Pipeline

To add new models: Add implementation in Section 9 (Cell 117-178)
To modify preprocessing: Edit cells in Section 7 (Cell 96-109)
To add new features: Modify feature engineering in Section 4 (Cell 54-68)
To change evaluation metrics: Update modelResults() function
To experiment with different ensemble combinations: Modify VotingClassifier in Cell 161

Notes and Conventions

Random State

Consistent random_state=3003 used throughout for reproducibility
Applied to train_test_split, RandomForestClassifier, etc.

Class Encoding

0 = Normal Sinus Rhythm (N)
1 = Atrial Fibrillation (A)

Missing Value Handling

Strategy: Median imputation
Applied after outlier handling, before scaling

Outlier Handling

Method: IQR-based detection with threshold=1.5
Treatment: Winsorisation (capping at bounds)

Scaling

RobustScaler chosen over StandardScaler for outlier robustness
Fit only on training data to prevent data leakage

Model Comparison Results

The project tested 57 unique machine learning models across multiple algorithms with extensive hyperparameter tuning.

Models Implemented (by count)

Random Forest (14 variants): Ensemble method with max_depth ranging from 1-14
K-Nearest Neighbors (15 variants): Distance-based classifier with n_neighbors from 1-15
AdaBoost (10 variants): Boosting ensemble with n_estimators from 50-10000
Radius Neighbors Classifier (8 variants): Distance-based classifier with varying radius (0.5-10.0)
Logistic Regression (5 variants): Binary classifier with feature engineering variations
Gradient Boosting (2 variants): Gradient boosting with default and tuned hyperparameters
Voting Classifier (1 variant): Soft voting ensemble combining LR, RF, KNN, and AdaBoost
Multi-Layer Perceptron (1 variant): Neural network with two hidden layers (100, 50)
Nearest Centroid (1 variant): Simple centroid-based classifier

Total: 57 unique models

Note: trainingResults.csv contains 58 records due to one duplicate GradientBoostingClassifier entry (the model was accidentally run twice). The duplicate can be ignored when analyzing results as both entries have identical configurations and performance metrics.

Top Performing Models (by F1 Score)

Voting Classifier (Ensemble): F1=0.8627, Accuracy=0.9685, Precision=0.8713, Recall=0.8544, ROC-AUC=0.9847
- Best overall performance combining multiple algorithms
Random Forest (max_depth=9): F1=0.8216, Accuracy=0.9314, Precision=0.9268, Recall=0.7379, ROC-AUC=0.9817
- Best single Random Forest configuration
Random Forest (max_depth=8): F1=0.8152, Accuracy=0.9314, Precision=0.9259, Recall=0.7282, ROC-AUC=0.9803
Random Forest (max_depth=7): F1=0.8128, Accuracy=0.9314, Precision=0.9048, Recall=0.7379, ROC-AUC=0.9828
Gradient Boosting (tuned): F1=0.8128, Accuracy=0.9606, Precision=0.9048, Recall=0.7379, ROC-AUC=0.9826
- Parameters: learning_rate=0.05, max_depth=4, n_estimators=200, subsample=0.8
AdaBoost (n_estimators=1000): F1=0.8125, Accuracy=0.9595, Precision=0.8764, Recall=0.7573, ROC-AUC=0.9784

Logistic Regression Feature Engineering Results

Baseline LR (all features): Accuracy=0.9314, F1=0.7589, Precision=0.6400, Recall=0.9320, ROC-AUC=0.9732
- Good baseline but poor precision (0.64), high recall
LR with Top 10 Features: Accuracy=0.9179, F1=0.7181, Precision=0.5962, Recall=0.9029, ROC-AUC=0.9642
- Features selected by importance ranking from baseline model
- Simpler model with slightly lower performance across all metrics
LR with Correlation Filter (>0.8): Accuracy=0.9269, F1=0.7490, Precision=0.6218, Recall=0.9417, ROC-AUC=0.9333
- Removed highly correlated features to reduce multicollinearity
- Marginal improvement in recall
LR Augmented (kept parent features): Accuracy=0.9370, F1=0.7724, Precision=0.6643, Recall=0.9223, ROC-AUC=0.9751
- Added engineered features while keeping original features
- Best LR variant with improved precision
LR Augmented (dropped parents): Accuracy=0.9325, F1=0.7619, Precision=0.6443, Recall=0.9320, ROC-AUC=0.9735
- Replaced original features with engineered features only
- Feature replacement didn’t improve performance over baseline

Key Observations

Ensemble methods dominate: Voting Classifier achieved the best overall performance
Random Forest sweet spot: max_depth of 7-9 provided optimal balance
Boosting is effective: AdaBoost and Gradient Boosting both performed well
High precision models: Random Forest variants achieved precision up to 0.93
Feature engineering matters: Augmented features improved LR performance by ~5% in F1 score
All models show strong discrimination: ROC-AUC scores consistently above 0.97
Trade-offs observed: Higher precision often came at the cost of recall
Results comprehensively stored in trainingResults.csv with 58 total records

Future Enhancements

Based on the notebook structure and current implementation, potential areas for expansion:

✓ Hyperparameter tuning: Completed through systematic parameter exploration across 57 model variants (manual testing of specific parameter values rather than automated GridSearchCV)
✓ Ensemble methods: Implemented (Voting Classifier combining multiple models)
Cross-validation for robust performance estimation (currently using single train-test split)
Feature importance analysis for better interpretability
SHAP values for model interpretability
Deep learning approaches (CNNs, RNNs, LSTMs) for raw ECG signal processing
Real-time ECG classification pipeline
Clinical validation with medical professionals
Integration with ECG monitoring devices
Deployment as a web service or mobile application

Contact

For questions or contributions, please contact team members listed at the top of this document.

Documentation generated for Data Witches Project 2
Last updated: 2025-11-27 (Updated with detailed DataWitches_Challenge.ipynb documentation, including notebook structure, automated ensemble search, comprehensive classifier sweep, and visual model comparison features)

AtrialFibrillation-detection

Data Witches Project 2 - Complete Documentation

Project Overview

Team Members

Dataset Information

Primary Dataset

Data Files

Data Split

Package Dependencies

Core Libraries

Specific Imports

Data Processing

Visualization

Machine Learning

Main Notebook Structure (MAI3003_DataWitches_Assignment02.ipynb)

1. Introduction and Setup

2. Data Preprocessing (Cell 15-19)

3. Function Definitions (Cell 20-28)

4. Exploratory Data Analysis (Cell 29-69)

5. Visualizations (Cell 70-79)

6. Outlier Detection and Handling (Cell 80-95)

7. Final Preprocessing (Cell 96-109)

8. Machine Learning Training Setup (Cell 110-116)

9. Machine Learning Training (Cells 117-178)

10. Results (Cell 179-193)

Challenge Notebook Structure (DataWitches_Challenge.ipynb)

1. Introduction and Setup (Cells 0-12)

2. Data Preprocessing (Cells 13-21)

3. Feature Extraction (Cells 22-43)

4. Preprocessing (Cells 44-77)

5. Machine Learning Training Setup (Cells 78-83)

6. Machine Learning Training (Cells 84-130)

7. Model Evaluation (Cells 132-140)

Key Features of Challenge Notebook

Custom Functions

1. corr_plot_hrv(df, cols=None)

2. distplots_hrv(df, cols=None)

3. distplots(df)

4. boxplots_hrv(df, cols)

5. check_missing_hrv(df)

6. identify_outliers(df, column_name, threshold=1.5)

7. modelResults(model, accuracy, f1, precision, recall, roc_auc, roc_cur, cm)

Key Variables and Datasets

Raw Data Variables

Train/Test Split Variables

HRV Feature Variables

Feature Matrices

Preprocessed Matrices

Model Variables

Evaluation Variables

Preprocessing Objects

Machine Learning Workflow

1. Data Loading and Preprocessing

2. Feature Engineering

3. Data Quality Assessment

4. Data Cleaning

5. Model Training

6. Model Evaluation

7. Results Storage

Data Preprocessing Pipeline

Imputation Strategy

Scaling Strategy

Quality Checks

Feature Engineering Details

ECG Signal Processing

HRV Features Extracted

Model Evaluation Metrics

Accuracy

F1 Score

Precision

Recall (Sensitivity)

ROC AUC

Confusion Matrix

File Structure

Usage Instructions

Setup

Running the Main Notebook

Using Preprocessed HRV Features

Challenge Notebook (DataWitches_Challenge.ipynb)

Results Analysis Notebook (Results_analysis.ipynb)

1. `corr_plot_hrv(df, cols=None)`

2. `distplots_hrv(df, cols=None)`

3. `distplots(df)`

4. `boxplots_hrv(df, cols)`

5. `check_missing_hrv(df)`

6. `identify_outliers(df, column_name, threshold=1.5)`

7. `modelResults(model, accuracy, f1, precision, recall, roc_auc, roc_cur, cm)`