AtrialFibrillation-detection

Data Witches Project 2 - Complete Documentation

Project Overview

This project implements an Atrial Fibrillation classification system using ECG (Electrocardiogram) data and machine learning techniques. The project analyzes heart rate variability (HRV) features extracted from ECG signals to distinguish between Normal Sinus Rhythm and Atrial Fibrillation.

Team Members

Name Student ID
Claessen, VVHJAE i6339543
Ovsiannikova, AM i6365923
Pubben, J i6276134
Roca Cugat, M i6351071
Záboj, J i6337952

Dataset Information

Primary Dataset

Data Files

Data Split


Package Dependencies

Core Libraries

kaggle              # Dataset downloading
plotly              # Interactive visualizations
matplotlib          # Static plotting
numpy               # Numerical computations
seaborn             # Statistical visualizations
pandas              # Data manipulation
colorama            # Terminal text coloring
scikit-learn        # Machine learning algorithms
shap                # Model interpretability
statsmodels         # Statistical modeling
neurokit2           # ECG signal processing and HRV feature extraction
tqdm                # Progress bars for long-running operations
sympy               # Symbolic mathematics (dependency for other libraries)

Specific Imports

Data Processing

Visualization

Machine Learning


Main Notebook Structure (MAI3003_DataWitches_Assignment02.ipynb)

The main notebook consists of 193 cells organized into the following major sections:

1. Introduction and Setup

2. Data Preprocessing (Cell 15-19)

3. Function Definitions (Cell 20-28)

4. Exploratory Data Analysis (Cell 29-69)

5. Visualizations (Cell 70-79)

6. Outlier Detection and Handling (Cell 80-95)

7. Final Preprocessing (Cell 96-109)

8. Machine Learning Training Setup (Cell 110-116)

9. Machine Learning Training (Cells 117-178)

Model implementations are organized by algorithm type:

Logistic Regression (Cells 118-119):

Random Forest (Cells 121-135):

K-Nearest Neighbors (Cells 136-150):

Radius Neighbors (Cells 151-158):

Other Classifiers (Cells 159-173):

10. Results (Cell 179-193)


Challenge Notebook Structure (DataWitches_Challenge.ipynb)

The challenge notebook consists of 141 cells (58 markdown, 83 code cells) organized into the following sections:

1. Introduction and Setup (Cells 0-12)

2. Data Preprocessing (Cells 13-21)

3. Feature Extraction (Cells 22-43)

4. Preprocessing (Cells 44-77)

5. Machine Learning Training Setup (Cells 78-83)

6. Machine Learning Training (Cells 84-130)

Model implementations are organized by algorithm type:

Logistic Regression (Cells 85-100):

Random Forest (Cells 101-104):

Neighbors Classifiers (Cells 105-110):

Neural Network (Cells 111-115):

Ensemble Methods (Cells 116-130):

7. Model Evaluation (Cells 132-140)

Key Features of Challenge Notebook


Custom Functions

1. corr_plot_hrv(df, cols=None)

Purpose: Creates a correlation heatmap for HRV (Heart Rate Variability) features

Parameters:

Functionality:

Usage: Identifying relationships between different HRV features to detect multicollinearity


2. distplots_hrv(df, cols=None)

Purpose: Creates individual distribution plots (histogram + KDE) for each HRV feature

Parameters:

Functionality:

Usage: Understanding the distribution characteristics of individual features


3. distplots(df)

Purpose: Creates a comprehensive multi-panel distribution plot for all numeric features

Parameters:

Functionality:

Usage: Getting a quick overview of all feature distributions simultaneously


4. boxplots_hrv(df, cols)

Purpose: Creates boxplots to detect outliers and unusual values in HRV features

Parameters:

Functionality:

Usage: Visual identification of outliers and understanding feature spread


5. check_missing_hrv(df)

Purpose: Summarizes and displays missing value information across all HRV features

Parameters:

Returns:

Functionality:

Usage: Assessing data quality and planning imputation strategy


6. identify_outliers(df, column_name, threshold=1.5)

Purpose: Identifies outliers using the IQR (Interquartile Range) method

Parameters:

Returns:

Functionality:

Usage: Detecting anomalous data points that may affect model performance


7. modelResults(model, accuracy, f1, precision, recall, roc_auc, roc_cur, cm)

Purpose: Records and stores model evaluation metrics

Parameters:

Functionality:

Usage: Tracking and comparing performance across different models


Key Variables and Datasets

Raw Data Variables

Train/Test Split Variables

HRV Feature Variables

Feature Matrices

Preprocessed Matrices

Model Variables

Evaluation Variables

Preprocessing Objects


Machine Learning Workflow

1. Data Loading and Preprocessing

  1. Load ECG signals from CSV (Cell 14)
  2. Load and encode labels (Cell 16)
  3. Stratified train-test split (80/20) with random_state=3003 (Cell 19)

2. Feature Engineering

  1. ECG signal analysis (frequency domain, time domain)
  2. HRV feature extraction using NeuroKit
  3. Extract R-peaks and time-domain features
  4. Generate comprehensive HRV feature set for both train and test

3. Data Quality Assessment

  1. Correlation analysis
  2. Distribution analysis
  3. Outlier detection using IQR method (threshold=1.5)
  4. Missing value analysis

4. Data Cleaning

  1. Outlier handling using Winsorisation
  2. Missing value imputation using median strategy
  3. Feature scaling using RobustScaler

5. Model Training

  1. Safety checks (no NaNs, aligned shapes)
  2. Logistic Regression with balanced class weights (5 variants with feature engineering)
  3. Random Forest with max_depth ranging from 1-14 (14 variants)
  4. K-Nearest Neighbors with varying n_neighbors (1-15) (15 variants)
  5. Radius Neighbors Classifier with varying radius (0.5-10.0) (8 variants)
  6. Nearest Centroid Classifier
  7. Multi-Layer Perceptron Neural Network with hidden layers (100, 50)
  8. Gradient Boosting Classifier (2 variants)
  9. AdaBoost Classifier with varying n_estimators (50-10000) (10 variants)
  10. Voting Classifier (ensemble of best models)

6. Model Evaluation

  1. Accuracy score
  2. F1 score
  3. Precision and Recall
  4. ROC AUC score
  5. Confusion matrix
  6. Classification report

7. Results Storage


Data Preprocessing Pipeline

Imputation Strategy

Scaling Strategy

Quality Checks


Feature Engineering Details

ECG Signal Processing

  1. Frequency Analysis: Using Welch method for power spectral density
  2. Noise Analysis:
    • Powerline interference (50/60 Hz)
    • Baseline wander (low frequency drift)
    • Muscle noise (high frequency artifacts)

HRV Features Extracted

Using NeuroKit library for comprehensive HRV analysis:


Model Evaluation Metrics

Accuracy

F1 Score

Precision

Recall (Sensitivity)

ROC AUC

Confusion Matrix


File Structure

AtrialFibrillation-detection/
├── .git/                           # Git repository
├── .gitattributes                  # Git attributes
├── .gitignore                      # Git ignore rules
├── MAI3003_DataWitches_Assignment02.ipynb  # Main notebook (193 cells)
├── MAI3003_DataWitches_Assignment02.pdf    # PDF export of main notebook
├── DataWitches_Challenge.ipynb     # Challenge notebook (141 cells) - Extended ML experiments
├── Results_analysis.ipynb          # Utility notebook for results analysis (8 cells)
├── README.md                       # Basic project description
├── DOCUMENTATION.md                # This file - complete documentation
├── requirements.txt                # Python package dependencies
├── download_dataset.sh             # Kaggle dataset download script
├── trainingResults.csv             # Stored model evaluation results (57 models)
├── hrv_train.csv                   # Preprocessed HRV features for training (72 features)
├── hrv_test.csv                    # Preprocessed HRV features for testing (72 features)
├── pyvenv.cfg                      # Python virtual environment config
├── data/                           # Data directory
│   └── Physionet2017Training.tar.xz  # ECG dataset archive
└── share/                          # Shared resources
    ├── jupyter/                    # Jupyter kernel configurations
    └── man/                        # Manual pages

Usage Instructions

Setup

  1. Install dependencies: pip install -r requirements.txt
  2. Download dataset using download_dataset.sh (requires Kaggle credentials)
  3. Extract data to data/ directory

Running the Main Notebook

  1. Open MAI3003_DataWitches_Assignment02.ipynb in Jupyter or Google Colab
  2. Run cells sequentially from top to bottom
  3. All functions are defined before use
  4. Results are automatically saved to trainingResults.csv
  5. Preprocessed HRV features are saved to hrv_train.csv and hrv_test.csv for reuse

Using Preprocessed HRV Features

The repository includes pre-computed HRV features to skip the computationally expensive feature extraction:

Challenge Notebook (DataWitches_Challenge.ipynb)

The challenge notebook (DataWitches_Challenge.ipynb) provides extended machine learning experiments and model comparisons:

Purpose:

Structure (141 cells):

Key Features:

  1. Soft Voting Classifier: Ensemble combining Logistic Regression, Random Forest (300 estimators), KNN (k=7), and AdaBoost (1000 estimators)
  2. Best Voting Classifier Search: Automated testing of all 2-4 classifier combinations from a pool of 10 base models
  3. Comprehensive Classifier Sweep: Tests 15+ different classifier types including Ridge, Decision Tree, Extra Trees, SVC, Naive Bayes variants, LDA, QDA, and more
  4. Visual Analysis: ROC curves and metric bar charts (logarithmic and linear scales)

Usage:

  1. Open DataWitches_Challenge.ipynb in Jupyter or Google Colab
  2. Run cells sequentially; HRV features will be extracted (or can be loaded from CSV)
  3. Experiment with different classifier combinations in the voting classifier search section
  4. Compare results using the visualization cells at the end

Results Analysis Notebook (Results_analysis.ipynb)

Modifying the Pipeline


Notes and Conventions

Random State

Class Encoding

Missing Value Handling

Outlier Handling

Scaling


Model Comparison Results

The project tested 57 unique machine learning models across multiple algorithms with extensive hyperparameter tuning.

Models Implemented (by count)

  1. Random Forest (14 variants): Ensemble method with max_depth ranging from 1-14
  2. K-Nearest Neighbors (15 variants): Distance-based classifier with n_neighbors from 1-15
  3. AdaBoost (10 variants): Boosting ensemble with n_estimators from 50-10000
  4. Radius Neighbors Classifier (8 variants): Distance-based classifier with varying radius (0.5-10.0)
  5. Logistic Regression (5 variants): Binary classifier with feature engineering variations
  6. Gradient Boosting (2 variants): Gradient boosting with default and tuned hyperparameters
  7. Voting Classifier (1 variant): Soft voting ensemble combining LR, RF, KNN, and AdaBoost
  8. Multi-Layer Perceptron (1 variant): Neural network with two hidden layers (100, 50)
  9. Nearest Centroid (1 variant): Simple centroid-based classifier

Total: 57 unique models

Note: trainingResults.csv contains 58 records due to one duplicate GradientBoostingClassifier entry (the model was accidentally run twice). The duplicate can be ignored when analyzing results as both entries have identical configurations and performance metrics.

Top Performing Models (by F1 Score)

  1. Voting Classifier (Ensemble): F1=0.8627, Accuracy=0.9685, Precision=0.8713, Recall=0.8544, ROC-AUC=0.9847
    • Best overall performance combining multiple algorithms
  2. Random Forest (max_depth=9): F1=0.8216, Accuracy=0.9314, Precision=0.9268, Recall=0.7379, ROC-AUC=0.9817
    • Best single Random Forest configuration
  3. Random Forest (max_depth=8): F1=0.8152, Accuracy=0.9314, Precision=0.9259, Recall=0.7282, ROC-AUC=0.9803
  4. Random Forest (max_depth=7): F1=0.8128, Accuracy=0.9314, Precision=0.9048, Recall=0.7379, ROC-AUC=0.9828
  5. Gradient Boosting (tuned): F1=0.8128, Accuracy=0.9606, Precision=0.9048, Recall=0.7379, ROC-AUC=0.9826
    • Parameters: learning_rate=0.05, max_depth=4, n_estimators=200, subsample=0.8
  6. AdaBoost (n_estimators=1000): F1=0.8125, Accuracy=0.9595, Precision=0.8764, Recall=0.7573, ROC-AUC=0.9784

Logistic Regression Feature Engineering Results

Key Observations

Future Enhancements

Based on the notebook structure and current implementation, potential areas for expansion:

  1. Hyperparameter tuning: Completed through systematic parameter exploration across 57 model variants (manual testing of specific parameter values rather than automated GridSearchCV)
  2. Ensemble methods: Implemented (Voting Classifier combining multiple models)
  3. Cross-validation for robust performance estimation (currently using single train-test split)
  4. Feature importance analysis for better interpretability
  5. SHAP values for model interpretability
  6. Deep learning approaches (CNNs, RNNs, LSTMs) for raw ECG signal processing
  7. Real-time ECG classification pipeline
  8. Clinical validation with medical professionals
  9. Integration with ECG monitoring devices
  10. Deployment as a web service or mobile application

Contact

For questions or contributions, please contact team members listed at the top of this document.


Documentation generated for Data Witches Project 2
Last updated: 2025-11-27 (Updated with detailed DataWitches_Challenge.ipynb documentation, including notebook structure, automated ensemble search, comprehensive classifier sweep, and visual model comparison features)