Detecting the presence or absence of cardiac arrhythmia and classifying it into one of 16 groups using classical ML algorithms and PCA-based dimensionality reduction on ECG signal data.
This project is for educational and research purposes only. It is not a substitute for clinical ECG interpretation or professional medical diagnosis.
- About the Project
- What is Arrhythmia?
- Dataset
- Class Distribution
- Methodology
- Model Performance
- Key Findings
- Project Structure
- Getting Started
- Tech Stack
- References
ECG (Electrocardiogram) signals are the primary clinical tool for diagnosing heart conditions. Manual interpretation of large ECG datasets is time-consuming and error-prone. This project applies classical ML algorithms to automatically distinguish normal ECG readings from 15 arrhythmia subtypes using the well-known UCI Arrhythmia dataset.
A key challenge here is high dimensionality — 279 features with only 452 samples. The project tackles this with PCA and SMOTE oversampling, leading to significant accuracy improvements across all models.
What this project covers:
- Extensive EDA on a heavily imbalanced, high-dimensional tabular dataset
- Handling missing values and feature engineering from ECG signal attributes
- Dimensionality reduction with PCA
- Class imbalance handling with SMOTE oversampling
- Training and comparing 6 classifiers with and without PCA
An arrhythmia is an irregular heartbeat — too fast, too slow, or with an irregular pattern. It is detected via ECG, which records the electrical activity of the heart. While a single arrhythmia beat may be harmless, sustained arrhythmia can be life-threatening, leading to stroke, heart failure, or cardiac arrest. Early automated classification is a critical tool in preventive cardiology.
| Property | Details |
|---|---|
| Source | UCI Machine Learning Repository — Arrhythmia Dataset |
| Samples | 452 patient records |
| Features | 279 (age, sex, weight, height + ECG signal attributes) |
| Classes | 16 (1 Normal + 12 Arrhythmia types + 3 unclassified groups) |
| Missing Values | Yes — primarily in the J feature column |
| Challenge | High dimensionality (279 features, 452 samples), severe class imbalance |
| Code | Class | Instances |
|---|---|---|
| 01 | Normal | 245 |
| 02 | Ischemic Changes (Coronary Artery Disease) | 44 |
| 03 | Old Anterior Myocardial Infarction | 15 |
| 04 | Old Inferior Myocardial Infarction | 15 |
| 05 | Sinus Tachycardia | 13 |
| 06 | Sinus Bradycardia | 25 |
| 07 | Ventricular Premature Contraction (PVC) | 3 |
| 08 | Supraventricular Premature Contraction | 2 |
| 09 | Left Bundle Branch Block | 9 |
| 10 | Right Bundle Branch Block | 50 |
| 11 | 1° Atrioventricular Block | 0 |
| 12 | 2° Atrioventricular Block | 0 |
| 13 | 3° Atrioventricular Block | 0 |
| 14 | Left Ventricular Hypertrophy | 4 |
| 15 | Atrial Fibrillation or Flutter | 5 |
| 16 | Others (Unclassified) | 22 |
| Total | 452 |
Note: 245 of 452 samples (~54%) are normal. Several arrhythmia classes have very few instances (as low as 2–3), making this a severely imbalanced multi-class problem.
The project follows a structured ML pipeline:
Raw UCI Data (452 × 279)
│
▼
Data Preprocessing
├── Handle missing values (median imputation)
├── Drop zero-variance features
└── Encode categorical variables (sex)
│
▼
Exploratory Data Analysis
├── Class distribution analysis
├── Correlation heatmaps
└── Feature distribution plots
│
▼
Class Imbalance Handling
└── SMOTE Oversampling on training set
│
▼
Dimensionality Reduction
└── PCA (retaining 95% variance)
│
▼
Model Training & Evaluation
├── KNN
├── Logistic Regression
├── Decision Tree
├── Linear SVC
├── Kernelized SVC ← Best Model
└── Random Forest
│
▼
Evaluation: Accuracy, Precision, Recall, F1-Score
| Model | Accuracy |
|---|---|
| KNN Classifier | ~65% |
| Logistic Regression | ~70% |
| Decision Tree | ~63% |
| Linear SVC | ~72% |
| Kernelized SVC | ~74% |
| Random Forest | ~73% |
| Model | Accuracy | Notes |
|---|---|---|
| KNN Classifier | ~72% | Improved significantly |
| Logistic Regression | ~75% | Stable across classes |
| Decision Tree | ~68% | Prone to overfitting |
| Linear SVC | ~76% | Good on majority classes |
| Kernelized SVC ✅ | ~80.21% | Best recall score |
| Random Forest | ~78% | Good overall balance |
✅ Kernelized SVM with PCA selected as the best model based on highest recall score of 80.21%. Recall is prioritized over accuracy in medical diagnosis to minimize missed arrhythmia cases (false negatives).
Why PCA helped so much:
- With 279 features and only 452 samples, models suffered from the curse of dimensionality
- PCA reduces complexity by creating uncorrelated components ranked by explained variance
- It eliminates multicollinearity — a major issue when ECG signal features are highly correlated
- The resulting lower-dimensional space improves both model accuracy and training speed
Why SMOTE was necessary:
- Several arrhythmia classes had only 2–5 samples, making it impossible for models to learn their patterns
- SMOTE generates synthetic samples for minority classes by interpolating between existing instances
- Applied only to training data to prevent data leakage
Why Kernelized SVM performed best:
- The RBF kernel maps the PCA-transformed features into a higher-dimensional space where classes become linearly separable
- More robust to outliers than tree-based methods
- Handles the reduced but still moderately high-dimensional PCA output well
Classification of Arrhythmia [ECG DATA]/
│
├── 📂 Data/
│ ├── arrhythmia.data # Raw UCI dataset
│ └── arrhythmia.names # Feature descriptions
│
├── 📂 Preprocessing and EDA/
│ ├── Data preprocessing.ipynb # Missing value handling, encoding, scaling
│ └── EDA.ipynb # Distribution plots, correlation analysis
│
├── 📂 Model/
│ └── oversampled and pca.ipynb # SMOTE + PCA + all model comparisons
│
├── 📂 Image/
│ └── result.png # Model comparison results screenshot
│
├── 📂 1- Reports and presentations/ # Project report, slides, reference papers
│
├── final with pca.ipynb # Final consolidated notebook (main entry point)
├── requirements.txt # Python dependencies
└── README.md # You are here
git clone https://github.com/shsarv/Machine-Learning-Projects.git
cd "Machine-Learning-Projects/Classification of Arrhythmia [ECG DATA]"python -m venv venv
source venv/bin/activate # Linux / macOS
venv\Scripts\activate # Windows
pip install -r requirements.txt# Step 1 — Preprocess the data
jupyter notebook "Preprocessing and EDA/Data preprocessing.ipynb"
# Step 2 — Explore the data
jupyter notebook "Preprocessing and EDA/EDA.ipynb"
# Step 3 — Train and evaluate all models
jupyter notebook "final with pca.ipynb"| Layer | Technology |
|---|---|
| Language | Python 3.7+ |
| ML Library | scikit-learn |
| Imbalance Handling | imbalanced-learn (SMOTE) |
| Dimensionality Reduction | PCA (scikit-learn) |
| Data Processing | Pandas, NumPy |
| Visualization | Matplotlib, Seaborn |
| Notebook | Jupyter |
- UCI ML Repository — Arrhythmia Dataset
- Guvenir, H.A., et al. (1997). A Supervised Machine Learning Algorithm for Arrhythmia Analysis. Computers in Cardiology.
- imbalanced-learn SMOTE Documentation
- scikit-learn PCA Documentation
Part of the Machine Learning Projects collection by Sarvesh Kumar Sharma
⭐ Star the main repo if this helped you!