Skip to content

Commit 95b072c

Browse files
committed
Add detailed README for ML pipeline contribution
1 parent 15e3398 commit 95b072c

1 file changed

Lines changed: 152 additions & 0 deletions

File tree

  • Diabetes Prediction [END 2 END]/diabetes_pipeline
Lines changed: 152 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,152 @@
1+
# Diabetes Prediction – Machine Learning Pipeline
2+
3+
> ⚠️ This repository is a **forked project**.
4+
> The work below represents my **independent contribution and extension** to the original codebase.
5+
6+
This project implements a complete **end-to-end machine learning pipeline** for predicting diabetes using the Pima Indians Diabetes dataset.
7+
The pipeline covers **data preprocessing, model training, evaluation, experimentation, and inference via CLI**.
8+
9+
---
10+
11+
## 📁 Project Structure
12+
diabetes_pipeline/
13+
14+
├── dataset/
15+
│ └── kaggle_diabetes.csv
16+
17+
├── model/
18+
│ ├── diabetes_model.pkl
19+
│ └── scaler.pkl
20+
21+
├── experiments/
22+
│ └── experiment_runner.py
23+
24+
├── data_preprocessing.py
25+
├── train.py
26+
├── predict.py
27+
├── evaluate.py
28+
└── README.md
29+
30+
---
31+
32+
## 🚀 My Contributions
33+
34+
I independently designed and implemented the following components:
35+
36+
### 1. Data Preprocessing Pipeline
37+
- Handled missing values in medical features:
38+
- `Glucose`, `BloodPressure`, `SkinThickness`, `Insulin`, `BMI`
39+
- Replaced invalid zeros with `NaN`
40+
- Applied **mean / median imputation**
41+
- Standardized features using `StandardScaler`
42+
- Ensured consistent feature names across training and inference
43+
44+
📄 `data_preprocessing.py`
45+
46+
---
47+
48+
### 2. Model Training
49+
- Implemented a reproducible training pipeline
50+
- Trained and persisted:
51+
- Random Forest classifier
52+
- Feature scaler
53+
- Stored trained artifacts for reuse and deployment
54+
55+
📄 `train.py`
56+
57+
---
58+
59+
### 3. Model Evaluation
60+
- Added evaluation logic with:
61+
- Accuracy
62+
- Precision, Recall, F1-score
63+
- Verified generalization on the test set
64+
65+
📄 `evaluate.py`
66+
67+
---
68+
69+
### 4. Experimentation Framework
70+
- Benchmarked multiple ML models:
71+
- Logistic Regression
72+
- Decision Tree
73+
- Random Forest
74+
- Support Vector Machine (SVM)
75+
- Automatically reports accuracy and F1-score
76+
77+
📄 `experiments/experiment_runner.py`
78+
79+
#### Sample Results
80+
81+
| Model | Accuracy | F1 Score |
82+
|----------------------|----------|----------|
83+
| Logistic Regression | 0.7875 | 0.6320 |
84+
| Decision Tree | 0.9875 | 0.9805 |
85+
| Random Forest | 0.9950 | 0.9921 |
86+
| SVM | 0.8450 | 0.7328 |
87+
88+
✔️ **Random Forest performs best on this dataset**
89+
90+
---
91+
92+
### 5. Command-Line Prediction Interface
93+
- Built a CLI-based inference script
94+
- Ensures:
95+
- Correct feature order
96+
- Feature-name alignment with trained scaler
97+
- Predicts diabetes for a single patient input
98+
99+
📄 `predict.py`
100+
101+
Example:
102+
```bash
103+
python predict.py \
104+
--pregnancies 2 \
105+
--glucose 120 \
106+
--bp 70 \
107+
--skin 20 \
108+
--insulin 80 \
109+
--bmi 25 \
110+
--dpf 0.5 \
111+
--age 35
112+
113+
114+
115+
---
116+
117+
## 🛠️ Tech Stack
118+
119+
- Python 3.10+
120+
- pandas
121+
- numpy
122+
- scikit-learn
123+
- joblib
124+
125+
---
126+
127+
## 🧩 Notes
128+
129+
- Project is modular and deployment-ready
130+
- Structured to support FastAPI / Flask integration
131+
- Generated files cleaned using `.gitignore`
132+
- Suitable for internship-level ML engineering evaluation
133+
134+
---
135+
136+
## 👩‍💻 Author Contribution
137+
138+
**Contributor:** Tandrita Mukherjee
139+
140+
**Contribution Scope:**
141+
- ML pipeline design
142+
- Data preprocessing
143+
- Model training & evaluation
144+
- Experimentation framework
145+
- CLI-based inference system
146+
147+
---
148+
149+
## 📌 Disclaimer
150+
151+
This repository is a fork of an existing project.
152+
All enhancements, restructuring, and ML pipeline components listed above were implemented independently as part of my learning and internship preparation.

0 commit comments

Comments
 (0)