|
| 1 | +<div align="center"> |
| 2 | + |
| 3 | +# 🏃 Human Activity Recognition — 2D Pose + LSTM RNN |
| 4 | + |
| 5 | +[](https://www.python.org/) |
| 6 | +[](https://www.tensorflow.org/) |
| 7 | +[]() |
| 8 | +[]() |
| 9 | +[]() |
| 10 | +[](../LICENSE.md) |
| 11 | + |
| 12 | +> Classifies **6 human activities** from **2D pose time series** (OpenPose keypoints) using a **2-layer stacked LSTM RNN** built in TensorFlow 1.x — achieving **>90% accuracy** in ~7 minutes of training. Deployed via ngrok with a Flask web app and `sample_video.mp4` demo. |
| 13 | +
|
| 14 | +[🔙 Back to Main Repository](https://github.com/shsarv/Machine-Learning-Projects) |
| 15 | + |
| 16 | +</div> |
| 17 | + |
| 18 | +--- |
| 19 | + |
| 20 | +## 📌 Table of Contents |
| 21 | + |
| 22 | +- [About the Project](#-about-the-project) |
| 23 | +- [Key Idea — Why 2D Pose?](#-key-idea--why-2d-pose) |
| 24 | +- [Dataset](#-dataset) |
| 25 | +- [LSTM Architecture](#-lstm-architecture) |
| 26 | +- [Training Configuration](#-training-configuration) |
| 27 | +- [Results & Findings](#-results--findings) |
| 28 | +- [Project Structure](#-project-structure) |
| 29 | +- [Getting Started](#-getting-started) |
| 30 | +- [Tech Stack](#-tech-stack) |
| 31 | +- [References](#-references) |
| 32 | + |
| 33 | +--- |
| 34 | + |
| 35 | +## 🔬 About the Project |
| 36 | + |
| 37 | +This experiment classifies human activities using **2D pose time series data** and a **stacked LSTM RNN**. Rather than feeding raw RGB images or expensive 3D pose data into the network, it uses **2D (x, y) keypoints** extracted from video frames via OpenPose — a much lighter and more accessible input representation. |
| 38 | + |
| 39 | +The core research questions: |
| 40 | + |
| 41 | +- Can **2D pose** match **3D pose** accuracy for activity recognition? (removes need for RGBD cameras) |
| 42 | +- Can **2D pose** match **raw RGB image** accuracy? (smaller input = smaller model = better with limited data) |
| 43 | +- Does this approach generalize to **animal** behaviour classification for robotics applications? |
| 44 | + |
| 45 | +The network architecture is based on Guillaume Chevalier's *LSTMs for Human Activity Recognition (2016)*, with key modifications for large class-ordered datasets using **random batch sampling without replacement**. |
| 46 | + |
| 47 | +--- |
| 48 | + |
| 49 | +## 🧠 Key Idea — Why 2D Pose? |
| 50 | + |
| 51 | +``` |
| 52 | +Raw Video Frame (640×480 RGB) |
| 53 | + │ |
| 54 | + ▼ |
| 55 | + OpenPose Inference |
| 56 | + 18 body keypoints × (x, y) coords |
| 57 | + │ |
| 58 | + ▼ |
| 59 | + 36-dimensional feature vector per frame |
| 60 | + │ |
| 61 | + ▼ (32 frames = 1 time window) |
| 62 | + LSTM RNN → Activity Class |
| 63 | +``` |
| 64 | + |
| 65 | +| Input Type | Pros | Cons | |
| 66 | +|------------|------|------| |
| 67 | +| Raw RGB images | High information | Large models, lots of data needed | |
| 68 | +| 3D pose (RGBD) | Rich spatial info | Needs depth sensors | |
| 69 | +| **2D pose (x,y)** ✅ | Lightweight, RGB-only camera, small model | Some spatial ambiguity | |
| 70 | + |
| 71 | +> Limiting the feature vector to 2D pose keypoints allows for a **smaller LSTM model** that generalises better on limited datasets — particularly relevant for future animal behaviour recognition tasks. |
| 72 | +
|
| 73 | +--- |
| 74 | + |
| 75 | +## 📊 Dataset |
| 76 | + |
| 77 | +| Property | Details | |
| 78 | +|----------|---------| |
| 79 | +| **Source** | Berkeley Multimodal Human Action Database (MHAD) — 2D poses extracted via OpenPose | |
| 80 | +| **Download** | `RNN-HAR-2D-Pose-database.zip` (~19.2 MB, Google Drive) | |
| 81 | +| **Subjects** | 12 | |
| 82 | +| **Angles** | 4 camera angles | |
| 83 | +| **Repetitions** | 5 per subject per action | |
| 84 | +| **Total videos** | 1,438 (2 missing from original 1,440) | |
| 85 | +| **Total frames** | 211,200 | |
| 86 | +| **Training windows** | 22,625 (32 timesteps each, 50% overlap) | |
| 87 | +| **Test windows** | 5,751 | |
| 88 | +| **Input shape** | `(22625, 32, 36)` → windows × timesteps × features | |
| 89 | +| **Preprocessing** | ❌ None — raw, unnormalized pose coordinates | |
| 90 | + |
| 91 | +### Activity Classes (6) |
| 92 | + |
| 93 | +| Label | Activity | |
| 94 | +|-------|----------| |
| 95 | +| `JUMPING` | Vertical jumps | |
| 96 | +| `JUMPING_JACKS` | Jumping jacks | |
| 97 | +| `BOXING` | Boxing motions | |
| 98 | +| `WAVING_2HANDS` | Waving with both hands | |
| 99 | +| `WAVING_1HAND` | Waving with one hand | |
| 100 | +| `CLAPPING_HANDS` | Clapping hands | |
| 101 | + |
| 102 | +### Data Files |
| 103 | + |
| 104 | +``` |
| 105 | +RNN-HAR-2D-Pose-database/ |
| 106 | +├── X_train.txt # 22,625 training windows (36 comma-separated floats per row) |
| 107 | +├── X_test.txt # 5,751 test windows |
| 108 | +├── Y_train.txt # Training labels (0–5) |
| 109 | +└── Y_test.txt # Test labels (0–5) |
| 110 | +``` |
| 111 | + |
| 112 | +--- |
| 113 | + |
| 114 | +## 🏗️ LSTM Architecture |
| 115 | + |
| 116 | +``` |
| 117 | +Input: (batch_size, 32 timesteps, 36 features) |
| 118 | + │ |
| 119 | + ▼ |
| 120 | + Linear projection: 36 → 34 (ReLU) |
| 121 | + │ |
| 122 | + ▼ |
| 123 | + ┌──────────────────────────────────┐ |
| 124 | + │ BasicLSTMCell(34, forget_bias=1)│ ← Layer 1 |
| 125 | + ├──────────────────────────────────┤ |
| 126 | + │ BasicLSTMCell(34, forget_bias=1)│ ← Layer 2 |
| 127 | + └──────────────────────────────────┘ |
| 128 | + tf.contrib.rnn.MultiRNNCell (stacked) |
| 129 | + tf.contrib.rnn.static_rnn (many-to-one) |
| 130 | + │ |
| 131 | + Last output only |
| 132 | + │ |
| 133 | + ▼ |
| 134 | + Linear: 34 → 6 |
| 135 | + Softmax → Activity class |
| 136 | +``` |
| 137 | + |
| 138 | +> **Why n_hidden = 34?** Testing across a range of hidden unit counts showed best generalisation when hidden units ≈ n_input (36). 34 was found to be optimal. |
| 139 | +
|
| 140 | +> **Many-to-one classifier** — only the last LSTM output (timestep 32) is used for classification, not the full sequence output. |
| 141 | +
|
| 142 | +--- |
| 143 | + |
| 144 | +## ⚙️ Training Configuration |
| 145 | + |
| 146 | +| Parameter | Value | |
| 147 | +|-----------|-------| |
| 148 | +| Framework | TensorFlow 1.x (`%tensorflow_version 1.x`) | |
| 149 | +| Timesteps (`n_steps`) | 32 | |
| 150 | +| Input features (`n_input`) | 36 (18 keypoints × x, y) | |
| 151 | +| Hidden units (`n_hidden`) | 34 | |
| 152 | +| Classes (`n_classes`) | 6 | |
| 153 | +| Epochs | 300 | |
| 154 | +| Batch size | 512 | |
| 155 | +| Optimizer | Adam | |
| 156 | +| Initial learning rate | 0.005 | |
| 157 | +| LR decay | Exponential — `0.96` per 100,000 steps | |
| 158 | +| Loss | Softmax cross-entropy + L2 regularization | |
| 159 | +| L2 lambda | 0.0015 | |
| 160 | +| Batch strategy | Random sampling **without replacement** (prevents class-order bias) | |
| 161 | +| Training time | ~7 minutes (Google Colab) | |
| 162 | + |
| 163 | +**L2 regularization formula:** |
| 164 | +```python |
| 165 | +l2 = lambda_loss_amount * sum( |
| 166 | + tf.nn.l2_loss(tf_var) for tf_var in tf.trainable_variables() |
| 167 | +) |
| 168 | +cost = tf.reduce_mean(softmax_cross_entropy) + l2 |
| 169 | +``` |
| 170 | + |
| 171 | +**Decayed learning rate:** |
| 172 | +```python |
| 173 | +learning_rate = init_lr * decay_rate ^ (global_step / decay_steps) |
| 174 | +# = 0.005 * 0.96 ^ (global_step / 100000) |
| 175 | +``` |
| 176 | + |
| 177 | +--- |
| 178 | + |
| 179 | +## 📈 Results & Findings |
| 180 | + |
| 181 | +| Metric | Value | |
| 182 | +|--------|:-----:| |
| 183 | +| **Final Accuracy** | **> 90%** | |
| 184 | +| Training time | ~7 minutes | |
| 185 | + |
| 186 | +**Confusion pairs observed:** |
| 187 | +- `CLAPPING_HANDS` ↔ `BOXING` — similar upper-body motion pattern |
| 188 | +- `JUMPING_JACKS` ↔ `WAVING_2HANDS` — symmetric arm movements |
| 189 | + |
| 190 | +**Key conclusions:** |
| 191 | +- 2D pose achieves >90% accuracy, validating its use over more expensive 3D pose or raw RGB inputs |
| 192 | +- Hidden units ≈ n_input (34 ≈ 36) gives optimal generalisation |
| 193 | +- Random batch sampling without replacement is **critical** — ordered class batches degrade training significantly |
| 194 | +- Approach is promising for future animal behaviour estimation with autonomous mobile robots |
| 195 | + |
| 196 | +--- |
| 197 | + |
| 198 | +## 📁 Project Structure |
| 199 | + |
| 200 | +``` |
| 201 | +Human Activity Detection/ |
| 202 | +│ |
| 203 | +├── 📂 images/ # Result plots and visualizations |
| 204 | +├── 📂 models/ # Saved LSTM model weights |
| 205 | +├── 📂 src/ # Helper source scripts |
| 206 | +├── 📂 templates/ # HTML templates (Flask app) |
| 207 | +│ |
| 208 | +├── Human_Activity_Recogination.ipynb # Main notebook — dataset, LSTM, training |
| 209 | +├── Human_Action_Classification_deployment_with_ngrok.ipynb # Flask + ngrok deployment notebook |
| 210 | +├── lstm_train.ipynb # Standalone LSTM training notebook |
| 211 | +├── app.py # Flask web application |
| 212 | +├── sample_video.mp4 # Sample video for live demo |
| 213 | +└── requirements.txt # Python dependencies |
| 214 | +``` |
| 215 | + |
| 216 | +--- |
| 217 | + |
| 218 | +## 🚀 Getting Started |
| 219 | + |
| 220 | +### 1. Clone the repository |
| 221 | + |
| 222 | +```bash |
| 223 | +git clone https://github.com/shsarv/Machine-Learning-Projects.git |
| 224 | +cd "Machine-Learning-Projects/Human Activity Detection" |
| 225 | +``` |
| 226 | + |
| 227 | +### 2. Set up environment |
| 228 | + |
| 229 | +```bash |
| 230 | +python -m venv venv |
| 231 | +source venv/bin/activate # Linux / macOS |
| 232 | +venv\Scripts\activate # Windows |
| 233 | + |
| 234 | +pip install -r requirements.txt |
| 235 | +``` |
| 236 | + |
| 237 | +> ⚠️ **TensorFlow 1.x required.** The LSTM uses `tf.contrib.rnn` and `tf.placeholder` APIs from TF1. |
| 238 | +> ```bash |
| 239 | +> pip install tensorflow==1.15.0 |
| 240 | +> ``` |
| 241 | +
|
| 242 | +### 3. Download the dataset |
| 243 | +
|
| 244 | +The dataset is downloaded automatically in the notebook: |
| 245 | +```python |
| 246 | +!wget -O RNN-HAR-2D-Pose-database.zip \ |
| 247 | + https://drive.google.com/u/1/uc?id=1IuZlyNjg6DMQE3iaO1Px6h1yLKgatynt |
| 248 | +!unzip RNN-HAR-2D-Pose-database.zip |
| 249 | +``` |
| 250 | +
|
| 251 | +### 4. Run on Google Colab (recommended) |
| 252 | + |
| 253 | +``` |
| 254 | +1. Open Human_Activity_Recogination.ipynb in Google Colab |
| 255 | +2. Runtime → Change runtime type → GPU (optional, speeds training) |
| 256 | +3. Run all cells — training completes in ~7 minutes |
| 257 | +``` |
| 258 | + |
| 259 | +### 5. Deploy with ngrok |
| 260 | + |
| 261 | +``` |
| 262 | +Open Human_Action_Classification_deployment_with_ngrok.ipynb |
| 263 | +Follow the ngrok setup cells to expose the Flask app publicly |
| 264 | +``` |
| 265 | + |
| 266 | +--- |
| 267 | + |
| 268 | +## 🛠️ Tech Stack |
| 269 | + |
| 270 | +| Layer | Technology | |
| 271 | +|-------|-----------| |
| 272 | +| Language | Python 3.7+ | |
| 273 | +| Deep Learning | TensorFlow 1.x (`tf.contrib.rnn`) | |
| 274 | +| Model | 2-layer stacked LSTM (`BasicLSTMCell`) | |
| 275 | +| Pose Extraction | OpenPose (CMU Perceptual Computing Lab) | |
| 276 | +| Data Processing | NumPy | |
| 277 | +| Visualization | Matplotlib | |
| 278 | +| Web Framework | Flask | |
| 279 | +| Deployment | ngrok (tunnel) | |
| 280 | +| Notebook | Jupyter / Google Colab | |
| 281 | + |
| 282 | +--- |
| 283 | + |
| 284 | +## 📚 References |
| 285 | + |
| 286 | +- Guillaume Chevalier (2016). *LSTMs for Human Activity Recognition.* [github.com/guillaume-chevalier](https://github.com/guillaume-chevalier/LSTM-Human-Activity-Recognition) — MIT License |
| 287 | +- [Berkeley MHAD Dataset](http://tele-immersion.citris-uc.org/berkeley_mhad) |
| 288 | +- [OpenPose — CMU Perceptual Computing Lab](https://github.com/CMU-Perceptual-Computing-Lab/openpose) |
| 289 | +- Goodfellow et al. *"It has been observed in practice that when using a larger batch there is a significant degradation in the quality of the model..."* — basis for small batch strategy |
| 290 | +- [Andrej Karpathy — The Unreasonable Effectiveness of RNNs](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) — referenced for many-to-one classifier design |
| 291 | + |
| 292 | +--- |
| 293 | + |
| 294 | +<div align="center"> |
| 295 | + |
| 296 | +Part of the [Machine Learning Projects](https://github.com/shsarv/Machine-Learning-Projects) collection by [Sarvesh Kumar Sharma](https://github.com/shsarv) |
| 297 | + |
| 298 | +⭐ Star the main repo if this helped you! |
| 299 | + |
| 300 | +</div> |
0 commit comments