SemanticSounds: Lyric Semantic Meaning Recommendation System

This study analyzes the relationship between audio features and lyrical content using extensive Spotify datasets from Kaggle. By integrating machine learning we aim to make a user tailored recommender system. Additionally, we explore semantic meanings in song lyrics and develop an improved recommender system. The study aims to prototype a recommender system that is more personalized than those in music apps. Attached is the src code and also the poster presentation and research paper.

Below is the youtube video that explains project with demo

) | 📄 Read the Research Paper

🎯 Overview

Traditional music recommendation systems filter songs by artist, genre, or audio similarity. SemanticSounds goes beyond surface-level features by analyzing the semantic meaning behind song lyrics using sentence-BERT (sBERT) embeddings.

Research Questions

Feature-Popularity Correlation: Which song features (tempo, energy, danceability, lyrical complexity) correlate with popularity over time?
Recommender System Development: Can we build a system that recommends songs based on both audio features AND lyrical meaning?
Semantic Analysis: How do lyrical themes evolve across decades (1950s-2010s)?

Motivation

For Listeners: Discover songs with similar emotional and thematic content, not just similar beats
For Artists: Understand which lyrical themes resonate with audiences
For Industry: Identify cyclical trends and optimize marketing strategies

✨ Key Features

Feature	Description
Dual Recommendation Engines	Base (audio features) + Enhanced (semantic lyrics)
Fuzzy Matching	Handles misspelled song/artist names gracefully
Semantic Clustering	Groups songs by meaning using UMAP + clustering algorithms
Temporal Analysis	Word clouds and trend analysis across musical eras (1950s-2010s)
Interactive Visualizations	Plotly-based interactive embedding visualizations

Example: Semantic vs. Base Recommendations

Input: "Judas" by Lady Gaga (a highly religious-themed song)

Base Recommender (Audio Features)	Semantic Recommender (sBERT)
Generic pop songs with similar tempo/energy	"Edge of Heaven"
Songs matching mood/vibe	"When You Were Young"
Similar danceability scores	"Original Sin", "Devil Inside"

The semantic recommender captures the religious motifs in the lyrics!

📁 Project Structure

SemanticSounds/
│
├── 📄 README.md                          # Main project documentation
├── 📄 paper.pdf                          # Research paper
├── 📄 Rhythms Through Time_ Hu.pdf       # Poster presentation
│
├── 📁 demo/                              # Demo materials
│
└── 📁 recommender_src/                   # Source code and data
    │
    ├── 📓 music_recommender_base.ipynb   # Base recommender (audio features)
    │   └── Contains:
    │       • EDA & data preprocessing
    │       • Feature engineering pipeline
    │       • Regression models (Linear, Ridge, Lasso, RF, XGBoost)
    │       • Neural network implementation
    │       • SHAP feature importance analysis
    │       • Base recommendation system
    │
    ├── 📓 music_recommender_sbert.ipynb  # SBERT-enhanced recommender
    │   └── Contains:
    │       • Fuzzy matching & record linkage (merging datasets)
    │       • Lyric preprocessing pipeline
    │       • SBERT embedding generation
    │       • UMAP dimensionality reduction
    │       • Clustering (KMeans, DBSCAN, HDBSCAN, Agglomerative)
    │       • Word cloud generation by decade
    │       • Semantic recommendation system
    │
    ├── 📊 top_10000_1950-now.csv         # Spotify audio features dataset
    ├── 📊 spotify_60000_songs.csv        # Song lyrics dataset (Git LFS)
    ├── 📊 merged_data2.csv               # Merged dataset after fuzzy matching
    │
    └── 📄 README.md                      # Source code documentation

📊 Datasets

Dataset 1: Spotify Top 10,000 Songs (1950-2024)

Attribute	Details
File	`top_10000_1950-now.csv`
Size	10,000 songs, 35 features
Key Features	Danceability, Energy, Acousticness, Valence, Speechiness, Liveness, Loudness, Tempo, Popularity, Artist Genres, Album Release Date

Dataset 2: Song Lyrics (57,650 songs)

Attribute	Details
File	`spotify_60000_songs.csv`
Size	57,650 songs
Columns	Artist, Song, Link, Text (full lyrics)

Dataset 3: Merged Dataset

Attribute	Details
File	`merged_data2.csv`
Matching Method	Fuzzy matching with RapidFuzz + RecordLinkage (threshold: 80%)
Final Size	~2,011 matched entries

🚀 Installation

Option 1: Google Colab (Recommended)

The notebooks are designed to run in Google Colab with Google Drive integration:

Upload the data files to your Google Drive under Colab Notebooks/
Open the notebooks in Colab
Run the setup cells to install dependencies

Option 2: Local Installation

# Clone the repository
git clone https://github.com/aqn96/SemanticSounds.git
cd SemanticSounds/recommender_src

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install pandas numpy scikit-learn tensorflow sentence-transformers
pip install umap-learn hdbscan plotly matplotlib seaborn
pip install thefuzz[speedup] rapidfuzz recordlinkage
pip install nltk gensim shap xgboost lightgbm wordcloud openpyxl

# Download NLTK resources
python -c "import nltk; nltk.download('stopwords'); nltk.download('wordnet'); nltk.download('omw-1.4')"

Key Dependencies

pandas, numpy, scikit-learn
tensorflow, sentence-transformers
umap-learn, hdbscan
thefuzz, rapidfuzz, recordlinkage
plotly, matplotlib, seaborn
shap, xgboost, lightgbm
nltk, gensim, wordcloud

💻 Usage

Running the Base Recommender

Open music_recommender_base.ipynb and run all cells. At the end, use the interactive recommender:

# Interactive mode - prompts for user input
recommend_similar_songs()

Example Session:

Enter the name of the song (required): toxic
Enter the artist name (optional): britney spears
Enter the number of recommendations you want (default 10): 15

Matched Song: 'Toxic' with score 100
Matched Artist: 'Britney Spears' with score 77

Recommended Songs (Top 15):
- 'Burning Up' by Madonna from the album 'Madonna'
- 'Turn Around (5,4,3,2,1)' by Flo Rida from the album 'Only One Flo (Part 1)'
- 'Sorry' by Joel Corry from the album 'Sorry'
- 'Me Against the Music' by Britney Spears, Madonna from the album 'In The Zone'
...

Running the SBERT-Enhanced Recommender

Open music_recommender_sbert.ipynb and run all cells. The semantic recommender considers lyrical meaning:

# Recommender with SBERT embeddings
recommend_similar_songs()  # Same interface, different results!

Example: "Firework" by Katy Perry

Because the lyrics are about "explosiveness," "fire," and "burning"
Recommends: "Burn," "Firefly," songs with passion/fire themes

🔬 Methodology

Data Pipeline

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  Spotify 10K    │     │  Lyrics 60K     │     │  Fuzzy Match    │
│  Audio Features │────▶│  Song Lyrics    │────▶│  & Record Link  │
└─────────────────┘     └─────────────────┘     └────────┬────────┘
                                                         │
                                                         ▼
                                               ┌─────────────────┐
                                               │  Merged Dataset │
                                               │  (~2,011 songs) │
                                               └────────┬────────┘
                                                        │
                        ┌───────────────────────────────┼───────────────────────────────┐
                        ▼                               ▼                               ▼
              ┌─────────────────┐            ┌─────────────────┐            ┌─────────────────┐
              │ Audio Features  │            │ SBERT Embeddings│            │ Time Period     │
              │ Engineering     │            │ (768-dim)       │            │ Binning         │
              └────────┬────────┘            └────────┬────────┘            └────────┬────────┘
                       │                              │                              │
                       └──────────────────────────────┼──────────────────────────────┘
                                                      ▼
                                            ┌─────────────────┐
                                            │  Combined       │
                                            │  Feature Vector │
                                            └────────┬────────┘
                                                     │
                                    ┌────────────────┼────────────────┐
                                    ▼                                 ▼
                          ┌─────────────────┐               ┌─────────────────┐
                          │ Clustering      │               │ Recommendation  │
                          │ (UMAP + KMeans) │               │ (Euclidean Dist)│
                          └─────────────────┘               └─────────────────┘

Feature Engineering (Base Notebook)

Feature Type	Examples	Purpose
Interaction Terms	`energy × valence`, `dance × energy`	Capture feature relationships
Binned Features	`tempo_slow/moderate/fast`, `duration_short/medium/long`	Discretize continuous variables
Derived Features	`is_instrumental`, `overall_mood`, `Age_of_Song`	Domain-specific indicators
One-Hot Encoded	`Genre_pop`, `Genre_rock`, `Genre_dance pop`	Categorical representation
Artist Proxy	`Artist_Popularity` (mean popularity per artist)	Artist influence factor

SBERT Embedding Pipeline (SBERT Notebook)

# 1. Load SBERT model (GPU-accelerated)
model = SentenceTransformer('all-mpnet-base-v2', device='cuda')

# 2. Generate 768-dimensional embeddings from lyrics
embeddings = model.encode(lyrics_list, batch_size=32)

# 3. Reduce dimensions: PCA (768 → 300) → UMAP (300 → 2)
pca = PCA(n_components=300)
embeddings_pca = pca.fit_transform(embeddings)

umap_model = UMAP(n_components=2, n_neighbors=15, min_dist=0.1)
embeddings_2d = umap_model.fit_transform(embeddings_pca)

Lyric Preprocessing

def preprocess_lyrics(text):
    # 1. Remove text within parentheses
    # 2. Convert to lowercase
    # 3. Replace line breaks with ' / '
    # 4. Remove special characters (keep essential punctuation)
    # 5. Remove excessive repetitions (limit to 2)
    # 6. Tokenize and lemmatize
    # 7. Remove duplicate lines
    return cleaned_text

Fuzzy Matching for Dataset Merge

# Using RapidFuzz + RecordLinkage
# Blocking on first 2 characters + phonetic (Double Metaphone)
# Jaro-Winkler similarity threshold: 85%
# Final match threshold: 80%

📈 Results

Model Performance Comparison

Regression Models (Predicting Popularity)

Model	MSE	R²	Improvement vs Baseline
Linear Regression	368.78	0.512	+103×
Ridge Regression	368.77	0.512	+103×
Lasso Regression	368.76	0.512	+103×
Random Forest	473.25	0.373	+75×
Gradient Boosting	380.86	0.496	+100×
XGBoost (Tuned)	377.18	0.501	+101×
Keras Neural Net	471.68	0.376	+76×
Baseline [1]	883.80	-0.005	—

Key Finding: Our models achieved R² = 0.512 vs baseline's R² = -0.005

Clustering Performance (Silhouette Scores)

Configuration	Silhouette Score	Interpretation
Time Period Only (Benchmark)	0.8084	High separation, low semantic richness
sBERT Only	0.4286	Semantic grouping, more overlap
sBERT + Audio Features	0.7464	Best balance of meaning & separation

Feature Importance (SHAP Analysis)

Top Features Influencing Popularity:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1. Artist_Popularity    ████████████████████ 58.3%
2. Speechiness          ████████░░░░░░░░░░░░  9.9%
3. Acousticness         ████████░░░░░░░░░░░░  9.8%
4. Loudness             ████████░░░░░░░░░░░░  9.8%
5. Liveness             ████████░░░░░░░░░░░░  9.5%

Time Period Analysis

Songs were binned into musical eras for temporal analysis:

Period	Era Name	Song Count
1950-1959	Birth of Rock 'n' Roll	~200
1960-1969	Cultural Revolution	~400
1970-1979	Rise of Diverse Genres	~600
1980-1989	MTV Era and Electronic Explosion	~1,200
1990-1993	The End of an Era	~800
1994-1996	Expansion and Mainstream Success	~900
1997-1999	Technological Advancements	~1,100
2000-2009	Digital Revolution	~2,500
2010-2019	Streaming and Global Connectivity	~2,300

🔮 Future Work

Real-time Streaming Integration: Spotify API for live recommendations
Enhanced NLP: Sentiment analysis, emotion detection per verse/chorus
Production Deployment: FastAPI REST API, Docker containerization
Advanced Modeling: Graph Neural Networks for artist relationships
Multi-modal Fusion: Combine audio waveforms + lyrics + album art

👥 Contributors

An Nguyen (@aqn96) - Lead Developer, ML Engineering

📚 References

Joe Beach Capital. (2023). Top 10,000 Songs: EDA and Models. Kaggle. Link
Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP.
McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv.
Spotify Web API Documentation. Audio Features. Link

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

⭐ If you found this project useful, please consider giving it a star! ⭐

📺 View Demo • 📄 Read the Paper • 🐛 Report Bug

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
demo		demo
recommender_src		recommender_src
.DS_Store		.DS_Store
.gitattributes		.gitattributes
README.md		README.md
Rhythms Through Time_ Hu.pdf		Rhythms Through Time_ Hu.pdf
paper.pdf		paper.pdf

Folders and files

Latest commit

History

Repository files navigation

SemanticSounds: Lyric Semantic Meaning Recommendation System

📋 Table of Contents

🎯 Overview

Research Questions

Motivation

✨ Key Features

Example: Semantic vs. Base Recommendations

📁 Project Structure

📊 Datasets

Dataset 1: Spotify Top 10,000 Songs (1950-2024)

Dataset 2: Song Lyrics (57,650 songs)

Dataset 3: Merged Dataset

🚀 Installation

Option 1: Google Colab (Recommended)

Option 2: Local Installation

Key Dependencies

💻 Usage

Running the Base Recommender

Running the SBERT-Enhanced Recommender

🔬 Methodology

Data Pipeline

Feature Engineering (Base Notebook)

SBERT Embedding Pipeline (SBERT Notebook)

Lyric Preprocessing

Fuzzy Matching for Dataset Merge

📈 Results

Model Performance Comparison

Regression Models (Predicting Popularity)

Clustering Performance (Silhouette Scores)

Feature Importance (SHAP Analysis)

Time Period Analysis

🔮 Future Work

👥 Contributors

📚 References

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages