This study analyzes the relationship between audio features and lyrical content using extensive Spotify datasets from Kaggle. By integrating machine learning we aim to make a user tailored recommender system. Additionally, we explore semantic meanings in song lyrics and develop an improved recommender system. The study aims to prototype a recommender system that is more personalized than those in music apps. Attached is the src code and also the poster presentation and research paper.
Below is the youtube video that explains project with demo
) | ๐ Read the Research Paper
- Overview
- Key Features
- Project Structure
- Datasets
- Installation
- Usage
- Methodology
- Results
- Future Work
- Contributors
- References
Traditional music recommendation systems filter songs by artist, genre, or audio similarity. SemanticSounds goes beyond surface-level features by analyzing the semantic meaning behind song lyrics using sentence-BERT (sBERT) embeddings.
- Feature-Popularity Correlation: Which song features (tempo, energy, danceability, lyrical complexity) correlate with popularity over time?
- Recommender System Development: Can we build a system that recommends songs based on both audio features AND lyrical meaning?
- Semantic Analysis: How do lyrical themes evolve across decades (1950s-2010s)?
- For Listeners: Discover songs with similar emotional and thematic content, not just similar beats
- For Artists: Understand which lyrical themes resonate with audiences
- For Industry: Identify cyclical trends and optimize marketing strategies
| Feature | Description |
|---|---|
| Dual Recommendation Engines | Base (audio features) + Enhanced (semantic lyrics) |
| Fuzzy Matching | Handles misspelled song/artist names gracefully |
| Semantic Clustering | Groups songs by meaning using UMAP + clustering algorithms |
| Temporal Analysis | Word clouds and trend analysis across musical eras (1950s-2010s) |
| Interactive Visualizations | Plotly-based interactive embedding visualizations |
Input: "Judas" by Lady Gaga (a highly religious-themed song)
| Base Recommender (Audio Features) | Semantic Recommender (sBERT) |
|---|---|
| Generic pop songs with similar tempo/energy | "Edge of Heaven" |
| Songs matching mood/vibe | "When You Were Young" |
| Similar danceability scores | "Original Sin", "Devil Inside" |
The semantic recommender captures the religious motifs in the lyrics!
SemanticSounds/
โ
โโโ ๐ README.md # Main project documentation
โโโ ๐ paper.pdf # Research paper
โโโ ๐ Rhythms Through Time_ Hu.pdf # Poster presentation
โ
โโโ ๐ demo/ # Demo materials
โ
โโโ ๐ recommender_src/ # Source code and data
โ
โโโ ๐ music_recommender_base.ipynb # Base recommender (audio features)
โ โโโ Contains:
โ โข EDA & data preprocessing
โ โข Feature engineering pipeline
โ โข Regression models (Linear, Ridge, Lasso, RF, XGBoost)
โ โข Neural network implementation
โ โข SHAP feature importance analysis
โ โข Base recommendation system
โ
โโโ ๐ music_recommender_sbert.ipynb # SBERT-enhanced recommender
โ โโโ Contains:
โ โข Fuzzy matching & record linkage (merging datasets)
โ โข Lyric preprocessing pipeline
โ โข SBERT embedding generation
โ โข UMAP dimensionality reduction
โ โข Clustering (KMeans, DBSCAN, HDBSCAN, Agglomerative)
โ โข Word cloud generation by decade
โ โข Semantic recommendation system
โ
โโโ ๐ top_10000_1950-now.csv # Spotify audio features dataset
โโโ ๐ spotify_60000_songs.csv # Song lyrics dataset (Git LFS)
โโโ ๐ merged_data2.csv # Merged dataset after fuzzy matching
โ
โโโ ๐ README.md # Source code documentation
| Attribute | Details |
|---|---|
| File | top_10000_1950-now.csv |
| Size | 10,000 songs, 35 features |
| Key Features | Danceability, Energy, Acousticness, Valence, Speechiness, Liveness, Loudness, Tempo, Popularity, Artist Genres, Album Release Date |
| Attribute | Details |
|---|---|
| File | spotify_60000_songs.csv |
| Size | 57,650 songs |
| Columns | Artist, Song, Link, Text (full lyrics) |
| Attribute | Details |
|---|---|
| File | merged_data2.csv |
| Matching Method | Fuzzy matching with RapidFuzz + RecordLinkage (threshold: 80%) |
| Final Size | ~2,011 matched entries |
The notebooks are designed to run in Google Colab with Google Drive integration:
- Upload the data files to your Google Drive under
Colab Notebooks/ - Open the notebooks in Colab
- Run the setup cells to install dependencies
# Clone the repository
git clone https://github.com/aqn96/SemanticSounds.git
cd SemanticSounds/recommender_src
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install pandas numpy scikit-learn tensorflow sentence-transformers
pip install umap-learn hdbscan plotly matplotlib seaborn
pip install thefuzz[speedup] rapidfuzz recordlinkage
pip install nltk gensim shap xgboost lightgbm wordcloud openpyxl
# Download NLTK resources
python -c "import nltk; nltk.download('stopwords'); nltk.download('wordnet'); nltk.download('omw-1.4')"pandas, numpy, scikit-learn
tensorflow, sentence-transformers
umap-learn, hdbscan
thefuzz, rapidfuzz, recordlinkage
plotly, matplotlib, seaborn
shap, xgboost, lightgbm
nltk, gensim, wordcloud
Open music_recommender_base.ipynb and run all cells. At the end, use the interactive recommender:
# Interactive mode - prompts for user input
recommend_similar_songs()Example Session:
Enter the name of the song (required): toxic
Enter the artist name (optional): britney spears
Enter the number of recommendations you want (default 10): 15
Matched Song: 'Toxic' with score 100
Matched Artist: 'Britney Spears' with score 77
Recommended Songs (Top 15):
- 'Burning Up' by Madonna from the album 'Madonna'
- 'Turn Around (5,4,3,2,1)' by Flo Rida from the album 'Only One Flo (Part 1)'
- 'Sorry' by Joel Corry from the album 'Sorry'
- 'Me Against the Music' by Britney Spears, Madonna from the album 'In The Zone'
...
Open music_recommender_sbert.ipynb and run all cells. The semantic recommender considers lyrical meaning:
# Recommender with SBERT embeddings
recommend_similar_songs() # Same interface, different results!Example: "Firework" by Katy Perry
- Because the lyrics are about "explosiveness," "fire," and "burning"
- Recommends: "Burn," "Firefly," songs with passion/fire themes
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
โ Spotify 10K โ โ Lyrics 60K โ โ Fuzzy Match โ
โ Audio Features โโโโโโถโ Song Lyrics โโโโโโถโ & Record Link โ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโฌโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโ
โ Merged Dataset โ
โ (~2,011 songs) โ
โโโโโโโโโโฌโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โผ โผ โผ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
โ Audio Features โ โ SBERT Embeddingsโ โ Time Period โ
โ Engineering โ โ (768-dim) โ โ Binning โ
โโโโโโโโโโฌโโโโโโโโโ โโโโโโโโโโฌโโโโโโโโโ โโโโโโโโโโฌโโโโโโโโโ
โ โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โผ
โโโโโโโโโโโโโโโโโโโ
โ Combined โ
โ Feature Vector โ
โโโโโโโโโโฌโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโ
โผ โผ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
โ Clustering โ โ Recommendation โ
โ (UMAP + KMeans) โ โ (Euclidean Dist)โ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
| Feature Type | Examples | Purpose |
|---|---|---|
| Interaction Terms | energy ร valence, dance ร energy |
Capture feature relationships |
| Binned Features | tempo_slow/moderate/fast, duration_short/medium/long |
Discretize continuous variables |
| Derived Features | is_instrumental, overall_mood, Age_of_Song |
Domain-specific indicators |
| One-Hot Encoded | Genre_pop, Genre_rock, Genre_dance pop |
Categorical representation |
| Artist Proxy | Artist_Popularity (mean popularity per artist) |
Artist influence factor |
# 1. Load SBERT model (GPU-accelerated)
model = SentenceTransformer('all-mpnet-base-v2', device='cuda')
# 2. Generate 768-dimensional embeddings from lyrics
embeddings = model.encode(lyrics_list, batch_size=32)
# 3. Reduce dimensions: PCA (768 โ 300) โ UMAP (300 โ 2)
pca = PCA(n_components=300)
embeddings_pca = pca.fit_transform(embeddings)
umap_model = UMAP(n_components=2, n_neighbors=15, min_dist=0.1)
embeddings_2d = umap_model.fit_transform(embeddings_pca)def preprocess_lyrics(text):
# 1. Remove text within parentheses
# 2. Convert to lowercase
# 3. Replace line breaks with ' / '
# 4. Remove special characters (keep essential punctuation)
# 5. Remove excessive repetitions (limit to 2)
# 6. Tokenize and lemmatize
# 7. Remove duplicate lines
return cleaned_text# Using RapidFuzz + RecordLinkage
# Blocking on first 2 characters + phonetic (Double Metaphone)
# Jaro-Winkler similarity threshold: 85%
# Final match threshold: 80%| Model | MSE | Rยฒ | Improvement vs Baseline |
|---|---|---|---|
| Linear Regression | 368.78 | 0.512 | +103ร |
| Ridge Regression | 368.77 | 0.512 | +103ร |
| Lasso Regression | 368.76 | 0.512 | +103ร |
| Random Forest | 473.25 | 0.373 | +75ร |
| Gradient Boosting | 380.86 | 0.496 | +100ร |
| XGBoost (Tuned) | 377.18 | 0.501 | +101ร |
| Keras Neural Net | 471.68 | 0.376 | +76ร |
| Baseline [1] | 883.80 | -0.005 | โ |
Key Finding: Our models achieved Rยฒ = 0.512 vs baseline's Rยฒ = -0.005
| Configuration | Silhouette Score | Interpretation |
|---|---|---|
| Time Period Only (Benchmark) | 0.8084 | High separation, low semantic richness |
| sBERT Only | 0.4286 | Semantic grouping, more overlap |
| sBERT + Audio Features | 0.7464 | Best balance of meaning & separation |
Top Features Influencing Popularity:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1. Artist_Popularity โโโโโโโโโโโโโโโโโโโโ 58.3%
2. Speechiness โโโโโโโโโโโโโโโโโโโโ 9.9%
3. Acousticness โโโโโโโโโโโโโโโโโโโโ 9.8%
4. Loudness โโโโโโโโโโโโโโโโโโโโ 9.8%
5. Liveness โโโโโโโโโโโโโโโโโโโโ 9.5%
Songs were binned into musical eras for temporal analysis:
| Period | Era Name | Song Count |
|---|---|---|
| 1950-1959 | Birth of Rock 'n' Roll | ~200 |
| 1960-1969 | Cultural Revolution | ~400 |
| 1970-1979 | Rise of Diverse Genres | ~600 |
| 1980-1989 | MTV Era and Electronic Explosion | ~1,200 |
| 1990-1993 | The End of an Era | ~800 |
| 1994-1996 | Expansion and Mainstream Success | ~900 |
| 1997-1999 | Technological Advancements | ~1,100 |
| 2000-2009 | Digital Revolution | ~2,500 |
| 2010-2019 | Streaming and Global Connectivity | ~2,300 |
- Real-time Streaming Integration: Spotify API for live recommendations
- Enhanced NLP: Sentiment analysis, emotion detection per verse/chorus
- Production Deployment: FastAPI REST API, Docker containerization
- Advanced Modeling: Graph Neural Networks for artist relationships
- Multi-modal Fusion: Combine audio waveforms + lyrics + album art
- An Nguyen (@aqn96) - Lead Developer, ML Engineering
-
Joe Beach Capital. (2023). Top 10,000 Songs: EDA and Models. Kaggle. Link
-
Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP.
-
McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv.
-
Spotify Web API Documentation. Audio Features. Link
This project is licensed under the MIT License - see the LICENSE file for details.
โญ If you found this project useful, please consider giving it a star! โญ