CARU & TIL: Modular Optimization for Review-Based Recommender Systems

Recommender Systems Review-Based: Ottimizzazione modulare per migliorare il ranking riducendo i costi computazionali e temporali

This repository contains the official implementation of CARU (Cross-Attention Review-to-User) and the TIL (Triplet Importance Learning) framework, as presented in the Master's Thesis at Politecnico di Bari (2024/2025).

📌 Overview

This project introduces a modular ranking-based approach designed to overcome the limitations of traditional error-based models (MSE) and standard BPR. It focuses on computational efficiency and ranking quality using two main components:

CARU (Cross-Attention Review-to-User): A lightweight encoder implementing the Contextualized BPR (cBPR) paradigm. It utilizes Multi-Head Cross-Attention to enrich user and item embeddings with semantic information from reviews (or a learnable null token).
TIL (Triplet Importance Learning): An auxiliary module trained via Bilevel Optimization. It dynamically learns importance weights ($w_{uij}$) for training triplets to mitigate noise and improve the generalization of the target recommender system.

Reference: Wu, H. (n.d.). A Triplet Importance Learning Framework for Recommendation with Implicit Feedback.

📂 Project Structure

main_NEI.py: The entry point for training, grid search, and testing. It uses Google Fire for CLI interaction.
CARU.py: Contains the CrossAttentionReview2User encoder architecture.
TIL.py: Contains the TIL module for generating triplet importance weights.
context_BPR.py: The orchestrator model that combines encoders and computes the cBPR scores.
dataset.py: Handles data loading for Amazon 5-core datasets.
prepare.py: Script for converting JSON datasets into optimized binary memory-mapped files.

🚀 Getting Started

Prerequisites

Python 3.11+
PyTorch (CUDA recommended)
Pandas, NumPy, Scikit-learn, Tqdm, Fire

Data Preparation

The experiments are conducted on the Amazon 5-core datasets. To replicate the study, the data must be pre-processed following a strict pipeline to ensure compatibility with the prepare.py script.

1. Data Splitting (DIRECT Strategy)

To ensure fair comparison with baselines, you must first split the dataset using the native splitting function provided by the DIRECT framework (MIT License).

Reference: JacksonWuxs/DIRECT
Output: This will generate Train/Validation/Test split files (JSON Lines).

2. Embedding Injection & Formatting

The model requires semantic information from reviews. You must enrich the split files by adding review embeddings generated by OpenAI text-embedding-3-small (1536-dim).

Crucial Formatting Requirement: The prepare.py script is optimized to read binary data directly. You must format the embedding vector as a Base64 encoded string within the JSON object.

Key Name: "tokens"
Value Format: Base64 string of the float32 array.

Handling Missing Reviews (Null Token Strategy): If a review is missing or unavailable, you must insert the Base64 representation of a 1536-dimensional vector of zeros (float32).

Visual Check: This string will appear essentially as a long sequence of "A" characters.
Model Logic: When the data loader encounters this zero-vector, the model detects the absence of context and automatically utilizes a specific, learnable "null token" strategy to substitute the missing review information.

Tip: You can generate the null token string in Python using: base64.b64encode(np.zeros(1536, dtype=np.float32)).decode('utf-8')

Example of expected JSON line structure:

{
  "reviewerID": "A1HK2FQW6KXQB2",
  "asin": "097293751X",
  "overall": 5.0,
  "reviewText": "Perfect for new parents. We were able to keep...",
  "tokens": "KRmRPNlvnjtvdq+8718KPJ..."
}

Example of a Missing Review (Null Token Case): In this scenario, reviewText is empty (or ignored), and tokens contains the zero-vector.

{
  "reviewerID": "A1J1GBJREDREP",
  "asin": "B00563XRYM",
  "overall": 1.0,
  "reviewText": "",
  "tokens": "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA..."
}

Why Base64? As seen in prepare.py, the loader uses np.frombuffer(base64.b64decode(...)) to strictly parse binary data. Storing 1536 floats as plain text JSON lists would drastically increase file size and parsing time.

Installation

Clone the repository and install the required dependencies:

git clone https://github.com/PeppeJerry/Cross-Attention-Review-to-User-CARU-.git CARU_TIL
cd CARU_TIL
pip install -r requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CARU & TIL: Modular Optimization for Review-Based Recommender Systems

Recommender Systems Review-Based: Ottimizzazione modulare per migliorare il ranking riducendo i costi computazionali e temporali

📌 Overview

📂 Project Structure

🚀 Getting Started

Prerequisites

Data Preparation

1. Data Splitting (DIRECT Strategy)

2. Embedding Injection & Formatting

Installation

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
dataset		dataset
models		models
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
main_NEI.py		main_NEI.py
prepare.py		prepare.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

CARU & TIL: Modular Optimization for Review-Based Recommender Systems

Recommender Systems Review-Based: Ottimizzazione modulare per migliorare il ranking riducendo i costi computazionali e temporali

📌 Overview

📂 Project Structure

🚀 Getting Started

Prerequisites

Data Preparation

1. Data Splitting (DIRECT Strategy)

2. Embedding Injection & Formatting

Installation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages