EN.601.773 Machine Social Intelligence, Spring 2026 — Johns Hopkins University
I investigate whether a persistent caregiver–child relationship improves knowledge transfer in language agents. A Qwen3-235B caregiver teaches a Qwen3-8B child across 160 household tasks using a cognitively-inspired memory architecture and a salience-gated consolidation mechanism. Caregiver-assisted agents achieve 100% training success with 25% fewer turns, but this advantage does not transfer to independent evaluation — mirroring the scaffolding dependency phenomenon from developmental psychology.
H2: Habit Acceleration — Caregiver conditions achieve 100% success and 160/160 LoRA updates vs. ~130 for Solo/Peer. |
Teaching Efficiency — Caregiver: 6.1 turns avg vs. Solo/Peer: 8.4 turns (25% reduction). |
| Metric | Solo | Sym. Peer | Role-Labeled | Relational |
|---|---|---|---|---|
| H1 Transfer Accuracy | 0.686 ± 0.046 | 0.686 ± 0.078 | 0.708 ± 0.033 | 0.682 ± 0.087 |
| Training Success Rate | 81.5% | 84.4% | 100% | 99.8% |
| Avg. Turns to Complete | 8.36 | 8.52 | 6.10 | 6.42 |
| Total LoRA Updates | 130 | 135 | 160 | 160 |
- Python 3.10+
- Tinker API key
git clone https://github.com/rishi-more-2003/asym-rel-eff-kt.git
cd asym-rel-eff-kt
python -m venv .venv
source .venv/bin/activate # or .venv\Scripts\activate on Windows
pip install -r requirements.txtCreate a .env file in the project root:
TINKER_API_KEY="your-tinker-api-key-here"
All hyperparameters are centralized in config.py.
python run_generate_tasks.pyRuns the multi-stage generation pipeline (ontology → skeletons → expansion → verification → filtering) to produce 197 household tasks. Output: data/task_database.json.
# Run full pipeline (training + evaluation)
python run_experiment.py all
# Or run phases separately
python run_experiment.py train
python run_experiment.py evalThis launches 12 concurrent training runs (4 conditions × 3 seeds) via asyncio, then evaluates on 40 held-out tasks.
python analyze_results.pyProduces publication-quality figures in documentation/figures/.
├── config.py # Centralized hyperparameters
├── run_generate_tasks.py # Task generation entry point
├── run_experiment.py # Training + evaluation entry point
├── analyze_results.py # Figure generation
├── requirements.txt
│
├── src/
│ ├── agents/
│ │ ├── caregiver.py # 235B caregiver agent
│ │ └── child.py # 8B child agent with LoRA
│ ├── memory/
│ │ ├── instinct.py # M1: Fixed behavioral priors
│ │ ├── working_memory.py # M2: Sliding window (k=8)
│ │ └── long_term.py # M3: Episodic store + BM25 retrieval
│ ├── salience.py # Salience signal (novelty + pred. error + teaching)
│ ├── trainer.py # M4: Salience-weighted LoRA SFT
│ ├── episode.py # Async episode runner
│ ├── training.py # Concurrent training loop
│ ├── evaluation.py # H1/H2/H3 evaluation suite
│ ├── judge.py # Semantic action judge (235B)
│ ├── scaffolding.py # Adaptive scaffolding controller
│ ├── child_model.py # Caregiver's model of the child
│ ├── curriculum.py # Difficulty-ordered curriculum
│ ├── conditions.py # Experimental condition configs
│ ├── bm25.py # Dependency-free BM25
│ ├── reward.py # Reward computation
│ ├── metrics_logger.py # JSONL metrics logging
│ ├── tinker_utils.py # Tinker API utilities
│ ├── ontology.py # Object ontology generation
│ ├── generate_skeletons.py # Task skeleton generation
│ ├── expand_tasks.py # Full task expansion
│ ├── verify_tasks.py # Self-verification loop
│ └── filter_tasks.py # Dedup + balancing + train/eval split
│
├── data/
│ ├── task_database.json # Generated task database (197 tasks)
│ ├── object_ontology.json # Household object ontology
│ ├── task_skeletons.json # Intermediate skeletons
│ └── runs/ # Experiment outputs
│ ├── evaluation_results.json
│ ├── {condition}_seed{n}/
│ │ ├── metrics.jsonl # Per-episode metrics
│ │ ├── ltm.json # Long-term memory state
│ │ └── transcripts/ # Full episode dialogues
│ └── ...
│
└── documentation/
├── final_report.tex # Final report
├── presentation.tex # Beamer slides
├── bibliography.bib # References
└── figures/ # Generated figures (PDF + PNG)
| Module | Implementation | Updated? |
|---|---|---|
| M1. Instinct Buffer | Fixed system prompt (role priors) | Never |
| M2. Working Memory | Last k=8 dialogue turns | Every turn |
| M3. Long-Term Memory | Episodic store, BM25 retrieval, 235B compression | If salience > τ |
| M4. Habit Store | LoRA adapter (rank 16, lr=2e-5, batch ≥ 4) | Salience-weighted SFT |
- Novelty (α=0.3): Category frequency decay + BM25 distance to LTM
- Prediction error (β=0.4): Rescorla-Wagner surprise + ZPD match (Gaussian at r=0.5)
- Teaching signal (γ=0.3): Productive struggle patterns + effort ratio; γ=0 without caregiver
| Condition | Agents | Teaching Signal |
|---|---|---|
| Solo | 8B child alone | γ = 0 |
| Symmetric Peer | Two 8B agents | γ = 0 |
| Role-Labeled | 235B caregiver + 8B child | γ = 0 |
| Relational | 235B caregiver + 8B child | γ = 0.3 |
Transfer by Difficulty — All conditions handle easy tasks well; performance degrades similarly on hard tasks. |
Curriculum Progression |
Adaptive Scaffolding |
Salience & LTM Growth — Salience decays as tasks become familiar; caregiver conditions accumulate more LTM entries. |
Category Heatmap — Transfer accuracy by condition and task category. No condition dominates all categories. |
This project uses the Tinker API for LLM inference and LoRA fine-tuning. Experiments were run on the JHU CS research compute cluster. Total API cost: ~$45.








