Graph persistence race condition: concurrent walker commits cause node/edge loss

## Summary

Concurrent walker invocations that modify the same graph node cause **silent data loss** due to a last-writer-wins race condition in the persistence layer. The entire node document (including its edge list) is replaced unconditionally at commit time, so the last walker to commit overwrites changes made by earlier concurrent walkers.

## Reproduction

1. Start a Jac application with `jac start` (or `jac start --scale`)
2. Create a walker that adds an edge to the user root: `profile ++> MyNode(...)`
3. Immediately fire 10+ concurrent walker requests that read/mutate the same `profile` node (e.g., updating a `last_accessed` timestamp)
4. After all requests complete, query `[profile -->(?:MyNode)]`
5. **Result**: The newly created `MyNode` edge is missing — it was overwritten by a concurrent walker's stale snapshot

### Concrete example (Jac Builder IDE)

```
# User creates a project (adds edge to profile node)
POST /walker/project_ops  {action: "create", name: "my-project"}
# → success, Project node created, edge added to profile

# Frontend navigates to IDE, fires ~12 concurrent requests:
POST /walker/project_ops  {action: "open", project_id: "prj_123"}   # mutates profile.updated_at
POST /walker/git_ops      {action: "status", project_id: "prj_123"}
POST /walker/ide_file_ops {action: "setup", project_id: "prj_123"}
POST /walker/ai_chat      {action: "load_history", ...}
POST /walker/git_ops      {action: "branches", ...}
# ... 7 more concurrent requests

# User navigates back to dashboard
POST /walker/project_ops  {action: "list"}
# → Project is GONE. Only previously-existing projects remain.
```

This is 100% reproducible. The more concurrent requests, the higher the chance of data loss.

## Root Cause

### The read-modify-write cycle (no isolation)

Each HTTP request gets its own `ExecutionContext` with a private L1 memory dict (`__mem__`). At request end, `commit()` flushes changed anchors to the persistent store. **There is no concurrency control at any level.**

### The race

```
Request A (creates edge)              Request B (updates timestamp)
────────────────────────              ──────────────────────────────
1. Load profile from DB
   edges = [existing_proj]
                                      2. Load profile from DB
                                         edges = [existing_proj]  ← SAME snapshot

3. profile ++> NewProject()
   L1 edges = [existing_proj, new_proj]

4. commit() → WRITE to DB
   edges = [existing_proj, new_proj]  ✓
                                      5. profile.updated_at = now
                                         L1 edges = [existing_proj]  ← STALE

                                      6. commit() → WRITE to DB
                                         edges = [existing_proj]  ← OVERWRITES step 4

Result: new_proj edge is permanently LOST
```

### Affected persistence backends

**MongoDB (jac-scale)**:
```python
# memory_hierarchy.mongo.impl.jac, MongoBackend.put(), lines 71-87
self.collection.update_one(
    {'_id': str(_id)},
    {'$set': {'data': data, 'type': type(anchor).__name__}},
    upsert=True    # ← unconditional full-document replace
)
```

**SQLite (jaclang)**:
```python
# memory.impl.jac, SqliteMemory.sync(), line ~380
INSERT OR REPLACE INTO anchors (id, type, data) VALUES (?, ?, ?)
# ← unconditional full-row replace
```

Both backends replace the **entire serialized anchor** (including embedded edge list) with whatever the committing request had in its private L1 memory. No version check, no merge, no conflict detection.

### Why there's no protection

- **No optimistic locking**: No version field on anchors, no conditional write (`WHERE version = ?`)
- **No pessimistic locking**: No `threading.Lock`, `asyncio.Lock`, or database-level advisory locks anywhere in the persistence path
- **No atomic edge operations**: Edges are embedded in the parent node's serialized blob, so any edge mutation requires replacing the entire document
- **Double commit (doesn't help)**: Both the middleware (`jfast_api.impl.jac:865`) and `spawn_walker` (`server.impl.jac:305`) call `commit()`, but both use the same stale L1 snapshot

## Relevant Files

| File | What |
|---|---|
| `jac/jaclang/runtimelib/impl/memory.impl.jac` (line ~340) | `SqliteMemory.sync` — `INSERT OR REPLACE` with no version check |
| `jac-scale/jac_scale/impl/memory_hierarchy.mongo.impl.jac` (line 71) | `MongoBackend.put` — unconditional `$set` upsert |
| `jac-scale/jac_scale/jserver/impl/jfast_api.impl.jac` (line 846) | `request_context_middleware` — creates per-request context, commits at end |
| `jac/jaclang/runtimelib/impl/server.impl.jac` (line 257) | `spawn_walker` — loads root, runs walker, commits |
| `jac/jaclang/jac0core/runtime.jac` (line 82) | `_request_context` ContextVar — per-coroutine isolation (correct, but insufficient) |

Minimal Reproducing Example + Validated Result

  Files

  server.jac — 3 walkers: add_item (creates edge), touch (concurrent read-modify-write), list_items (verify)
```
  """Minimal reproduction of graph persistence race condition."""

  import from datetime { datetime }

  node Item {
      has name: str = "";
      has created_at: str = "";
  }

  """Create a new Item node connected to user root."""
  walker add_item {
      has name: str = "test";

      can do with Root entry {
          now = datetime.utcnow().isoformat() + "Z";
          item = here ++> Item(name=self.name, created_at=now);
          report {"success": True, "name": self.name, "created_at": now};
      }
  }

  """List all Item nodes on user root."""
  walker list_items {
      can do with Root entry {
          items: list = [];
          for item in [-->(?:Item)] {
              items.append({"name": item.name, "created_at": item.created_at});
          }
          report {"success": True, "count": len(items), "items": items};
      }
  }

  node Counter {
      has value: int = 0;
  }

  """Simulate a concurrent read-modify-write on the root node.
  This walker reads children (loading root's edge list into L1),
  mutates a counter node, then commits — writing the stale edge list back."""
  walker touch {
      has ts: str = "";

      can do with Root entry {
          items = [-->(?:Item)];
          counters = [-->(?:Counter)];
          if len(counters) == 0 {
              c = here ++> Counter(value=1);
          } else {
              counters[0].value = counters[0].value + 1;
          }
          report {"success": True, "item_count": len(items), "ts": self.ts};
      }
  }
```

  test_race.py — registers a user, runs 5 rounds of (create item + 15 concurrent touch requests), checks for data loss
```
  import requests, concurrent.futures, time, sys, uuid

  BASE = "http://localhost:8000"  # adjust port if needed
  CONCURRENT_TOUCHES = 15
  ROUNDS = 5

  def register_and_login():
      username = f"test_{uuid.uuid4().hex[:8]}@example.com"
      password = "TestPass123!"
      requests.post(f"{BASE}/user/register", json={"username": username, "password": password})
      r = requests.post(f"{BASE}/user/login", json={"username": username, "password": password})
      data = r.json()
      return data.get("token") or data.get("access_token") or ""

  def walker_call(token, walker, payload=None):
      r = requests.post(f"{BASE}/walker/{walker}", json=payload or {},
                        headers={"Authorization": f"Bearer {token}"})
      return r.json()

  def extract_report(body):
      if isinstance(body, dict):
          reports = (body.get("data") or {}).get("reports") or body.get("reports") or []
          if reports: return reports[0]
          if "success" in body: return body
      return body

  def main():
      print("=" * 60)
      print("Graph Persistence Race Condition Reproducer")
      print("=" * 60)
      token = register_and_login()
      print(f"Token: {token[:20]}...")
      r = extract_report(walker_call(token, "list_items"))
      initial_count = r.get("count", 0)
      lost_count = 0
      total_created = 0

      for round_num in range(1, ROUNDS + 1):
          item_name = f"item_round_{round_num}"
          print(f"\n[Round {round_num}] add_item('{item_name}') + {CONCURRENT_TOUCHES} concurrent touches...")
          total_created += 1

          with concurrent.futures.ThreadPoolExecutor(max_workers=CONCURRENT_TOUCHES + 1) as pool:
              create_future = pool.submit(walker_call, token, "add_item", {"name": item_name})
              touch_futures = [pool.submit(walker_call, token, "touch", {"ts": str(i)})
                             for i in range(CONCURRENT_TOUCHES)]
              concurrent.futures.wait([create_future] + touch_futures)

          create_resp = extract_report(create_future.result())
          print(f"  add_item: {create_resp}")
          time.sleep(0.5)

          list_resp = extract_report(walker_call(token, "list_items"))
          current_count = list_resp.get("count", 0)
          expected_count = initial_count + round_num
          items = [i.get("name") for i in list_resp.get("items", [])]

          if item_name not in items:
              lost_count += 1
              print(f"  *** LOST! '{item_name}' not in list ***")
          else:
              print(f"  OK - '{item_name}' survived")
          if current_count < expected_count:
              print(f"  Expected {expected_count} items, got {current_count} — {expected_count - current_count} missing!")

      print(f"\n{'=' * 60}")
      print(f"RESULTS: {total_created} created, {lost_count} lost")
      if lost_count > 0:
          print(f"*** BUG CONFIRMED: {lost_count}/{total_created} items silently lost ***")
      print(f"Final: {extract_report(walker_call(token, 'list_items'))}")

  if __name__ == "__main__":
      main()
```
##  How to reproduce

  ### 1. Create directory with the 3 files above
  mkdir race-repro && cd race-repro                                                                                                                                         
                                                                                                                                                                            
  ###  2. Start server                                                                                                                                                         
  jac start server.jac                                                                                                                                                      
                                                                                                                                                                            
  ###  3. In another terminal, run test (adjust BASE port if needed)                                                                                                           
  pip install requests                                                                                                                                                      
  python test_race.py                                                                                                                                                       
          
```                                                                                                                                                               
  Validated result                                                                                                                                                       

  ============================================================
  Graph Persistence Race Condition Reproducer                                                                                                                               
  ============================================================                                                                                                              
                                                                                                                                                                            
  [Round 1] OK - 'item_round_1' survived                                                                                                                                    
  [Round 2] OK - 'item_round_2' survived                                                                                                                                 
  [Round 3] OK - 'item_round_3' survived                                                                                                                                    
  [Round 4] *** LOST! 'item_round_4' not in list ***                                                                                                                        
            Expected 4 items, got 3 — 1 missing!                                                                                                                            
  [Round 5] OK - 'item_round_5' survived                                                                                                                                    
            Expected 5 items, got 4 — 1 missing!                                                                                                                            
                                                                                                                                                                            
  ============================================================                                                                                                              
  RESULTS: 5 created, 1 lost                                                                                                                                                
  *** BUG CONFIRMED: 1/5 items silently lost ***                                                                                                                            
  ```                                                                                                                                                                      
  item_round_4 was created successfully (server returned success: True) but was permanently erased from the graph by a concurrent touch walker's commit overwriting the root
   node's edge list with a stale snapshot.                                                                                                                                  
                       

## Suggested Fixes

### Option 1: Per-user-root async lock (quick fix)
Add an `asyncio.Lock` keyed by user root ID so two walkers for the same user cannot commit simultaneously. Low complexity, prevents the race entirely for single-pod deployments.

### Option 2: Optimistic locking with retry (proper fix)
Add a `_version` field to every anchor. Change writes to:
```python
# MongoDB
result = collection.update_one(
    {'_id': str(_id), '_v': old_version},
    {'$set': {'data': data}, '$inc': {'_v': 1}}
)
if result.modified_count == 0:
    # Conflict — reload, re-merge, retry
```
```sql
-- SQLite
UPDATE anchors SET data = ?, version = version + 1
WHERE id = ? AND version = ?
```

### Option 3: Separate edge storage (best fix)
Move edges out of the parent node's serialized blob into a separate collection/table. Use atomic operations (`$addToSet`, `$pull` in MongoDB; separate `INSERT`/`DELETE` in SQLite) for edge mutations. This eliminates the need to replace the entire parent document when edges change.

## Impact

- **Data loss**: Graph nodes and edges silently disappear
- **Silent**: No errors, no warnings — the overwriting commit succeeds
- **Affects all applications**: Any Jac app that handles concurrent requests touching the same graph nodes
- **Worse at scale**: Multi-worker deployments (`--scale` with Gunicorn) have completely independent in-memory state with zero inter-process coordination

## Environment

- jaclang: installed from `jaseci-labs/jaseci@main`
- jac-scale: installed from `jaseci-labs/jaseci@main`
- Python: 3.12
- OS: Linux (Ubuntu)
- Server: uvicorn (single worker, async)

File	What
`jac/jaclang/runtimelib/impl/memory.impl.jac` (line ~340)	`SqliteMemory.sync` — `INSERT OR REPLACE` with no version check
`jac-scale/jac_scale/impl/memory_hierarchy.mongo.impl.jac` (line 71)	`MongoBackend.put` — unconditional `$set` upsert
`jac-scale/jac_scale/jserver/impl/jfast_api.impl.jac` (line 846)	`request_context_middleware` — creates per-request context, commits at end
`jac/jaclang/runtimelib/impl/server.impl.jac` (line 257)	`spawn_walker` — loads root, runs walker, commits
`jac/jaclang/jac0core/runtime.jac` (line 82)	`_request_context` ContextVar — per-coroutine isolation (correct, but insufficient)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graph persistence race condition: concurrent walker commits cause node/edge loss #5446

Summary

Reproduction

Concrete example (Jac Builder IDE)

Root Cause

The read-modify-write cycle (no isolation)

The race

Affected persistence backends

Why there's no protection

Relevant Files

How to reproduce

1. Create directory with the 3 files above

2. Start server

3. In another terminal, run test (adjust BASE port if needed)

Suggested Fixes

Option 1: Per-user-root async lock (quick fix)

Option 2: Optimistic locking with retry (proper fix)

Option 3: Separate edge storage (best fix)

Impact

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Graph persistence race condition: concurrent walker commits cause node/edge loss #5446

Description

Summary

Reproduction

Concrete example (Jac Builder IDE)

Root Cause

The read-modify-write cycle (no isolation)

The race

Affected persistence backends

Why there's no protection

Relevant Files

How to reproduce

1. Create directory with the 3 files above

2. Start server

3. In another terminal, run test (adjust BASE port if needed)

Suggested Fixes

Option 1: Per-user-root async lock (quick fix)

Option 2: Optimistic locking with retry (proper fix)

Option 3: Separate edge storage (best fix)

Impact

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions