Skip to content

Graph persistence race condition: concurrent walker commits cause node/edge loss #5446

@ahzan-dev

Description

@ahzan-dev

Summary

Concurrent walker invocations that modify the same graph node cause silent data loss due to a last-writer-wins race condition in the persistence layer. The entire node document (including its edge list) is replaced unconditionally at commit time, so the last walker to commit overwrites changes made by earlier concurrent walkers.

Reproduction

  1. Start a Jac application with jac start (or jac start --scale)
  2. Create a walker that adds an edge to the user root: profile ++> MyNode(...)
  3. Immediately fire 10+ concurrent walker requests that read/mutate the same profile node (e.g., updating a last_accessed timestamp)
  4. After all requests complete, query [profile -->(?:MyNode)]
  5. Result: The newly created MyNode edge is missing — it was overwritten by a concurrent walker's stale snapshot

Concrete example (Jac Builder IDE)

# User creates a project (adds edge to profile node)
POST /walker/project_ops  {action: "create", name: "my-project"}
# → success, Project node created, edge added to profile

# Frontend navigates to IDE, fires ~12 concurrent requests:
POST /walker/project_ops  {action: "open", project_id: "prj_123"}   # mutates profile.updated_at
POST /walker/git_ops      {action: "status", project_id: "prj_123"}
POST /walker/ide_file_ops {action: "setup", project_id: "prj_123"}
POST /walker/ai_chat      {action: "load_history", ...}
POST /walker/git_ops      {action: "branches", ...}
# ... 7 more concurrent requests

# User navigates back to dashboard
POST /walker/project_ops  {action: "list"}
# → Project is GONE. Only previously-existing projects remain.

This is 100% reproducible. The more concurrent requests, the higher the chance of data loss.

Root Cause

The read-modify-write cycle (no isolation)

Each HTTP request gets its own ExecutionContext with a private L1 memory dict (__mem__). At request end, commit() flushes changed anchors to the persistent store. There is no concurrency control at any level.

The race

Request A (creates edge)              Request B (updates timestamp)
────────────────────────              ──────────────────────────────
1. Load profile from DB
   edges = [existing_proj]
                                      2. Load profile from DB
                                         edges = [existing_proj]  ← SAME snapshot

3. profile ++> NewProject()
   L1 edges = [existing_proj, new_proj]

4. commit() → WRITE to DB
   edges = [existing_proj, new_proj]  ✓
                                      5. profile.updated_at = now
                                         L1 edges = [existing_proj]  ← STALE

                                      6. commit() → WRITE to DB
                                         edges = [existing_proj]  ← OVERWRITES step 4

Result: new_proj edge is permanently LOST

Affected persistence backends

MongoDB (jac-scale):

# memory_hierarchy.mongo.impl.jac, MongoBackend.put(), lines 71-87
self.collection.update_one(
    {'_id': str(_id)},
    {'$set': {'data': data, 'type': type(anchor).__name__}},
    upsert=True    # ← unconditional full-document replace
)

SQLite (jaclang):

# memory.impl.jac, SqliteMemory.sync(), line ~380
INSERT OR REPLACE INTO anchors (id, type, data) VALUES (?, ?, ?)
# ← unconditional full-row replace

Both backends replace the entire serialized anchor (including embedded edge list) with whatever the committing request had in its private L1 memory. No version check, no merge, no conflict detection.

Why there's no protection

  • No optimistic locking: No version field on anchors, no conditional write (WHERE version = ?)
  • No pessimistic locking: No threading.Lock, asyncio.Lock, or database-level advisory locks anywhere in the persistence path
  • No atomic edge operations: Edges are embedded in the parent node's serialized blob, so any edge mutation requires replacing the entire document
  • Double commit (doesn't help): Both the middleware (jfast_api.impl.jac:865) and spawn_walker (server.impl.jac:305) call commit(), but both use the same stale L1 snapshot

Relevant Files

File What
jac/jaclang/runtimelib/impl/memory.impl.jac (line ~340) SqliteMemory.syncINSERT OR REPLACE with no version check
jac-scale/jac_scale/impl/memory_hierarchy.mongo.impl.jac (line 71) MongoBackend.put — unconditional $set upsert
jac-scale/jac_scale/jserver/impl/jfast_api.impl.jac (line 846) request_context_middleware — creates per-request context, commits at end
jac/jaclang/runtimelib/impl/server.impl.jac (line 257) spawn_walker — loads root, runs walker, commits
jac/jaclang/jac0core/runtime.jac (line 82) _request_context ContextVar — per-coroutine isolation (correct, but insufficient)

Minimal Reproducing Example + Validated Result

Files

server.jac — 3 walkers: add_item (creates edge), touch (concurrent read-modify-write), list_items (verify)

  """Minimal reproduction of graph persistence race condition."""

  import from datetime { datetime }

  node Item {
      has name: str = "";
      has created_at: str = "";
  }

  """Create a new Item node connected to user root."""
  walker add_item {
      has name: str = "test";

      can do with Root entry {
          now = datetime.utcnow().isoformat() + "Z";
          item = here ++> Item(name=self.name, created_at=now);
          report {"success": True, "name": self.name, "created_at": now};
      }
  }

  """List all Item nodes on user root."""
  walker list_items {
      can do with Root entry {
          items: list = [];
          for item in [-->(?:Item)] {
              items.append({"name": item.name, "created_at": item.created_at});
          }
          report {"success": True, "count": len(items), "items": items};
      }
  }

  node Counter {
      has value: int = 0;
  }

  """Simulate a concurrent read-modify-write on the root node.
  This walker reads children (loading root's edge list into L1),
  mutates a counter node, then commits — writing the stale edge list back."""
  walker touch {
      has ts: str = "";

      can do with Root entry {
          items = [-->(?:Item)];
          counters = [-->(?:Counter)];
          if len(counters) == 0 {
              c = here ++> Counter(value=1);
          } else {
              counters[0].value = counters[0].value + 1;
          }
          report {"success": True, "item_count": len(items), "ts": self.ts};
      }
  }

test_race.py — registers a user, runs 5 rounds of (create item + 15 concurrent touch requests), checks for data loss

  import requests, concurrent.futures, time, sys, uuid

  BASE = "http://localhost:8000"  # adjust port if needed
  CONCURRENT_TOUCHES = 15
  ROUNDS = 5

  def register_and_login():
      username = f"test_{uuid.uuid4().hex[:8]}@example.com"
      password = "TestPass123!"
      requests.post(f"{BASE}/user/register", json={"username": username, "password": password})
      r = requests.post(f"{BASE}/user/login", json={"username": username, "password": password})
      data = r.json()
      return data.get("token") or data.get("access_token") or ""

  def walker_call(token, walker, payload=None):
      r = requests.post(f"{BASE}/walker/{walker}", json=payload or {},
                        headers={"Authorization": f"Bearer {token}"})
      return r.json()

  def extract_report(body):
      if isinstance(body, dict):
          reports = (body.get("data") or {}).get("reports") or body.get("reports") or []
          if reports: return reports[0]
          if "success" in body: return body
      return body

  def main():
      print("=" * 60)
      print("Graph Persistence Race Condition Reproducer")
      print("=" * 60)
      token = register_and_login()
      print(f"Token: {token[:20]}...")
      r = extract_report(walker_call(token, "list_items"))
      initial_count = r.get("count", 0)
      lost_count = 0
      total_created = 0

      for round_num in range(1, ROUNDS + 1):
          item_name = f"item_round_{round_num}"
          print(f"\n[Round {round_num}] add_item('{item_name}') + {CONCURRENT_TOUCHES} concurrent touches...")
          total_created += 1

          with concurrent.futures.ThreadPoolExecutor(max_workers=CONCURRENT_TOUCHES + 1) as pool:
              create_future = pool.submit(walker_call, token, "add_item", {"name": item_name})
              touch_futures = [pool.submit(walker_call, token, "touch", {"ts": str(i)})
                             for i in range(CONCURRENT_TOUCHES)]
              concurrent.futures.wait([create_future] + touch_futures)

          create_resp = extract_report(create_future.result())
          print(f"  add_item: {create_resp}")
          time.sleep(0.5)

          list_resp = extract_report(walker_call(token, "list_items"))
          current_count = list_resp.get("count", 0)
          expected_count = initial_count + round_num
          items = [i.get("name") for i in list_resp.get("items", [])]

          if item_name not in items:
              lost_count += 1
              print(f"  *** LOST! '{item_name}' not in list ***")
          else:
              print(f"  OK - '{item_name}' survived")
          if current_count < expected_count:
              print(f"  Expected {expected_count} items, got {current_count} — {expected_count - current_count} missing!")

      print(f"\n{'=' * 60}")
      print(f"RESULTS: {total_created} created, {lost_count} lost")
      if lost_count > 0:
          print(f"*** BUG CONFIRMED: {lost_count}/{total_created} items silently lost ***")
      print(f"Final: {extract_report(walker_call(token, 'list_items'))}")

  if __name__ == "__main__":
      main()

How to reproduce

1. Create directory with the 3 files above

mkdir race-repro && cd race-repro

2. Start server

jac start server.jac

3. In another terminal, run test (adjust BASE port if needed)

pip install requests
python test_race.py

  Validated result                                                                                                                                                       

  ============================================================
  Graph Persistence Race Condition Reproducer                                                                                                                               
  ============================================================                                                                                                              
                                                                                                                                                                            
  [Round 1] OK - 'item_round_1' survived                                                                                                                                    
  [Round 2] OK - 'item_round_2' survived                                                                                                                                 
  [Round 3] OK - 'item_round_3' survived                                                                                                                                    
  [Round 4] *** LOST! 'item_round_4' not in list ***                                                                                                                        
            Expected 4 items, got 3 — 1 missing!                                                                                                                            
  [Round 5] OK - 'item_round_5' survived                                                                                                                                    
            Expected 5 items, got 4 — 1 missing!                                                                                                                            
                                                                                                                                                                            
  ============================================================                                                                                                              
  RESULTS: 5 created, 1 lost                                                                                                                                                
  *** BUG CONFIRMED: 1/5 items silently lost ***                                                                                                                            

item_round_4 was created successfully (server returned success: True) but was permanently erased from the graph by a concurrent touch walker's commit overwriting the root
node's edge list with a stale snapshot.

Suggested Fixes

Option 1: Per-user-root async lock (quick fix)

Add an asyncio.Lock keyed by user root ID so two walkers for the same user cannot commit simultaneously. Low complexity, prevents the race entirely for single-pod deployments.

Option 2: Optimistic locking with retry (proper fix)

Add a _version field to every anchor. Change writes to:

# MongoDB
result = collection.update_one(
    {'_id': str(_id), '_v': old_version},
    {'$set': {'data': data}, '$inc': {'_v': 1}}
)
if result.modified_count == 0:
    # Conflict — reload, re-merge, retry
-- SQLite
UPDATE anchors SET data = ?, version = version + 1
WHERE id = ? AND version = ?

Option 3: Separate edge storage (best fix)

Move edges out of the parent node's serialized blob into a separate collection/table. Use atomic operations ($addToSet, $pull in MongoDB; separate INSERT/DELETE in SQLite) for edge mutations. This eliminates the need to replace the entire parent document when edges change.

Impact

  • Data loss: Graph nodes and edges silently disappear
  • Silent: No errors, no warnings — the overwriting commit succeeds
  • Affects all applications: Any Jac app that handles concurrent requests touching the same graph nodes
  • Worse at scale: Multi-worker deployments (--scale with Gunicorn) have completely independent in-memory state with zero inter-process coordination

Environment

  • jaclang: installed from jaseci-labs/jaseci@main
  • jac-scale: installed from jaseci-labs/jaseci@main
  • Python: 3.12
  • OS: Linux (Ubuntu)
  • Server: uvicorn (single worker, async)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions