Skip to content

Commit 52c3b09

Browse files
committed
add blog post for multiplayer
1 parent db5591d commit 52c3b09

3 files changed

Lines changed: 159 additions & 85 deletions

File tree

apps/sim/content/authors/vik.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
{
22
"id": "vik",
33
"name": "Vikhyath Mondreti",
4-
"url": "https://x.com/vikhyathm",
5-
"xHandle": "vikhyathm",
4+
"url": "https://github.com/icecrasher321",
5+
"xHandle": "icecrasher321",
66
"avatarUrl": "/studio/authors/vik.png"
77
}
Lines changed: 153 additions & 81 deletions
Original file line numberDiff line numberDiff line change
@@ -1,109 +1,181 @@
11
---
22
slug: multiplayer
3-
title: 'Realtime and Multiplayer in Sim — how it works under the hood'
4-
description: "A technical look at Sim's realtime and multiplayer architecture: presence, collaboration, conflict resolution, and scale."
5-
date: 2025-11-08
6-
updated: 2025-11-08
3+
title: 'Realtime Collaboration on Sim'
4+
description: A high-level explanation into Sim realtime collaborative workflow builder - from operation queues to conflict resolution.
5+
date: 2025-11-12
6+
updated: 2025-11-12
77
authors:
88
- vik
9-
readingTime: 8
10-
tags: [Multiplayer, Realtime, Collaboration, CRDT, WebSockets, Sim]
11-
ogImage: /studio/multiplayer/cover.png
12-
ogAlt: 'Sim multiplayer architecture overview'
13-
about: ['Realtime Systems', 'Operational Transform / CRDT', 'Collaboration']
14-
timeRequired: PT8M
15-
canonical: https://sim.ai/studio/multiplayer
16-
featured: false
9+
readingTime: 12
10+
tags: [Multiplayer, Realtime, Collaboration, WebSockets, Architecture]
11+
ogImage: /blog/multiplayer/cover.png
12+
canonical: https://sim.ai/blog/multiplayer
1713
draft: false
1814
---
1915

20-
> This post outlines the key pieces of Sim’s realtime and multiplayer stack. It’s a scaffold we’ll keep enriching with diagrams, traces, and code snippets as we publish more details.
21-
22-
## Goals
23-
24-
- Low‑latency collaboration on shared canvases and workflows
25-
- Deterministic conflict resolution and auditability
26-
- Scales from small teams to enterprise orgs
27-
28-
## High‑level architecture
29-
30-
1. Transport
31-
- Secure WebSocket channels per workspace/session with fallbacks.
32-
2. Session and presence
33-
- Authenticated connections; presence, cursors, and selections broadcast on a lightweight channel.
34-
3. State model
35-
- Canonical workflow state stored in a durable DB; clients hold ephemeral working copies.
36-
4. Conflict resolution
37-
- Operation‑based CRDT/OT hybrid for block changes; idempotent ops with causal timestamps.
38-
5. Persistence and snapshots
39-
- Append‑only operation log; periodic compaction into snapshots for fast loads.
40-
6. Observability
41-
- Per‑op metrics, client RTT, and reconnection traces; room health dashboards.
42-
43-
```mermaid
44-
sequenceDiagram
45-
participant C1 as Client A
46-
participant C2 as Client B
47-
participant GW as Realtime Gateway
48-
participant S as State Service
49-
C1->>GW: connect(ws, auth)
50-
C2->>GW: connect(ws, auth)
51-
C1->>GW: op(block.update)
52-
GW->>S: validate & persist(op)
53-
S-->>GW: ack(op, version)
54-
GW-->>C1: ack(op)
55-
GW-->>C2: broadcast(op)
16+
When we started building Sim, we noticed that AI workflow development looked a lot like the design process [Figma](https://www.figma.com/blog/how-figmas-multiplayer-technology-works/) had already solved for. Product managers need to sketch out user-facing flows, engineers need to configure integrations and APIs, and domain experts need to validate business logic—often all at the same time. Traditional workflow builders force serial collaboration: one person edits, saves, exports, and notifies the next person. This creates unnecessary friction.
17+
18+
We decided multiplayer editing was the right approach, even though workflow platforms like n8n and Make do not currently offer it. This post explains how we built it. We'll cover the operation queue, conflict resolution, how we handle blocks/edges/subflows separately, undo/redo as a wrapper around this, and why our system is a lot simpler than you'd expect.
19+
20+
## Architecture Overview: Client-Server with WebSockets
21+
22+
Sim uses a client-server architecture where browser clients communicate with a standalone Node.js WebSocket server over persistent connections. When you open a workflow, your client joins a "workflow room" on the server. All subsequent operations—adding blocks, connecting edges, updating configurations—are synchronized through this connection.
23+
24+
### Server-Side: The Source of Truth
25+
26+
The server maintains authoritative state in PostgreSQL across three normalized tables:
27+
28+
- `workflow_blocks`: Block metadata, positions, configurations, and subblock values
29+
- `workflow_edges`: Connections between blocks with source/target handles
30+
- `workflow_subflows`: Loop and parallel container configurations with child node lists
31+
32+
This separation is deliberate. Blocks, edges, and subflows have different update patterns and conflict characteristics. By storing them separately:
33+
34+
1. **Targeted updates**: Moving a block only updates `positionX` and `positionY` fields for that specific block row. We don't load or lock the entire workflow.
35+
2. **Query optimization**: Different operations hit different tables with appropriate indexes. Updating edge connections only touches `workflow_edges`, leaving blocks untouched.
36+
3. **Separate channels**: Structural operations (adding blocks, connecting edges) go through the main operation handler with persistence-first logic. Value updates (editing text in a subblock) go through a separate debounced channel with server-side coalescing—reducing database writes from hundreds to dozens for a typical typing session.
37+
38+
The server uses different broadcast strategies: position updates are broadcast immediately for smooth collaborative dragging (optimistic), while structural operations (adding blocks, connecting edges) persist first to ensure consistency (pessimistic).
39+
40+
### Client-Side: Optimistic Updates with Reconciliation
41+
42+
Clients maintain local copies of workflow state in [Zustand](https://github.com/pmndrs/zustand) stores. When you drag a block or type in a text field, the UI updates immediately—this is optimistic rendering. Simultaneously, the client queues an operation in a separate operation queue store to send to the server.
43+
44+
The client doesn't wait for server confirmation to render changes. Instead, it assumes success and continues. If the server rejects an operation (permissions failure, conflict, validation error), the client reconciles by either retrying or reverting the local change.
45+
46+
This is why workflow editing feels instantaneous—you never wait for a network round-trip to see your changes. The downside is added complexity around handling reconciliation, retries, and conflict resolution.
47+
48+
## The Operation Queue: Reliability Through Retries
49+
50+
At the heart of Sim's multiplayer system is the **Operation Queue**—a client-side abstraction that ensures no operation is lost, even under poor network conditions.
51+
52+
### How It Works
53+
54+
Every user action that modifies workflow state generates an operation object:
55+
56+
```typescript
57+
{
58+
id: 'op-uuid',
59+
operation: {
60+
operation: 'update', // or 'add', 'remove', 'move'
61+
target: 'block', // or 'edge', 'subblock', 'variable'
62+
payload: { /* change data */ }
63+
},
64+
workflowId: 'workflow-id',
65+
userId: 'user-id',
66+
status: 'pending'
67+
}
68+
```
69+
70+
Operations are enqueued in FIFO order. The queue processor sends one operation at a time over the WebSocket, waiting for server confirmation before proceeding to the next. Text edits (subblock values, variable fields) are debounced client-side and coalesced server-side—a user typing a 500-character prompt generates ~10 operations instead of 500.
71+
72+
Failed operations retry with exponential backoff (structural changes get 3 attempts, text edits get 5). If all retries fail, the system enters offline mode—the queue is cleared and the UI becomes read-only until the user manually refreshes.
73+
74+
### Handling Dependent Operations
75+
76+
The operation queue's real power emerges when handling conflicts between collaborators. Consider this scenario:
77+
78+
**User A** deletes a block while **User B** has a pending subblock update for that same block in their operation queue.
79+
5680
```
81+
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
82+
│ User A │ │ Server │ │ User B │
83+
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
84+
│ │ │
85+
│ Delete Block X │ │
86+
├─────────────────────────────────>│ │
87+
│ │ │
88+
│ │ Persist deletion │
89+
│ │ ────────────┐ │
90+
│ │ │ │
91+
│ │<─────────────┘ │
92+
│ │ │
93+
│ │ Broadcast: Block X deleted │
94+
│ ├─────────────────────────────────>│
95+
│ │ │
96+
│ │ Cancel all ops for X │
97+
│ │ (including subblock) │
98+
│ │ ────────┤
99+
│ │ │
100+
│ │ Remove Block X │
101+
│ │ ────────┤
102+
│ │ │
103+
```
104+
105+
Here's what happens:
106+
107+
1. User A's delete operation reaches the server and persists successfully
108+
2. The server broadcasts the deletion to all clients, including User B
109+
3. User B's client receives the broadcast and **immediately cancels all pending operations** for Block X (including the subblock update)
110+
4. Then User B's client removes Block X from local state
111+
112+
No operations are sent to the server for a block that no longer exists. The client proactively removes all related operations from the queue—both block-level operations and subblock operations. User B never sees an error because the stale operation is silently discarded before it's sent.
113+
114+
This is more efficient than server-side validation. By canceling dependent operations locally when receiving a deletion broadcast, we avoid wasting network requests on operations that would fail anyway.
57115

58-
## Presence and awareness
116+
## Conflict Resolution: Timestamps and Determinism
59117

60-
- Presence channel carries user metadata (name, color), cursor positions, and ephemeral selections.
61-
- Heartbeats + timeouts remove stale presence; reconnects recover presence state.
118+
In line with our goal of keeping things simple, Sim uses a **last-writer-wins** strategy with timestamp-based ordering. Every operation carries a client-generated timestamp. When conflicts occur, the operation with the latest timestamp takes precedence.
62119

63-
## Operations and versions
120+
This is simpler than Figma's operational transform approach, but sufficient for our use case. Workflow building has lower conflict density than text editing—users typically work on different parts of the canvas or different blocks.
64121

65-
- Every mutating action becomes an operation with: opId, actorId, version, path, payload.
66-
- Servers validate permissions and consistency (version checks) before persisting.
67-
- Clients apply local‑first (optimistic) and reconcile on ack or transform.
122+
**Position conflicts** are handled with timestamp ordering. If two users simultaneously drag the same block, both clients render their local positions optimistically. The server persists both updates based on timestamps, broadcasting each in sequence. Clients receive the conflicting positions and converge to the latest timestamp.
68123

69-
## Conflict handling
124+
**Value conflicts** (editing the same text field) are rarer but use last-to-arrive wins. Subblock updates are coalesced server-side within a 25ms window—whichever update reaches the server last within that window is persisted, regardless of client timestamp.
70125

71-
- Commutative ops where possible; otherwise use a simple priority rule (timestamp + actor tie‑break).
72-
- Path‑scoped transforms for list inserts/deletes to prevent positional drift.
126+
## Undo/Redo: A Thin Wrapper Over Sockets
73127

74-
## Latency compensation
128+
Undo/redo in multiplayer environments is notoriously complex. Should undoing overwrite others' changes? What happens when you undo something someone else modified?
75129

76-
- Local optimistic apply → render immediately.
77-
- On ack mismatch, transform local queue and rebase.
78-
- Visual hints for pending vs. confirmed states.
130+
Sim takes a pragmatic approach: **undo/redo is a local, per-user stack that generates inverse operations sent through the same socket system as regular edits.**
79131

80-
## Scale and sharding
132+
### How It Works
81133

82-
- Rooms keyed by workspace + resource; sticky routing ensures op ordering.
83-
- Horizontal gateway workers; state service partitions by workspace.
84-
- Backpressure and fan‑out limits on large rooms.
134+
Every operation you perform is recorded in a local undo stack with its inverse:
85135

86-
## Security model
136+
- **Add block** → Inverse: **Remove block** (with full block snapshot)
137+
- **Remove block** → Inverse: **Add block** (restoring from snapshot)
138+
- **Move block** → Inverse: **Move block** (with original position)
139+
- **Update subblock** → Inverse: **Update subblock** (with previous value)
87140

88-
- Auth tokens scoped to workspace and resources; server‑side permission checks per op.
89-
- Rate limits per actor and per room; anomaly detection for spammy clients.
141+
When you press Cmd+Z:
90142

91-
## Benchmarks (placeholder)
143+
1. Pop the latest operation from your undo stack
144+
2. Push it to your redo stack
145+
3. Execute the inverse operation by queuing it through the operation queue
146+
4. The inverse operation flows through the normal socket system: validation, persistence, broadcast
92147

93-
| Metric | Result (p50) | Result (p95) |
94-
| ------------------------------ | -----------: | -----------: |
95-
| Round‑trip op latency | 60ms | 140ms |
96-
| Broadcast fan‑out (100 users) | 8ms | 22ms |
97-
| Reconnect time | 120ms | 280ms |
148+
This means **undo is just another edit**. If you undo adding a block, Sim sends a "remove block" operation through the queue. Other users see the block disappear in real-time, as if you manually deleted it.
98149

99-
We’ll publish a full methodology and open telemetry traces as we finalize numbers.
150+
### Coalescing and Snapshots
100151

101-
## Roadmap
152+
Consecutive operations of the same type are coalesced. If you drag a block across the canvas in 50 small movements, only the starting and ending positions are recorded—pressing undo moves the block back to where you started dragging, not through every intermediate position.
102153

103-
- Presence enrichments (inline comments, threads)
104-
- Partial‑document subscriptions for massive canvases
105-
- Time‑travel and per‑block history
154+
For removal operations, we snapshot the complete state of the removed entity (including all subblock values and connected edges) at the time of removal. This snapshot travels with the undo entry. When you undo a deletion, we restore from the snapshot, ensuring perfect reconstruction even if the workflow structure changed in the interim.
106155

107-
— Team Sim
156+
### Multiplayer Undo Semantics
108157

158+
Undo stacks are **per-user**. Your undo history doesn't include others' changes. This matches user expectations: Cmd+Z undoes *your* recent actions, not your collaborator's.
159+
160+
The system prunes invalid operations from your stack when entities are deleted by collaborators. If User B has "add edge to Block X" in their undo stack, but User A deletes Block X, that undo entry becomes invalid and is automatically removed since the target block no longer exists.
161+
162+
An interesting case: you add a block, someone else connects an edge to it, and then you undo your addition. The block disappears along with their edge (because of foreign key constraints). This is correct—your block no longer exists, so edges referencing it can't exist either. Both users see the block and edge vanish.
163+
164+
During execution, undo operations are marked in-progress to prevent circular recording—undoing shouldn't create a new undo entry for the inverse operation itself.
165+
166+
## Conclusion
167+
168+
Building multiplayer workflow editing required rethinking assumptions about how workflow builders should work. By applying lessons from Figma's collaborative design tool to the domain of AI agent workflows, we created a system that feels fast, reliable, and natural for teams building together.
169+
170+
If you're building collaborative editing for structured data (not just text), consider:
171+
172+
- Whether OT/CRDT complexity is necessary for your conflict density
173+
- How to separate high-frequency value updates from structural changes
174+
- What guarantees your users need around data persistence and offline editing
175+
- Whether exposing operation status builds trust in the system
176+
177+
Multiplayer workflow building is no longer a technical curiosity—it's how teams should work together to build AI agents. And the infrastructure to make it reliable and fast is more approachable than you might think.
178+
179+
---
109180

181+
*Interested in how Sim's multiplayer system works in practice? [Try building a workflow](https://sim.ai) with a collaborator in real-time.*

apps/sim/lib/environment.ts

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
/**
22
* Environment utility functions for consistent environment detection across the application
33
*/
4-
import { env, isTruthy } from './env'
4+
import { env, getEnv, isTruthy } from './env'
55

66
/**
77
* Is the application running in production mode
@@ -21,7 +21,9 @@ export const isTest = env.NODE_ENV === 'test'
2121
/**
2222
* Is this the hosted version of the application
2323
*/
24-
export const isHosted = true
24+
export const isHosted =
25+
getEnv('NEXT_PUBLIC_APP_URL') === 'https://www.sim.ai' ||
26+
getEnv('NEXT_PUBLIC_APP_URL') === 'https://www.staging.sim.ai'
2527

2628
/**
2729
* Is billing enforcement enabled

0 commit comments

Comments
 (0)