fix: improve concurrency handling of users by MarcosNicolau · Pull Request #2014 · yetanotherco/aligned_layer

MarcosNicolau · 2025-07-07T17:08:34Z

Description

Previously, we were performing proof pre-verification as soon as a new proof message arrived. This verification could take a few hundred milliseconds, and since it was done before even validating the message the lock would be held far too long.

We’ve moved the proof verification step after performing basic validation checks, such as nonce, max fee, and replacement.s This way, only well-formed messages proceed to the verification step.

However, this introduced a new problem: proof verification still needs to run with access to shared state batch_state, and locking this state for the duration of the verification step could block all other processing for too much time. To solve this, we introduced a mutex per user:

user_proof_processing_mutexes: this allows us to process proofs in parallel between users by not locking the whole batch_state.

This new mutex setup, means that the batch building and proof processing can run in parallel. Before, the proof message handler kept the batch_state_lock for the entire processing, making it more sequential. Now, they’ll only block each other when they need access the batch_state.

As long as we ensure the batch_state_lock is held while modifying shared state (adding/removing from the batch queue + updating the user state), things are safe. For example, if a replacement is validated but the proof is no longer in the state by the time we lock and check (because it was removed by the batch builder), we can just return an error to the user.

The tldr is: there's no need to synchronize the two processes explicitly, the batch_state_lock already ensures consistency. At worst, we'll hit a race that results in an error in one side, which might lead to a poorer ux, but I prever to avoid unnecessary mutex complexity.

Another improvement was introduced. The batch_posting mutex was used to synchronize the tasks spawned to submit a batch when a new block arrived. This could be solved much easily by simply not spawning a task and awaiting it.

Finally, other improvements/refactors have been improved:

Showing debug logs in dev mode and filtering spamming crate logs
Better mutex handling in general to prevent keeping it to much and starving other tasks.

Moving forward
There are two other checks to ensure a better prevention of ddos:

Min bump fee: we should enforce a min fee bump of at least n% with respect to the previous one.
Min max fee: we should enforce a minimum max fee that can cover a batch of m proofs at least.

How to test

To test this, I would suggest doing it in two. One with a few proofs so you can see the logs how everything is synchronized and another one with a high load of proofs to make sure it can handle it and nothing breaks.

Set up

Locate seconds_per_slot in network_params.yaml and set it to 2 seconds. This is to create more frequent dependencies between the batch building process and the proof processing.
Start ethereum package:

make ethereum_package_start

Start the aggregator

make aggregator_start_ethereum_package ENVIRONMENT=devnet

Start an operator:

make operator_full_registration CONFIG_FILE=config-files/config-operator-1-ethereum-package.yaml
make operator_start CONFIG_FILE=config-files/config-operator-1-ethereum-package.yaml ENVIRONMENT=devnet

Start the explorer:

make explorer_build_db
make explorer_start

Few proofs test

The idea of this test is to see the mutex behaviour step by step. For this I suggest:

Add a sleep in the verify_proof function in crates/batcher/src/lib.rs

    /// Takes [`VerificationData`] and spawns a blocking task to verify the proof
    async fn verify_proof(
        &self,
        verification_data: &VerificationData,
    ) -> Result<(), ProofInvalidReason> {
        tokio::time::sleep(tokio::time::Duration::from_secs(5)).await;
...

Start the batcher:

make batcher_start_ethereum_package

Fund two wallets in aligned, here you can do this with default rich accounts:

make aligned_install_compiling

# Fund wallet 1
aligned deposit-to-batcher --amount 1ether --private_key 0x4bbbf85ce3377467afe5d46f804f221813b2bb87f24d81f60f1fcdbf7cbf4356  --network devnet --rpc_url http://localhost:8545

# Fund wallet 2
aligned deposit-to-batcher --amount 1ether --private_key 0xdbda1821b80551c9d65939329250298aa3472ba22feea921c0cf5d620ea67b97  --network devnet --rpc_url http://localhost:8545

Send proofs and analyze the lock behaviour:

# With account 1
aligned submit --proving_system SP1 \
		--proof scripts/test_files/sp1/sp1_fibonacci_5_0_0.proof \
		--vm_program scripts/test_files/sp1/sp1_fibonacci_5_0_0.elf \
		--private_key 0x4bbbf85ce3377467afe5d46f804f221813b2bb87f24d81f60f1fcdbf7cbf4356 \
		--repetitions 10 \
		--random_address \
		--custom_fee_estimate 5 \
		--rpc_url  http://localhost:8545 \
		--network devnet

# With account 2
aligned submit --proving_system SP1 \
		--proof scripts/test_files/sp1/sp1_fibonacci_5_0_0.proof \
		--vm_program scripts/test_files/sp1/sp1_fibonacci_5_0_0.elf \
		--private_key 0xdbda1821b80551c9d65939329250298aa3472ba22feea921c0cf5d620ea67b97 \
		--repetitions 10 \
		--random_address \
		--custom_fee_estimate 5 \
		--rpc_url  http://localhost:8545 \
		--network devnet

# And again with account 1 to open two connection with the same address
aligned submit --proving_system SP1 \
		--proof scripts/test_files/sp1/sp1_fibonacci_5_0_0.proof \
		--vm_program scripts/test_files/sp1/sp1_fibonacci_5_0_0.elf \
		--private_key 0x4bbbf85ce3377467afe5d46f804f221813b2bb87f24d81f60f1fcdbf7cbf4356 \
		--repetitions 10 \
		--random_address \
		--custom_fee_estimate 5 \
		--rpc_url  http://localhost:8545 \
		--network devnet

Big load test

For this test, we use the task-sender to spam the batcher with different wallets and make sure it doesn't lock and continues to build batches.

Go to crates/task-sender/wallets/devnet and copy the 10 rich accounts of ethereum-package:

0xac0974bec39a17e36ba4a6b4d238ff944bacb478cbed5efcae784d7bf4f2ff80
0x59c6995e998f97a5a0044966f0945389dc9e86dae88c7a8412f4603b6b78690d
0x5de4111afa1a4b94908f83103eb1f1706367c2e68ca870fc3fb9a804cdab365a
0x7c852118294e51e653712a81e05800f419141751be58f605c371e15141b007a6
0x47e179ec197488593b187f80a00eb0da91f1b9d0b13f8733639f19c30a34926a
0x8b3a350cf5c34c9194ca85829a2df0ec3153be0318b5e2d3348e872092edffba
0x92db14e403b83dfe3df233f83dfa3a0d7096f21ca9b0d6d6b8d88b2b4ec1564e
0x4bbbf85ce3377467afe5d46f804f221813b2bb87f24d81f60f1fcdbf7cbf4356
0xdbda1821b80551c9d65939329250298aa3472ba22feea921c0cf5d620ea67b97
0x2a871d0798f97d79848a013d4936a73bf4cc922c825d33c1cf7073dff6d409c6

Start the batcher:

make batcher_start_ethereum_package

Generate proofs for the task-sender and fund wallets:

make task_sender_generate_gnark_groth16_proofs
make task_sender_fund_wallets_devnet

Send proofs, I suggest doing this on various different process so you have many different ws connection so you'll see per address, the mutex per user should sync them correctly:

# Modify this params as you wish
make task_sender_send_infinite_proofs_devnet BURST_SIZE=50 BURST_TIME_SECS=1

Type of change

Bug fix
Optimization
Refactor

Checklist

instead wait until it finishes and update the last block number with the receipt

…ng and batch building

JuArce · 2025-07-08T17:03:43Z

 	@cd crates/task-sender && \
 	cargo run --release -- generate-proofs \
-	--number-of-proofs $(NUMBER_OF_PROOFS) --proof-type gnark_groth16 \
+	--number-of-proofs $(NUMBER_OF_PROOFS) --proof-type groth16 \


This should be reverted

Actually, the gnark_groth16 wasn't added to the task-sender. We should update the commands args in the task sender first in other pr. I fixed that here because I used the task-sender to test.

JuArce · 2025-07-08T17:43:11Z

+                .clone()
+        };
+        let _user_mutex = user_mutex.lock().await;
+        debug!("User mutex for {:?} acquired...", addr_in_msg);


Would be useful to log when the lock is released

Debug level is okay imo

I agree, but given all the early returns it would add a lot of noise to the code. For now I would avoid and after #2015, we can easily add this, as all the errors are handled at one level (so we can put the log there).

JuArce · 2025-07-08T20:32:31Z

-    env_logger::Builder::from_env(Env::default().default_filter_or("info")).init();
+    env_logger::Builder::from_env(Env::default().filter_or(
+        "RUST_LOG",
+        "info,aligned_batcher=debug,ethers_providers=off",


If we want to reduce ethers_providers logs, I'd put it in WARN at least to avoid omit errors or warnings

Sure, default logs for ether_providers were very very spammy. A warn makes more sense. Solved in a2c05bf.

JuArce · 2025-07-08T20:41:34Z

        // * ---------------------------------------------------------------------*

-        if batch_state_lock.is_queue_full() {
+        if self.batch_state.lock().await.is_queue_full() {


This checks may be done in the add_to_batch function because you can have a race condition where the batch queue limit is exceeded, as we discussed in huddle

It is a case I had in mind but because triggering was really hard I prefer to avoid the complexity of handling it. As discussed, best thing was to move it to the add_to_batch function, see 8787958.

MauroToscano

The state should probably be splitted according the locks, instead of having locks of nowhere. State batch shouldn't be locked all the time by proof processors.

MauroToscano · 2025-07-31T16:49:53Z

This PR #2017 replaces this one

MarcosNicolau added 18 commits June 30, 2025 10:06

feat: move verification after the message and replamcents validations

08d51b6

feat: proof processing via channels

766fa70

feat: lock per user implementation

f48157d

feat: batch building mutex

44c36e6

refactor: move user update to batch state

ef8c9a3

chore: address clippy warnings

d25bef1

fix: handle block don't spawn a new task for it

653283c

instead wait until it finishes and update the last block number with the receipt

chore: better tracing for debug mode

067b64a

fix: remove sleep from verify_proof

4aabc53

chore: remove proof processor

ae55834

feat: priority for batch building process when acquiring user locks

6cdf54e

refactor: naming, docs, logs, style

26b1641

fix: clone user_proof_processing_mutexes to prevent deadlock

7a7df60

fix: task-sender generate proofs target

30b7c44

refactor: improvements over locks and code style

a46bf2c

refactor: remove building_batch_mutex and sync between proof processi…

8837ea4

…ng and batch building

docs: batcher mutexes

c48c272

refactor: comments, tracing and err messages

96def6c

JuArce assigned MarcosNicolau Jul 8, 2025

JuArce requested changes Jul 8, 2025

View reviewed changes

MauroToscano requested changes Jul 9, 2025

View reviewed changes

MarcosNicolau added 2 commits July 10, 2025 10:01

fix: move batch queue full condition to add_to_batch

8787958

chore: show ethers_providers warn log level by default

a2c05bf

MauroToscano closed this Jul 31, 2025

JuArce deleted the fix/verification-concurrency branch January 13, 2026 18:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: improve concurrency handling of users#2014

fix: improve concurrency handling of users#2014
MarcosNicolau wants to merge 20 commits intostagingfrom
fix/verification-concurrency

MarcosNicolau commented Jul 7, 2025

Uh oh!

JuArce Jul 8, 2025

Uh oh!

MarcosNicolau Jul 10, 2025

Uh oh!

JuArce Jul 8, 2025

Uh oh!

JuArce Jul 8, 2025

Uh oh!

MarcosNicolau Jul 10, 2025

Uh oh!

JuArce Jul 8, 2025

Uh oh!

MarcosNicolau Jul 10, 2025 •

edited

Loading

Uh oh!

JuArce Jul 8, 2025

Uh oh!

MarcosNicolau Jul 10, 2025

Uh oh!

MauroToscano left a comment

Uh oh!

MauroToscano commented Jul 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

MarcosNicolau commented Jul 7, 2025

Description

How to test

Set up

Few proofs test

Big load test

Type of change

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MarcosNicolau Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MauroToscano left a comment

Choose a reason for hiding this comment

Uh oh!

MauroToscano commented Jul 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MarcosNicolau Jul 10, 2025 •

edited

Loading