Skip to content

fix: improve concurrency handling of users#2014

Closed
MarcosNicolau wants to merge 20 commits intostagingfrom
fix/verification-concurrency
Closed

fix: improve concurrency handling of users#2014
MarcosNicolau wants to merge 20 commits intostagingfrom
fix/verification-concurrency

Conversation

@MarcosNicolau
Copy link
Copy Markdown
Member

Description

Previously, we were performing proof pre-verification as soon as a new proof message arrived. This verification could take a few hundred milliseconds, and since it was done before even validating the message the lock would be held far too long.

We’ve moved the proof verification step after performing basic validation checks, such as nonce, max fee, and replacement.s This way, only well-formed messages proceed to the verification step.

However, this introduced a new problem: proof verification still needs to run with access to shared state batch_state, and locking this state for the duration of the verification step could block all other processing for too much time. To solve this, we introduced a mutex per user:

  • user_proof_processing_mutexes: this allows us to process proofs in parallel between users by not locking the whole batch_state.

This new mutex setup, means that the batch building and proof processing can run in parallel. Before, the proof message handler kept the batch_state_lock for the entire processing, making it more sequential. Now, they’ll only block each other when they need access the batch_state.

As long as we ensure the batch_state_lock is held while modifying shared state (adding/removing from the batch queue + updating the user state), things are safe. For example, if a replacement is validated but the proof is no longer in the state by the time we lock and check (because it was removed by the batch builder), we can just return an error to the user.

The tldr is: there's no need to synchronize the two processes explicitly, the batch_state_lock already ensures consistency. At worst, we'll hit a race that results in an error in one side, which might lead to a poorer ux, but I prever to avoid unnecessary mutex complexity.

Another improvement was introduced. The batch_posting mutex was used to synchronize the tasks spawned to submit a batch when a new block arrived. This could be solved much easily by simply not spawning a task and awaiting it.

Finally, other improvements/refactors have been improved:

  • Showing debug logs in dev mode and filtering spamming crate logs
  • Better mutex handling in general to prevent keeping it to much and starving other tasks.

Moving forward
There are two other checks to ensure a better prevention of ddos:

  • Min bump fee: we should enforce a min fee bump of at least n% with respect to the previous one.
  • Min max fee: we should enforce a minimum max fee that can cover a batch of m proofs at least.

How to test

To test this, I would suggest doing it in two. One with a few proofs so you can see the logs how everything is synchronized and another one with a high load of proofs to make sure it can handle it and nothing breaks.

Set up

  1. Locate seconds_per_slot in network_params.yaml and set it to 2 seconds. This is to create more frequent dependencies between the batch building process and the proof processing.
  2. Start ethereum package:
make ethereum_package_start
  1. Start the aggregator
make aggregator_start_ethereum_package ENVIRONMENT=devnet
  1. Start an operator:
make operator_full_registration CONFIG_FILE=config-files/config-operator-1-ethereum-package.yaml
make operator_start CONFIG_FILE=config-files/config-operator-1-ethereum-package.yaml ENVIRONMENT=devnet
  1. Start the explorer:
make explorer_build_db
make explorer_start

Few proofs test

The idea of this test is to see the mutex behaviour step by step. For this I suggest:

  1. Add a sleep in the verify_proof function in crates/batcher/src/lib.rs
    /// Takes [`VerificationData`] and spawns a blocking task to verify the proof
    async fn verify_proof(
        &self,
        verification_data: &VerificationData,
    ) -> Result<(), ProofInvalidReason> {
        tokio::time::sleep(tokio::time::Duration::from_secs(5)).await;
...
  1. Start the batcher:
make batcher_start_ethereum_package
  1. Fund two wallets in aligned, here you can do this with default rich accounts:
make aligned_install_compiling

# Fund wallet 1
aligned deposit-to-batcher --amount 1ether --private_key 0x4bbbf85ce3377467afe5d46f804f221813b2bb87f24d81f60f1fcdbf7cbf4356  --network devnet --rpc_url http://localhost:8545

# Fund wallet 2
aligned deposit-to-batcher --amount 1ether --private_key 0xdbda1821b80551c9d65939329250298aa3472ba22feea921c0cf5d620ea67b97  --network devnet --rpc_url http://localhost:8545
  1. Send proofs and analyze the lock behaviour:
# With account 1
aligned submit --proving_system SP1 \
		--proof scripts/test_files/sp1/sp1_fibonacci_5_0_0.proof \
		--vm_program scripts/test_files/sp1/sp1_fibonacci_5_0_0.elf \
		--private_key 0x4bbbf85ce3377467afe5d46f804f221813b2bb87f24d81f60f1fcdbf7cbf4356 \
		--repetitions 10 \
		--random_address \
		--custom_fee_estimate 5 \
		--rpc_url  http://localhost:8545 \
		--network devnet

# With account 2
aligned submit --proving_system SP1 \
		--proof scripts/test_files/sp1/sp1_fibonacci_5_0_0.proof \
		--vm_program scripts/test_files/sp1/sp1_fibonacci_5_0_0.elf \
		--private_key 0xdbda1821b80551c9d65939329250298aa3472ba22feea921c0cf5d620ea67b97 \
		--repetitions 10 \
		--random_address \
		--custom_fee_estimate 5 \
		--rpc_url  http://localhost:8545 \
		--network devnet

# And again with account 1 to open two connection with the same address
aligned submit --proving_system SP1 \
		--proof scripts/test_files/sp1/sp1_fibonacci_5_0_0.proof \
		--vm_program scripts/test_files/sp1/sp1_fibonacci_5_0_0.elf \
		--private_key 0x4bbbf85ce3377467afe5d46f804f221813b2bb87f24d81f60f1fcdbf7cbf4356 \
		--repetitions 10 \
		--random_address \
		--custom_fee_estimate 5 \
		--rpc_url  http://localhost:8545 \
		--network devnet

Big load test

For this test, we use the task-sender to spam the batcher with different wallets and make sure it doesn't lock and continues to build batches.

  1. Go to crates/task-sender/wallets/devnet and copy the 10 rich accounts of ethereum-package:
0xac0974bec39a17e36ba4a6b4d238ff944bacb478cbed5efcae784d7bf4f2ff80
0x59c6995e998f97a5a0044966f0945389dc9e86dae88c7a8412f4603b6b78690d
0x5de4111afa1a4b94908f83103eb1f1706367c2e68ca870fc3fb9a804cdab365a
0x7c852118294e51e653712a81e05800f419141751be58f605c371e15141b007a6
0x47e179ec197488593b187f80a00eb0da91f1b9d0b13f8733639f19c30a34926a
0x8b3a350cf5c34c9194ca85829a2df0ec3153be0318b5e2d3348e872092edffba
0x92db14e403b83dfe3df233f83dfa3a0d7096f21ca9b0d6d6b8d88b2b4ec1564e
0x4bbbf85ce3377467afe5d46f804f221813b2bb87f24d81f60f1fcdbf7cbf4356
0xdbda1821b80551c9d65939329250298aa3472ba22feea921c0cf5d620ea67b97
0x2a871d0798f97d79848a013d4936a73bf4cc922c825d33c1cf7073dff6d409c6
  1. Start the batcher:
make batcher_start_ethereum_package
  1. Generate proofs for the task-sender and fund wallets:
make task_sender_generate_gnark_groth16_proofs
make task_sender_fund_wallets_devnet
  1. Send proofs, I suggest doing this on various different process so you have many different ws connection so you'll see per address, the mutex per user should sync them correctly:
# Modify this params as you wish
make task_sender_send_infinite_proofs_devnet BURST_SIZE=50 BURST_TIME_SECS=1

Type of change

  • Bug fix
  • Optimization
  • Refactor

Checklist

  • “Hotfix” to testnet, everything else to staging
  • Linked to Github Issue
  • This change depends on code or research by an external entity
    • Acknowledgements were updated to give credit
  • Unit tests added
  • This change requires new documentation.
    • Documentation has been added/updated.
  • This change is an Optimization
    • Benchmarks added/run
  • Has a known issue
  • If your PR changes the Operator compatibility (Ex: Upgrade prover versions)
    • This PR adds compatibility for operator for both versions and do not change crates/docs/examples
    • This PR updates batcher and docs/examples to the newer version. This requires the operator are already updated to be compatible

Comment thread Makefile
@cd crates/task-sender && \
cargo run --release -- generate-proofs \
--number-of-proofs $(NUMBER_OF_PROOFS) --proof-type gnark_groth16 \
--number-of-proofs $(NUMBER_OF_PROOFS) --proof-type groth16 \
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be reverted

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the gnark_groth16 wasn't added to the task-sender. We should update the commands args in the task sender first in other pr. I fixed that here because I used the task-sender to test.

Comment thread crates/batcher/src/lib.rs
.clone()
};
let _user_mutex = user_mutex.lock().await;
debug!("User mutex for {:?} acquired...", addr_in_msg);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be useful to log when the lock is released

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug level is okay imo

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but given all the early returns it would add a lot of noise to the code. For now I would avoid and after #2015, we can easily add this, as all the errors are handled at one level (so we can put the log there).

Comment thread crates/batcher/src/main.rs Outdated
env_logger::Builder::from_env(Env::default().default_filter_or("info")).init();
env_logger::Builder::from_env(Env::default().filter_or(
"RUST_LOG",
"info,aligned_batcher=debug,ethers_providers=off",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to reduce ethers_providers logs, I'd put it in WARN at least to avoid omit errors or warnings

Copy link
Copy Markdown
Member Author

@MarcosNicolau MarcosNicolau Jul 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, default logs for ether_providers were very very spammy. A warn makes more sense. Solved in a2c05bf.

Comment thread crates/batcher/src/lib.rs Outdated
// * ---------------------------------------------------------------------*

if batch_state_lock.is_queue_full() {
if self.batch_state.lock().await.is_queue_full() {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This checks may be done in the add_to_batch function because you can have a race condition where the batch queue limit is exceeded, as we discussed in huddle

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a case I had in mind but because triggering was really hard I prefer to avoid the complexity of handling it. As discussed, best thing was to move it to the add_to_batch function, see 8787958.

Copy link
Copy Markdown
Contributor

@MauroToscano MauroToscano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The state should probably be splitted according the locks, instead of having locks of nowhere. State batch shouldn't be locked all the time by proof processors.

@MauroToscano
Copy link
Copy Markdown
Contributor

This PR #2017 replaces this one

@JuArce JuArce deleted the fix/verification-concurrency branch January 13, 2026 18:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants