fix: improve concurrency handling of users#2014
fix: improve concurrency handling of users#2014MarcosNicolau wants to merge 20 commits intostagingfrom
Conversation
instead wait until it finishes and update the last block number with the receipt
…ng and batch building
| @cd crates/task-sender && \ | ||
| cargo run --release -- generate-proofs \ | ||
| --number-of-proofs $(NUMBER_OF_PROOFS) --proof-type gnark_groth16 \ | ||
| --number-of-proofs $(NUMBER_OF_PROOFS) --proof-type groth16 \ |
There was a problem hiding this comment.
Actually, the gnark_groth16 wasn't added to the task-sender. We should update the commands args in the task sender first in other pr. I fixed that here because I used the task-sender to test.
| .clone() | ||
| }; | ||
| let _user_mutex = user_mutex.lock().await; | ||
| debug!("User mutex for {:?} acquired...", addr_in_msg); |
There was a problem hiding this comment.
Would be useful to log when the lock is released
There was a problem hiding this comment.
I agree, but given all the early returns it would add a lot of noise to the code. For now I would avoid and after #2015, we can easily add this, as all the errors are handled at one level (so we can put the log there).
| env_logger::Builder::from_env(Env::default().default_filter_or("info")).init(); | ||
| env_logger::Builder::from_env(Env::default().filter_or( | ||
| "RUST_LOG", | ||
| "info,aligned_batcher=debug,ethers_providers=off", |
There was a problem hiding this comment.
If we want to reduce ethers_providers logs, I'd put it in WARN at least to avoid omit errors or warnings
There was a problem hiding this comment.
Sure, default logs for ether_providers were very very spammy. A warn makes more sense. Solved in a2c05bf.
| // * ---------------------------------------------------------------------* | ||
|
|
||
| if batch_state_lock.is_queue_full() { | ||
| if self.batch_state.lock().await.is_queue_full() { |
There was a problem hiding this comment.
This checks may be done in the add_to_batch function because you can have a race condition where the batch queue limit is exceeded, as we discussed in huddle
There was a problem hiding this comment.
It is a case I had in mind but because triggering was really hard I prefer to avoid the complexity of handling it. As discussed, best thing was to move it to the add_to_batch function, see 8787958.
MauroToscano
left a comment
There was a problem hiding this comment.
The state should probably be splitted according the locks, instead of having locks of nowhere. State batch shouldn't be locked all the time by proof processors.
|
This PR #2017 replaces this one |
Description
Previously, we were performing proof pre-verification as soon as a new proof message arrived. This verification could take a few hundred milliseconds, and since it was done before even validating the message the lock would be held far too long.
We’ve moved the proof verification step after performing basic validation checks, such as nonce, max fee, and replacement.s This way, only well-formed messages proceed to the verification step.
However, this introduced a new problem: proof verification still needs to run with access to shared state
batch_state, and locking this state for the duration of the verification step could block all other processing for too much time. To solve this, we introduced a mutex per user:user_proof_processing_mutexes: this allows us to process proofs in parallel between users by not locking the wholebatch_state.This new mutex setup, means that the batch building and proof processing can run in parallel. Before, the proof message handler kept the
batch_state_lockfor the entire processing, making it more sequential. Now, they’ll only block each other when they need access thebatch_state.As long as we ensure the batch_state_lock is held while modifying shared state (adding/removing from the batch queue + updating the user state), things are safe. For example, if a replacement is validated but the proof is no longer in the state by the time we lock and check (because it was removed by the batch builder), we can just return an error to the user.
The tldr is: there's no need to synchronize the two processes explicitly, the
batch_state_lockalready ensures consistency. At worst, we'll hit a race that results in an error in one side, which might lead to a poorer ux, but I prever to avoid unnecessary mutex complexity.Another improvement was introduced. The
batch_postingmutex was used to synchronize the tasks spawned to submit a batch when a new block arrived. This could be solved much easily by simply not spawning a task and awaiting it.Finally, other improvements/refactors have been improved:
Moving forward
There are two other checks to ensure a better prevention of ddos:
n%with respect to the previous one.mproofs at least.How to test
To test this, I would suggest doing it in two. One with a few proofs so you can see the logs how everything is synchronized and another one with a high load of proofs to make sure it can handle it and nothing breaks.
Set up
seconds_per_slotinnetwork_params.yamland set it to2seconds. This is to create more frequent dependencies between the batch building process and the proof processing.Few proofs test
The idea of this test is to see the mutex behaviour step by step. For this I suggest:
verify_prooffunction incrates/batcher/src/lib.rsBig load test
For this test, we use the task-sender to spam the batcher with different wallets and make sure it doesn't lock and continues to build batches.
crates/task-sender/wallets/devnetand copy the 10 rich accounts of ethereum-package:# Modify this params as you wish make task_sender_send_infinite_proofs_devnet BURST_SIZE=50 BURST_TIME_SECS=1Type of change
Checklist
testnet, everything else tostaging