Skip to content
This repository was archived by the owner on Feb 21, 2024. It is now read-only.

Commit 283b00c

Browse files
committed
hacky workaround: use locking to quiet race detector problems
So we have a problem which is triggered in part by the race detector, but which is actually deeper, but also possibly rare enough to be politely ignored. The real underlying issue is that sometimes when we have multiple tests running in CI, multiple instances of the CLI test end up using the same postgres database backing for some of their DAX stuff. We have workarounds for this in some places, but not others. But the *observed symptom* of this is that it can cause a trivial race detector issue where we have one call to `(*Resource).Lock()` and another call to `(*Resource).IsLocked()` which aren't synchronized in any way, so if the race detector spots this, it complains. We can suppress that very easily by synchronizing these. That does not solve the other possibly-weird problems, so this may not actually address the issue, but I think it might reduce the rate of sporadic failures significantly, which would give us some time to think about solving the deeper problem. The underlying design issue is that we're reusing the database name in postgres for testing. This lets us have bounded growth (one database) while leaving the database contents up after a failed test (so we can examine them), then truncating the database during startup if it already exists. Which works fine if *only one thing runs at once*, which would be true on a laptop, but in CI, it's sometimes not true. A real fix for that is complex and requires some rethinking of how we approach the test stuff, as we don't want unbounded growth, but we also don't want two copies of the test running at once to see each other, and ensuring cleanup after a test failure is surprisingly hard.
1 parent ad5f1d4 commit 283b00c

File tree

1 file changed

+27
-0
lines changed

1 file changed

+27
-0
lines changed

dax/storage/storage.go

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -243,6 +243,12 @@ type Resource struct {
243243
latestWLVersion int
244244
lastWLPos int
245245

246+
// temporary workaround: we use this to control access to locked
247+
// because it can cause race detector failures in testing under
248+
// circumstances. these circumstances are probably actually a
249+
// different and more serious bug, but we want CI to run in the
250+
// mean time.
251+
mu sync.Mutex
246252
locked bool
247253

248254
dirty bool
@@ -259,6 +265,13 @@ func (m *Resource) initialize() *Resource {
259265
// believes it holds the lock. It does not look at the state of
260266
// underlying storage to verify the lock.
261267
func (m *Resource) IsLocked() bool {
268+
// WARNING: This is probably wrong. The problem this immediately
269+
// solves is race detector complaining about writes in Lock()
270+
// racing against this. That's valid. But we shouldn't be getting
271+
// there at all, so something else is also wrong. This is a
272+
// WORKAROUND.
273+
m.mu.Lock()
274+
defer m.mu.Unlock()
262275
return m.locked
263276
}
264277

@@ -373,6 +386,13 @@ func (m *Resource) Lock() error {
373386
if err := m.writelogger.Lock(m.bucket, m.key); err != nil {
374387
return errors.Wrap(err, "acquiring lock")
375388
}
389+
// WARNING: This is probably wrong. The problem this immediately
390+
// solves is race detector complaining about writes in Lock()
391+
// racing against this. That's valid. But we shouldn't be getting
392+
// there at all, so something else is also wrong. This is a
393+
// WORKAROUND.
394+
m.mu.Lock()
395+
defer m.mu.Unlock()
376396
m.locked = true
377397
return nil
378398
}
@@ -453,6 +473,13 @@ func (m *Resource) Unlock() error {
453473
if err := m.writelogger.Unlock(m.bucket, m.key); err != nil {
454474
return errors.Wrap(err, "unlocking")
455475
}
476+
// WARNING: This is probably wrong. The problem this immediately
477+
// solves is race detector complaining about writes in Lock()
478+
// racing against this. That's valid. But we shouldn't be getting
479+
// there at all, so something else is also wrong. This is a
480+
// WORKAROUND.
481+
m.mu.Lock()
482+
defer m.mu.Unlock()
456483
m.locked = false
457484
return nil
458485
}

0 commit comments

Comments
 (0)