|
|
DescriptionChange thread check to race check. Also, add comment to explain implementation of RaceChecker.
BUG=webrtc:6345
Committed: https://crrev.com/347ec5c72e9fc1698520b650f0515205999e39d0
Cr-Commit-Position: refs/heads/master@{#14369}
Patch Set 1 #Patch Set 2 : new impl #
Total comments: 2
Patch Set 3 : using RaceChecker instead #
Total comments: 6
Patch Set 4 : update comment #
Total comments: 1
Patch Set 5 : comment #Messages
Total messages: 40 (10 generated)
Description was changed from ========== Change thread check to non-re-entrancy check. BUG=webrtc:6345 ========== to ========== Change thread check to non-re-entrancy check. Introduces rtc::NonReentrantChecker. BUG=webrtc:6345 ==========
solenberg@webrtc.org changed reviewers: + kwiberg@webrtc.org, tommi@webrtc.org
Tommi, Karl, this is a proof of concept and I'd like your input before adding tests and documentation. In Chromium, the thread providing capture audio to WebRTC may change for legitimate reasons - "ownership" is transferred to a new thread, and the thread checker in WebRtcAudioSendStream::OnData() blows up. The thread checker did its job fine, but gave us a bit more than we asked for. Not only does it warn us about multi-thread access to this code path, it also locks us to one specific thread ever calling here. One way to work around it would be to add plumbing from Chromium all the way down, to signal when ownership is transferred. Another way is to just remove the thread checker and hope for the best. I've here opted to instead add a "NonReentrantChecker", which simply tries to acquire a critsect and DCHECKs if it couldn't. Since it is related in purpose to ThreadChecker I've added it to thread_checker.h. Why not just add that CS in this one place? Why invent a new primitive? Good question - since I wanted it to be completely stripped from release builds I needed to do the macro dance regardless, and it suddenly felt like too much to leave in webrtcvoiceengine.cc. WDYT?
On 2016/09/19 20:45:01, the sun wrote: > Tommi, Karl, this is a proof of concept and I'd like your input before adding > tests and documentation. > > In Chromium, the thread providing capture audio to WebRTC may change for > legitimate reasons - "ownership" is transferred to a new thread, and the thread > checker in WebRtcAudioSendStream::OnData() blows up. > > The thread checker did its job fine, but gave us a bit more than we asked for. > Not only does it warn us about multi-thread access to this code path, it also > locks us to one specific thread ever calling here. One way to work around it > would be to add plumbing from Chromium all the way down, to signal when > ownership is transferred. Another way is to just remove the thread checker and > hope for the best. > > I've here opted to instead add a "NonReentrantChecker", which simply tries to > acquire a critsect and DCHECKs if it couldn't. Since it is related in purpose to > ThreadChecker I've added it to thread_checker.h. > > Why not just add that CS in this one place? Why invent a new primitive? Good > question - since I wanted it to be completely stripped from release builds I > needed to do the macro dance regardless, and it suddenly felt like too much to > leave in webrtcvoiceengine.cc. > > WDYT? Would a SequencedTaskChecker help?
On 2016/09/20 08:51:58, tommi (webrtc) wrote: > On 2016/09/19 20:45:01, the sun wrote: > > Tommi, Karl, this is a proof of concept and I'd like your input before adding > > tests and documentation. > > > > In Chromium, the thread providing capture audio to WebRTC may change for > > legitimate reasons - "ownership" is transferred to a new thread, and the > thread > > checker in WebRtcAudioSendStream::OnData() blows up. > > > > The thread checker did its job fine, but gave us a bit more than we asked for. > > Not only does it warn us about multi-thread access to this code path, it also > > locks us to one specific thread ever calling here. One way to work around it > > would be to add plumbing from Chromium all the way down, to signal when > > ownership is transferred. Another way is to just remove the thread checker and > > hope for the best. > > > > I've here opted to instead add a "NonReentrantChecker", which simply tries to > > acquire a critsect and DCHECKs if it couldn't. Since it is related in purpose > to > > ThreadChecker I've added it to thread_checker.h. > > > > Why not just add that CS in this one place? Why invent a new primitive? Good > > question - since I wanted it to be completely stripped from release builds I > > needed to do the macro dance regardless, and it suddenly felt like too much to > > leave in webrtcvoiceengine.cc. > > > > WDYT? > > Would a SequencedTaskChecker help? Doesn't look like it. SequencedTaskChecker will fall back to TC::CalledOnValidThread() unless executing on a task queue.
On 2016/09/20 09:02:38, the sun wrote: > On 2016/09/20 08:51:58, tommi (webrtc) wrote: > > On 2016/09/19 20:45:01, the sun wrote: > > > Tommi, Karl, this is a proof of concept and I'd like your input before > adding > > > tests and documentation. > > > > > > In Chromium, the thread providing capture audio to WebRTC may change for > > > legitimate reasons - "ownership" is transferred to a new thread, and the > > thread > > > checker in WebRtcAudioSendStream::OnData() blows up. > > > > > > The thread checker did its job fine, but gave us a bit more than we asked > for. > > > Not only does it warn us about multi-thread access to this code path, it > also > > > locks us to one specific thread ever calling here. One way to work around it > > > would be to add plumbing from Chromium all the way down, to signal when > > > ownership is transferred. Another way is to just remove the thread checker > and > > > hope for the best. > > > > > > I've here opted to instead add a "NonReentrantChecker", which simply tries > to > > > acquire a critsect and DCHECKs if it couldn't. Since it is related in > purpose > > to > > > ThreadChecker I've added it to thread_checker.h. > > > > > > Why not just add that CS in this one place? Why invent a new primitive? Good > > > question - since I wanted it to be completely stripped from release builds I > > > needed to do the macro dance regardless, and it suddenly felt like too much > to > > > leave in webrtcvoiceengine.cc. > > > > > > WDYT? > > > > Would a SequencedTaskChecker help? > > Doesn't look like it. SequencedTaskChecker will fall back to > TC::CalledOnValidThread() unless executing on a task queue. There might be ways to better integrate with Chrome there. Can you talk to Per about the TaskQueue implementation we have in Chrome?
I like the idea. https://codereview.webrtc.org/2350663002/diff/20001/webrtc/base/thread_checker.h File webrtc/base/thread_checker.h (right): https://codereview.webrtc.org/2350663002/diff/20001/webrtc/base/thread_checke... webrtc/base/thread_checker.h:211: __RTC_DCHECK_NON_REENTRANT_NAME(checker, __LINE__)(checker); The semicolon shouldn't be here. https://codereview.webrtc.org/2350663002/diff/20001/webrtc/media/engine/webrt... File webrtc/media/engine/webrtcvoiceengine.cc (right): https://codereview.webrtc.org/2350663002/diff/20001/webrtc/media/engine/webrt... webrtc/media/engine/webrtcvoiceengine.cc:1319: rtc::NonReentrantChecker audio_capture_non_reentrant_checker_; That you can change the name of the checker without having to update GUARDED_BY or similar annotations for the protected member variables is a problem. I presume this class lacks such annotations for legacy reasons? Still, in a proof-of-concept, it would be useful to see annotations.
On 2016/09/20 10:55:22, kwiberg-webrtc wrote: > I like the idea. > > https://codereview.webrtc.org/2350663002/diff/20001/webrtc/base/thread_checker.h > File webrtc/base/thread_checker.h (right): > > https://codereview.webrtc.org/2350663002/diff/20001/webrtc/base/thread_checke... > webrtc/base/thread_checker.h:211: __RTC_DCHECK_NON_REENTRANT_NAME(checker, > __LINE__)(checker); > The semicolon shouldn't be here. > > https://codereview.webrtc.org/2350663002/diff/20001/webrtc/media/engine/webrt... > File webrtc/media/engine/webrtcvoiceengine.cc (right): > > https://codereview.webrtc.org/2350663002/diff/20001/webrtc/media/engine/webrt... > webrtc/media/engine/webrtcvoiceengine.cc:1319: rtc::NonReentrantChecker > audio_capture_non_reentrant_checker_; > That you can change the name of the checker without having to update GUARDED_BY > or similar annotations for the protected member variables is a problem. I > presume this class lacks such annotations for legacy reasons? Still, in a > proof-of-concept, it would be useful to see annotations. perkj@ made me aware of https://cs.chromium.org/chromium/src/third_party/webrtc/base/race_checker.h I'll update the CL and use that since it is almost exactly what I was trying to do here (% an int or two left in release builds).
On 2016/09/20 12:34:46, the sun wrote: > perkj@ made me aware of > https://cs.chromium.org/chromium/src/third_party/webrtc/base/race_checker.h Ew! That one uses volatile int instead of real atomics. I guess it makes sense, since it can only ever misbehave if there was indeed a race, but still... it's only a few lines of code, and five minutes later I'm still not sure that it can't misbehave in a way that wouldn't be OK. > I'll update the CL and use that since it is almost exactly what I was trying to > do here (% an int or two left in release builds). Yes, that implementation attempts to be usable both as CHECK and DCHECK, so the struct can't be empty in release builds. I would be much more comfortable with something that was a no-op in release builds, and used proper synchronization to detect races when DCHECK_IS_ON. I won't object if you write a CL to copy it into WebRTC, though.
On 2016/09/20 12:50:56, kwiberg-webrtc wrote: > On 2016/09/20 12:34:46, the sun wrote: > > perkj@ made me aware of > > https://cs.chromium.org/chromium/src/third_party/webrtc/base/race_checker.h > > Ew! That one uses volatile int instead of real atomics. I guess it makes sense, > since it can only ever misbehave if there was indeed a race, but still... it's > only a few lines of code, and five minutes later I'm still not sure that it > can't misbehave in a way that wouldn't be OK. Yeah, I think you're right. In C++, volatile only marks memory as "may change at any time", it does not say anything about memory ordering, like it does in Java, and is therefore not so useful for multithreaded situations. But these things are *tricky*. To me it looks like the "if (access_count_++ == 0)" line introduces a race. Loading the value and incrementing will happen in separate operations so the thread may be interrupted in between, and they will proceed to set accessing_thread_. If I'm right that'd be a race in the RaceChecker. > > I'll update the CL and use that since it is almost exactly what I was trying > to > > do here (% an int or two left in release builds). > > Yes, that implementation attempts to be usable both as CHECK and DCHECK, so the > struct can't be empty in release builds. I would be much more comfortable with > something that was a no-op in release builds, and used proper synchronization to > detect races when DCHECK_IS_ON. > > I won't object if you write a CL to copy it into WebRTC, though. It's already in webrtc/base/. I'll ask the author for an explanation and suggest we use a crit sect if we figure out it is indeed broken.
On 2016/09/20 17:53:31, the sun wrote: > On 2016/09/20 12:50:56, kwiberg-webrtc wrote: > > On 2016/09/20 12:34:46, the sun wrote: > > > perkj@ made me aware of > > > https://cs.chromium.org/chromium/src/third_party/webrtc/base/race_checker.h > > > > Ew! That one uses volatile int instead of real atomics. I guess it makes > sense, > > since it can only ever misbehave if there was indeed a race, but still... it's > > only a few lines of code, and five minutes later I'm still not sure that it > > can't misbehave in a way that wouldn't be OK. > > Yeah, I think you're right. In C++, volatile only marks memory as "may change at > any time", it does not say anything about memory ordering, like it does in Java, > and is therefore not so useful for multithreaded situations. But these things > are *tricky*. Yes. It's difficult enough to get right if you use real atomic operations. > To me it looks like the "if (access_count_++ == 0)" line introduces a race. > [...] If I'm right that'd be a race in the RaceChecker. Yes (trivially, because the code writes to memory without proper synchronization). It's clear enough that if there is no race, the checker will reliably fail to report a race while doing no damage to the system. It's also obvious that if we experience a race, the checker may (1) report the race without doing any damage, or (2) fail to report the race without doing any damage. What's in question is if there's a third possibility in case we experience a race: (3) the checker damages the system. If we just stick to the standard, then (3) may trivially occur, because a race is undefined behavior, so the compiler is free to e.g. insert its own race checker that starts downloading cat videos if it detects a race. It's very likely that (3) cannot occur in practice, but as I said I'm not sure. > > > > I'll update the CL and use that since it is almost exactly what I was trying > > to > > > do here (% an int or two left in release builds). > > > > Yes, that implementation attempts to be usable both as CHECK and DCHECK, so > the > > struct can't be empty in release builds. I would be much more comfortable with > > something that was a no-op in release builds, and used proper synchronization > to > > detect races when DCHECK_IS_ON. > > > > I won't object if you write a CL to copy it into WebRTC, though. > > It's already in webrtc/base/. I'll ask the author for an explanation and suggest > we use a crit sect if we figure out it is indeed broken. Oh, somewhy I got the impression that this was a Chromium thing, but it's a WebRTC-specific thing. Yes, unless the author can give a convincing argument why this code is unconditionally safe, we should fix it. One way to do so is to use real atomics, but as you said even those are tricky to use correctly. A real mutex would be much better, but the performance penalty would probably make this checker useless unless checking is disabled in release builds.
On 2016/09/20 18:26:18, kwiberg-webrtc wrote: > On 2016/09/20 17:53:31, the sun wrote: > > On 2016/09/20 12:50:56, kwiberg-webrtc wrote: > > > On 2016/09/20 12:34:46, the sun wrote: > > > > perkj@ made me aware of > > > > > https://cs.chromium.org/chromium/src/third_party/webrtc/base/race_checker.h > > > > > > Ew! That one uses volatile int instead of real atomics. I guess it makes > > sense, > > > since it can only ever misbehave if there was indeed a race, but still... > it's > > > only a few lines of code, and five minutes later I'm still not sure that it > > > can't misbehave in a way that wouldn't be OK. > > > > Yeah, I think you're right. In C++, volatile only marks memory as "may change > at > > any time", it does not say anything about memory ordering, like it does in > Java, > > and is therefore not so useful for multithreaded situations. But these things > > are *tricky*. > > Yes. It's difficult enough to get right if you use real atomic operations. > > > To me it looks like the "if (access_count_++ == 0)" line introduces a race. > > [...] If I'm right that'd be a race in the RaceChecker. > > Yes (trivially, because the code writes to memory without proper > synchronization). It's clear enough that if there is no race, the checker will > reliably fail to report a race while doing no damage to the system. It's also > obvious that if we experience a race, the checker may (1) report the race > without doing any damage, or (2) fail to report the race without doing any > damage. What's in question is if there's a third possibility in case we > experience a race: (3) the checker damages the system. > > If we just stick to the standard, then (3) may trivially occur, because a race > is undefined behavior, so the compiler is free to e.g. insert its own race > checker that starts downloading cat videos if it detects a race. It's very > likely that (3) cannot occur in practice, but as I said I'm not sure. > > > > > > > I'll update the CL and use that since it is almost exactly what I was > trying > > > to > > > > do here (% an int or two left in release builds). > > > > > > Yes, that implementation attempts to be usable both as CHECK and DCHECK, so > > the > > > struct can't be empty in release builds. I would be much more comfortable > with > > > something that was a no-op in release builds, and used proper > synchronization > > to > > > detect races when DCHECK_IS_ON. > > > > > > I won't object if you write a CL to copy it into WebRTC, though. > > > > It's already in webrtc/base/. I'll ask the author for an explanation and > suggest > > we use a crit sect if we figure out it is indeed broken. > > Oh, somewhy I got the impression that this was a Chromium thing, but it's a > WebRTC-specific thing. > > Yes, unless the author can give a convincing argument why this code is > unconditionally safe, we should fix it. One way to do so is to use real atomics, > but as you said even those are tricky to use correctly. A real mutex would be > much better, but the performance penalty would probably make this checker > useless unless checking is disabled in release builds. According to the author (pbos@ - https://codereview.webrtc.org/2097403002) the RaceChecker is intentionally racy for performance reasons. I've added a comment trying to explain why it will still reliably report the races it is there to find, but it does not address your (3) concern.
Description was changed from ========== Change thread check to non-re-entrancy check. Introduces rtc::NonReentrantChecker. BUG=webrtc:6345 ========== to ========== Change thread check to race check. Also, add comment to explain implementation of RaceChecker. BUG=webrtc:6345 ==========
solenberg@webrtc.org changed reviewers: + pbos@webrtc.org
https://codereview.webrtc.org/2350663002/diff/40001/webrtc/base/race_checker.cc File webrtc/base/race_checker.cc (right): https://codereview.webrtc.org/2350663002/diff/40001/webrtc/base/race_checker.... webrtc/base/race_checker.cc:20: // twice, causing the IsThreadRefEqual() check to return false at a later point. It may be incremented once instead of twice, possibly followed by both threads writing to accessing_thread_. A similar race may occur if two threads try to leave the critical section at the same time, or if one leaves at the same time as one enters. The only possible bad effect of a race should be that we fail to detect a race. However, this assumes that all reads of accessing_thread_ retrieve a value that was previously written. This is not guaranteed to be the case! E.g. if PlatformThreadRef is 64 bits and we have a 32-bit system, the write may happen in two parts. I'd be more comfortable if RaceChecker used atomics so that it itself was race free, or if it was only enabled in debug builds. I don't expect that problematic behavior is very likely, though, so fixing this isn't a high priority. https://codereview.webrtc.org/2350663002/diff/40001/webrtc/media/engine/webrt... File webrtc/media/engine/webrtcvoiceengine.cc (right): https://codereview.webrtc.org/2350663002/diff/40001/webrtc/media/engine/webrt... webrtc/media/engine/webrtcvoiceengine.cc:1320: rtc::RaceChecker audio_capture_race_checker_; Again, no annotations on the member variables prove that you're not accessing them without telling the checker. This was a pre-existing property of this code, but one which makes reviewing this CL harder.
https://codereview.webrtc.org/2350663002/diff/40001/webrtc/base/race_checker.cc File webrtc/base/race_checker.cc (right): https://codereview.webrtc.org/2350663002/diff/40001/webrtc/base/race_checker.... webrtc/base/race_checker.cc:20: // twice, causing the IsThreadRefEqual() check to return false at a later point. On 2016/09/20 19:29:07, kwiberg-webrtc wrote: > It may be incremented once instead of twice, possibly followed by both threads > writing to accessing_thread_. A similar race may occur if two threads try to > leave the critical section at the same time, or if one leaves at the same time > as one enters. The only possible bad effect of a race should be that we fail to > detect a race. > > However, this assumes that all reads of accessing_thread_ retrieve a value that > was previously written. This is not guaranteed to be the case! E.g. if > PlatformThreadRef is 64 bits and we have a 32-bit system, the write may happen > in two parts. > > I'd be more comfortable if RaceChecker used atomics so that it itself was race > free, or if it was only enabled in debug builds. I don't expect that problematic > behavior is very likely, though, so fixing this isn't a high priority. If two threads use it, and anyone at any point writes garbage into access_count_ (or generate -1 by double-decrementing or whatever) then the thread ref check will fail for either thread A or B in the future since they won't set accessing_thread_ even though they use the critical section. Atomics cause synchronization and could also potentially make a data race less likely to be found, plus it adds non-zero overhead, especially on ARM platforms etc. The idea was that this should be used in release as well, and the cost being that this is more probabilistic than deterministic in detection. Though I doubt it has a very high false negative rate, unless a compiler manages to completely miscompile it since it's not supposed to be racy.
ell-gee-tee-emm lgtm https://codereview.webrtc.org/2350663002/diff/40001/webrtc/media/engine/webrt... File webrtc/media/engine/webrtcvoiceengine.cc (right): https://codereview.webrtc.org/2350663002/diff/40001/webrtc/media/engine/webrt... webrtc/media/engine/webrtcvoiceengine.cc:1321: webrtc::AudioTransport* const voe_audio_transport_ = nullptr; GUARDED_BY(audio_capture_racer_checker_) if this is what it's intending to run synchronously.
https://codereview.webrtc.org/2350663002/diff/40001/webrtc/base/race_checker.cc File webrtc/base/race_checker.cc (right): https://codereview.webrtc.org/2350663002/diff/40001/webrtc/base/race_checker.... webrtc/base/race_checker.cc:20: // twice, causing the IsThreadRefEqual() check to return false at a later point. On 2016/09/20 19:36:02, pbos-webrtc wrote: > On 2016/09/20 19:29:07, kwiberg-webrtc wrote: > > It may be incremented once instead of twice, possibly followed by both threads > > writing to accessing_thread_. A similar race may occur if two threads try to > > leave the critical section at the same time, or if one leaves at the same time > > as one enters. The only possible bad effect of a race should be that we fail > to > > detect a race. > > > > However, this assumes that all reads of accessing_thread_ retrieve a value > that > > was previously written. This is not guaranteed to be the case! E.g. if > > PlatformThreadRef is 64 bits and we have a 32-bit system, the write may happen > > in two parts. > > > > I'd be more comfortable if RaceChecker used atomics so that it itself was race > > free, or if it was only enabled in debug builds. I don't expect that > problematic > > behavior is very likely, though, so fixing this isn't a high priority. > > If two threads use it, and anyone at any point writes garbage into access_count_ > (or generate -1 by double-decrementing or whatever) then the thread ref check > will fail for either thread A or B in the future since they won't set > accessing_thread_ even though they use the critical section. Yes, garbage in access_count_ shouldn't be a problem. At most, it'll cause us to fail to detect a race. It's garbage in accessing_thread_ that could be bad. On Windows, we end up comparing such values as plain integers, which is safe even if they contain garbage, but where we use pthreads we compare them with pthread_equal(). If that function e.g. interprets (parts of) its two pthread_t arguments as indexes and uses those indexes to look stuff up in memory, we have a problem. > Atomics cause synchronization and could also potentially make a data race less > likely to be found, plus it adds non-zero overhead, especially on ARM platforms > etc. Yes on both counts, although I wouldn't suspect that atomic instructions would cause data races to be all that much less likely to occur. Did you try it and find a nonnegligible effect? Implementing this with proper atomics would make it exactly equivalent to a mutex that explodes whenever there is contention (i.e. in exactly the circumstances where a real mutex would busy-wait or sleep). That seems like it ought to work just fine. > The idea was that this should be used in release as well, and the cost being > that this is more probabilistic than deterministic in detection. OK. That may serve as a convincing argument for why using a mutex or even atomics might be too expensive. > Though I doubt > it has a very high false negative rate, unless a compiler manages to completely > miscompile it since it's not supposed to be racy. The compiler doesn't need to miscompile this code---if you have a race, that's undefined behavior, so the compiler is fully within its rights to produce code that behaves very badly if a race occurs. This code relies on compiler implementations and hardware to be nice. As a result, you have to know a bunch of low-level details about the various processors, OSes, and compilers that we use in order to assess if it's safe.
https://codereview.webrtc.org/2350663002/diff/40001/webrtc/media/engine/webrt... File webrtc/media/engine/webrtcvoiceengine.cc (right): https://codereview.webrtc.org/2350663002/diff/40001/webrtc/media/engine/webrt... webrtc/media/engine/webrtcvoiceengine.cc:1321: webrtc::AudioTransport* const voe_audio_transport_ = nullptr; On 2016/09/20 19:38:03, pbos-webrtc wrote: > GUARDED_BY(audio_capture_racer_checker_) if this is what it's intending to run > synchronously. Yes. You really need to annotate all the members to be safe, though. Annotating just one is like locking just one of your doors when leaving home ("the others are already locked, I'm sure of it, so I don't need to double check").
On 2016/09/20 20:01:32, kwiberg-webrtc wrote: > https://codereview.webrtc.org/2350663002/diff/40001/webrtc/base/race_checker.cc > File webrtc/base/race_checker.cc (right): > > https://codereview.webrtc.org/2350663002/diff/40001/webrtc/base/race_checker.... > webrtc/base/race_checker.cc:20: // twice, causing the IsThreadRefEqual() check > to return false at a later point. > On 2016/09/20 19:36:02, pbos-webrtc wrote: > > On 2016/09/20 19:29:07, kwiberg-webrtc wrote: > > > It may be incremented once instead of twice, possibly followed by both > threads > > > writing to accessing_thread_. A similar race may occur if two threads try to > > > leave the critical section at the same time, or if one leaves at the same > time > > > as one enters. The only possible bad effect of a race should be that we fail > > to > > > detect a race. > > > > > > However, this assumes that all reads of accessing_thread_ retrieve a value > > that > > > was previously written. This is not guaranteed to be the case! E.g. if > > > PlatformThreadRef is 64 bits and we have a 32-bit system, the write may > happen > > > in two parts. > > > > > > I'd be more comfortable if RaceChecker used atomics so that it itself was > race > > > free, or if it was only enabled in debug builds. I don't expect that > > problematic > > > behavior is very likely, though, so fixing this isn't a high priority. > > > > If two threads use it, and anyone at any point writes garbage into > access_count_ > > (or generate -1 by double-decrementing or whatever) then the thread ref check > > will fail for either thread A or B in the future since they won't set > > accessing_thread_ even though they use the critical section. > > Yes, garbage in access_count_ shouldn't be a problem. At most, it'll cause us to > fail to detect a race. It's garbage in accessing_thread_ that could be bad. On > Windows, we end up comparing such values as plain integers, which is safe even > if they contain garbage, but where we use pthreads we compare them with > pthread_equal(). If that function e.g. interprets (parts of) its two pthread_t > arguments as indexes and uses those indexes to look stuff up in memory, we have > a problem. If this is a crash, we're fine. If it ends up reformatting the drive that's less fine. But remember that this code is *only* racy when the program using it is racy, and if the program running this is racy we have the same problems, just outside this class, right? > > Atomics cause synchronization and could also potentially make a data race less > > likely to be found, plus it adds non-zero overhead, especially on ARM > platforms > > etc. > > Yes on both counts, although I wouldn't suspect that atomic instructions would > cause data races to be all that much less likely to occur. Did you try it and > find a nonnegligible effect? Nah, just know that it might flush cache lines etc, which might align threads I guess. This is very hand-wavy, theoretical and mostly bogus. Possibly. > Implementing this with proper atomics would make it exactly equivalent to a > mutex that explodes whenever there is contention (i.e. in exactly the > circumstances where a real mutex would busy-wait or sleep). That seems like it > ought to work just fine. > > > The idea was that this should be used in release as well, and the cost being > > that this is more probabilistic than deterministic in detection. > > OK. That may serve as a convincing argument for why using a mutex or even > atomics might be too expensive. It was also shut down in the review landing this RaceChecker, I originally did it with atomics. :) > > Though I doubt > > it has a very high false negative rate, unless a compiler manages to > completely > > miscompile it since it's not supposed to be racy. > > The compiler doesn't need to miscompile this code---if you have a race, that's > undefined behavior, so the compiler is fully within its rights to produce code > that behaves very badly if a race occurs. This code relies on compiler > implementations and hardware to be nice. As a result, you have to know a bunch > of low-level details about the various processors, OSes, and compilers that we > use in order to assess if it's safe. These optimizations are what I mean as miscompiling. But they're only miscompilations if the underlying program is racy (regardless of the race checker), since it should protect thread-unsafe areas. In that hypothetical case we're not guaranteed to be able to detect races (since the whole class could be optimized away, theoretically).
On 2016/09/20 20:15:29, pbos-webrtc wrote: > On 2016/09/20 20:01:32, kwiberg-webrtc wrote: > > > https://codereview.webrtc.org/2350663002/diff/40001/webrtc/base/race_checker.cc > > File webrtc/base/race_checker.cc (right): > > > > > https://codereview.webrtc.org/2350663002/diff/40001/webrtc/base/race_checker.... > > webrtc/base/race_checker.cc:20: // twice, causing the IsThreadRefEqual() check > > to return false at a later point. > > On 2016/09/20 19:36:02, pbos-webrtc wrote: > > > On 2016/09/20 19:29:07, kwiberg-webrtc wrote: > > > > It may be incremented once instead of twice, possibly followed by both > > threads > > > > writing to accessing_thread_. A similar race may occur if two threads try > to > > > > leave the critical section at the same time, or if one leaves at the same > > time > > > > as one enters. The only possible bad effect of a race should be that we > fail > > > to > > > > detect a race. > > > > > > > > However, this assumes that all reads of accessing_thread_ retrieve a value > > > that > > > > was previously written. This is not guaranteed to be the case! E.g. if > > > > PlatformThreadRef is 64 bits and we have a 32-bit system, the write may > > happen > > > > in two parts. > > > > > > > > I'd be more comfortable if RaceChecker used atomics so that it itself was > > race > > > > free, or if it was only enabled in debug builds. I don't expect that > > > problematic > > > > behavior is very likely, though, so fixing this isn't a high priority. > > > > > > If two threads use it, and anyone at any point writes garbage into > > access_count_ > > > (or generate -1 by double-decrementing or whatever) then the thread ref > check > > > will fail for either thread A or B in the future since they won't set > > > accessing_thread_ even though they use the critical section. > > > > Yes, garbage in access_count_ shouldn't be a problem. At most, it'll cause us > to > > fail to detect a race. It's garbage in accessing_thread_ that could be bad. On > > Windows, we end up comparing such values as plain integers, which is safe even > > if they contain garbage, but where we use pthreads we compare them with > > pthread_equal(). If that function e.g. interprets (parts of) its two pthread_t > > arguments as indexes and uses those indexes to look stuff up in memory, we > have > > a problem. > > If this is a crash, we're fine. If it ends up reformatting the drive that's less > fine. But remember that this code is *only* racy when the program using it is > racy, and if the program running this is racy we have the same problems, just > outside this class, right? > > > > Atomics cause synchronization and could also potentially make a data race > less > > > likely to be found, plus it adds non-zero overhead, especially on ARM > > platforms > > > etc. > > > > Yes on both counts, although I wouldn't suspect that atomic instructions would > > cause data races to be all that much less likely to occur. Did you try it and > > find a nonnegligible effect? > > Nah, just know that it might flush cache lines etc, which might align threads I > guess. This is very hand-wavy, theoretical and mostly bogus. Possibly. > > > Implementing this with proper atomics would make it exactly equivalent to a > > mutex that explodes whenever there is contention (i.e. in exactly the > > circumstances where a real mutex would busy-wait or sleep). That seems like it > > ought to work just fine. > > > > > The idea was that this should be used in release as well, and the cost being > > > that this is more probabilistic than deterministic in detection. > > > > OK. That may serve as a convincing argument for why using a mutex or even > > atomics might be too expensive. > > It was also shut down in the review landing this RaceChecker, I originally did > it with atomics. :) > > > > Though I doubt > > > it has a very high false negative rate, unless a compiler manages to > > completely > > > miscompile it since it's not supposed to be racy. > > > > The compiler doesn't need to miscompile this code---if you have a race, that's > > undefined behavior, so the compiler is fully within its rights to produce code > > that behaves very badly if a race occurs. This code relies on compiler > > implementations and hardware to be nice. As a result, you have to know a bunch > > of low-level details about the various processors, OSes, and compilers that we > > use in order to assess if it's safe. > > These optimizations are what I mean as miscompiling. But they're only > miscompilations if the underlying program is racy (regardless of the race > checker), since it should protect thread-unsafe areas. In that hypothetical case > we're not guaranteed to be able to detect races (since the whole class could be > optimized away, theoretically). I'm mostly ok with the tradeoffs in the RaceChecker design. Finding races using run time primitives is by nature a probabilistic business, and the RaceChecker seems more likely to trip over and signal something, rather than hide it, whenever it is hit by its own raciness. So, yeah, it's a bit scary, but I believe it does the job. What do you guys think about the comment I added? Does it improve understanding to the extent that later users won't have to go through these similar discussions?
On 2016/09/20 20:15:29, pbos-webrtc wrote: > > Yes, garbage in access_count_ shouldn't be a problem. At most, > > it'll cause us to fail to detect a race. It's garbage in > > accessing_thread_ that could be bad. On Windows, we end up > > comparing such values as plain integers, which is safe even if > > they contain garbage, but where we use pthreads we compare them > > with pthread_equal(). If that function e.g. interprets (parts of) > > its two pthread_t arguments as indexes and uses those indexes to > > look stuff up in memory, we have a problem. > > If this is a crash, we're fine. If it ends up reformatting the drive > that's less fine. But remember that this code is *only* racy when > the program using it is racy, and if the program running this is > racy we have the same problems, just outside this class, right? Yes. But there's no guarantee that adding more racy code to an already racy situation won't make it worse. (Not saying that I think this implementation is *likely* to cause bad things to happen, just that I don't know it doesn't.) > > The compiler doesn't need to miscompile this code---if you have a > > race, that's undefined behavior, so the compiler is fully within > > its rights to produce code that behaves very badly if a race > > occurs. This code relies on compiler implementations and hardware > > to be nice. As a result, you have to know a bunch of low-level > > details about the various processors, OSes, and compilers that we > > use in order to assess if it's safe. > > These optimizations are what I mean as miscompiling. But they're > only miscompilations if the underlying program is racy (regardless > of the race checker), since it should protect thread-unsafe areas. > In that hypothetical case we're not guaranteed to be able to detect > races (since the whole class could be optimized away, > theoretically). Oh, OK. The meaning of "miscompilation" that I'm used to seeing is "the compiler produces code that violates the standard". The compiler is unlikely to optimize the whole class away. To do so, it would need to prove (at compile time) that a race is sure to occur. And maybe not even then---I don't think the code is allowed to do arbitrary stuff just because undefined behavior will occur in the future. Although technically, the compiler is allowed to produce code that will change the past if undefined behavior occurs... not sure if the standard mentions this. :-)
On 2016/09/20 20:33:02, kwiberg-webrtc wrote: > Yes. But there's no guarantee that adding more racy code to an already > racy situation won't make it worse. (Not saying that I think this > implementation is *likely* to cause bad things to happen, just that I > don't know it doesn't.) I think data races inside std::map would be way scarier, and places like that are the ones we're trying to protect from races. This code isn't racy if used properly, just like any other piece of single-threaded code (just that you're likelier to put this code near races). But I think the bikeshed is sufficiently red by now. :)
On 2016/09/20 20:37:33, pbos-webrtc wrote: > On 2016/09/20 20:33:02, kwiberg-webrtc wrote: > > Yes. But there's no guarantee that adding more racy code to an already > > racy situation won't make it worse. (Not saying that I think this > > implementation is *likely* to cause bad things to happen, just that I > > don't know it doesn't.) > > I think data races inside std::map would be way scarier, and places like that > are the ones we're trying to protect from races. This code isn't racy if used > properly, just like any other piece of single-threaded code (just that you're > likelier to put this code near races). > > But I think the bikeshed is sufficiently red by now. :) Sorry, I also think the comment is good. Thanks for adding it!
On 2016/09/20 20:25:00, the sun wrote: > On 2016/09/20 20:15:29, pbos-webrtc wrote: > > On 2016/09/20 20:01:32, kwiberg-webrtc wrote: > > > > > > https://codereview.webrtc.org/2350663002/diff/40001/webrtc/base/race_checker.cc > > > File webrtc/base/race_checker.cc (right): > > > > > > > > > https://codereview.webrtc.org/2350663002/diff/40001/webrtc/base/race_checker.... > > > webrtc/base/race_checker.cc:20: // twice, causing the IsThreadRefEqual() > check > > > to return false at a later point. > > > On 2016/09/20 19:36:02, pbos-webrtc wrote: > > > > On 2016/09/20 19:29:07, kwiberg-webrtc wrote: > > > > > It may be incremented once instead of twice, possibly followed by both > > > threads > > > > > writing to accessing_thread_. A similar race may occur if two threads > try > > to > > > > > leave the critical section at the same time, or if one leaves at the > same > > > time > > > > > as one enters. The only possible bad effect of a race should be that we > > fail > > > > to > > > > > detect a race. > > > > > > > > > > However, this assumes that all reads of accessing_thread_ retrieve a > value > > > > that > > > > > was previously written. This is not guaranteed to be the case! E.g. if > > > > > PlatformThreadRef is 64 bits and we have a 32-bit system, the write may > > > happen > > > > > in two parts. > > > > > > > > > > I'd be more comfortable if RaceChecker used atomics so that it itself > was > > > race > > > > > free, or if it was only enabled in debug builds. I don't expect that > > > > problematic > > > > > behavior is very likely, though, so fixing this isn't a high priority. > > > > > > > > If two threads use it, and anyone at any point writes garbage into > > > access_count_ > > > > (or generate -1 by double-decrementing or whatever) then the thread ref > > check > > > > will fail for either thread A or B in the future since they won't set > > > > accessing_thread_ even though they use the critical section. > > > > > > Yes, garbage in access_count_ shouldn't be a problem. At most, it'll cause > us > > to > > > fail to detect a race. It's garbage in accessing_thread_ that could be bad. > On > > > Windows, we end up comparing such values as plain integers, which is safe > even > > > if they contain garbage, but where we use pthreads we compare them with > > > pthread_equal(). If that function e.g. interprets (parts of) its two > pthread_t > > > arguments as indexes and uses those indexes to look stuff up in memory, we > > have > > > a problem. > > > > If this is a crash, we're fine. If it ends up reformatting the drive that's > less > > fine. But remember that this code is *only* racy when the program using it is > > racy, and if the program running this is racy we have the same problems, just > > outside this class, right? > > > > > > Atomics cause synchronization and could also potentially make a data race > > less > > > > likely to be found, plus it adds non-zero overhead, especially on ARM > > > platforms > > > > etc. > > > > > > Yes on both counts, although I wouldn't suspect that atomic instructions > would > > > cause data races to be all that much less likely to occur. Did you try it > and > > > find a nonnegligible effect? > > > > Nah, just know that it might flush cache lines etc, which might align threads > I > > guess. This is very hand-wavy, theoretical and mostly bogus. Possibly. > > > > > Implementing this with proper atomics would make it exactly equivalent to a > > > mutex that explodes whenever there is contention (i.e. in exactly the > > > circumstances where a real mutex would busy-wait or sleep). That seems like > it > > > ought to work just fine. > > > > > > > The idea was that this should be used in release as well, and the cost > being > > > > that this is more probabilistic than deterministic in detection. > > > > > > OK. That may serve as a convincing argument for why using a mutex or even > > > atomics might be too expensive. > > > > It was also shut down in the review landing this RaceChecker, I originally did > > it with atomics. :) > > > > > > Though I doubt > > > > it has a very high false negative rate, unless a compiler manages to > > > completely > > > > miscompile it since it's not supposed to be racy. > > > > > > The compiler doesn't need to miscompile this code---if you have a race, > that's > > > undefined behavior, so the compiler is fully within its rights to produce > code > > > that behaves very badly if a race occurs. This code relies on compiler > > > implementations and hardware to be nice. As a result, you have to know a > bunch > > > of low-level details about the various processors, OSes, and compilers that > we > > > use in order to assess if it's safe. > > > > These optimizations are what I mean as miscompiling. But they're only > > miscompilations if the underlying program is racy (regardless of the race > > checker), since it should protect thread-unsafe areas. In that hypothetical > case > > we're not guaranteed to be able to detect races (since the whole class could > be > > optimized away, theoretically). > > I'm mostly ok with the tradeoffs in the RaceChecker design. Finding races using > run time primitives is by nature a probabilistic business, and the RaceChecker > seems more likely to trip over and signal something, rather than hide it, > whenever it is hit by its own raciness. So, yeah, it's a bit scary, but I > believe it does the job. Me too. > What do you guys think about the comment I added? Does it improve understanding > to the extent that later users won't have to go through these similar > discussions? It's a bit too specific. Maybe something like // Note that the implementation here is in itself racy, but we pretend it does // not matter because we don't want to pay the cost of using atomics. A race // may cause the checker to scream bloody murder in an unexpected way, or (more // unlikely) cause it to fail to detect a race. Technically, it is possible // that a race would e.g. cause the checker to crash or even worse if there is // a race, but this doesn't seem likely to happen in practice.
On 2016/09/20 20:37:33, pbos-webrtc wrote: > But I think the bikeshed is sufficiently red by now. :) What!? It's supposed to be a *green* bikeshed...
Karl, Tommi, I've updated the comment.
lgtm, with a suggestion https://codereview.webrtc.org/2350663002/diff/60001/webrtc/base/race_checker.cc File webrtc/base/race_checker.cc (right): https://codereview.webrtc.org/2350663002/diff/60001/webrtc/base/race_checker.... webrtc/base/race_checker.cc:22: // spot where a race *first* appeared in the code we're trying to protect. Maybe not that it's known that a race may also cause us to fail to detect the race, or (very unlikely!) eat your dog, but that this is a deliberately chosen trade-off. Without such a note, we'd still have spent all this time digging through the code.
lgtm
The CQ bit was checked by solenberg@webrtc.org
The patchset sent to the CQ was uploaded after l-g-t-m from pbos@webrtc.org, tommi@webrtc.org, kwiberg@webrtc.org Link to the patchset: https://codereview.webrtc.org/2350663002/#ps80001 (title: "comment")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.webrtc.org/...
The CQ bit was unchecked by commit-bot@chromium.org
Try jobs failed on following builders: ios_arm64_rel on master.tryserver.webrtc (JOB_FAILED, http://build.chromium.org/p/tryserver.webrtc/builders/ios_arm64_rel/builds/13134)
The CQ bit was checked by solenberg@webrtc.org
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.webrtc.org/...
Message was sent while issue was closed.
Description was changed from ========== Change thread check to race check. Also, add comment to explain implementation of RaceChecker. BUG=webrtc:6345 ========== to ========== Change thread check to race check. Also, add comment to explain implementation of RaceChecker. BUG=webrtc:6345 ==========
Message was sent while issue was closed.
Committed patchset #5 (id:80001)
Message was sent while issue was closed.
Description was changed from ========== Change thread check to race check. Also, add comment to explain implementation of RaceChecker. BUG=webrtc:6345 ========== to ========== Change thread check to race check. Also, add comment to explain implementation of RaceChecker. BUG=webrtc:6345 Committed: https://crrev.com/347ec5c72e9fc1698520b650f0515205999e39d0 Cr-Commit-Position: refs/heads/master@{#14369} ==========
Message was sent while issue was closed.
Patchset 5 (id:??) landed as https://crrev.com/347ec5c72e9fc1698520b650f0515205999e39d0 Cr-Commit-Position: refs/heads/master@{#14369} |