|
|
Created:
3 years, 8 months ago by peah-webrtc Modified:
3 years, 8 months ago CC:
webrtc-reviews_webrtc.org, AleBzk, Andrew MacDonald, aleloi, tterriberry_mozilla.com, audio-team_agora.io, hlundin-webrtc, kwiberg-webrtc, minyue-webrtc, the sun, aluebs-webrtc, bjornv1 Target Ref:
refs/heads/master Project:
webrtc Visibility:
Public. |
DescriptionAdding new functionality for SIMD optimizations in AEC3
Most of the complex functionality in AEC3 is done using
vector maths. This CL adds a new functionality for
performing these using SIMD operations in a simple manner
whenever such are available.
The reason for putting the implementations in the header file
is to allow any possible inlining.
BUG=webrtc:6018
Review-Url: https://codereview.webrtc.org/2813823002
Cr-Commit-Position: refs/heads/master@{#17663}
Committed: https://chromium.googlesource.com/external/webrtc/+/5e79b293137f9022322331c9644743203d246ba3
Patch Set 1 #
Total comments: 17
Patch Set 2 : Changes in response to reviewer comments #Patch Set 3 : Fixed build error on windows #
Messages
Total messages: 25 (17 generated)
Patchset #1 (id:1) has been deleted
Description was changed from ========== MoreSSe2Optimizations MoreSSe2Optimizations BUG= ========== to ========== Adding new functionality for SIMD optimizations in AEC3 Most of the complex functionality in AEC3 is done using vector maths. This CL adds a new functionality for performing these using SIMD operations in a simple manner whenever such are available. BUG=webrtc:6018 ==========
Patchset #1 (id:20001) has been deleted
Description was changed from ========== Adding new functionality for SIMD optimizations in AEC3 Most of the complex functionality in AEC3 is done using vector maths. This CL adds a new functionality for performing these using SIMD operations in a simple manner whenever such are available. BUG=webrtc:6018 ========== to ========== Adding new functionality for SIMD optimizations in AEC3 Most of the complex functionality in AEC3 is done using vector maths. This CL adds a new functionality for performing these using SIMD operations in a simple manner whenever such are available. The reason for putting the implementations in the header file is to allow any possible inlining. BUG=webrtc:6018 ==========
peah@webrtc.org changed reviewers: + aleloi@webrtc.org, ivoc@webrtc.org
Hi, This is a CL that replaces a some of the computations in AEC3 using SIMD computations whenever that is available. This CL applies these changes to suppression_gain.cc as an example of how this can be done. Depending on the outcome of the review, however, the intention is that it should be applied to the rest of the AEC3 code. It would be great if you could take a look!
See some minor remarks below. Also, I didn't test this, but I think it might be good for performance to split the loop into 3 parts: 1. Process start of the array which is potentially unaligned, up to the first alignment boundary. 2. Process the main chunk of aligned memory (using _mm_load_ps instead of _mm_loadu_ps) 3. Process the remainder of unaligned memory. Also, I think these math functions have a pretty low number of computations per memory access, which means that the SSE speedup could be lower than expected (since memory is likely to be the bottleneck). https://codereview.webrtc.org/2813823002/diff/40001/webrtc/modules/audio_proc... File webrtc/modules/audio_processing/aec3/suppression_gain.cc (right): https://codereview.webrtc.org/2813823002/diff/40001/webrtc/modules/audio_proc... webrtc/modules/audio_processing/aec3/suppression_gain.cc:52: // TODO(peah): Add further optimizations, in particular for the divisions. Is this still relevant? https://codereview.webrtc.org/2813823002/diff/40001/webrtc/modules/audio_proc... File webrtc/modules/audio_processing/aec3/vector_math.h (right): https://codereview.webrtc.org/2813823002/diff/40001/webrtc/modules/audio_proc... webrtc/modules/audio_processing/aec3/vector_math.h:45: for (size_t k = 0; k < kVectorLimit; ++k, j += 4) { I think the 2 different loop variables are a bit confusing. An alternative is to do it like this: int j = 0; for (; j<kVectorLimit*4; j+=4) { ... } for (; j<x.size(); j++) { ... } https://codereview.webrtc.org/2813823002/diff/40001/webrtc/modules/audio_proc... webrtc/modules/audio_processing/aec3/vector_math.h:74: for (size_t k = 0; k < kVectorLimit; ++k, j += 4) { Same here. https://codereview.webrtc.org/2813823002/diff/40001/webrtc/modules/audio_proc... webrtc/modules/audio_processing/aec3/vector_math.h:102: for (size_t k = 0; k < kVectorLimit; ++k, j += 4) { And here. https://codereview.webrtc.org/2813823002/diff/40001/webrtc/modules/audio_proc... File webrtc/modules/audio_processing/aec3/vector_math_unittest.cc (right): https://codereview.webrtc.org/2813823002/diff/40001/webrtc/modules/audio_proc... webrtc/modules/audio_processing/aec3/vector_math_unittest.cc:28: x[k] = 2.f / 3.f * k; I would add some braces to make the order of operations more clear (although I think the code is correct). https://codereview.webrtc.org/2813823002/diff/40001/webrtc/modules/audio_proc... webrtc/modules/audio_processing/aec3/vector_math_unittest.cc:35: EXPECT_EQ(z, z_sse2); Does this check the contents of the arrays as well? https://codereview.webrtc.org/2813823002/diff/40001/webrtc/modules/audio_proc... webrtc/modules/audio_processing/aec3/vector_math_unittest.cc:51: y[k] = 2.f / 3.f * k; Braces would be nice.
Thanks for the comments! I've uploaded a new patch. https://codereview.webrtc.org/2813823002/diff/40001/webrtc/modules/audio_proc... File webrtc/modules/audio_processing/aec3/suppression_gain.cc (right): https://codereview.webrtc.org/2813823002/diff/40001/webrtc/modules/audio_proc... webrtc/modules/audio_processing/aec3/suppression_gain.cc:52: // TODO(peah): Add further optimizations, in particular for the divisions. On 2017/04/11 12:03:01, ivoc wrote: > Is this still relevant? Yes, I did not add the optimizations for the divisions on line 71. Furthermore, we probably want to do some more optimized methods for the rest of the vector operations. https://codereview.webrtc.org/2813823002/diff/40001/webrtc/modules/audio_proc... File webrtc/modules/audio_processing/aec3/vector_math.h (right): https://codereview.webrtc.org/2813823002/diff/40001/webrtc/modules/audio_proc... webrtc/modules/audio_processing/aec3/vector_math.h:45: for (size_t k = 0; k < kVectorLimit; ++k, j += 4) { On 2017/04/11 12:03:01, ivoc wrote: > I think the 2 different loop variables are a bit confusing. An alternative is to > do it like this: > > int j = 0; > for (; j<kVectorLimit*4; j+=4) { > ... > } > for (; j<x.size(); j++) { > ... > } Awesome! Thanks! Done. https://codereview.webrtc.org/2813823002/diff/40001/webrtc/modules/audio_proc... webrtc/modules/audio_processing/aec3/vector_math.h:74: for (size_t k = 0; k < kVectorLimit; ++k, j += 4) { On 2017/04/11 12:03:01, ivoc wrote: > Same here. Done. https://codereview.webrtc.org/2813823002/diff/40001/webrtc/modules/audio_proc... webrtc/modules/audio_processing/aec3/vector_math.h:102: for (size_t k = 0; k < kVectorLimit; ++k, j += 4) { On 2017/04/11 12:03:01, ivoc wrote: > And here. Done. https://codereview.webrtc.org/2813823002/diff/40001/webrtc/modules/audio_proc... File webrtc/modules/audio_processing/aec3/vector_math_unittest.cc (right): https://codereview.webrtc.org/2813823002/diff/40001/webrtc/modules/audio_proc... webrtc/modules/audio_processing/aec3/vector_math_unittest.cc:28: x[k] = 2.f / 3.f * k; On 2017/04/11 12:03:01, ivoc wrote: > I would add some braces to make the order of operations more clear (although I > think the code is correct). Done. https://codereview.webrtc.org/2813823002/diff/40001/webrtc/modules/audio_proc... webrtc/modules/audio_processing/aec3/vector_math_unittest.cc:35: EXPECT_EQ(z, z_sse2); On 2017/04/11 12:03:01, ivoc wrote: > Does this check the contents of the arrays as well? No. That was not the intention, but since this is so easy to do, I'll add that. Good point. Done. https://codereview.webrtc.org/2813823002/diff/40001/webrtc/modules/audio_proc... webrtc/modules/audio_processing/aec3/vector_math_unittest.cc:51: y[k] = 2.f / 3.f * k; On 2017/04/11 12:03:01, ivoc wrote: > Braces would be nice. Done.
lgtm! Nice that the complexity could be moved from suppression_gain.cc. https://codereview.webrtc.org/2813823002/diff/40001/webrtc/modules/audio_proc... File webrtc/modules/audio_processing/aec3/vector_math.h (right): https://codereview.webrtc.org/2813823002/diff/40001/webrtc/modules/audio_proc... webrtc/modules/audio_processing/aec3/vector_math.h:14: #include "webrtc/typedefs.h" Project headers should be after system headers. https://codereview.webrtc.org/2813823002/diff/40001/webrtc/modules/audio_proc... webrtc/modules/audio_processing/aec3/vector_math.h:42: const size_t kRemaining = x.size() - kVectorLimit * 4; I think automatic variables are usually named with lower_case in webrtc. The style guide is ok with kConstantNaming though. https://codereview.webrtc.org/2813823002/diff/40001/webrtc/modules/audio_proc... webrtc/modules/audio_processing/aec3/vector_math.h:88: std::multiplies<float>()); Are we sure std::transform doesn't do this kind of optimization already? I don't know what flags to pass to the compiler to enable SSE2 in godbolt.
On 2017/04/11 12:03:02, ivoc wrote: > See some minor remarks below. Also, I didn't test this, but I think it might be > good for performance to split the loop into 3 parts: > > 1. Process start of the array which is potentially unaligned, up to the first > alignment boundary. > 2. Process the main chunk of aligned memory (using _mm_load_ps instead of > _mm_loadu_ps) > 3. Process the remainder of unaligned memory. > > Also, I think these math functions have a pretty low number of computations per > memory access, which means that the SSE speedup could be lower than expected > (since memory is likely to be the bottleneck). > > https://codereview.webrtc.org/2813823002/diff/40001/webrtc/modules/audio_proc... > File webrtc/modules/audio_processing/aec3/suppression_gain.cc (right): > > https://codereview.webrtc.org/2813823002/diff/40001/webrtc/modules/audio_proc... > webrtc/modules/audio_processing/aec3/suppression_gain.cc:52: // TODO(peah): Add > further optimizations, in particular for the divisions. > Is this still relevant? > > https://codereview.webrtc.org/2813823002/diff/40001/webrtc/modules/audio_proc... > File webrtc/modules/audio_processing/aec3/vector_math.h (right): > > https://codereview.webrtc.org/2813823002/diff/40001/webrtc/modules/audio_proc... > webrtc/modules/audio_processing/aec3/vector_math.h:45: for (size_t k = 0; k < > kVectorLimit; ++k, j += 4) { > I think the 2 different loop variables are a bit confusing. An alternative is to > do it like this: > > int j = 0; > for (; j<kVectorLimit*4; j+=4) { > ... > } > for (; j<x.size(); j++) { > ... > } > > https://codereview.webrtc.org/2813823002/diff/40001/webrtc/modules/audio_proc... > webrtc/modules/audio_processing/aec3/vector_math.h:74: for (size_t k = 0; k < > kVectorLimit; ++k, j += 4) { > Same here. > > https://codereview.webrtc.org/2813823002/diff/40001/webrtc/modules/audio_proc... > webrtc/modules/audio_processing/aec3/vector_math.h:102: for (size_t k = 0; k < > kVectorLimit; ++k, j += 4) { > And here. > > https://codereview.webrtc.org/2813823002/diff/40001/webrtc/modules/audio_proc... > File webrtc/modules/audio_processing/aec3/vector_math_unittest.cc (right): > > https://codereview.webrtc.org/2813823002/diff/40001/webrtc/modules/audio_proc... > webrtc/modules/audio_processing/aec3/vector_math_unittest.cc:28: x[k] = 2.f / > 3.f * k; > I would add some braces to make the order of operations more clear (although I > think the code is correct). > > https://codereview.webrtc.org/2813823002/diff/40001/webrtc/modules/audio_proc... > webrtc/modules/audio_processing/aec3/vector_math_unittest.cc:35: EXPECT_EQ(z, > z_sse2); > Does this check the contents of the arrays as well? > > https://codereview.webrtc.org/2813823002/diff/40001/webrtc/modules/audio_proc... > webrtc/modules/audio_processing/aec3/vector_math_unittest.cc:51: y[k] = 2.f / > 3.f * k; > Braces would be nice. Sorry, I did not see the initial remarks: Yes, the code does not do any assumptions on alignment at all (uses loadu_ instead of load_). The reason for this is that a) on newer platforms, the alignment does not seem to matter (my benchmarks gave no difference). I think it matters on older platforms though, but this actually matches how this is done in AEC2. Therefore, since it was not straightforward to align the signals I did not do that and until the alignment is done I'd prefer to use 2 loops. Furthermore, I'm not sure how to really verify that the alignment is properly done if I cannot show any difference in the benchmarks. On the platforms where I tested, load_ gave no difference compared to load_ps. That is along what I read on the net would be the case. WDYT?
The CQ bit was checked by peah@webrtc.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.webrtc.org/...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: win_x64_clang_rel on master.tryserver.webrtc (JOB_FAILED, http://build.chromium.org/p/tryserver.webrtc/builders/win_x64_clang_rel/build...)
The CQ bit was checked by peah@webrtc.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.webrtc.org/...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
On 2017/04/11 15:12:02, peah-webrtc wrote: > On 2017/04/11 12:03:02, ivoc wrote: > > See some minor remarks below. Also, I didn't test this, but I think it might > be > > good for performance to split the loop into 3 parts: > > > > 1. Process start of the array which is potentially unaligned, up to the first > > alignment boundary. > > 2. Process the main chunk of aligned memory (using _mm_load_ps instead of > > _mm_loadu_ps) > > 3. Process the remainder of unaligned memory. > > > > Also, I think these math functions have a pretty low number of computations > per > > memory access, which means that the SSE speedup could be lower than expected > > (since memory is likely to be the bottleneck). > > > > > https://codereview.webrtc.org/2813823002/diff/40001/webrtc/modules/audio_proc... > > File webrtc/modules/audio_processing/aec3/suppression_gain.cc (right): > > > > > https://codereview.webrtc.org/2813823002/diff/40001/webrtc/modules/audio_proc... > > webrtc/modules/audio_processing/aec3/suppression_gain.cc:52: // TODO(peah): > Add > > further optimizations, in particular for the divisions. > > Is this still relevant? > > > > > https://codereview.webrtc.org/2813823002/diff/40001/webrtc/modules/audio_proc... > > File webrtc/modules/audio_processing/aec3/vector_math.h (right): > > > > > https://codereview.webrtc.org/2813823002/diff/40001/webrtc/modules/audio_proc... > > webrtc/modules/audio_processing/aec3/vector_math.h:45: for (size_t k = 0; k < > > kVectorLimit; ++k, j += 4) { > > I think the 2 different loop variables are a bit confusing. An alternative is > to > > do it like this: > > > > int j = 0; > > for (; j<kVectorLimit*4; j+=4) { > > ... > > } > > for (; j<x.size(); j++) { > > ... > > } > > > > > https://codereview.webrtc.org/2813823002/diff/40001/webrtc/modules/audio_proc... > > webrtc/modules/audio_processing/aec3/vector_math.h:74: for (size_t k = 0; k < > > kVectorLimit; ++k, j += 4) { > > Same here. > > > > > https://codereview.webrtc.org/2813823002/diff/40001/webrtc/modules/audio_proc... > > webrtc/modules/audio_processing/aec3/vector_math.h:102: for (size_t k = 0; k < > > kVectorLimit; ++k, j += 4) { > > And here. > > > > > https://codereview.webrtc.org/2813823002/diff/40001/webrtc/modules/audio_proc... > > File webrtc/modules/audio_processing/aec3/vector_math_unittest.cc (right): > > > > > https://codereview.webrtc.org/2813823002/diff/40001/webrtc/modules/audio_proc... > > webrtc/modules/audio_processing/aec3/vector_math_unittest.cc:28: x[k] = 2.f / > > 3.f * k; > > I would add some braces to make the order of operations more clear (although I > > think the code is correct). > > > > > https://codereview.webrtc.org/2813823002/diff/40001/webrtc/modules/audio_proc... > > webrtc/modules/audio_processing/aec3/vector_math_unittest.cc:35: EXPECT_EQ(z, > > z_sse2); > > Does this check the contents of the arrays as well? > > > > > https://codereview.webrtc.org/2813823002/diff/40001/webrtc/modules/audio_proc... > > webrtc/modules/audio_processing/aec3/vector_math_unittest.cc:51: y[k] = 2.f / > > 3.f * k; > > Braces would be nice. > > Sorry, I did not see the initial remarks: > Yes, the code does not do any assumptions on alignment at all (uses loadu_ > instead of load_). The reason for this is that a) on newer platforms, the > alignment does not seem to matter (my benchmarks gave no difference). I think it > matters on older platforms though, but this actually matches how this is done in > AEC2. Therefore, since it was not straightforward to align the signals I did not > do that and until the alignment is done I'd prefer to use 2 loops. Furthermore, > I'm not sure how to really verify that the alignment is properly done if I > cannot show any difference in the benchmarks. > > On the platforms where I tested, load_ gave no difference compared to load_ps. > That is along what I read on the net would be the case. > > WDYT? Ok, let's keep it like this for now. LGTM.
The CQ bit was checked by peah@webrtc.org
The patchset sent to the CQ was uploaded after l-g-t-m from aleloi@webrtc.org Link to the patchset: https://codereview.webrtc.org/2813823002/#ps80001 (title: "Fixed build error on windows")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.webrtc.org/...
CQ is committing da patch. Bot data: {"patchset_id": 80001, "attempt_start_ts": 1491985106317250, "parent_rev": "9f28b1d354a2f457032fec5f1a0b3aeac31f7275", "commit_rev": "5e79b293137f9022322331c9644743203d246ba3"}
Message was sent while issue was closed.
Description was changed from ========== Adding new functionality for SIMD optimizations in AEC3 Most of the complex functionality in AEC3 is done using vector maths. This CL adds a new functionality for performing these using SIMD operations in a simple manner whenever such are available. The reason for putting the implementations in the header file is to allow any possible inlining. BUG=webrtc:6018 ========== to ========== Adding new functionality for SIMD optimizations in AEC3 Most of the complex functionality in AEC3 is done using vector maths. This CL adds a new functionality for performing these using SIMD operations in a simple manner whenever such are available. The reason for putting the implementations in the header file is to allow any possible inlining. BUG=webrtc:6018 Review-Url: https://codereview.webrtc.org/2813823002 Cr-Commit-Position: refs/heads/master@{#17663} Committed: https://chromium.googlesource.com/external/webrtc/+/5e79b293137f9022322331c96... ==========
Message was sent while issue was closed.
Committed patchset #3 (id:80001) as https://chromium.googlesource.com/external/webrtc/+/5e79b293137f9022322331c96... |