webrtc/modules/audio_processing/intelligibility/intelligibility_enhancer.h - Issue 1693823004: Use VAD to get a better speech power estimation in the IntelligibilityEnhancer

Side by Side Diff: webrtc/modules/audio_processing/intelligibility/intelligibility_enhancer.h

Issue 1693823004: Use VAD to get a better speech power estimation in the IntelligibilityEnhancer (Closed) Base URL: https://chromium.googlesource.com/external/webrtc.git@pow

Patch Set: Created 4 years, 10 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View unified diff | Download patch

« no previous file with comments | « webrtc/modules/audio_processing/audio_processing_impl.cc ('k') | webrtc/modules/audio_processing/intelligibility/intelligibility_enhancer.cc » ('j') | webrtc/modules/audio_processing/intelligibility/intelligibility_enhancer.cc » ('J')
Toggle Intra-line Diffs ('i') | Expand Comments ('e') | Collapse Comments ('c') | Hide Comments ('s')

OLD	NEW
1 /*	1 /*

2 * Copyright (c) 2014 The WebRTC project authors. All Rights Reserved.	2 * Copyright (c) 2014 The WebRTC project authors. All Rights Reserved.

3 *	3 *

4 * Use of this source code is governed by a BSD-style license	4 * Use of this source code is governed by a BSD-style license

5 * that can be found in the LICENSE file in the root of the source	5 * that can be found in the LICENSE file in the root of the source

6 * tree. An additional intellectual property rights grant can be found	6 * tree. An additional intellectual property rights grant can be found

7 * in the file PATENTS. All contributing project authors may	7 * in the file PATENTS. All contributing project authors may

8 * be found in the AUTHORS file in the root of the source tree.	8 * be found in the AUTHORS file in the root of the source tree.

9 */	9 */

10	10

11 #ifndef WEBRTC_MODULES_AUDIO_PROCESSING_INTELLIGIBILITY_INTELLIGIBILITY_ENHANCER _H_	11 #ifndef WEBRTC_MODULES_AUDIO_PROCESSING_INTELLIGIBILITY_INTELLIGIBILITY_ENHANCER _H_

12 #define WEBRTC_MODULES_AUDIO_PROCESSING_INTELLIGIBILITY_INTELLIGIBILITY_ENHANCER _H_	12 #define WEBRTC_MODULES_AUDIO_PROCESSING_INTELLIGIBILITY_INTELLIGIBILITY_ENHANCER _H_

13	13

14 #include <complex>	14 #include <complex>

15 #include <vector>	15 #include <vector>

16	16

17 #include "webrtc/base/scoped_ptr.h"	17 #include "webrtc/base/scoped_ptr.h"

18 #include "webrtc/common_audio/lapped_transform.h"	18 #include "webrtc/common_audio/lapped_transform.h"

19 #include "webrtc/common_audio/channel_buffer.h"	19 #include "webrtc/common_audio/channel_buffer.h"

20 #include "webrtc/modules/audio_processing/intelligibility/intelligibility_utils. h"	20 #include "webrtc/modules/audio_processing/intelligibility/intelligibility_utils. h"

	21 #include "webrtc/modules/audio_processing/vad/voice_activity_detector.h"

21	22

22 namespace webrtc {	23 namespace webrtc {

23	24

24 // Speech intelligibility enhancement module. Reads render and capture	25 // Speech intelligibility enhancement module. Reads render and capture

25 // audio streams and modifies the render stream with a set of gains per	26 // audio streams and modifies the render stream with a set of gains per

26 // frequency bin to enhance speech against the noise background.	27 // frequency bin to enhance speech against the noise background.

27 // Details of the model and algorithm can be found in the original paper:	28 // Details of the model and algorithm can be found in the original paper:

28 // http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6882788	29 // http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6882788

29 class IntelligibilityEnhancer {	30 class IntelligibilityEnhancer {

30 public:	31 public:

31 struct Config {	32 IntelligibilityEnhancer(int sample_rate_hz,

32 // TODO(bercic): the \|decay_rate\|, \|analysis_rate\| and \|gain_limit\|	33 size_t num_render_channels);

33 // parameters should probably go away once fine tuning is done.

34 Config()

35 : sample_rate_hz(16000),

36 num_capture_channels(1),

37 num_render_channels(1),

38 decay_rate(0.9f),

39 analysis_rate(60),

40 gain_change_limit(0.1f),

41 rho(0.02f) {}

42 int sample_rate_hz;

43 size_t num_capture_channels;

44 size_t num_render_channels;

45 float decay_rate;

46 int analysis_rate;

47 float gain_change_limit;

48 float rho;

49 };

50

51 explicit IntelligibilityEnhancer(const Config& config);
turaj 2016/02/13 00:09:43 ctor with a config struct might come handy for tun ctor with a config struct might come handy for tuning, but it is totally up to you. aluebs-webrtc 2016/02/19 03:56:31 I dropped num_capture_channels and analysis_rate a Show quoted text On 2016/02/13 00:09:43, turaj wrote: > ctor with a config struct might come handy for tuning, but it is totally up to > you. I dropped num_capture_channels and analysis_rate and tuned decay_rate in this CL. That only leaves the gain_change_limit and rho to be tuned, because sample_rate_hz and num_render_channels are settable parameters. I think in this case this constructor adds more complexity than the simplicity for tuning. But if you have a strong opinion I am happy to add it back. turaj 2016/02/19 16:48:47 My point was that to tune for rho or gain_change_l Show quoted text On 2016/02/19 03:56:31, aluebs-webrtc wrote: > On 2016/02/13 00:09:43, turaj wrote: > > ctor with a config struct might come handy for tuning, but it is totally up to > > you. > > I dropped num_capture_channels and analysis_rate and tuned decay_rate in this > CL. That only leaves the gain_change_limit and rho to be tuned, because > sample_rate_hz and num_render_channels are settable parameters. I think in this > case this constructor adds more complexity than the simplicity for tuning. But > if you have a strong opinion I am happy to add it back. My point was that to tune for rho or gain_change_limit (maybe decay_factor of noise should be different than clear speech) one needs to recompile, but if that is fine with you I have no complain. I agree that the final code should look like what you have here. aluebs-webrtc 2016/02/19 19:30:48 I will leave this as is and will add a more flexib Show quoted text On 2016/02/19 16:48:47, turaj wrote: > On 2016/02/19 03:56:31, aluebs-webrtc wrote: > > On 2016/02/13 00:09:43, turaj wrote: > > > ctor with a config struct might come handy for tuning, but it is totally up > to > > > you. > > > > I dropped num_capture_channels and analysis_rate and tuned decay_rate in this > > CL. That only leaves the gain_change_limit and rho to be tuned, because > > sample_rate_hz and num_render_channels are settable parameters. I think in > this > > case this constructor adds more complexity than the simplicity for tuning. But > > if you have a strong opinion I am happy to add it back. > > My point was that to tune for rho or gain_change_limit (maybe decay_factor of > noise should be different than clear speech) one needs to recompile, but if that > is fine with you I have no complain. I agree that the final code should look > like what you have here. I will leave this as is and will add a more flexible constructor locally when tuning the remaining parameters.
52 IntelligibilityEnhancer(); // Initialize with default config.

53	34

54 // Sets the capture noise magnitude spectrum estimate.	35 // Sets the capture noise magnitude spectrum estimate.

55 void SetCaptureNoiseEstimate(std::vector<float> noise);	36 void SetCaptureNoiseEstimate(std::vector<float> noise);

56	37

57 // Reads chunk of speech in time domain and updates with modified signal.	38 // Reads chunk of speech in time domain and updates with modified signal.

58 void ProcessRenderAudio(float* const* audio,	39 void ProcessRenderAudio(float* const* audio,

59 int sample_rate_hz,	40 int sample_rate_hz,

60 size_t num_channels);	41 size_t num_channels);

61 bool active() const;	42 bool active() const;

62	43

(...skipping 16 matching lines...) Expand all Loading...
79 };	60 };

80 friend class TransformCallback;	61 friend class TransformCallback;

81 FRIEND_TEST_ALL_PREFIXES(IntelligibilityEnhancerTest, TestErbCreation);	62 FRIEND_TEST_ALL_PREFIXES(IntelligibilityEnhancerTest, TestErbCreation);

82 FRIEND_TEST_ALL_PREFIXES(IntelligibilityEnhancerTest, TestSolveForGains);	63 FRIEND_TEST_ALL_PREFIXES(IntelligibilityEnhancerTest, TestSolveForGains);

83	64

84 // Updates power computation and analysis with \|in_block_\|,	65 // Updates power computation and analysis with \|in_block_\|,

85 // and writes modified speech to \|out_block\|.	66 // and writes modified speech to \|out_block\|.

86 void ProcessClearBlock(const std::complex<float>* in_block,	67 void ProcessClearBlock(const std::complex<float>* in_block,

87 std::complex<float>* out_block);	68 std::complex<float>* out_block);

88	69

89 // Computes and sets modified gains.

90 void AnalyzeClearBlock();

91

92 // Bisection search for optimal \|lambda\|.	70 // Bisection search for optimal \|lambda\|.

93 void SolveForLambda(float power_target, float power_bot, float power_top);	71 void SolveForLambda(float power_target, float power_bot, float power_top);

94	72

95 // Transforms freq gains to ERB gains.	73 // Transforms freq gains to ERB gains.

96 void UpdateErbGains();	74 void UpdateErbGains();

97	75

98 // Returns number of ERB filters.	76 // Returns number of ERB filters.

99 static size_t GetBankSize(int sample_rate, size_t erb_resolution);	77 static size_t GetBankSize(int sample_rate, size_t erb_resolution);

100	78

101 // Initializes ERB filterbank.	79 // Initializes ERB filterbank.

102 std::vector<std::vector<float>> CreateErbBank(size_t num_freqs);	80 std::vector<std::vector<float>> CreateErbBank(size_t num_freqs);

103	81

104 // Analytically solves quadratic for optimal gains given \|lambda\|.	82 // Analytically solves quadratic for optimal gains given \|lambda\|.

105 // Negative gains are set to 0. Stores the results in \|sols\|.	83 // Negative gains are set to 0. Stores the results in \|sols\|.

106 void SolveForGainsGivenLambda(float lambda, size_t start_freq, float* sols);	84 void SolveForGainsGivenLambda(float lambda, size_t start_freq, float* sols);

107	85

	86 // Returns true if the audio is speech.

	87 bool IsSpeech(const float* audio);

	88

108 const size_t freqs_; // Num frequencies in frequency domain.	89 const size_t freqs_; // Num frequencies in frequency domain.

109 const size_t window_size_; // Window size in samples; also the block size.

110 const size_t chunk_length_; // Chunk size in samples.	90 const size_t chunk_length_; // Chunk size in samples.

111 const size_t bank_size_; // Num ERB filters.	91 const size_t bank_size_; // Num ERB filters.

112 const int sample_rate_hz_;	92 const int sample_rate_hz_;

113 const int erb_resolution_;

114 const size_t num_capture_channels_;

115 const size_t num_render_channels_;	93 const size_t num_render_channels_;

116 const int analysis_rate_; // Num blocks before gains recalculated.

117	94

118 const bool active_; // Whether render gains are being updated.	95 PowerEstimator clear_power_estimator_;

119 // TODO(ekm): Add logic for updating \|active_\|.	96 rtc::scoped_ptr<PowerEstimator> noise_power_estimator_;

120

121 PowerEstimator clear_power_;

122 std::vector<float> noise_power_;

123 rtc::scoped_ptr<float[]> filtered_clear_pow_;	97 rtc::scoped_ptr<float[]> filtered_clear_pow_;

124 rtc::scoped_ptr<float[]> filtered_noise_pow_;	98 rtc::scoped_ptr<float[]> filtered_noise_pow_;

125 rtc::scoped_ptr<float[]> center_freqs_;	99 rtc::scoped_ptr<float[]> center_freqs_;

126 std::vector<std::vector<float>> capture_filter_bank_;	100 std::vector<std::vector<float>> capture_filter_bank_;

127 std::vector<std::vector<float>> render_filter_bank_;	101 std::vector<std::vector<float>> render_filter_bank_;

128 size_t start_freq_;	102 size_t start_freq_;

129 rtc::scoped_ptr<float[]> rho_; // Production and interpretation SNR.	103 rtc::scoped_ptr<float[]> rho_; // Production and interpretation SNR.

130 // for each ERB band.	104 // for each ERB band.

131 rtc::scoped_ptr<float[]> gains_eq_; // Pre-filter modified gains.	105 rtc::scoped_ptr<float[]> gains_eq_; // Pre-filter modified gains.

132 GainApplier gain_applier_;	106 GainApplier gain_applier_;

133	107

134 // Destination buffers used to reassemble blocked chunks before overwriting	108 // Destination buffers used to reassemble blocked chunks before overwriting

135 // the original input array with modifications.	109 // the original input array with modifications.

136 ChannelBuffer<float> temp_render_out_buffer_;	110 ChannelBuffer<float> temp_render_out_buffer_;

137	111

138 rtc::scoped_ptr<float[]> kbd_window_;

139 TransformCallback render_callback_;	112 TransformCallback render_callback_;

140 rtc::scoped_ptr<LappedTransform> render_mangler_;	113 rtc::scoped_ptr<LappedTransform> render_mangler_;

141 int block_count_;	114

142 int analysis_step_;	115 VoiceActivityDetector vad_;

	116 std::vector<int16_t> audio_s16_;
	hlundin-webrtc 2016/02/15 13:05:11 What does "s" mean in audio_s16_? What does "s" mean in audio_s16_? aluebs-webrtc 2016/02/19 03:56:31 Not sure, but I am following the same naming than Show quoted text On 2016/02/15 13:05:11, hlundin-webrtc wrote: > What does "s" mean in audio_s16_? Not sure, but I am following the same naming than in audio_util. I am happy to rename it if you suggest a better name for it :) turaj 2016/02/19 16:48:47 Perhaps it means signed, like sox uses similar con Show quoted text On 2016/02/19 03:56:31, aluebs-webrtc wrote: > On 2016/02/15 13:05:11, hlundin-webrtc wrote: > > What does "s" mean in audio_s16_? > > Not sure, but I am following the same naming than in audio_util. I am happy to > rename it if you suggest a better name for it :) Perhaps it means signed, like sox uses similar convention. hlundin-webrtc 2016/02/22 11:03:13 Acknowledged. Show quoted text On 2016/02/19 16:48:47, turaj wrote: > On 2016/02/19 03:56:31, aluebs-webrtc wrote: > > On 2016/02/15 13:05:11, hlundin-webrtc wrote: > > > What does "s" mean in audio_s16_? > > > > Not sure, but I am following the same naming than in audio_util. I am happy to > > rename it if you suggest a better name for it :) > > Perhaps it means signed, like sox uses similar convention. Acknowledged.
	117 size_t chunks_since_voice_;

	118 bool is_speech_;

143 };	119 };

144	120

145 } // namespace webrtc	121 } // namespace webrtc

146	122

147 #endif // WEBRTC_MODULES_AUDIO_PROCESSING_INTELLIGIBILITY_INTELLIGIBILITY_ENHAN CER_H_	123 #endif // WEBRTC_MODULES_AUDIO_PROCESSING_INTELLIGIBILITY_INTELLIGIBILITY_ENHAN CER_H_

OLD	NEW