Issue 1693823004: Use VAD to get a better speech power estimation in the IntelligibilityEnhancer

aluebs-webrtc

aluebs@webrtc.org changed reviewers: + henrik.lundin@webrtc.org, turaj@webrtc.org

4 years, 10 months ago (2016-02-12 19:40:31 UTC) #1

turaj

https://codereview.webrtc.org/1693823004/diff/1/webrtc/modules/audio_processing/intelligibility/intelligibility_enhancer.cc File webrtc/modules/audio_processing/intelligibility/intelligibility_enhancer.cc (right): https://codereview.webrtc.org/1693823004/diff/1/webrtc/modules/audio_processing/intelligibility/intelligibility_enhancer.cc#newcode38 webrtc/modules/audio_processing/intelligibility/intelligibility_enhancer.cc:38: const float kGainChangeLimit = 0.1f; // Maximum change in ...

4 years, 10 months ago (2016-02-13 00:09:43 UTC) #3

hlundin-webrtc

https://codereview.webrtc.org/1693823004/diff/1/webrtc/modules/audio_processing/intelligibility/intelligibility_enhancer.cc File webrtc/modules/audio_processing/intelligibility/intelligibility_enhancer.cc (right): https://codereview.webrtc.org/1693823004/diff/1/webrtc/modules/audio_processing/intelligibility/intelligibility_enhancer.cc#newcode37 webrtc/modules/audio_processing/intelligibility/intelligibility_enhancer.cc:37: const float kDecayRate = 0.97f; // Power estimation decay ...

4 years, 10 months ago (2016-02-15 13:05:12 UTC) #4

aluebs-webrtc

https://codereview.webrtc.org/1693823004/diff/1/webrtc/modules/audio_processing/intelligibility/intelligibility_enhancer.cc File webrtc/modules/audio_processing/intelligibility/intelligibility_enhancer.cc (right): https://codereview.webrtc.org/1693823004/diff/1/webrtc/modules/audio_processing/intelligibility/intelligibility_enhancer.cc#newcode37 webrtc/modules/audio_processing/intelligibility/intelligibility_enhancer.cc:37: const float kDecayRate = 0.97f; // Power estimation decay ...

4 years, 10 months ago (2016-02-19 03:56:31 UTC) #5

https://codereview.webrtc.org/1693823004/diff/1/webrtc/modules/audio_processi...
File webrtc/modules/audio_processing/intelligibility/intelligibility_enhancer.cc
(right):

https://codereview.webrtc.org/1693823004/diff/1/webrtc/modules/audio_processi...
webrtc/modules/audio_processing/intelligibility/intelligibility_enhancer.cc:37:
const float kDecayRate = 0.97f; // Power estimation decay rate.
On 2016/02/15 13:05:11, hlundin-webrtc wrote:
> Two spaces before comment.

Done.

https://codereview.webrtc.org/1693823004/diff/1/webrtc/modules/audio_processi...
webrtc/modules/audio_processing/intelligibility/intelligibility_enhancer.cc:37:
const float kDecayRate = 0.97f; // Power estimation decay rate.
On 2016/02/15 13:05:11, hlundin-webrtc wrote:
> You change this value from 0.9 to 0.97. Can you explain why?

For this algorithm we care about the long-time psd of the speech and not
individual phonemes, which is what was happening before this patch. Using 0.97
and knowing that each step is 8ms long we can estimate the time constant in
182ms instead of 52ms. Now I increased it to 0.98, which gives us a time
constant of 275ms.

https://codereview.webrtc.org/1693823004/diff/1/webrtc/modules/audio_processi...
webrtc/modules/audio_processing/intelligibility/intelligibility_enhancer.cc:38:
const float kGainChangeLimit = 0.1f; // Maximum change in gain.
On 2016/02/15 13:05:11, hlundin-webrtc wrote:
> Two spaces before comment.

Done.

https://codereview.webrtc.org/1693823004/diff/1/webrtc/modules/audio_processi...
webrtc/modules/audio_processing/intelligibility/intelligibility_enhancer.cc:38:
const float kGainChangeLimit = 0.1f; // Maximum change in gain.
On 2016/02/13 00:09:42, turaj wrote:
> Is kGainChangeLimit relative to current value, or absolute? The naming
indicates
> absolute limit (I haven't checked the implementation), but doesn't relative
> limit make more sense? If so, please change the name.

It was an absolute limit, but I agree that a relative limit makes much more
sense, so I changed it to be that way and fixed the name.

https://codereview.webrtc.org/1693823004/diff/1/webrtc/modules/audio_processi...
webrtc/modules/audio_processing/intelligibility/intelligibility_enhancer.cc:39:
const float kRho = 0.0004f;
On 2016/02/15 13:05:11, hlundin-webrtc wrote:
> This value is also changed...

It changed to be squared, which is the only way it is used. I didn't change the
naming because the variable containing the squared value before was also called
rho_.

https://codereview.webrtc.org/1693823004/diff/1/webrtc/modules/audio_processi...
webrtc/modules/audio_processing/intelligibility/intelligibility_enhancer.cc:110:
float freqs_khz = kClipFreq / 1000.f;
On 2016/02/15 13:05:11, hlundin-webrtc wrote:
> This is const too. And it should probably be named kFreqsKhz, since it is
> compile-time defined.

I removed it and updated the constant directly.

https://codereview.webrtc.org/1693823004/diff/1/webrtc/modules/audio_processi...
webrtc/modules/audio_processing/intelligibility/intelligibility_enhancer.cc:111:
size_t erb_index = static_cast<size_t>(ceilf(
On 2016/02/15 13:05:11, hlundin-webrtc wrote:
> const

Done.

https://codereview.webrtc.org/1693823004/diff/1/webrtc/modules/audio_processi...
webrtc/modules/audio_processing/intelligibility/intelligibility_enhancer.cc:155:
MapToErbBands(noise_power_estimator_->power(),
On 2016/02/13 00:09:42, turaj wrote:
> I'm confused that why we are back to using PowerEstimator for captured signal.

To be consistent with the PSD estimation from the spectrum magnitude which looks
something like this: mean(x^2). Also, it adds an additional smoothing, which
will help  to avoid some spurious peaks in the noise estimation.

https://codereview.webrtc.org/1693823004/diff/1/webrtc/modules/audio_processi...
webrtc/modules/audio_processing/intelligibility/intelligibility_enhancer.cc:323:
if (vad_.last_voice_probability() > kVoiceProbabilityThreshold) {
On 2016/02/13 00:09:42, turaj wrote:
> I thought we gonna use the energy-based VAD with has binary output.

As discussed offline, having the pitch-based VAD will have less false positives
and by adding a guard we can estimate the PSD for voiced and unvoiced speech,
while not adapting to the noise during long periods of silence.

https://codereview.webrtc.org/1693823004/diff/1/webrtc/modules/audio_processi...
webrtc/modules/audio_processing/intelligibility/intelligibility_enhancer.cc:325:
} else if (chunks_since_voice_ < kSpeechOffsetDelay) {
On 2016/02/13 00:09:42, turaj wrote:
> If energy-based VAD is used, do we still need this guard? 

No, but I think we should use the pitch-based VAD.

https://codereview.webrtc.org/1693823004/diff/1/webrtc/modules/audio_processi...
File webrtc/modules/audio_processing/intelligibility/intelligibility_enhancer.h
(left):

https://codereview.webrtc.org/1693823004/diff/1/webrtc/modules/audio_processi...
webrtc/modules/audio_processing/intelligibility/intelligibility_enhancer.h:51:
explicit IntelligibilityEnhancer(const Config& config);
On 2016/02/13 00:09:43, turaj wrote:
> ctor with a config struct might come handy for tuning, but it is totally up to
> you.

I dropped num_capture_channels and analysis_rate and tuned decay_rate in this
CL. That only leaves the gain_change_limit and rho to be tuned, because
sample_rate_hz and num_render_channels are settable parameters. I think in this
case this constructor adds more complexity than the simplicity for tuning. But
if you have a strong opinion I am happy to add it back.

https://codereview.webrtc.org/1693823004/diff/1/webrtc/modules/audio_processi...
File webrtc/modules/audio_processing/intelligibility/intelligibility_enhancer.h
(right):

https://codereview.webrtc.org/1693823004/diff/1/webrtc/modules/audio_processi...
webrtc/modules/audio_processing/intelligibility/intelligibility_enhancer.h:116:
std::vector<int16_t> audio_s16_;
On 2016/02/15 13:05:11, hlundin-webrtc wrote:
> What does "s" mean in audio_s16_?

Not sure, but I am following the same naming than in audio_util. I am happy to
rename it if you suggest a better name for it :)

https://codereview.webrtc.org/1693823004/diff/1/webrtc/modules/audio_processi...
File webrtc/modules/audio_processing/intelligibility/intelligibility_utils.cc
(right):

https://codereview.webrtc.org/1693823004/diff/1/webrtc/modules/audio_processi...
webrtc/modules/audio_processing/intelligibility/intelligibility_utils.cc:39:
void PowerEstimator::Step(const float* data) {
On 2016/02/13 00:09:43, turaj wrote:
> This is a utility class so I guess it is fine to be implemented like this. But
> this is a bit dangerous implementation in the sense that one might call both
> Step() methods and mess up with averaging. I guess what you need here is a
> template class.

Done.

https://codereview.webrtc.org/1693823004/diff/1/webrtc/modules/audio_processi...
webrtc/modules/audio_processing/intelligibility/intelligibility_utils.cc:40: for
(size_t i = 0; i < power_.size(); ++i) {
On 2016/02/15 13:05:12, hlundin-webrtc wrote:
> On 2016/02/13 00:09:43, turaj wrote:
> > maybe a DCHECK that power_ and data are of the same size.
> 
> You don't know the size of |data|.

As Henrik points out, the size of data is unknown. If you think it is worth it I
can pass in the size as another parameter and DCHECK.

https://codereview.webrtc.org/1693823004/diff/1/webrtc/modules/audio_processi...
File webrtc/modules/audio_processing/intelligibility/intelligibility_utils.h
(right):

https://codereview.webrtc.org/1693823004/diff/1/webrtc/modules/audio_processi...
webrtc/modules/audio_processing/intelligibility/intelligibility_utils.h:35:
const float* power() { return &power_[0]; };
On 2016/02/15 13:05:12, hlundin-webrtc wrote:
> Why not return a const std::vector<float>&?

Done.

https://codereview.webrtc.org/1693823004/diff/1/webrtc/modules/audio_processi...
File
webrtc/modules/audio_processing/intelligibility/test/intelligibility_proc.cc
(right):

https://codereview.webrtc.org/1693823004/diff/1/webrtc/modules/audio_processi...
webrtc/modules/audio_processing/intelligibility/test/intelligibility_proc.cc:63:
size_t samples = std::min(in_stat.st_size, noise_stat.st_size) / 2;
On 2016/02/15 13:05:12, hlundin-webrtc wrote:
> const?

Done.

turaj

https://codereview.webrtc.org/1693823004/diff/1/webrtc/modules/audio_processing/intelligibility/intelligibility_enhancer.h File webrtc/modules/audio_processing/intelligibility/intelligibility_enhancer.h (left): https://codereview.webrtc.org/1693823004/diff/1/webrtc/modules/audio_processing/intelligibility/intelligibility_enhancer.h#oldcode51 webrtc/modules/audio_processing/intelligibility/intelligibility_enhancer.h:51: explicit IntelligibilityEnhancer(const Config& config); On 2016/02/19 03:56:31, aluebs-webrtc wrote: ...

4 years, 10 months ago (2016-02-19 16:48:47 UTC) #6

https://codereview.webrtc.org/1693823004/diff/1/webrtc/modules/audio_processi...
File webrtc/modules/audio_processing/intelligibility/intelligibility_enhancer.h
(left):

https://codereview.webrtc.org/1693823004/diff/1/webrtc/modules/audio_processi...
webrtc/modules/audio_processing/intelligibility/intelligibility_enhancer.h:51:
explicit IntelligibilityEnhancer(const Config& config);
On 2016/02/19 03:56:31, aluebs-webrtc wrote:
> On 2016/02/13 00:09:43, turaj wrote:
> > ctor with a config struct might come handy for tuning, but it is totally up
to
> > you.
> 
> I dropped num_capture_channels and analysis_rate and tuned decay_rate in this
> CL. That only leaves the gain_change_limit and rho to be tuned, because
> sample_rate_hz and num_render_channels are settable parameters. I think in
this
> case this constructor adds more complexity than the simplicity for tuning. But
> if you have a strong opinion I am happy to add it back.

My point was that to tune for rho or gain_change_limit (maybe decay_factor of
noise should be different than clear speech) one needs to recompile, but if that
is fine with you I have no complain. I agree that the final code should look
like what you have here.

https://codereview.webrtc.org/1693823004/diff/1/webrtc/modules/audio_processi...
File webrtc/modules/audio_processing/intelligibility/intelligibility_enhancer.h
(right):

https://codereview.webrtc.org/1693823004/diff/1/webrtc/modules/audio_processi...
webrtc/modules/audio_processing/intelligibility/intelligibility_enhancer.h:116:
std::vector<int16_t> audio_s16_;
On 2016/02/19 03:56:31, aluebs-webrtc wrote:
> On 2016/02/15 13:05:11, hlundin-webrtc wrote:
> > What does "s" mean in audio_s16_?
> 
> Not sure, but I am following the same naming than in audio_util. I am happy to
> rename it if you suggest a better name for it :)

Perhaps it means signed, like sox uses similar convention.

https://codereview.webrtc.org/1693823004/diff/1/webrtc/modules/audio_processi...
File webrtc/modules/audio_processing/intelligibility/intelligibility_utils.cc
(right):

https://codereview.webrtc.org/1693823004/diff/1/webrtc/modules/audio_processi...
webrtc/modules/audio_processing/intelligibility/intelligibility_utils.cc:40: for
(size_t i = 0; i < power_.size(); ++i) {
On 2016/02/19 03:56:31, aluebs-webrtc wrote:
> On 2016/02/15 13:05:12, hlundin-webrtc wrote:
> > On 2016/02/13 00:09:43, turaj wrote:
> > > maybe a DCHECK that power_ and data are of the same size.
> > 
> > You don't know the size of |data|.
> 
> As Henrik points out, the size of data is unknown. If you think it is worth it
I
> can pass in the size as another parameter and DCHECK.

Sorry, my mistake, no I don't think it is necessary to pass the size.

https://codereview.webrtc.org/1693823004/diff/40001/webrtc/modules/audio_proc...
File webrtc/modules/audio_processing/intelligibility/intelligibility_utils.cc
(right):

https://codereview.webrtc.org/1693823004/diff/40001/webrtc/modules/audio_proc...
webrtc/modules/audio_processing/intelligibility/intelligibility_utils.cc:34:
return current * gain;
I'm not sure if |current| could ever be zero, but if it is then this way of
update will never recover it, how about just changing the last line of the
previous implementation to 

return current + sign * fminf(delta, fmaxf(limit * current, epsilon()))

https://codereview.webrtc.org/1693823004/diff/40001/webrtc/modules/audio_proc...
File webrtc/modules/audio_processing/intelligibility/intelligibility_utils.h
(right):

https://codereview.webrtc.org/1693823004/diff/40001/webrtc/modules/audio_proc...
webrtc/modules/audio_processing/intelligibility/intelligibility_utils.h:33:
template <typename T>
I'm sure have compiled this code, just out of my curiosity, usually templates
need instantiations, wasn't that necessary here? And why implementation is in
header file, while implementation of constructor is in .cc file?

I'm not experienced with templates, but I would have templated the class and
instantiate the possible templates. I guess in this case instantiations would be
<float> and <std::complex<float>>.

aluebs-webrtc

https://codereview.webrtc.org/1693823004/diff/1/webrtc/modules/audio_processing/intelligibility/intelligibility_enhancer.h File webrtc/modules/audio_processing/intelligibility/intelligibility_enhancer.h (left): https://codereview.webrtc.org/1693823004/diff/1/webrtc/modules/audio_processing/intelligibility/intelligibility_enhancer.h#oldcode51 webrtc/modules/audio_processing/intelligibility/intelligibility_enhancer.h:51: explicit IntelligibilityEnhancer(const Config& config); On 2016/02/19 16:48:47, turaj wrote: ...

4 years, 10 months ago (2016-02-19 19:30:48 UTC) #7

https://codereview.webrtc.org/1693823004/diff/1/webrtc/modules/audio_processi...
File webrtc/modules/audio_processing/intelligibility/intelligibility_enhancer.h
(left):

https://codereview.webrtc.org/1693823004/diff/1/webrtc/modules/audio_processi...
webrtc/modules/audio_processing/intelligibility/intelligibility_enhancer.h:51:
explicit IntelligibilityEnhancer(const Config& config);
On 2016/02/19 16:48:47, turaj wrote:
> On 2016/02/19 03:56:31, aluebs-webrtc wrote:
> > On 2016/02/13 00:09:43, turaj wrote:
> > > ctor with a config struct might come handy for tuning, but it is totally
up
> to
> > > you.
> > 
> > I dropped num_capture_channels and analysis_rate and tuned decay_rate in
this
> > CL. That only leaves the gain_change_limit and rho to be tuned, because
> > sample_rate_hz and num_render_channels are settable parameters. I think in
> this
> > case this constructor adds more complexity than the simplicity for tuning.
But
> > if you have a strong opinion I am happy to add it back.
> 
> My point was that to tune for rho or gain_change_limit (maybe decay_factor of
> noise should be different than clear speech) one needs to recompile, but if
that
> is fine with you I have no complain. I agree that the final code should look
> like what you have here.

I will leave this as is and will add a more flexible constructor locally when
tuning the remaining parameters.

https://codereview.webrtc.org/1693823004/diff/40001/webrtc/modules/audio_proc...
File webrtc/modules/audio_processing/intelligibility/intelligibility_utils.cc
(right):

https://codereview.webrtc.org/1693823004/diff/40001/webrtc/modules/audio_proc...
webrtc/modules/audio_processing/intelligibility/intelligibility_utils.cc:34:
return current * gain;
On 2016/02/19 16:48:47, turaj wrote:
> I'm not sure if |current| could ever be zero, but if it is then this way of
> update will never recover it, how about just changing the last line of the
> previous implementation to 
> 
> return current + sign * fminf(delta, fmaxf(limit * current, epsilon()))

|current| should never be zero, since it starts in 1.f and gets multiplied by
gains between (1.f - limit) and (1.f + limit), but I see how it could get really
close. What I fear from your suggestion is that it could return negative numbers
when |target| is zero and |current| is smaller than epsilon. I added epsilon to
the return value to my proposal to make sure current never becomes zero. WDYT?

https://codereview.webrtc.org/1693823004/diff/40001/webrtc/modules/audio_proc...
File webrtc/modules/audio_processing/intelligibility/intelligibility_utils.h
(right):

https://codereview.webrtc.org/1693823004/diff/40001/webrtc/modules/audio_proc...
webrtc/modules/audio_processing/intelligibility/intelligibility_utils.h:33:
template <typename T>
On 2016/02/19 16:48:47, turaj wrote:
> I'm sure have compiled this code, just out of my curiosity, usually templates
> need instantiations, wasn't that necessary here? And why implementation is in
> header file, while implementation of constructor is in .cc file?
> 
> I'm not experienced with templates, but I would have templated the class and
> instantiate the possible templates. I guess in this case instantiations would
be
> <float> and <std::complex<float>>. 

I am not experienced with templates either, but I decided to only templatize the
function, since the implementation is the same and power_ has the same type,
changing only the input type of that function. But maybe it is preferred to
templatize the whole class, I am not sure.
Templates only need explicit instantiations when implemented in the cc file
instead of the header, so I implemented the template function in the header to
get the additional flexibility of not having to explicitly instantiate.
If you prefer a class template, I am happy to change it, since I am not sure
myself what the criteria is to chose between function and class template. Or if
you prefer to have the implementation in the cc file and restrict the template
with explicit instantations.

hlundin-webrtc

The diff got horrible by the rebase you did in patch set 2. Please, avoid ...

4 years, 10 months ago (2016-02-22 11:03:13 UTC) #8

turaj

LGTM, with an optional change. https://codereview.webrtc.org/1693823004/diff/40001/webrtc/modules/audio_processing/intelligibility/intelligibility_utils.h File webrtc/modules/audio_processing/intelligibility/intelligibility_utils.h (right): https://codereview.webrtc.org/1693823004/diff/40001/webrtc/modules/audio_processing/intelligibility/intelligibility_utils.h#newcode33 webrtc/modules/audio_processing/intelligibility/intelligibility_utils.h:33: template <typename T> Like ...

4 years, 10 months ago (2016-02-22 17:36:54 UTC) #9

aluebs-webrtc

Sorry if the diff got horrible, I find that as long as it is kept ...

4 years, 10 months ago (2016-02-22 20:41:03 UTC) #10

aluebs-webrtc

The patchset sent to the CQ was uploaded after l-g-t-m from henrik.lundin@webrtc.org, turaj@webrtc.org Link to ...

4 years, 10 months ago (2016-02-22 20:41:27 UTC) #12