research

111IEEE SIGNAL PROCESSING MAGAZINE | November 2019 |1053-5888/19©2019IEEE

O
nce a popular theme of futuristic science fiction or
far-fetched technology forecasts, digital home assis-
tants with a spoken language interface have become
a ubiquitous commodity today. This success has been

made possible by major advancements in signal processing
and machine learning for so-called far-field speech recog-
nition, where the commands are spoken at a distance from

the sound-capturing device. The challenges encountered are
quite unique and different from many other use cases of au-
tomatic speech recognition (ASR). The purpose of this article
is to describe, in a way that is amenable to the nonspecial-
ist, the key speech processing algorithms that enable reliable,
fully hands-free speech interaction with digital home assis-
tants. These technologies include multichannel acoustic
echo cancellation (MAEC), microphone array processing
and dereverberation techniques for signal enhancement, reli-
able wake-up word and end-of-interaction detection, and

Reinhold Haeb-Umbach, Shinji Watanabe, Tomohiro Nakatani, Michiel Bacchiani,
Björn Hoffmeister, Michael L. Seltzer, Heiga Zen, and Mehrez Souden

Digital Object Identifier 10.1109/MSP.2019.2918706
Date of current version: 29 October 2019

Speech Processing for
Digital Home Assistants

Combining signal processing with deep-learning techniques

©ISTOCKPHOTO.COM/MF3D

Authorized licensed use limited to: California State University – Chico. Downloaded on March 19,2021 at 18:24:57 UTC from IEEE Xplore. Restrictions apply.

112 IEEE SIGNAL PROCESSING MAGAZINE | November 2019 |

high-quality speech synthesis as well as sophisticated sta-
tistical models for speech and language, learned from large
amounts of heterogeneous training data. In all of these fields,
deep learning (DL) has played a critical role.

Evolution of digital home assistants
In the last several years, the smart speaker has emerged as a
rapidly growing new category of consumer electronic devic-
es. Smart speakers are Internet-connected loudspeakers con-
taining a digital assistant that can perform a variety of tasks
through a hands-free spoken language interface. In many cas-
es, these devices lack a screen and voice is the only input and
output modality. These digital home assis-
tants initially performed a small number of
tasks, such as playing music, retrieving the
time or weather, setting alarms, and basic
home automation. Over time, the capabili-
ties of these systems have grown dramati-
cally, as developers have created third-party
“skills” in much the same way that smartphones created an
ecosystem of apps.

The success of smart speakers in the marketplace can be
largely attributed to advances in all of the constituent tech-
nologies that comprise a digital assistant, including the digital
signal processing involved in capturing the user’s voice, the
speech recognition that turns said voice into text, the natural
language understanding that converts the text into a user’s
intent, the dialog system that decides how to respond, the natu-
ral language generation (NLG) that puts the system’s action
into natural language, and finally, the speech synthesis that
speaks this response to the user.

In this article, we describe in detail the signal process-
ing and speech technologies that are involved in captur-
ing the user’s voice and converting it to text in the context
of digital assistants for smart speakers. We focus on these
aspects of the system because they are the ones most differ-
ent from previous digital assistants, which reside on mobile
phones. Unlike smartphones, smart speakers are located at
a fixed location in a home environment, and thus need to
be capable of performing accurate speech recognition from
anywhere in the room. In these environments, the user may

be several meters from the device; as a result, the captured
speech signal can be significantly corrupted by ambient
noise and reverberation. In addition, smart speakers are typi-
cally screenless devices, so they need to support completely
hands-free interaction, including accurate voice activation to
wake up the device.

We present breakthroughs in the field of far-field ASR,
where reliable recognition is achieved despite significant sig-
nal degradations. We show how the DL paradigm has pen-
etrated virtually all components of the system and has played
a pivotal role in the success of digital home assistants.

Not e t hat sever a l of t he t e ch nolog ica l a dva nc e –
ments described in this article have been
inspired or accompanied by efforts in the
academic community, which have pro-
vided resea rchers t he oppor t un it y to
ca r r y out comprehensive evaluations of
technologies for far-field robust speech
recognition using shared data sets and a

common evaluation framework. Notably, the Computational
Hearing in Multisource Environments (CHiME) series of
challenges [1], [2], the Reverberant Voice Enhancement and
Recognition Benchmark (REVERB) Challenge [3], and the
Automatic Speech Recognition in Reverberant Environ-
ments (ASpIRE) Challenge [4] were met with considerable
enthusiasm by the research community.

While these challenges led to significant improvements in
the state of the art, they were focused primarily on speech
recognition accuracy in far-field conditions as the criterion
for success. Factors such as algorithmic latency or com-
putational efficiency were not considered. However, the
success of digital assistants in smart speakers can be attrib-
uted to not just the system’s accuracy but also its ability
to operate with low latency, which creates a positive user
experience by responding to the user’s query with an answer
shortly after the user stops speaking.

The acoustic environment in the home
In a typical home environment, the distance between the user
and the microphones on the smart loudspeaker is on the order
of a few meters. There are multiple ways in which this distance
negatively impacts the quality of the recorded signal, particu-
larly when compared to a voice signal captured on a mobile
phone or headset.

First, signal attenuation occurs as the sound propagates
from the source to the sensor. In free space, the power of the
signal per unit surface decreases by the square of the dis-
tance. This means that if the distance between the speaker
and microphone is increased from 2 cm to 1 m, the signal will
be attenuated by 34 dB. In reality, the user’s mouth is not an
omnidirectional source and, therefore, the attenuation will not
be this severe; however, it still results in a significant loss of
signal power.

Second, the distance between the source and a sensor in
a contained space such as a living room or kitchen causes
reverberation as a consequence of multipath propagation.

0 50 100 150 200 250 300 350
Time (ms)

A
m

p
lit

u
d
e

Direct Sound
Early Reflections
Late Reverberation

FIGURE 1. An AIR consists of the direct sound, early reflections, and late
reverberation.

Signal attenuation occurs
as the sound propagates
from the source to the
sensor.

Authorized licensed use limited to: California State University – Chico. Downloaded on March 19,2021 at 18:24:57 UTC from IEEE Xplore. Restrictions apply.

113IEEE SIGNAL PROCESSING MAGAZINE | November 2019 |

The wavefront of the speech signal repeatedly reflects off
the walls and objects in the room. Thus, the signal recorded
at the microphone consists of multiple copies of the source
signal, each with a different attenuation and time delay.
This effect is described by the acoustic impulse response
(AIR) or its equivalent representation in the frequency
domain, the acoustic transfer function (ATF). Reverber-
ant speech is thus modeled as the original source signal
filtered by the AIR.

An AIR can be broadly divided into direct sound, early
reflections (up to roughly the first 50 ms), and late rever-
beration, as shown in Figure 1. While early reflections are
actually known to improve the perceptual quality by increas-
ing the signal level compared to the “dry” direct path signal,
the late reverberation causes difficulty in perception—for
humans and machines alike—because it smears the signal
over time [5].

The degree of reverberation is often measured by the
time it takes for the signal power to decrease to !60 dB
below its original value; this is referred to as the reverbera-
tion time and is denoted by .T60 Its value depends on the size
of the room, the materials comprising the walls, floor, and
ceiling, as well as the furniture. A typical value for a living
room is between 300 and 700 ms. Because the reverbera-
tion time is usually much longer than the typical short-time
signal analysis window of 20–64 ms, its effect cannot be
adequately described by considering a single speech frame
in isolation. Thus, the convolution of the source signal with
the AIR cannot be represented by multiplying their corre-
sponding transforms in the short-time Fourier transform
(STFT) domain; rather, it is approximated by a convolution
over frames.

.x a s, , ,t f m f
m

M

t m f
0

1

=
=

-/ (1)

Here, ,x ,t f ,s ,t f and a ,t f are the STFT coefficients of the re-
verberated signal, source signal, and AIR, respectively, at
(discrete) time frame t and frequency bin index .f The length
M of the STFT of the AIR is approximately given by / ,T B60
where B is the frame advance (e.g., 10 ms). Clearly, the ef-
fect of reverberation spans multiple consecutive time frames,
leading to a temporal dispersion of a speech event over adja-
cent speech feature vectors.

Third, in a distant-talking speech recognition scenario,
it is likely that the microphone will capture other interfer-
ing sounds, in addition to the desired speech signal. These
sources of acoustic interference can be diverse, hard to pre-
dict, and often nonstationary in nature, and thus, difficult
to compensate. In a home environment, common sources of
interference include TV or radio, home appliances, and other
people in the room.

These signal degradations can be observed in Figure 2,
which shows signals of the speech utterance “Alexa stop” in
1) a close talk, 2) a distant speech, and 3) a distant speech with
additional background speech recording. Keyword detection

and speech recognition are much more challenging in the lat-
ter case.

The final major source of signal degradation is the cap-
ture of signals that originate from the loudspeaker itself
during playback. Because the loudspeaker and the micro-
phones are colocated on the device, the playback signal can
be as much as 30–40-dB louder than the user’s voice, ren-
dering the user’s command inaudible if no countermeasures
are taken.

System overview
Figure 3 shows the high-level overview of a digital home
assistant’s speech processing components. For sound ren-
dering, the loudspeaker system plays music or system re-
sponses. For sound capture, digital home assistants employ
an array of microphones (typically between two to eight).
Due to the form factor of the device, the array is compact
with distances between the microphones on the order of a
few centimeters. In the following section, techniques from
multichannel signal processing are described that can com-
pensate for many of the sources of signal degradation dis-
cussed previously.

The signal processing front end performs acoustic echo
cancellation, dereverberation, noise reduction (NR), and source
separation, all of which aim to clean up the captured signal for
input to the downstream speech recognizer. For a true hands-
free interface, the system must detect whether speech has been
directed to the device. This can be done using

! wake-word detectors (also called hotwords, keywords, or
voice triggers), which decide whether a user has said the
keyword (e.g., “OK Google”) that addresses the device

! end-of-query detectors, which are equally important for
signaling that the user’s input is complete

“Alexa”

Close Talk

“Stop”

Distant Speech

Distant Speech With Background Noise

FIGURE 2. A speech utterance starting with the wake word “Alexa”
followed by “Stop” in close-talk, reverberated, and noisy, reverberated
conditions. The red bars indicate the detected start and end times of the
keyword “Alexa” and the end of the utterance.

Authorized licensed use limited to: California State University – Chico. Downloaded on March 19,2021 at 18:24:57 UTC from IEEE Xplore. Restrictions apply.

114 IEEE SIGNAL PROCESSING MAGAZINE | November 2019 |

! second-turn, device-directed speech classifiers, which
eliminate the need to use the wake word when resuming an
ongoing dialogue

! speaker identification modules, which make the system
capable of interpreting a query in a user-dependent way.

Once device-directed speech is detected, it is forwarded to the
ASR component.

The recognized word sequence is then forwarded to
the natural language processing (NLP) and dialog man-
agement subsystem, which analyzes the user input and
decides on a response. The N LG component prepa res
the desired system response, which is spoken out on the
device through the text-to-speech (TTS) component. Note
that NLP is beyond the scope of this article. The remain-
der of this article focuses on the various speech process-
ing tasks.

Some of the aforementioned processing tasks are carried out
on the device, typically those close to the input–output, while
others are done on the server. Although the division between
client and server may vary, it is common practice to run signal
enhancement and wake-word detection on the device, while the
primary ASR and NLP are done on the server.

Multichannel speech enhancement
The vector of the D microphone signals ( , , )y yy D1 Tf= at
time–frequency (tf ) bin ( , )t f can be written in the STFT do-
main [6] as

.a w ns oy , ,
( )

,
( )

,
( )

,
( )

,t f m f
i

m

M

i

N

t m f
i

m f
j

m

M

j

N

t m f
j

t f
0

1

1 0

1

1

s o

= + +
=

=

=

=

speech playback
noise1 2 34444 4444 1 2 34444 4444
7// // (2)

The first sum is over the Ns speech sources , , , ,s i N1,
( )
t f
i

sf=
where a ,

( )
t f
i is the vector of ATFs from the ith source to the

microphones. The second sum describes the playback of the
No loudspeaker signals ,o ,

( )
t f
j , , ,j N1 of= which are inadver-

tently captured by the microphones via the ATF vector w ,
( )
t f
j

at frequency bin .f Additionally, n ,t f denotes additive noise;
here, we assume for simplicity that the transfer functions are
time invariant and of the same length.

It is only one of many signals, which contains the user’s
command, while all other components of the received signal
are distortions. In the following section we describe how to
extract this desired signal.

MAEC
MAEC is a signal processing approach that prevents sig-
nals generated by a device’s loudspeaker from being cap-
tured by the device’s own microphones and confusing the
system. MAEC is a well-established technology that relies
on the use of adaptive filters [7]; these filters estimate the
acoustic paths between loudspeakers and microphones to
identify the part of the microphone signal that is caused by
the system output and then subtracts it from the captured
microphone signal.

Linea r adaptive f ilters ca n suppress the echoes by
typically 10 –20 dB, but they cannot remove them com-
pletely. One reason is the presence of nonlinear compo-
nents in the echo signal, which are caused by loudspeaker
nonlinearities and mechanical vibrations. Another reason
is that the filter lengths must not be chosen to be too large
to enable fast adaptation to changing echo paths. These
lengths are usually shorter than the true loudspeaker-to-
microphone impulse responses. Furthermore, there is a
well-known ambiguity issue with system identification in
MAEC [7].

Therefore, it is common practice in acoustic echo can-
cellation to employ a residual echo suppressor following
echo cancellation. In a modern digital home assistant, its
filter coefficients are determined with the help of a neu-
ral network (NN) [6]. The deep NN (DNN) is trained to
estimate, for each tf bin, a speech presence probability
(SPP). Details of this procedure are described in “Unsu-
pervised and Supervised Speech Presence Probability
Est i mat ion.” From t h is SPP a mask ca n be computed,

MAEC

Dereverberation,

Beamforming,

Source Separation,

Channel Selection ASR

TTS

Knowledge
Base

Device-Directed
Speech

Detection

NLP,
Dialog

Management,
and NLG

FIGURE 3. An overview of the example architecture of signal processing tasks in a smart loudspeaker.

Authorized licensed use limited to: California State University – Chico. Downloaded on March 19,2021 at 18:24:57 UTC from IEEE Xplore. Restrictions apply.

115IEEE SIGNAL PROCESSING MAGAZINE | November 2019 |

which separates desired speech-dominated tf bins from
those dominated by residual echoes, and from this infor-
mation, the coefficients of a multichannel filter for resid-
ual echo suppression are computed.

With MAEC in place, it is possible that the device can
listen to a command, while the loudspeaker is in use, e.g.,
playing music. The user can barge in and still be understood,
an important feature for user convenience. Once the wake-
up keyword has been detected, the loudspeaker signal and
MAEC are ducked or switched off, while the speech recog-
nizer is activated.

Dereverberation
We now turn our attention to the first sum in (2). Assuming
for simplicity that a single speech source is present, this term
simplifies to (1).

As mentioned previously, it is the late reverberation that
is harmful to speech recognition performance. Decom-
posing the reverberated signal into the direct sound and
early reflections x ,t f

(early) and the late reverberation x ,t f
(late)

according to

,x x x, , ,t f t f t f
(early) (late)

= + (3)

it is the late reverberation that a dereverberation algorithm
aims to remove, while preserving the direct signal and ear-
ly reflections.

There is a wealth of literature on signal dereverbera-
tion [5]. Approaches can be broadly categorized into lin-
ear filtering and magnitude or power spectrum-estimation
techniques. For ASR tasks, the linear filtering approach is
recommended because it does not introduce nonlinear dis-
tortions to the signal, which can be detrimental to speech
recognition performance.

Using the signal model in (1) where the AIR is a finite
impulse response, a Kalman filter can be derived as the sta-
tistically optimum linear estimator under a Gaussian source
assumption. Because the AIR is unknown and even time vary-
ing, the Kalman filter is embedded in an expectation maxi-
mization (EM) framework, where Kalman filtering and signal
parameter estimation alternate [8].

If the reverberated signal is modeled as an autoregressive
stochastic process instead, linear prediction-based derever-
beration filters can be derived. A particularly effective method
that has found widespread use in far-field speech recognition
is the weighted prediction error (WPE) approach [9]. WPE
can be formulated as a multiple-input, multiple-output filter,

Unsupervised and Supervised Speech Presence Probability Estimation

In the unsupervised learning approach, a spatial mixture
model is used to describe the statistics of y ,t f or a quantity
derived from it:

( ) ( ),p py y, ,t f k
k

t f k
0

1
;ir=

=

/ (S1)

where we assumed a single speech source and where kr
is the a priori probability that an observation belongs to
mixture component k and ( )p y ,t f k;i is an appropriate
component distribution with parameters ki [17]–[19]. This
model rests upon the well-known sparsity of speech in the
short-time Fourier transform (STFT) domain [20]

,y s z
z

1
0

a n
n,

, ,

,

,

,
t f

f t f t f

t f

t f

t f
=

+ =
=

‘ (S2)
where z ,t f is the hidden class affiliation variable, which
indicates speech presence. The model parameters are esti-
mated via the expectation maximization (EM) algorithm,
which delivers the speech presence probability (SPP)

( )Pr z 1 y, , ,t f t f t f;c = = in the E-step [21].
The supervised learning approach to SPP estimation

employs a neural network (NN). Given a set of features
extracted from the microphone signals at its input and the
true class affiliations z ,t f at the output, the network is
trained to output the SPP ,tc f [22], [23]. Because all of the
STFT bins , ,f F0 1f= – are used as inputs, the network is

able to exploit interfrequency dependencies, while the
mixture model-based SPP estimation operates on each fre-
quency independently. If additional cross-channel features,
such as interchannel phase differences, are used as
inputs, spatial information can also be exploited for SPP
estimation.

In a batch implementation, given the SPP the spatial
covariance matrices of speech plus noise and noise are
estimated by

;

( )

( )
.

1

1

y y

y y

( )

( )

,

, , ,

,

, , ,
n

t f
t

t f
t

t f t f
H

t f
t

t f
t

t f t f
H

y/

/

c

c

c

c

=

=

f

f

/
/

/
/

(S3)

From these covariance matrices, the beamformer coeffi-
cients of most common beamformers can be readily com-
puted [21]. By an appropriate definition of the noise mask,
this concept can also be extended to noisy and reverber-
ant speech, leading to a significant dereverberation effect
of the beamformer [24], as shown in Figure 4.

Low latency in a smart loudspeaker is important and
impacts both the design of the EM (or statistical methods
in general) and the NN-based approaches, see, e.g., [6],
[15], and [25] for further discussion.

Authorized licensed use limited to: California State University – Chico. Downloaded on March 19,2021 at 18:24:57 UTC from IEEE Xplore. Restrictions apply.

116 IEEE SIGNAL PROCESSING MAGAZINE | November 2019 |

allowing further multichannel processing, such as beamform-
ing, to follow it [10], [11]. The underlying idea of WPE is to
estimate the late reverberation x ,t f

(late) and subtract it from the
observation to obtain a maximum likelihood estimate of the
early arriving speech

.x x G x ,, , ,t f t f t f t f
(early)
= – T-t u (4)

Here, Gtf is a matrix containing the linear prediction coeffi-
cients for the different channels and x ,t fD-u are stacked represen-
tations of the observations: ( , , ) ,x x x, , ,t f t fT t L fT T1f= T TD- – – – +u
where L is the length of the dereverberation filter. It is im-
portant to note that x ,t f

(early)t at time frame t is estimated from
observations at least T frames in the past. This ensures that
the dereverberation filter does not destroy the inherent tempo-
ral correlation of a speech signal, which is not caused by the
reverberation. The filter coefficient matrix cannot be estimated
in closed form; the reason is that the driving process of the au-
toregressive model, ,x ,t f

(early) has an unknown and time-varying
variance .,t fm However, an iterative procedure can be derived,
which alternates between estimating the variance ,t fm and the
matrix of filter coefficients G ,t f on signal segments.

Because WPE is an iterative algorithm, it is not suit-
able for use in a digital home assistant, where low latency
is important; however, the estimation of the filter coeffi-
cients can be cast as a recursive least squares problem [12].
Furthermore, using the average over a window of observed
speech power spectra as an estimate of the signal variance

,,t fm a very efficient low-latency version of the algorithm
can be used [13].

Many authors reported that WPE leads to word error rate
(WER) reductions of a subsequent speech recognizer [13],
[14]. How much of a WER reduction is achieved by derever-
beration depends on many factors such as degree of rever-
beration, signal-to-noise ratio (SNR), difficulty of the ASR
task, robustness of the models in the ASR decoder, and
so on. In [13], relative WER improvements of 5–10% were
reported on simulated digital home assistant data with a
pair of microphones and a strong back-end ASR engine.

Multichannel NR and beamforming
Multicha n nel N R a ims to remove additive distortions,
denoted by n ,t f in (2). If the AIR from the desired source
to the sensors is known, a spatial filter (i.e., a beamformer),
can be designed that emphasizes the source sig na l over
sig na ls wit h d if ferent transfer characteristics. In its sim-
plest form, this filter compensates for the different propa-
gation delays that the signals at the individual sensors of
the microphone array exhibit and that are caused by their
slightly different distances to the source.

For the noisy and reverberant home environment, this
approach, however, is too simplistic. The microphone signals
differ not only in their relative delay, the whole reflection
pattern they are exposed to is different. Assuming again a
single speech source and good echo suppression and derever-
beration, (2) reduces to

,sy x n a n, , , , ,t f t f t f f t f t f.= + + (5)

where a f is the vector form of the AIRs to multiple micro-
phones, and where we assume it to be time invariant under
the condition that the source and microphone positions do not
change during a speech segment (e.g., an utterance). Note that
unlike (1) and (2), the multiplicative transfer function ap-
proximation is used here, which is justified by the preceding
dereverberation component. Any signal component that devi-
ates from this assumption can be viewed as captured by the
noise term .n ,t f Similarly, residual echoes can be viewed as
contributing to ,n ,t f which results in a spatial filter for denois-
ing, dereverberation, and residual echo suppression.

Looking at (5), it is obvious that s ,t f and a f can only be
identified up to a (complex-valued) scalar because s · a,t f f =
( ) ( / ) .s C C· · a,t f f To fix this ambiguity, a scale factor is cho-
sen such that for a given reference channel, e.g., channel 1, the
value of the transfer function is 1. This yields the so-called
relative transfer function (RTF) vector / .aa a ,f f f1=u

Spatial filtering for signal enhancement is a classic and
well-studied topic for which statistically optimal solutions
are known; however, these textbook solutions usually assume
that the RTF ,a fu or its equivalent in anechoic environments
(i.e., the vector of time difference of arrival), are known,
which is an unrealistic assumption. The key to spatial filter-
ing is, again, SPP estimation (see “Unsupervised and Super-
vised Speech Presence Probability Estimation.”) The SPP
tells us which tf bins are dominated by the desired speech
signal and which are dominated by noise. Given this infor-
mation, spatial covariance matrices for speech and noise
can be estimated, from which, in turn, the beamformer
coefficients are computed. An alternative approach is to
use the SPP to derive a tf mask, which multiplies tf bins
dominated by noise with zero, thus leading to an effective
mask-based NR.

Figure 4 shows the effectiveness of beamforming for an
example utterance. The spectrogram, i.e., the tf represen-
tation of a clean speech signal, is displayed in Figure 4(a), fol-
lowed in Figure 4(b) by the same utterance after convolution
with an AIR, and in Figure 4(c) after the addition of noise.
Figure 4(d) shows the output of the beamformer, which effec-
tively removed noise and reverberation.

The usefulness of acoustic beamforming for speech recog-
nition is well documented. On the CHiME 3 and 4 challenge
data, acoustic beamforming reduced the WER by nearly half.
On typical digital home assistant data, WER reductions on
the order of 10–30% relative were reported [6], [15], [16].

Source separation and stream selection
Now we assume that, in addition to the desired speech
source, there are other competing talkers, resulting in a
total of Ns speech signals, see (2). Blind source separation
(BSS) is a technique that can separate multiple audio sources
into individual audio streams autonomously. Traditionally, re-
searchers tackle speech source separation using either unsu-
pervised methods, e.g., independent component analysis and

Authorized licensed use limited to: California State University – Chico. Downloaded on March 19,2021 at 18:24:57 UTC from IEEE Xplore. Restrictions apply.

117IEEE SIGNAL PROCESSING MAGAZINE | November 2019 |

clustering [26], or DL [27], [28]. With clustering in particular,
BSS using spatial mixture models is a powerful tool that de-
composes the microphone array signal into the individual talk-
ers’ signals [17]–[19]. The parameters and variables of those
mixture models are learned via the EM algorithm, as explained
in “Unsupervised and Supervised Speech Presence Probability
Estimation.” The only difference being that the mixture model
now has as many components as concurrent speakers. During
the EM, for each speaker, a source activity probability (SAP),
which is the equivalent to the SPP in the multispeaker case,
is estimated.

Extraction of the individual source signals may be achieved
using the estimated SAP to derive for each speaker a mask, by
which all tf bins not dominated by this speaker are zeroed out.
An alternative to this is to use the SAP to compute beamform-
ers, one for each of the speakers, similar to what is …

Place your order
(550 words)

Approximate price: $22

Calculate the price of your order

550 words
We'll send you the first draft for approval by September 11, 2018 at 10:52 AM
Total price:
$26
The price is based on these factors:
Academic level
Number of pages
Urgency
Basic features
  • Free title page and bibliography
  • Unlimited revisions
  • Plagiarism-free guarantee
  • Money-back guarantee
  • 24/7 support
On-demand options
  • Writer’s samples
  • Part-by-part delivery
  • Overnight delivery
  • Copies of used sources
  • Expert Proofreading
Paper format
  • 275 words per page
  • 12 pt Arial/Times New Roman
  • Double line spacing
  • Any citation style (APA, MLA, Chicago/Turabian, Harvard)

Our guarantees

Delivering a high-quality product at a reasonable price is not enough anymore.
That’s why we have developed 5 beneficial guarantees that will make your experience with our service enjoyable, easy, and safe.

Money-back guarantee

You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.

Read more

Zero-plagiarism guarantee

Each paper is composed from scratch, according to your instructions. It is then checked by our plagiarism-detection software. There is no gap where plagiarism could squeeze in.

Read more

Free-revision policy

Thanks to our free revisions, there is no way for you to be unsatisfied. We will work on your paper until you are completely happy with the result.

Read more

Privacy policy

Your email is safe, as we store it according to international data protection rules. Your bank details are secure, as we use only reliable payment systems.

Read more

Fair-cooperation guarantee

By sending us your money, you buy the service we provide. Check out our terms and conditions if you prefer business talks to be laid out in official language.

Read more
Open chat
1
You can contact our live agent via WhatsApp! Via + 1 929 473-0077

Feel free to ask questions, clarifications, or discounts available when placing an order.

Order your essay today and save 20% with the discount code GURUH