This article was first published in Software Developer's Journal 4/2006 and SDJ Extra 4/2006 Magazines by Software Wydawnictwo. Article reprinted online by original author in courtesy of Software Developer's Journal.
Available also in German as Audio-Zeit-und-Pitch-Skalierung.
by Olli Parviainen
Anyone who's used a nowadays obsolete tape recorder or a vinyl disc player is likely familiar with effects of
playing the recording back at different speed than it was originally recorded: Playing
the recording at double speed reduces the playtime to half, at the
same time causing a
side-effect of the sound pitch jumping up by an octave which amusingly converts
human voices to sound like cartoon figures. Similarly, playing back a record at slower
rate makes the record duration longer and lowers
the sound pitch by the same ratio.
In old days of analog audio recording technology this modified
playback speed/time/pitch effect was easy to produce by applying incorrect playback
speed setting. In digital signal
processing the same effect can be achieved by interpolating between sound
samples, referred as resampling.
While resampling affects both the playback time and pitch in the same
ratio, occasionally its desirable to control the sound duration or
pitch independently of the other to modify pitch or time scale without the
sometimes-undesirable cartoon-hero voice side-effects. Luckily signal processing
techniques exist for modifying the time and pitch scales independently of each
other, these techniques being known as time/pitch scaling, time/pitch
shifting or time stretching.
Time scaling that reduces the music playback speed or tempo
ease practicing playing musical instruments and dancing. Slowing down recorded speech helps transcribing
recorded notes and practicing spoken languages, and blind people with developed
hearing sense may prefer accelerated speech when listening to recorded audio books to save time. Further, video formats conversions between different frame
rates require fine-adjusting the sound duration, for example
converting a 24 frames/second cinema format into 25 frames/second TV format
reduces the movie playback time by 4%, implying need for corresponding sound track
duration adjustment.
Similarly, altering the sound pitch or key to match with the singer's
voice eases karaoke singing and singing practising. When practicing playing an instrument along music, adjusting key
of the background music can be easier than retuning the instrument for each song
separately. Pitch correction is nowadays a standard method in record
production to correct imperfections in singing, handily compensating possibly
insufficient musical talent in the concurrent celebrity idol
industry.
Finally, people with questionable plots may wish to alter pitch of their voice to conceal their true identity.
When considering general-purpose time and/or pitch scaling methods that work with any kind of sampled sound, were it speech, music or other sound, two fundamental approaches exist for such effects, namely doing the processing either in time-domain or in frequency-domain.
The time-domain methods operate directly with sampled data,
such as SOLA algorithm introduced in this article. Advantage of time-domain
processing is that implementation is rather straight-forward, as the
sound data is manipulated in the same sample format as it's recorded and played back. Disadvantage of the time-domain processing is
tendency for producing reverberating artefacts that
get more obvious with larger time modification, i.e. when scaled time differs from the original sound
roughly by 15% or more.
Another approach is frequency-domain methods that convert the sampled sound into
short-time frequency/amplitude components and do the
scaling with frequency information, Phase
Vocoder that's featured below being an example of such method. Advantage of frequency domain processing is
enabling of more sophisticated sound modifications that better
resemble the eventual human hearing experience, as human hearing
is fundamentally based on frequency/amplitude sensing: we don't have a magnetic
voice coils with digital sampling circuitry in our ears, instead the ear's hearing
sense
consists of a great number of tiny frequency sensors, each being sensitive to
certain frequency alone, with our brains combining these together for a continuous frequency/amplitude
scope hearing sensation - well, at least so until our
once-perfect ears become ruined by high-volume activities such as
rock-music, motor sports or fostering junior family members.
However, despite their apparent potent and elegance, frequency
domain methods are also vastly more complicated to program and compute than the time-domain methods, thus making
them possibly impractical in application constrained by computational resources,
CPU speed and memory these are.
SOLA, being short-hand for Synchronous-OverLap-Add, works by cutting the sound data into shortish sequences between tens to hundreds milliseconds each, and then joining these sequences back together with a suitable amount of sound data either skipped or partially repeated in between these sequences, to achieve a shorter or longer playback time than originally. Algorithms based on the same basic approach are also TDHS for Time-Domain Harmonic Sampling, WSOLA and PSOLA, with differences in details of how the overlapping offset that's discussed below is sought.
To prevent too obvious discontinuities in the sound at those locations where the
two adjacent sequences are joined together, the sequences are
partially overlapped so that the sound amplitude gradually shifts from one sequence to the other
during the overlapping period - that's why the "Overlap-Add" part of the name
SOLA.
At simplest the SOLA implementation could use common sequence length
and pick the sequences at even intervals from the original sound data.
Wish to make the sound duration 10% shorter? Use processing sequence duration of
e.g. 100 milliseconds (+overlapping duration), pick sequences from
the original sound data at intervals of 110 milliseconds, join these
sequences together by overlapping and there you are. In a similar way, to
achieve 10% longer sound duration, choose the 100 milliseconds long
sequences from original sound at intervals of 90 milliseconds, and so forth.
However, practical SOLA implementation isn't quite that easy. Choosing
the processing sequences without sufficient attention to their contents will
result
into sequences not fitting together too well, consequently producing inferior sound quality with humming and clicking artefacts
due to too large discontinuities in the sound despite
the gradual overlapping approach.
For satisfactory sound quality, SOLA implementation needs to choose the processing sequences
with waveforms of adjacent sequences to be joined together being as much alike
as possible over the overlapping period.
In practice, the sound stream is processed one sequence at a time,
always choosing the next processing sequence from a suitable window of
possible offsets candidates around the theoretical next sequence offset, so that the
two sequences match together as well as possible. A good
way for choosing the best-matching sequences is to calculate cross-correlation function between the end of previous sequence and the window
of possible new sequence candidates; the offset at which the waveforms have
highest similarity also gives the highest cross-correlation value. The sequences are
then joined together by overlapping to produce a new continuous sound stream
with different duration than the original sound.
The overall SOLA algorithm is illustrated in the figure at side. Notice that time axis scales shown in the figure are arbitrary, the time units not necessarily corresponding to any real time units. During the algorithm execution, the original sound is cut into processing sequences of suitable duration, new sequences being chosen at a suitable intervals after the previous sequence so that the desired time scaling ratio is achieved.
On Figure 1, the first processing sequence begins at time offset of 0, the sequence duration being 7 time units, including the two units wide overlapping areas at begin and end of the sequence. Each further sequence are chosen at nominal intervals of 9 units, resulting into time scaling ratio of (7-2) / 9 = 0.555, meaning ~ 44.5% reduction in the time duration compared to the original sound.
However, instead of taking the new sequence directly at the nominally calculated time interval, the actual beginning of a sequence is selected from a window around the nominal offset, in case of "New Sequence" as in Figure 1 between offsets 8..10, so that the new and previous sequence shall match together as well as possible once overlapped. The sequences are then overlap-added together to produce a continuous resulting sound. Notice the resulting time compression between the original and resulting sound waveform shapes shown in Figure 1.
Cross-correlation function works well for evaluating similarity of the sequence waveforms and is easy to implement. Also other similarity measures have been proposed though, one suggested approach being to match the sequence edges to the beats of the sound, which can reduce the reverberating artefact, assuming that the beats are there and they can be detected with good consistency. Another alternative is to evaluate frequency-domain spectral similarity instead of the waveform similarity for finding the best-matching overlapping offset.
SOLA implementation requires only basic add/multiply arithmetic operations, thus SOLA algorithm can be implement with integer or fixed-point arithmetics if so desirable, in cases that floating-point support is not available or is undesirable to use for performance reasons.
If processing stereo sound, surround sound or other multi-channel sound, it's
good
to notice that the overlapping of the processing sequences have to be
done at exactly same offsets for all the channels to maintain them in synchrony.
While it might appear obvious to process N-channel audio simply as N separate mono channels, that's
infeasible approach because then each different channel would end up with slightly different overlapping
offsets depending on channel's individual properties, consequently the channels
losing their fine mutual synchronisation that's very prerequisite for stereo and
surround effects.
Instead, the overlapping offset selection has to be done as a compromise between all the
sound channels at the same time. This can be achieved by summing (or
averaging) all the channels together, then finding the best common overlapping offset
by cross-correlating in this common data as described above, and then applying
the same overlapping offset for each channel.
SOLA as such produces a time scaling effect, however, combining SOLA algorithm with resampling that modifies both time and pitch scales in the same ratio allows producing also a pitch shift effect with a little extra cost.
The pitch shifting effect implemented combining SOLA + resampling works as follows:
Use resampling to increase or decrease sound pitch by
desired amount. Because resampling modifies both sound duration and pitch in
the same ratio, the sound duration will become different than original in
the process.
Then use SOLA algorithm to adjust the sound duration modified by resampling back to the original duration. The result has the same duration as originally, but now with modified pitch.
Any resampling + SOLA time scaling ratios can be applied to achieve arbitrary pitch scaling ratios, or even arbitrary simultaneous time and pitch scaling when so desirable.
For example, to increase the pitch by ratio of two (i.e. an octave), first resample the sound duration down to half of the original, resulting in twice higher pitch at the same time, then apply SOLA to expand the sound duration by the same factor of two, resulting into the original sound duration but with an octave higher pitch.
In a similar fashion, to achieve a lower sound pitch, first resample the sound for a longer duration and lower pitch, then apply SOLA with a corresponding time scaling ratio to reduce the duration back to the original, producing a sound with original duration but lower pitch.
Typical sound artefact produced by SOLA is an echoing or reverberating sound which is insignificant at
small time scale changes but becomes more
obvious with larger-scale time slowdown.
To reduce the reverberating artefact, the processing sequence duration
is ideally chosen to be relatively close to the fundamental
frequency of the processed sound - for example, if there's certain
frequency component or constant beat in the sound, that can be used as a
reference for selecting the processing sequence duration.
Also the width of the offset seeking window affects the sound quality. The wider this window is,
the higher possibility there is for finding a good match for the overlapping sequences. However, if the window is set to too wide
setting, the result can sound unstable, as if it were drifting around.
For overlapping duration it's usually sufficient that it's a fraction of the
whole processing sequence length. Usually a simple linear amplitude sliding over the
overlapping period works quite as well as more sophisticated window functions.
The SOLA algorithm processing latency depends on the processing sequence duration, the width of overlap seeking window, and the width of the overlapping period, typically summing up to some 100 milliseconds. This processing latency isn't usually an issue when processing sound files offline, but if the application has strict requirements for sound synchronisation or if processing or playing back sound in real-time, then the latency may require attention.
An example of such latency-critical application is a real-time application that records sound with a microphone to process the recorded sound in realtime with SOLA for a pitch shift effect, and then immediately play back the processed sound with a loudspeaker. In such case it's good to notice the additional approx. 100 millisecond delay that SOLA processing introduces in addition to the basic latencies of audio sampling and playback operations.
Listing 1 shows a simple example implementation of a SOLA routine. The function sola() processes sampled sound data given in array 'input' and stores the resulting time-scaled sound into 'output' array, assuming that both arrays are sufficiently large for the desired processing. The algorithm assumes that sound data is mono sound sampled at 44100Hz sampling rate with 16bit integer sample format.
The time scaling ratio is given in #define TIME_SCALE, the default value of 0.85 corresponding to (1.0 - 0.85) * 100% = 15% longer duration than originally. User can modify this value to achieve other time scaling ratios.
The subroutine overlap() overlaps the two given input sequences by linearly sliding the amplitude weight from one sequence to the other over the overlapping period.
The subroutine seek_best_overlap() seeks for optimal overlapping offset by cross-correlating the sound sequences 'input_prev' and 'input_new' over their overlapping periods, by testing various offsets for 'input_prev' within range [0..SEEK_WINDOW]. A precalculation step multiplying the 'input_prev' overlapping region with the overlapping sliding coefficients is done prior to the the actual cross-correlation to remarkably speed up the actual cross-correlation calculation. Notice that cross-correlation result scaling by the vector lengths are omitted, because in this case it's sufficient to find just the largest correlation value without caring of the absolute result magnitude. Cross-correlation calculation uses floating point arithmetics for sake of simplicity; while the cross-correlation calculation could be done also with pure integer or fixed-point arithmetics, additional scaling of the sub-results would then be necessary to ensure that integer overflows can't occur.
Refer to figure at side for an overview of the Phase Vocoder algorithm. Time/Pitch scaling using Phase Vocoder consist of three primary processing stages:
Analysis - converts sound into frequency/amplitude presentation.
Effect - applies effects to frequency/amplitude data
Synthesis - converts sound information back from frequency/amplitude format to sampled format.
In practice, the Phase Vocoder analysis and synthesis stages are implemented using Fast Fourier Transform (FFT) algorithm, which is a well-established and efficient algorithm for conversion between time and frequency domains. Sound data is processed in sequences, by taking a suitable number of samples from the original sound stream and applying a window function and applying FFT to this data. See stages 1 and 2 in Figure 2.
However, FFT transforms time-domain sample data into complex-valued phase+amplitude components with equally spaced frequency bins. Alas, these equally spaced frequency bins having complex-valued contents aren't a suitable format for modifying time/pitch spectrum characteristics, but an additional analysis step called phase unwrapping is required to further convert the complex-valued phase+amplitude information of pre-defined frequencies into exact frequency+amplitude components.
For example, if applying a FFT of size 1024 to sound with sample rate of 44100Hz, the FFT derives complex-valued phase/amplitude components for evenly distributed frequency bins of size of 44100/(2*1024) ~ 21.5 Hz each. This would then be the primary frequency resolution of the FFT'd data, meaning that if there were a frequency component of say 30Hz present in the audio data, this wouldn't align exactly either to the 21.5 Hz bin or to the neighbouring bin of 43.0 Hz, but the 30Hz signal power would smear between these 21.5 Hz and 43.0 Hz bins.
The phase unwrapping process is applied for deriving the exact frequency components, by matching the signal phase angles of each frequency bin with the corresponding bin's phase angles of the earlier processed FFT window, and then using these phase angle differences to calculate exact-frequency adjustments for the FFT's fixed-bin frequencies. See stage 3 in Figure 2.
Synthesis stage is basically the inverse of the analysis stage. During the synthesis, a similar but reverse operation to phase unwrapping is carried out to first convert from exact frequency/amplitude information into bin-based phase/amplitude format, again matching the each frequency bin phase angles so that they align well with the phase angles of the earlier processed sound batch, and then the result is inverse-FFT'd and combined with the earlier processed sound stream.
Fourier Transform is an algorithm that converts array of N pieces of time-domain samples into frequency domain, resulting into signal amplitude and phase information for N/2 equally spaced frequency components bins of width [sampling frequency] / (N * 2) Hz each. The Fourier Transform is commonly used in applications relying on frequency-domain processing or operating with the signal power spectrum. While computational complexity of Fourier Transform is O(n^{2}) (that is, computation time grows in square of N), meaning it quickly gets heavy with bigger array sizes N, a faster O(n·log_{2}n) variation of the transform exists, Fast Fourier Transform (FFT), which radically reduces amount of calculations in special cases that the data size N is an integer power of 2, i.e. N = 2, 4, 8, 16... etc. Practical applications almost always use the FFT variation thanks to it's computational advantage. FFT also has an inverse operation, inverse-FFT or IFFT, that converts the frequency-domain information to time-domain sample format. |
Pitch shifting can be achieved simply by stretching the exact
frequency/amplitude information
with a suitable ratio between the analysis and synthesis stages.
Interpolation can
be used for the stretching, but even stretching by taking the strongest of the
nearest signal values as such
produce a doable result. See stage 4 in Figure 2.
Time scaling can be achieved by modifying the synthesis stage so that the synthesized processing batches are applied back to the resulting sound with
different overlap stepping than used during the analysis stage. This allows synthesising longer or
shorter
sound duration with similar frequency characteristics than originally.
To achieve a high sound quality, a large FFT conversion size such as 4096 or 8192 is required. In order to the phase unwrapping to work well, the FFT windows also need to overlap with each other in a suitably high ratio, in practice meaning at least four-fold overlap. If time scaling to expand time duration during the synthesis stage is desired, then even higher overlapping ratio may be required, because time scaling during synthesis expands the overlap stepping and thus effectively reduces the output overlapping ratio.
It's obvious that Phase Vocoder is computationally heavy. Besides the large FFT size and several-times overlapping, phase unwrapping done during the analysis stage involves calculating phase angle and amplitude as arcus-tangent and absolute value of complex-valued numbers, as well as matching the accumulated phase angles between the range [-pi..pi]. The synthesis stage similarly involves calculating trigonometric sin/cos functions for converting phase angle/amplitude back into complex notation. These operations practically enforce using floating point arithmetics, making integer-only implementation infeasible.
If time scaling alone without simultaneous pitch shifting is desired, then it's possible to partially combine the analysis and synthesis stages, omit some steps of the phase unwrapping process and do the analysis-synthesis phase angle matching directly using the complex-valued phase/amplitude/frequency values produced by FFT, by rotating the complex-valued phase angles with suitable complex-valued multiplications. This eliminates need for trigonometric functions that otherwise were required for analysis and synthesis stages, and thus provides a great improvement in calculation efficiency and allows also further routine optimizations.
If processing stereo or other multi-channel sound, pay attention to maintaining the phase synchronisation between the channels to avoid ruining stereo or surround effect. In practice, all the channels need to be processed together so that the phase components of the same frequencies get advanced by same amount for all the channels. For example, during the phase modification steps during analysis and synthesis stages, for each frequency component the channel having the strongest amplitude is chosen, and phase change of this single channel is then applied to all other sound channels simultaneously.
Processing latency of Phase Vocoder algorithm, again important factor in real-time processing and in applications with strict sound synchronisation requirements, depends on the FFT size and overlapping factor. Using FFT size of 4096 and four-time overlapping with sound data sampled at 44100Hz the processing latency is around 116 milliseconds.
Advantage of time/pitch scaling based on Phase Vocoder is that it doesn't produce similar reverberating artefact as SOLA. However, neither Phase Vocoder is free of artefacts, but it can result to major phase distortions causing dull sound, which is especially notable in sharp sounds produced by instruments with a quick attack time, such as percussion instruments, piano etc. This is a major problem in case that clear and crisp, high-quality sound quality were desired for music involving such instruments.
These phase distortions are due to the Phase Vocoder aligning the phases of each frequency component so that they match well with the phases at same frequency of the previously processed slot. Because signal wave length depends linearly of the frequency, altering the pitch or time scale results in different phase modifications ratios at different frequencies. In practice, running time or pitch scaling with Phase Vocoder for longer than a couple of second results into practically randomized signal phases.
The phase coherence problem is difficult one to solve. One approach is to divide the frequency scale into groups of several neighbouring frequencies, and lock the phases within the same group so that phase coherence is maintained at least locally. Although helping, even this approach requires occasionally resetting all the group phases in order to prevent the frequency groups diverting too far from the original situation, and such phase resets again produce audible artefacts.
Another distortion caused by Phase Vocoder is uneven frequency response, so that narrow frequency bands over the spectrum range become highly attenuated, as well as the higher end of the frequency range gets attenuated. These frequency response distortions may or may not be problem, depending on what kind of sound is processed.
SOLA approach allows relatively simple and efficient time/pitch scaling implementation that works well for small-scale time/pitch changes and can be applied for all kinds of sound, from human speech to ready produced music. Sound quality of SOLA becomes however disadvantageous with larger-scale tempo/pitch modification, with scaling ratios exceeding a couple of tens of percents introducing a reverberating artefact into the sound.
Processing CD-quality stereo sound in real-time with SOLA requires around a 100 MHz processor with a standard C language implementation, meaning that real-time SOLA processing is well feasible with modern PC's and even with portable devices such as PDAs and smart phones. Cross-correlation algorithm used in SOLA are also well-suited for CPU-specific SIMD optimizations, allowing reduction of CPU usage by a factor of two to three.
SOLA can be implemented using integer or fixed-point arithmetics if necessary, making it feasible also for processors without built-in floating-point support.
Compared to SOLA algorithm, the greatest advantage of using Phase Vocoder for time/pitch scaling is that it doesn't produce similar reverberating artefact as SOLA even with large-scale time/pitch modifications. However, Phase Vocoder produces other artefacts, most notable the dull sound caused by loss of phase synchronisation and uneven frequency response.
Phase Vocoder is best suited for processing independent sound channel information with a single or few instrument, human speech or singing, but may not be at its best for post-processing high quality music that's already been produced and mixed into final recording format.
Phase Vocoder is calculationally heavy, involving complex calculation and thus practically enforcing use of floating point arithmetics. High quality Phase Vocoder implementation requires approximately an order of magnitude more computational power than the SOLA algorithm. If applying only the time scaling effect, consequently allowing omitting some of the most time-consuming stages, and using hand-tuned processor-level SIMD optimizations, processing CD-quality stereo sound in real-time requires around a 500Mhz on a Pentium-III level processor.
Single-Instruction-Multiple-Destination or SIMD means processor-level instructions that perform same operation for multiple data items with a single instruction. Modern processors support SIMD instruction extensions as means for optimizing performance of calculationally intensive routines. Examples of SIMD instruction sets are MMX and SSE extensions of the Intel X86 processors used in PCs and new Mac computers, and AltiVec of PowerPPC used in older Mac computers. An example of a SIMD instruction is multiplying four parallel numbers by other four numbers, producing four parallel multiplication result values, thus offering theoretically four times higher performance than using four consecutive multiplication instruction. However, due to overhead typically required for arranging data into suitable batches for SIMD, quite that high speed-up isn't practically achievable, yet speed-up by factor of two to three is often possible. SIMD optimizations are suitable for routines doing multiple calculation operations at the same time, such as vector and matrix calculations. However, finding parallel operations in routines written in usual programming language isn't trivial, and still today compilers can't produce good automatic SIMD code, but the programmer has to help the compiler by hand-writing the routines using SIMD-compatible syntax extensions. |
For an efficient, open-source implementation of SOLA-based time/pitch scaling routines, see SoundTouch library at http://www.surina.net/soundtouch.
For further information of time/pitch scaling based on Phase Vocoder, the paper "Improved Phase Vocoder Time-Scale modification of Audio" (Laroche, Dolson, IEEE transactions on speech and audio processing, Vol. 7, No. 3, May 1999) presents the Phase Vocoder concept with further references and a good discussion of the phase coherence issue.
Copyright © Olli Parviainen 2006