File format for Phase Vocoder data, based onPVOC-EX
Preliminary specification.
Rationale.
The PVOC-EX file format seeks to provide a cross-platform and robust
format for standard
fixed-overlap phase vocoder analysis files. Many implementations of
the phase vocoder exist, notably in Csound,
the CDP system
(based on the CARL implementation (Moore/Dolson)), Soundhack (Tom
Erbe), and the Princeton-hosted
PVC package (Paul Koonce). The differences between these formats are
minor, and consist of different headers, and varying scale factors for
amplitude. More importantly, the Csound format is not defined fully,
and uses the byte-order of the
host platform. The Soundhack format is based closely on the Csound
format, and similarly does not define word-order, though by being hosted
on the Macintosh platform, it will invariably be written in big-endian
format.
Uniquely, the PVC implementation supports multi-channel sources (stereo and beyond).
While it would be a simple matter to converge the header elements of
these existing formats, and define a byte order,
I have felt that the introduction by Microsoft of WAVE_FORMAT_EXTENSIBLE
(WAVE_EX), which by definition supports custom extensions, offered a good
opportunity to define a format based on an existing standard. One reason
for choosing this route is that it enables rendering information to be
fully incorporated into the format, by inheriting the WAVEFORMATEX component
of WAVE_EX. With the power of modern PCs, it is not only possible,
but easy, to stream more than one channel of analysis data in real time.
The proposed format is intended to support use of analysis data in a real-time
streaming environment.
For big-endian platforms, the SDIF
initiative based at CNMAT is likely to prove of lasting significance, though
a format for phase vocoder data has not yet been defined. SDIF would appear
to offer the natural format for big-endian platforms. The SDIF format is
extremely flexible, supporting frame-based data with arbitrary time-stamps,
and multiple types of data within a single file. A probable problem with
this is that in many cases file conversion can be only one-way, as while
a simple phase vocoder format can be converted into an SDIF file, this
cannot be guaranteed in the other direction. For the reasons outlined above,
that many programs already share all the important aspects of a single
format, PVOC-EX is designed to ensure two-way conversion for at least CDP,
Csound, Soundhack and PVC. The one immediate exception to this is that
PVOC-EX supports multi-channel data, which as noted above is currently
only true of PVC, among the applications identified. However, it is my
hope that the format, once consolidated, will be able to be incorporated
into all these applications. Since SDIF is designed to support advanced
and research-oriented applications, I feel that PVOCEX is best kept to
a minimum specification compatible with effective use.
The Format.
This document presumes a basic knowledge of WAVE_EX.
To extend WAVE_EX, a unique identifier, or GUID, is required. Applications which do not recognise, or cannot handle, files with this GUID will reject the file. The GUID defined for PVOC-EX is:
{8312B9C2-2E6E-11d4-A824-DE5B96C3AB21}
The complete format chunk for PVOC-EX is:
typedef struct {
WAVEFORMATEXTENSIBLE wxFormat;
DWORD dwVersion;
/* initial version is 1*/
DWORD dwDataSize;
/* sizeof PVOCDATA data block */
PVOCDATA data;
/* 32 byte block */
} WAVEFORMATPVOCEX;
The total size of WAVEFORMATPVOCEX is 80 bytes, thus respecting the requirements of WAVE_EX that the format chunk support alignment to 8-byte boundaries.
wxFormat:
contains the information required to synthesize
the file as originally analysed. The full scope of WAVE_EX is available,
including the definition of speaker positions.This information can be ignored
by a renderer, though certain fields are important:
wxFormat.Format.nChannels
Number of channels in the file (mono,
stereo, etc)
wFormat.Format.nSamplesPerSec
Sample Rate of the source. Informs applications
of the Nyquist frequency for the analysis data.
In circumstances where the analysis data has been synthesized directly,
these are the two essential pieces of information. It is then a matter
of choice what output sample format is specified, though for synthetic
data, use of the floating-point format is recommended. The data for
the full WAVEFORMATEX block should be set correctly as for any WAVE file.
All information specific to the phase vocoder is contained within the PVOCDATA block. This is defined by the structure:
typedef struct pvoc_data {
WORD wWordFormat; /* IEEE_FLOAT or IEEE_DOUBLE
*/
WORD wAnalFormat; /*PVOC_AMP_FREQ or PVOC_AMP_PHASE
*/
WORD wSourceFormat; /* WAVE_FORMAT_PCM or WAVE_FORMAT_IEEE_FLOAT*/
WORD wWindowType; /* defines the standard analysis
window used, or a custom window */
DWORD nAnalysisBins; /* number of analysis channels. */
DWORD dwWinlen; /* analysis window length,
in samples */
DWORD dwOverlap; /* window overlap length
in samples (decimation) */
DWORD dwFrameAlign; /* usually nAnalysisBins
* 2 * sizeof(float) */
float fAnalysisRate;
/* sample rate / Overlap */
float fWindowParam; /* parameter associated
with some window types: default 0.0f unless needed */
} PVOCDATA;
Notes on some PVOCDATA fields.
wWordFormat:
I expect that IEEE_FLOAT will be used almost always.
I recognize that some advanced applications may wish to be able to use
doubles; the issue is that more than one f/p format exists for doubles,
and it will be important to eliminate all possibility of ambiguity here.
wAnalFormat:
Csound, CDP/CARL, and PVC all write analysis channels
as amplitude and frequency. Soundhack writes a format as amplitude and
phase (listed within Csound but not implemented). Other representations
are possible, but I feel that specifying too many alternative formats adds
complexity to a receiving application. Conversion is easy between the two
formats, though of course at a cost computationally.
wSourceFormat
This is required to disambiguate a 32bit
source sample size as defined in WAVEFORMATEX. Since wFormatTag is WAVE_FORMAT_EXTENSIBLE,
and a custom GUID is used, the distinction between integer and floating-point
samples is lost.
wWindowType:
One of the arguable aspects of the specification.
It is possible to identify a large number of analysis windows. However,
in current phase vocoder implementations, one of a small set of standard
windows is used. The following window types have been defined for PVOC-EX
so far:
PVOC_HAMMING
PVOC_HANNING
PVOC_KAISER
PVOC_RECT
PVOC_CUSTOM
The Kaiser window has an associated parameter, 'beta', which can be given in the fWindowParam field. If this is zero, the default value of 6.8 will be assumed.
The provision of PVOC-CUSTOM is possibly contentious. If this is specified, the format chunk must be followed, before the 'data' chunk, by a special chunk containing the window data, of length dwWinlen. The samples must be of the same type as the analysis data itself, as given by wWordFormat. The data must be normalised so that the peak sample (centre of the window) is 1.0.
nAnalysisBins
Number of analysis channels. This is derived directly
from the fft size used in the analysis:
nAnalysisBins = (fft_size / 2) + 1.
Note that the format supports the use of window sizes, given by dwWinlen,
greater than the fft size.
Custom Window Chunk.
This is very simple:
<PVXW>
<chunk-size in bytes, excluding tag and size
field>
< window data, dwWinlen samples>
This may well not be adequate. Possible additions include a 4-byte ident, and a floats field specifying amplitude. Note that the WAVE_EX spec encourages all chunks to support 8-byte alignment.
No other chunks, apart from the data chunk itself, are required for
PVOC_EX. Where the renderer sepcifies floating-point samples, the PEAK
chunk can be used in the usual way. This will be especially relevant where
a custom window is used, as amplitude levels cannot be presumed.
The 'data' chunk.
Analysis frames are interleaved according to nChannels, i.e.:
for a stereo file:
<frame 0 Ch 0>
<frame0 Ch 1>
<frame 1 Ch 0>
<frame 1 Ch 1>
etc...
Frames amplitudes are expected to be normalized close to 1.0. Thus, where the source is a full-amplitude sinewave, the peak amplitude in the nearest bin will be close to 1.0. Later versions of this document will develop this aspect further. Suffice it to say here that both the CARL and Soundhack formats provide analysis windows in this form, while Csound and PVC require scale factors. The example implementation accompanying this release is based on the CARL distribution, and a further program demomnstrates conversion from the current Csound format to the PVOC-EX format.
Example Implementation (command-line programs):
All pvocex code is confined to two files: pvfileio.c, and pvfileio.h
Source for pvocex analysis-resynthesis (accepts WAVE, AIFF and
AIFF-C files):
pvocex_src01.zip
(49KB). Includes project files for VC++5.0
Source for pvconv application (convert Csound and Soundhack
analysis files to pvoc-ex):
pvconv_src01.zip
(5KB) NB requires pvocex_src01.zip
Executables for Win32 Pentium systems (pvocex and pvconv):
pvocex_bin01.zip
(82KB)
Richard Dobson 25 May 2000