PVOC-EX
File format for Phase Vocoder data, based on
WAVE_FORMAT_EXTENSIBLE.

Download

Preliminary specification.
 

Rationale.

The PVOC-EX file format seeks to provide a cross-platform and robust format for standard
fixed-overlap phase vocoder analysis files. Many implementations of the phase vocoder exist, notably in Csound,
the CDP system (based on the CARL implementation (Moore/Dolson)), Soundhack (Tom Erbe), and the Princeton-hosted
PVC package (Paul Koonce). The differences between these formats are minor, and consist of different headers, and varying scale factors for amplitude. More importantly, the Csound  format is not defined fully, and uses the byte-order of the
host platform. The Soundhack format is based closely on the Csound format, and similarly does not define word-order, though by being hosted on the Macintosh platform, it will invariably be written in big-endian format.

Uniquely, the PVC implementation supports multi-channel sources (stereo and beyond).

While it would be a simple matter to converge the header elements of these existing formats, and define a byte order,
I have felt that the introduction by Microsoft of WAVE_FORMAT_EXTENSIBLE (WAVE_EX), which by definition supports custom extensions, offered a good opportunity to define a format based on an existing standard. One reason for choosing this route is that it enables rendering information to be fully incorporated into the format, by inheriting the WAVEFORMATEX component of  WAVE_EX. With the power of modern PCs, it is not only possible, but easy, to stream more than one channel of analysis data in real time. The proposed format is intended to support use of analysis data in a real-time streaming environment.

For big-endian platforms,  the SDIF initiative based at CNMAT is likely to prove of lasting significance, though a format for phase vocoder data has not yet been defined. SDIF would appear to offer the natural format for big-endian platforms. The SDIF format is extremely flexible, supporting frame-based data with arbitrary time-stamps, and multiple types of data within a single file. A probable problem with this is that in many cases file conversion can be only one-way, as while a simple phase vocoder format can be converted into an SDIF file, this cannot be guaranteed in the other direction. For the reasons outlined above, that many programs already share all the important aspects of a single format, PVOC-EX is designed to ensure two-way conversion for at least CDP, Csound, Soundhack and PVC. The one immediate exception to this is that PVOC-EX supports multi-channel data, which as noted above is currently only true of PVC, among the applications identified. However, it is my hope that the format, once consolidated, will be able to be incorporated into all these applications. Since SDIF is designed to support advanced and research-oriented applications, I feel that PVOCEX is best kept to a minimum specification compatible with effective use.
 
 

The Format.

This document presumes a basic knowledge of WAVE_EX.

To extend WAVE_EX, a unique identifier, or GUID, is required. Applications which do not recognise, or cannot handle, files with this GUID will reject the file. The GUID defined for PVOC-EX is:

{8312B9C2-2E6E-11d4-A824-DE5B96C3AB21}
 

The complete format chunk for PVOC-EX is:

typedef struct {
 WAVEFORMATEXTENSIBLE wxFormat;
 DWORD dwVersion;                                      /* initial version is 1*/
 DWORD dwDataSize;                                  /*  sizeof PVOCDATA data block */
 PVOCDATA data;                                       /* 32 byte block */
} WAVEFORMATPVOCEX;

The total size of WAVEFORMATPVOCEX is 80 bytes, thus respecting the requirements of WAVE_EX that the format chunk support alignment to 8-byte boundaries.

wxFormat:
    contains the information required to synthesize the file as originally analysed. The full scope of WAVE_EX is available, including the definition of speaker positions.This information can be ignored by a renderer, though certain fields are important:

wxFormat.Format.nChannels
      Number of channels in the file (mono, stereo, etc)

wFormat.Format.nSamplesPerSec
    Sample Rate of the source. Informs applications of the Nyquist frequency for the analysis data.

In circumstances where the analysis data has been synthesized directly, these are the two essential pieces of information. It is then a matter of choice what output sample format is specified, though for synthetic data, use of  the floating-point format is recommended. The data for the full WAVEFORMATEX block should be set correctly as for any WAVE file.
 

All information specific to the phase vocoder is contained within the PVOCDATA block. This is defined by the structure:

typedef struct pvoc_data {
 WORD wWordFormat;    /* IEEE_FLOAT or IEEE_DOUBLE */
 WORD wAnalFormat;    /*PVOC_AMP_FREQ or PVOC_AMP_PHASE */
 WORD wSourceFormat;    /* WAVE_FORMAT_PCM or WAVE_FORMAT_IEEE_FLOAT*/
 WORD wWindowType;    /* defines the standard analysis window used, or a  custom window */
 DWORD nAnalysisBins;  /* number of analysis channels. */
 DWORD dwWinlen;     /* analysis window length, in samples */
 DWORD dwOverlap;     /* window overlap length in samples (decimation)  */
 DWORD dwFrameAlign;    /* usually nAnalysisBins * 2 * sizeof(float) */
 float fAnalysisRate;            /* sample rate / Overlap */
 float fWindowParam;    /* parameter associated with some window types: default 0.0f unless needed */
} PVOCDATA;

Notes on some PVOCDATA fields.

wWordFormat:
    I expect that IEEE_FLOAT will be used almost always. I recognize that some advanced applications may wish to be able to use doubles; the issue is that more than one f/p format exists for doubles, and it will be important to eliminate all possibility of ambiguity here.

wAnalFormat:
    Csound, CDP/CARL, and PVC all write analysis channels as amplitude and frequency. Soundhack writes a format as amplitude and phase (listed within Csound but not implemented). Other representations are possible, but I feel that specifying too many alternative formats adds complexity to a receiving application. Conversion is easy between the two formats, though of course at a cost computationally.

wSourceFormat
    This is required to disambiguate a 32bit source sample size as defined in WAVEFORMATEX. Since wFormatTag is WAVE_FORMAT_EXTENSIBLE, and a custom GUID is used,  the distinction between integer and floating-point samples is lost.

wWindowType:
    One of the arguable aspects of the specification. It is possible to identify a large number of analysis windows. However, in current phase vocoder implementations, one of a small set of standard windows is used. The following window types have been defined for PVOC-EX so far:

PVOC_HAMMING
PVOC_HANNING
PVOC_KAISER
PVOC_RECT
PVOC_CUSTOM

The Kaiser window has an associated parameter, 'beta', which can be given in the fWindowParam field. If this is zero, the default value of 6.8 will be assumed.

The provision of PVOC-CUSTOM is possibly contentious. If this is specified, the format chunk must be followed, before the 'data' chunk, by a special chunk containing the window data, of length dwWinlen. The samples must be of the same type as the analysis data itself, as given by wWordFormat. The data must be normalised so that the peak sample (centre of the window) is 1.0.

nAnalysisBins
    Number of analysis channels. This is derived directly from the fft size used in the analysis:
nAnalysisBins  = (fft_size / 2) + 1.

Note that the format supports the use of window sizes, given by dwWinlen, greater than the fft size.
 

Custom Window Chunk.

    This is very simple:
    <PVXW>
    <chunk-size in bytes, excluding tag and size field>
    < window data, dwWinlen samples>

This may well not be adequate. Possible additions include a 4-byte ident, and a floats field specifying amplitude. Note that the WAVE_EX spec encourages all chunks to support 8-byte alignment.

No other chunks, apart from the data chunk itself, are required for PVOC_EX. Where the renderer sepcifies floating-point samples, the PEAK chunk can be used in the usual way. This will be especially relevant where a custom window is used, as amplitude levels cannot be presumed.
 

The 'data' chunk.

Analysis frames are interleaved according to nChannels, i.e.:

for a stereo file:

<frame 0 Ch 0>
<frame0 Ch 1>
<frame 1 Ch 0>
<frame 1 Ch 1>
etc...

Frames amplitudes are expected to be normalized close to 1.0. Thus, where the source is a full-amplitude sinewave, the peak amplitude in the nearest bin will be close to 1.0. Later versions of this document will develop this aspect further. Suffice it to say here that both the CARL and Soundhack formats provide analysis windows in this form, while Csound and PVC require scale factors. The example implementation accompanying this release is based on the CARL distribution, and a further program demomnstrates conversion from the current Csound format to the PVOC-EX format.

Example Implementation (command-line programs):

All pvocex code is confined to two files: pvfileio.c, and pvfileio.h

Source for pvocex analysis-resynthesis (accepts WAVE, AIFF and AIFF-C files):
 pvocex_src01.zip (49KB). Includes project files for VC++5.0
Source for pvconv application (convert Csound and Soundhack analysis files to pvoc-ex):
pvconv_src01.zip (5KB)  NB requires pvocex_src01.zip

Executables for Win32 Pentium systems (pvocex and pvconv):
pvocex_bin01.zip (82KB)
 
 

Richard Dobson 25 May 2000