WaveNet Computation Walkthrough

This document provides a detailed step-by-step explanation of how the NAM WaveNet architecture performs its computations, including the LayerArray and Layer objects that make up a the model.

“It’s not really a Wavenet”

The name “WaveNet” is a bit of a misnomer. There are similarities to the architecture from van den Oord et al. (2016)–this is a convolutional neural network that repeats a “Layer” motif withskip connections that give good accuracy typical of convnets along with good training stability, but there are a lot of differences.

Here’s a rundown of what’s not exactly the same at an informal level:

  • The model in NAM is feedforward and used in a “regression” setting; the model from the original paper is autoregressive and used for generative tasks.

  • The class in NAM actually composes several “Layer array” objects. Each one of these individually is actually far closer to a “WaveNet” in architecture. In other words, this is more like a “stacked WaveNet”.

  • There are additional skip connections (e.g. input mixin) that aren’t really part of the original WaveNet architecture.

  • And finally, the actual recipe within the layer has a lot of modifications. The original layer has, roughly, a “convolution-activation-convolution” sequence with a gated activation. Here, the gated activation is optional (and is frequently not used, like in the popular A1 standard/lite/feather/nano configurations).

  • In v0.4.0, even more modifications have been added in–FiLMs, a bottlneck, and an arbitrary “conditioning DSP” module that can be used to embed the input signal in a more effective way to modulate the layers in the main model. It doesn’t need to be a WaveNet, but if it were then this feels more like a “cascading (stacked) WaveNet”.

WaveNet Overview

WaveNet is a dilated convolutional neural network architecture designed for audio processing. The model consists of:

  • Multiple LayerArrays: Each LayerArray contains multiple layers with the same channel configuration

  • Conditioning: Optional DSP processing of the input to generate conditioning signals and “skip in” this signal to the layers.

  • Residual and Skip Connections: Information flows through both residual (layer-to-layer) and skip (to head) paths

Computation graphs of the layer, layer array, and full model are below on this page.

Layer Computation

A single Layer performs the core computation of a WaveNet block. The computation proceeds through several stages:

Step 1: Input Convolution

The input first goes through a dilated 1D convolution:

  1. Optional Pre-FiLM: If conv_pre_film is active, the input is modulated by the condition signal before convolution

  2. Dilated Convolution: The input is convolved with a dilated kernel

  3. Optional Post-FiLM: If conv_post_film is active, the convolution output is modulated by the condition signal

Note

Having two FiLM layers bookending the convolution layer is mathematically equivalent to a sort of “rank 1 adaptive LoRA” on the convolution weights.

Input convolution processing
if (this->_conv_pre_film) {
    this->_conv_pre_film->Process(input, condition, num_frames);
    this->_conv.Process(this->_conv_pre_film->GetOutput(), num_frames);
} else {
    this->_conv.Process(input, num_frames);
}
if (this->_conv_post_film) {
    Eigen::MatrixXf& conv_output = this->_conv.GetOutput();
    this->_conv_post_film->Process_(conv_output, condition, num_frames);
}

Step 2: Input Mixin

The conditioning input is processed separately and added to the convolution output:

  1. Optional Pre-FiLM: If input_mixin_pre_film is active, the condition is modulated before the mixin convolution

  2. Input Mixin Convolution: A 1x1 convolution processes the condition signal

  3. Optional Post-FiLM: If input_mixin_post_film is active, the mixin output is modulated

Input mixin processing
if (this->_input_mixin_pre_film) {
    this->_input_mixin_pre_film->Process(condition, condition, num_frames);
    this->_input_mixin.process_(this->_input_mixin_pre_film->GetOutput(), num_frames);
} else {
    this->_input_mixin.process_(condition, num_frames);
}
if (this->_input_mixin_post_film) {
    Eigen::MatrixXf& input_mixin_output = this->_input_mixin.GetOutput();
    this->_input_mixin_post_film->Process_(input_mixin_output, condition, num_frames);
}

Step 3: Sum and Pre-Activation FiLM

The convolution output and input mixin output are summed, and optionally modulated:

Sum and pre-activation FiLM
this->_z.leftCols(num_frames).noalias() =
    _conv.GetOutput().leftCols(num_frames) + _input_mixin.GetOutput().leftCols(num_frames);
if (this->_activation_pre_film) {
    this->_activation_pre_film->Process_(this->_z, condition, num_frames);
}

Step 4: Activation

The activation stage depends on the gating mode:

No Gating (GatingMode::NONE)

Simple activation function applied to the summed output.

Gated (GatingMode::GATED)

The output channels are doubled (2 * bottleneck). The top half goes through the primary activation, the bottom half through a secondary activation (typically sigmoid). The results are multiplied element-wise.

Blended (GatingMode::BLENDED)

Similar to gated, but instead of multiplication, a weighted blend is performed: output = alpha * activated_input + (1 - alpha) * pre_activation_input where alpha comes from the secondary activation.

After activation, an optional post-activation FiLM may be applied.

Note

Even though the secondary activation is calssically chosen to be a sigmoid, it doesn’t need to be. It doesn’t even need to output a value between 0 and 1. The operation is still well-defined.

Activation processing (gated mode example)
if (this->_gating_mode == GatingMode::GATED) {
    auto input_block = this->_z.leftCols(num_frames);
    auto output_block = this->_z.topRows(bottleneck).leftCols(num_frames);
    this->_gating_activation->apply(input_block, output_block);
    if (this->_activation_post_film) {
        this->_activation_post_film->Process(this->_z.topRows(bottleneck), condition, num_frames);
        this->_z.topRows(bottleneck).leftCols(num_frames).noalias() =
            this->_activation_post_film->GetOutput().leftCols(num_frames);
    }
}

Step 5: 1x1 Convolution

A 1x1 convolution reduces the bottleneck channels back to the layer channel count:

1x1 convolution
_1x1.process_(this->_z.topRows(bottleneck), num_frames);
if (this->_1x1_post_film) {
    Eigen::MatrixXf& _1x1_output = this->_1x1.GetOutput();
    this->_1x1_post_film->Process_(_1x1_output, condition, num_frames);
}

Step 6: Head 1x1 (Optional)

If a head1x1 convolution is configured, it processes the activated output for the skip connection:

Head 1x1 processing
if (this->_head1x1) {
    this->_head1x1->process_(this->_z.topRows(bottleneck).leftCols(num_frames), num_frames);
    if (this->_head1x1_post_film) {
        Eigen::MatrixXf& head1x1_output = this->_head1x1->GetOutput();
        this->_head1x1_post_film->Process_(head1x1_output, condition, num_frames);
    }
    this->_output_head.leftCols(num_frames).noalias() =
        this->_head1x1->GetOutput().leftCols(num_frames);
}

Note

If there is no head 1x1, then the output dimension is the same as the activation output dimension (the “bottleneck” dimension). If there is, then the head can project to an arbitrary dimension.

Step 7: Residual and Skip Connections

Finally, the outputs are computed:

  • Residual Connection: output_next_layer = input + 1x1_output

  • Skip Connection: output_head = activated_output (or head1x1 output if present)

Residual and skip connections
// Store output to next layer (residual connection)
this->_output_next_layer.leftCols(num_frames).noalias() =
    input.leftCols(num_frames) + _1x1.GetOutput().leftCols(num_frames);

// Store output to head (skip connection)
if (this->_head1x1) {
    this->_output_head.leftCols(num_frames).noalias() =
        this->_head1x1->GetOutput().leftCols(num_frames);
} else {
    this->_output_head.leftCols(num_frames).noalias() =
        this->_z.topRows(bottleneck).leftCols(num_frames);
}

Data Flow Diagram

Data arrays are marked with their dimensions as (channels, frames). Notes:

  • g=2 if a gating or blending activation is used, and 1 otherwise.

  • The head output dimension dh is the bottleneck dimension b when no head 1x1 is used; otherwise, it is determined by the head 1x1’s number of output channels.

        graph TD
    Input["Input (dx,n)"] --> PreFiLM1{Pre-FiLM?}
    PreFiLM1 -->|Yes| ConvPre[Conv Pre-FiLM]
    PreFiLM1 -->|No| Conv["Dilated Conv (g*b,n)"]
    ConvPre --> Conv
    Conv --> PostFiLM1{Post-FiLM?}
    PostFiLM1 -->|Yes| ConvPost[Conv Post-FiLM]
    PostFiLM1 -->|No| Sum["Sum (g*b,n)"]
    ConvPost --> Sum

    Condition["Condition (dc,n)"] --> PreFiLM2{Pre-FiLM?}
    PreFiLM2 -->|Yes| MixinPre[Input Mixin Pre-FiLM]
    PreFiLM2 -->|No| Mixin["Input Mixin (g*b,n)"]
    MixinPre --> Mixin
    Mixin --> PostFiLM2{Post-FiLM?}
    PostFiLM2 -->|Yes| MixinPost[Input Mixin Post-FiLM]
    PostFiLM2 -->|No| Sum
    MixinPost --> Sum

    Sum --> PreActFiLM{Pre-Act FiLM?}
    PreActFiLM -->|Yes| PreAct[Pre-Activation FiLM]
    PreActFiLM -->|No| Act["Activation (b,n)"]
    PreAct --> Act

    Act --> PostActFiLM{Post-Act FiLM?}
    PostActFiLM -->|Yes| PostActFilm[Post-Activation FiLM]
    PostActFiLM -->|No| PostAct["Post-Activation Output (b,n)"]
    PostActFilm --> PostAct

    PostAct --> Conv1x1["1x1 Conv (dx,n)"]
    Conv1x1 --> Post1x1FiLM{Post-1x1 FiLM?}
    Post1x1FiLM -->|Yes| Post1x1[Post-1x1 FiLM]
    Post1x1FiLM -->|No| Residual["Residual (dx,n)"]
    Post1x1 --> Residual

    Input --> ResidualSum["Residual Sum (dx,n)"]
    Residual --> ResidualSum
    ResidualSum --> LayerOutput["Layer Output (dx,n)"]

    PostAct --> Head1x1{Head 1x1?}
    Head1x1 -->|Yes| HeadConv["Head 1x1 Conv (dh,n)"]
    Head1x1 -->|No| HeadOutput["Head Output (dh,n)"]
    HeadConv --> HeadFiLM{Head FiLM?}
    HeadFiLM -->|Yes| HeadPost[Head Post-FiLM]
    HeadFiLM -->|No| HeadOutput
    HeadPost --> HeadOutput
    

Layer Computation Flow

LayerArray Computation

A LayerArray chains multiple Layer objects together, processing them sequentially while accumulating their “head outputs” via skip-out connections.

Step 1: Rechanneling

The input is first proejcted (rechanneled) to match the layer channel count:

Input rechanneling
this->_rechannel.process_(layer_inputs, num_frames);
Eigen::MatrixXf& rechannel_output = _rechannel.GetOutput();

Step 2: Layer Processing

Each layer processes the output of the previous layer:

  1. First Layer: Processes the rechanneled input

  2. Subsequent Layers: Process the residual output from the previous layer

  3. Head Accumulation: Each “head output” is accumulated into the head buffer

Layer processing loop
for (size_t i = 0; i < this->_layers.size(); i++) {
    if (i == 0) {
        // First layer consumes the rechannel output buffer
        this->_layers[i].Process(rechannel_output, condition, num_frames);
    } else {
        // Subsequent layers consume the previous layer's output
        Eigen::MatrixXf& prev_output = this->_layers[i - 1].GetOutputNextLayer();
        this->_layers[i].Process(prev_output, condition, num_frames);
    }

    // Accumulate head output from this layer
    this->_head_inputs.leftCols(num_frames).noalias() +=
        this->_layers[i].GetOutputHead().leftCols(num_frames);
}

Step 3: Head Rechanneling

The accumulated head outputs are proejcted (rechanneled) to the final output dimension for the layer array:

Head rechanneling
_head_rechannel.process_(this->_head_inputs, num_frames);

LayerArray Structure

        graph TD
    Input[Layer Input] --> Rechannel[Rechannel]
    Rechannel --> Layer1[Layer 1]
    Layer1 --> Layer2[Layer 2]
    Layer2 --> Layer3[Layer 3]
    Layer3 --> LayerN[Layer N]
    Layer1 -->|Skip| HeadAccum[Head Accumulator]
    Layer2 -->|Skip| HeadAccum
    Layer3 -->|Skip| HeadAccum
    LayerN -->|Skip| HeadAccum
    HeadAccum --> HeadRechannel[Head Rechannel]
    HeadRechannel --> HeadOut[Head Output]
    LayerN --> LayerOut[Layer Output]
    

LayerArray Structure

WaveNet Processing

The complete WaveNet processing pipeline:

Step 1: Condition Processing

If a condition DSP is provided, the input is processed through it to generate the conditioning signal:

Condition processing
void WaveNet::_process_condition(const int num_frames) {
    if (this->_condition_dsp != nullptr) {
        // Process input through condition DSP
        this->_condition_dsp->process(/* input */, /* output */, num_frames);
        // Copy output to condition buffer
    } else {
        // Use input directly as condition
        this->_condition_output = this->_condition_input;
    }
}

The condition module can be a WaveNet, but it can also be something else–a convolution, an RNN, etc.

Step 2: LayerArray Processing

Each LayerArray processes the output of the previous array:

  1. First LayerArray: Processes the input with zeroed head inputs

  2. Subsequent LayerArrays: Process the previous array’s output and accumulate head inputs

LayerArray processing
// First layer array
this->_layer_arrays[0].Process(input, condition, num_frames);

// Subsequent layer arrays
for (size_t i = 1; i < this->_layer_arrays.size(); i++) {
    Eigen::MatrixXf& prev_output = this->_layer_arrays[i-1].GetLayerOutputs();
    Eigen::MatrixXf& prev_head = this->_layer_arrays[i-1].GetHeadOutputs();
    this->_layer_arrays[i].Process(prev_output, condition, prev_head, num_frames);
}

Step 3: Head Scaling and Output

The final head output from the last LayerArray is scaled and written to output:

Head scaling and output
Eigen::MatrixXf& final_head = this->_layer_arrays.back().GetHeadOutputs();
// Apply head scale and write to output buffers
// (implementation details in wavenet.cpp)

Complete WaveNet Flow

        graph TD
    AudioIn[Audio Input] --> ConditionProc{Condition DSP?}
    ConditionProc -->|Yes| CondDSP[Condition DSP]
    ConditionProc -->|No| Condition[Condition Signal]
    CondDSP --> Condition
    AudioIn --> LayerArray1[LayerArray 1]
    Condition --> LayerArray1
    LayerArray1 -->|LayerN Output| LayerArray2[LayerArray 2]
    LayerArray1 -->|Head Output| LayerArray2
    Condition --> LayerArray2
    LayerArray2 -->|LayerN Output| LayerArrayN[LayerArray N]
    LayerArray2 -->|Head Output| LayerArrayN
    Condition --> LayerArrayN
    LayerArrayN -->|LayerN Output| Unused("(Unused)")
    LayerArrayN -->|Head Output| HeadAccum[Head Accumulator]
    HeadAccum --> HeadScale[Head Scale]
    HeadScale --> AudioOut[Audio Output]
    

Complete WaveNet Processing Flow

See Also

  • WaveNet API - Complete API reference for WaveNet classes

  • DSP API - Base DSP interface documentation

  • Conv1D API - Convolution implementation details