WaveNet Computation Walkthrough

This document provides a detailed step-by-step explanation of how the NAM WaveNet architecture performs its computations, including the LayerArray and Layer objects that make up a the model.

“It’s not really a Wavenet”

The name “WaveNet” is a bit of a misnomer. There are similarities to the architecture from van den Oord et al. (2016)–this is a convolutional neural network that repeats a “Layer” motif withskip connections that give good accuracy typical of convnets along with good training stability, but there are a lot of differences.

Here’s a rundown of what’s not exactly the same at an informal level:

The model in NAM is feedforward and used in a “regression” setting; the model from the original paper is autoregressive and used for generative tasks.
The class in NAM actually composes several “Layer array” objects. Each one of these individually is actually far closer to a “WaveNet” in architecture. In other words, this is more like a “stacked WaveNet”.
There are additional skip connections (e.g. input mixin) that aren’t really part of the original WaveNet architecture.
And finally, the actual recipe within the layer has a lot of modifications. The original layer has, roughly, a “convolution-activation-convolution” sequence with a gated activation. Here, the gated activation is optional (and is frequently not used, like in the popular A1 standard/lite/feather/nano configurations).
In v0.4.0, even more modifications have been added in–FiLMs, a bottlneck, and an arbitrary “conditioning DSP” module that can be used to embed the input signal in a more effective way to modulate the layers in the main model. It doesn’t need to be a WaveNet, but if it were then this feels more like a “cascading (stacked) WaveNet”.

WaveNet Overview

WaveNet is a dilated convolutional neural network architecture designed for audio processing. The model consists of:

Multiple LayerArrays: Each LayerArray contains multiple layers with the same channel configuration
Conditioning: Optional DSP processing of the input to generate conditioning signals and “skip in” this signal to the layers.
Residual and Skip Connections: Information flows through both residual (layer-to-layer) and skip (to head) paths

Computation graphs of the layer, layer array, and full model are below on this page.

Layer Computation

A single Layer performs the core computation of a WaveNet block. The computation proceeds through several stages:

Step 1: Input Convolution

The input first goes through a dilated 1D convolution:

Optional Pre-FiLM: If conv_pre_film is active, the input is modulated by the condition signal before convolution
Dilated Convolution: The input is convolved with a dilated kernel
Optional Post-FiLM: If conv_post_film is active, the convolution output is modulated by the condition signal

Note

Having two FiLM layers bookending the convolution layer is mathematically equivalent to a sort of “rank 1 adaptive LoRA” on the convolution weights.

Input convolution processing

if (this->_conv_pre_film) {
    this->_conv_pre_film->Process(input, condition, num_frames);
    this->_conv.Process(this->_conv_pre_film->GetOutput(), num_frames);
} else {
    this->_conv.Process(input, num_frames);
}
if (this->_conv_post_film) {
    Eigen::MatrixXf& conv_output = this->_conv.GetOutput();
    this->_conv_post_film->Process_(conv_output, condition, num_frames);
}

Step 2: Input Mixin

The conditioning input is processed separately and added to the convolution output:

Optional Pre-FiLM: If input_mixin_pre_film is active, the condition is modulated before the mixin convolution
Input Mixin Convolution: A 1x1 convolution processes the condition signal
Optional Post-FiLM: If input_mixin_post_film is active, the mixin output is modulated

Input mixin processing

if (this->_input_mixin_pre_film) {
    this->_input_mixin_pre_film->Process(condition, condition, num_frames);
    this->_input_mixin.process_(this->_input_mixin_pre_film->GetOutput(), num_frames);
} else {
    this->_input_mixin.process_(condition, num_frames);
}
if (this->_input_mixin_post_film) {
    Eigen::MatrixXf& input_mixin_output = this->_input_mixin.GetOutput();
    this->_input_mixin_post_film->Process_(input_mixin_output, condition, num_frames);
}

Step 3: Sum and Pre-Activation FiLM

The convolution output and input mixin output are summed, and optionally modulated:

Sum and pre-activation FiLM

this->_z.leftCols(num_frames).noalias() =
    _conv.GetOutput().leftCols(num_frames) + _input_mixin.GetOutput().leftCols(num_frames);
if (this->_activation_pre_film) {
    this->_activation_pre_film->Process_(this->_z, condition, num_frames);
}

Step 4: Activation

The activation stage depends on the gating mode:

No Gating (GatingMode::NONE): Simple activation function applied to the summed output.
Gated (GatingMode::GATED): The output channels are doubled (2 * bottleneck). The top half goes through the primary activation, the bottom half through a secondary activation (typically sigmoid). The results are multiplied element-wise.
Blended (GatingMode::BLENDED): Similar to gated, but instead of multiplication, a weighted blend is performed: output = alpha * activated_input + (1 - alpha) * pre_activation_input where alpha comes from the secondary activation.

After activation, an optional post-activation FiLM may be applied.

Note

Even though the secondary activation is calssically chosen to be a sigmoid, it doesn’t need to be. It doesn’t even need to output a value between 0 and 1. The operation is still well-defined.

Activation processing (gated mode example)

if (this->_gating_mode == GatingMode::GATED) {
    auto input_block = this->_z.leftCols(num_frames);
    auto output_block = this->_z.topRows(bottleneck).leftCols(num_frames);
    this->_gating_activation->apply(input_block, output_block);
    if (this->_activation_post_film) {
        this->_activation_post_film->Process(this->_z.topRows(bottleneck), condition, num_frames);
        this->_z.topRows(bottleneck).leftCols(num_frames).noalias() =
            this->_activation_post_film->GetOutput().leftCols(num_frames);
    }
}

Step 5: 1x1 Convolution

A 1x1 convolution reduces the bottleneck channels back to the layer channel count:

1x1 convolution

_1x1.process_(this->_z.topRows(bottleneck), num_frames);
if (this->_1x1_post_film) {
    Eigen::MatrixXf& _1x1_output = this->_1x1.GetOutput();
    this->_1x1_post_film->Process_(_1x1_output, condition, num_frames);
}

Step 6: Head 1x1 (Optional)

If a head1x1 convolution is configured, it processes the activated output for the skip connection:

Head 1x1 processing

if (this->_head1x1) {
    this->_head1x1->process_(this->_z.topRows(bottleneck).leftCols(num_frames), num_frames);
    if (this->_head1x1_post_film) {
        Eigen::MatrixXf& head1x1_output = this->_head1x1->GetOutput();
        this->_head1x1_post_film->Process_(head1x1_output, condition, num_frames);
    }
    this->_output_head.leftCols(num_frames).noalias() =
        this->_head1x1->GetOutput().leftCols(num_frames);
}

Note

If there is no head 1x1, then the output dimension is the same as the activation output dimension (the “bottleneck” dimension). If there is, then the head can project to an arbitrary dimension.

Step 7: Residual and Skip Connections

Finally, the outputs are computed:

Residual Connection: output_next_layer = input + 1x1_output
Skip Connection: output_head = activated_output (or head1x1 output if present)

Residual and skip connections

// Store output to next layer (residual connection)
this->_output_next_layer.leftCols(num_frames).noalias() =
    input.leftCols(num_frames) + _1x1.GetOutput().leftCols(num_frames);

// Store output to head (skip connection)
if (this->_head1x1) {
    this->_output_head.leftCols(num_frames).noalias() =
        this->_head1x1->GetOutput().leftCols(num_frames);
} else {
    this->_output_head.leftCols(num_frames).noalias() =
        this->_z.topRows(bottleneck).leftCols(num_frames);
}

Data Flow Diagram

Data arrays are marked with their dimensions as (channels, frames). Notes:

g=2 if a gating or blending activation is used, and 1 otherwise.
The head output dimension dh is the bottleneck dimension b when no head 1x1 is used; otherwise, it is determined by the head 1x1’s number of output channels.

        graph TD
    Input["Input (dx,n)"] --> PreFiLM1{Pre-FiLM?}
    PreFiLM1 -->|Yes| ConvPre[Conv Pre-FiLM]
    PreFiLM1 -->|No| Conv["Dilated Conv (g*b,n)"]
    ConvPre --> Conv
    Conv --> PostFiLM1{Post-FiLM?}
    PostFiLM1 -->|Yes| ConvPost[Conv Post-FiLM]
    PostFiLM1 -->|No| Sum["Sum (g*b,n)"]
    ConvPost --> Sum

    Condition["Condition (dc,n)"] --> PreFiLM2{Pre-FiLM?}
    PreFiLM2 -->|Yes| MixinPre[Input Mixin Pre-FiLM]
    PreFiLM2 -->|No| Mixin["Input Mixin (g*b,n)"]
    MixinPre --> Mixin
    Mixin --> PostFiLM2{Post-FiLM?}
    PostFiLM2 -->|Yes| MixinPost[Input Mixin Post-FiLM]
    PostFiLM2 -->|No| Sum
    MixinPost --> Sum

    Sum --> PreActFiLM{Pre-Act FiLM?}
    PreActFiLM -->|Yes| PreAct[Pre-Activation FiLM]
    PreActFiLM -->|No| Act["Activation (b,n)"]
    PreAct --> Act

    Act --> PostActFiLM{Post-Act FiLM?}
    PostActFiLM -->|Yes| PostActFilm[Post-Activation FiLM]
    PostActFiLM -->|No| PostAct["Post-Activation Output (b,n)"]
    PostActFilm --> PostAct

    PostAct --> Conv1x1["1x1 Conv (dx,n)"]
    Conv1x1 --> Post1x1FiLM{Post-1x1 FiLM?}
    Post1x1FiLM -->|Yes| Post1x1[Post-1x1 FiLM]
    Post1x1FiLM -->|No| Residual["Residual (dx,n)"]
    Post1x1 --> Residual

    Input --> ResidualSum["Residual Sum (dx,n)"]
    Residual --> ResidualSum
    ResidualSum --> LayerOutput["Layer Output (dx,n)"]

    PostAct --> Head1x1{Head 1x1?}
    Head1x1 -->|Yes| HeadConv["Head 1x1 Conv (dh,n)"]
    Head1x1 -->|No| HeadOutput["Head Output (dh,n)"]
    HeadConv --> HeadFiLM{Head FiLM?}
    HeadFiLM -->|Yes| HeadPost[Head Post-FiLM]
    HeadFiLM -->|No| HeadOutput
    HeadPost --> HeadOutput

Layer Computation Flow

LayerArray Computation

A LayerArray chains multiple Layer objects together, processing them sequentially while accumulating their “head outputs” via skip-out connections.

Step 1: Rechanneling

The input is first proejcted (rechanneled) to match the layer channel count:

Input rechanneling

this->_rechannel.process_(layer_inputs, num_frames);
Eigen::MatrixXf& rechannel_output = _rechannel.GetOutput();

Step 2: Layer Processing

Each layer processes the output of the previous layer:

First Layer: Processes the rechanneled input
Subsequent Layers: Process the residual output from the previous layer
Head Accumulation: Each “head output” is accumulated into the head buffer

Layer processing loop

for (size_t i = 0; i < this->_layers.size(); i++) {
    if (i == 0) {
        // First layer consumes the rechannel output buffer
        this->_layers[i].Process(rechannel_output, condition, num_frames);
    } else {
        // Subsequent layers consume the previous layer's output
        Eigen::MatrixXf& prev_output = this->_layers[i - 1].GetOutputNextLayer();
        this->_layers[i].Process(prev_output, condition, num_frames);
    }

    // Accumulate head output from this layer
    this->_head_inputs.leftCols(num_frames).noalias() +=
        this->_layers[i].GetOutputHead().leftCols(num_frames);
}

Step 3: Head Rechanneling

The accumulated head outputs are proejcted (rechanneled) to the final output dimension for the layer array:

Head rechanneling

_head_rechannel.process_(this->_head_inputs, num_frames);

LayerArray Structure

        graph TD
    Input[Layer Input] --> Rechannel[Rechannel]
    Rechannel --> Layer1[Layer 1]
    Layer1 --> Layer2[Layer 2]
    Layer2 --> Layer3[Layer 3]
    Layer3 --> LayerN[Layer N]
    Layer1 -->|Skip| HeadAccum[Head Accumulator]
    Layer2 -->|Skip| HeadAccum
    Layer3 -->|Skip| HeadAccum
    LayerN -->|Skip| HeadAccum
    HeadAccum --> HeadRechannel[Head Rechannel]
    HeadRechannel --> HeadOut[Head Output]
    LayerN --> LayerOut[Layer Output]

LayerArray Structure

WaveNet Processing

The complete WaveNet processing pipeline:

Step 1: Condition Processing

If a condition DSP is provided, the input is processed through it to generate the conditioning signal:

Condition processing

void WaveNet::_process_condition(const int num_frames) {
    if (this->_condition_dsp != nullptr) {
        // Process input through condition DSP
        this->_condition_dsp->process(/* input */, /* output */, num_frames);
        // Copy output to condition buffer
    } else {
        // Use input directly as condition
        this->_condition_output = this->_condition_input;
    }
}

The condition module can be a WaveNet, but it can also be something else–a convolution, an RNN, etc.

Step 2: LayerArray Processing

Each LayerArray processes the output of the previous array:

First LayerArray: Processes the input with zeroed head inputs
Subsequent LayerArrays: Process the previous array’s output and accumulate head inputs

LayerArray processing

// First layer array
this->_layer_arrays[0].Process(input, condition, num_frames);

// Subsequent layer arrays
for (size_t i = 1; i < this->_layer_arrays.size(); i++) {
    Eigen::MatrixXf& prev_output = this->_layer_arrays[i-1].GetLayerOutputs();
    Eigen::MatrixXf& prev_head = this->_layer_arrays[i-1].GetHeadOutputs();
    this->_layer_arrays[i].Process(prev_output, condition, prev_head, num_frames);
}

Step 3: Head Scaling and Output

The final head output from the last LayerArray is scaled and written to output:

Head scaling and output

Eigen::MatrixXf& final_head = this->_layer_arrays.back().GetHeadOutputs();
// Apply head scale and write to output buffers
// (implementation details in wavenet.cpp)

Complete WaveNet Flow

        graph TD
    AudioIn[Audio Input] --> ConditionProc{Condition DSP?}
    ConditionProc -->|Yes| CondDSP[Condition DSP]
    ConditionProc -->|No| Condition[Condition Signal]
    CondDSP --> Condition
    AudioIn --> LayerArray1[LayerArray 1]
    Condition --> LayerArray1
    LayerArray1 -->|LayerN Output| LayerArray2[LayerArray 2]
    LayerArray1 -->|Head Output| LayerArray2
    Condition --> LayerArray2
    LayerArray2 -->|LayerN Output| LayerArrayN[LayerArray N]
    LayerArray2 -->|Head Output| LayerArrayN
    Condition --> LayerArrayN
    LayerArrayN -->|LayerN Output| Unused("(Unused)")
    LayerArrayN -->|Head Output| HeadAccum[Head Accumulator]
    HeadAccum --> HeadScale[Head Scale]
    HeadScale --> AudioOut[Audio Output]

Complete WaveNet Processing Flow

WaveNet Computation Walkthrough

“It’s not really a Wavenet”

WaveNet Overview

Layer Computation

Step 1: Input Convolution

Step 2: Input Mixin

Step 3: Sum and Pre-Activation FiLM

Step 4: Activation

Step 5: 1x1 Convolution

Step 6: Head 1x1 (Optional)

Step 7: Residual and Skip Connections

Data Flow Diagram

LayerArray Computation

Step 1: Rechanneling

Step 2: Layer Processing

Step 3: Head Rechanneling

LayerArray Structure

WaveNet Processing

Step 1: Condition Processing

Step 2: LayerArray Processing

Step 3: Head Scaling and Output

Complete WaveNet Flow

See Also