WaveNet Computation Walkthrough
This document provides a detailed step-by-step explanation of how the NAM WaveNet architecture performs its computations, including the LayerArray and Layer objects that make up a the model.
“It’s not really a Wavenet”
The name “WaveNet” is a bit of a misnomer. There are similarities to the architecture from van den Oord et al. (2016)–this is a convolutional neural network that repeats a “Layer” motif withskip connections that give good accuracy typical of convnets along with good training stability, but there are a lot of differences.
Here’s a rundown of what’s not exactly the same at an informal level:
The model in NAM is feedforward and used in a “regression” setting; the model from the original paper is autoregressive and used for generative tasks.
The class in NAM actually composes several “Layer array” objects. Each one of these individually is actually far closer to a “WaveNet” in architecture. In other words, this is more like a “stacked WaveNet”.
There are additional skip connections (e.g. input mixin) that aren’t really part of the original WaveNet architecture.
And finally, the actual recipe within the layer has a lot of modifications. The original layer has, roughly, a “convolution-activation-convolution” sequence with a gated activation. Here, the gated activation is optional (and is frequently not used, like in the popular A1 standard/lite/feather/nano configurations).
In v0.4.0, even more modifications have been added in–FiLMs, a bottlneck, and an arbitrary “conditioning DSP” module that can be used to embed the input signal in a more effective way to modulate the layers in the main model. It doesn’t need to be a WaveNet, but if it were then this feels more like a “cascading (stacked) WaveNet”.
WaveNet Overview
WaveNet is a dilated convolutional neural network architecture designed for audio processing. The model consists of:
Multiple LayerArrays: Each LayerArray contains multiple layers with the same channel configuration
Conditioning: Optional DSP processing of the input to generate conditioning signals and “skip in” this signal to the layers.
Residual and Skip Connections: Information flows through both residual (layer-to-layer) and skip (to head) paths
Computation graphs of the layer, layer array, and full model are below on this page.
Layer Computation
A single Layer performs the core computation of a WaveNet block. The computation proceeds through several stages:
Step 1: Input Convolution
The input first goes through a dilated 1D convolution:
Optional Pre-FiLM: If conv_pre_film is active, the input is modulated by the condition signal before convolution
Dilated Convolution: The input is convolved with a dilated kernel
Optional Post-FiLM: If conv_post_film is active, the convolution output is modulated by the condition signal
Note
Having two FiLM layers bookending the convolution layer is mathematically equivalent to a sort of “rank 1 adaptive LoRA” on the convolution weights.
if (this->_conv_pre_film) {
this->_conv_pre_film->Process(input, condition, num_frames);
this->_conv.Process(this->_conv_pre_film->GetOutput(), num_frames);
} else {
this->_conv.Process(input, num_frames);
}
if (this->_conv_post_film) {
Eigen::MatrixXf& conv_output = this->_conv.GetOutput();
this->_conv_post_film->Process_(conv_output, condition, num_frames);
}
Step 2: Input Mixin
The conditioning input is processed separately and added to the convolution output:
Optional Pre-FiLM: If input_mixin_pre_film is active, the condition is modulated before the mixin convolution
Input Mixin Convolution: A 1x1 convolution processes the condition signal
Optional Post-FiLM: If input_mixin_post_film is active, the mixin output is modulated
if (this->_input_mixin_pre_film) {
this->_input_mixin_pre_film->Process(condition, condition, num_frames);
this->_input_mixin.process_(this->_input_mixin_pre_film->GetOutput(), num_frames);
} else {
this->_input_mixin.process_(condition, num_frames);
}
if (this->_input_mixin_post_film) {
Eigen::MatrixXf& input_mixin_output = this->_input_mixin.GetOutput();
this->_input_mixin_post_film->Process_(input_mixin_output, condition, num_frames);
}
Step 3: Sum and Pre-Activation FiLM
The convolution output and input mixin output are summed, and optionally modulated:
this->_z.leftCols(num_frames).noalias() =
_conv.GetOutput().leftCols(num_frames) + _input_mixin.GetOutput().leftCols(num_frames);
if (this->_activation_pre_film) {
this->_activation_pre_film->Process_(this->_z, condition, num_frames);
}
Step 4: Activation
The activation stage depends on the gating mode:
- No Gating (GatingMode::NONE)
Simple activation function applied to the summed output.
- Gated (GatingMode::GATED)
The output channels are doubled (2 * bottleneck). The top half goes through the primary activation, the bottom half through a secondary activation (typically sigmoid). The results are multiplied element-wise.
- Blended (GatingMode::BLENDED)
Similar to gated, but instead of multiplication, a weighted blend is performed: output = alpha * activated_input + (1 - alpha) * pre_activation_input where alpha comes from the secondary activation.
After activation, an optional post-activation FiLM may be applied.
Note
Even though the secondary activation is calssically chosen to be a sigmoid, it doesn’t need to be. It doesn’t even need to output a value between 0 and 1. The operation is still well-defined.
if (this->_gating_mode == GatingMode::GATED) {
auto input_block = this->_z.leftCols(num_frames);
auto output_block = this->_z.topRows(bottleneck).leftCols(num_frames);
this->_gating_activation->apply(input_block, output_block);
if (this->_activation_post_film) {
this->_activation_post_film->Process(this->_z.topRows(bottleneck), condition, num_frames);
this->_z.topRows(bottleneck).leftCols(num_frames).noalias() =
this->_activation_post_film->GetOutput().leftCols(num_frames);
}
}
Step 5: 1x1 Convolution
A 1x1 convolution reduces the bottleneck channels back to the layer channel count:
_1x1.process_(this->_z.topRows(bottleneck), num_frames);
if (this->_1x1_post_film) {
Eigen::MatrixXf& _1x1_output = this->_1x1.GetOutput();
this->_1x1_post_film->Process_(_1x1_output, condition, num_frames);
}
Step 6: Head 1x1 (Optional)
If a head1x1 convolution is configured, it processes the activated output for the skip connection:
if (this->_head1x1) {
this->_head1x1->process_(this->_z.topRows(bottleneck).leftCols(num_frames), num_frames);
if (this->_head1x1_post_film) {
Eigen::MatrixXf& head1x1_output = this->_head1x1->GetOutput();
this->_head1x1_post_film->Process_(head1x1_output, condition, num_frames);
}
this->_output_head.leftCols(num_frames).noalias() =
this->_head1x1->GetOutput().leftCols(num_frames);
}
Note
If there is no head 1x1, then the output dimension is the same as the activation output dimension (the “bottleneck” dimension). If there is, then the head can project to an arbitrary dimension.
Step 7: Residual and Skip Connections
Finally, the outputs are computed:
Residual Connection: output_next_layer = input + 1x1_output
Skip Connection: output_head = activated_output (or head1x1 output if present)
// Store output to next layer (residual connection)
this->_output_next_layer.leftCols(num_frames).noalias() =
input.leftCols(num_frames) + _1x1.GetOutput().leftCols(num_frames);
// Store output to head (skip connection)
if (this->_head1x1) {
this->_output_head.leftCols(num_frames).noalias() =
this->_head1x1->GetOutput().leftCols(num_frames);
} else {
this->_output_head.leftCols(num_frames).noalias() =
this->_z.topRows(bottleneck).leftCols(num_frames);
}
Data Flow Diagram
Data arrays are marked with their dimensions as (channels, frames). Notes:
g=2if a gating or blending activation is used, and1otherwise.The head output dimension
dhis the bottleneck dimensionbwhen no head 1x1 is used; otherwise, it is determined by the head 1x1’s number of output channels.
graph TD
Input["Input (dx,n)"] --> PreFiLM1{Pre-FiLM?}
PreFiLM1 -->|Yes| ConvPre[Conv Pre-FiLM]
PreFiLM1 -->|No| Conv["Dilated Conv (g*b,n)"]
ConvPre --> Conv
Conv --> PostFiLM1{Post-FiLM?}
PostFiLM1 -->|Yes| ConvPost[Conv Post-FiLM]
PostFiLM1 -->|No| Sum["Sum (g*b,n)"]
ConvPost --> Sum
Condition["Condition (dc,n)"] --> PreFiLM2{Pre-FiLM?}
PreFiLM2 -->|Yes| MixinPre[Input Mixin Pre-FiLM]
PreFiLM2 -->|No| Mixin["Input Mixin (g*b,n)"]
MixinPre --> Mixin
Mixin --> PostFiLM2{Post-FiLM?}
PostFiLM2 -->|Yes| MixinPost[Input Mixin Post-FiLM]
PostFiLM2 -->|No| Sum
MixinPost --> Sum
Sum --> PreActFiLM{Pre-Act FiLM?}
PreActFiLM -->|Yes| PreAct[Pre-Activation FiLM]
PreActFiLM -->|No| Act["Activation (b,n)"]
PreAct --> Act
Act --> PostActFiLM{Post-Act FiLM?}
PostActFiLM -->|Yes| PostActFilm[Post-Activation FiLM]
PostActFiLM -->|No| PostAct["Post-Activation Output (b,n)"]
PostActFilm --> PostAct
PostAct --> Conv1x1["1x1 Conv (dx,n)"]
Conv1x1 --> Post1x1FiLM{Post-1x1 FiLM?}
Post1x1FiLM -->|Yes| Post1x1[Post-1x1 FiLM]
Post1x1FiLM -->|No| Residual["Residual (dx,n)"]
Post1x1 --> Residual
Input --> ResidualSum["Residual Sum (dx,n)"]
Residual --> ResidualSum
ResidualSum --> LayerOutput["Layer Output (dx,n)"]
PostAct --> Head1x1{Head 1x1?}
Head1x1 -->|Yes| HeadConv["Head 1x1 Conv (dh,n)"]
Head1x1 -->|No| HeadOutput["Head Output (dh,n)"]
HeadConv --> HeadFiLM{Head FiLM?}
HeadFiLM -->|Yes| HeadPost[Head Post-FiLM]
HeadFiLM -->|No| HeadOutput
HeadPost --> HeadOutput
Layer Computation Flow
LayerArray Computation
A LayerArray chains multiple Layer objects together, processing them sequentially while accumulating their “head outputs” via skip-out connections.
Step 1: Rechanneling
The input is first proejcted (rechanneled) to match the layer channel count:
this->_rechannel.process_(layer_inputs, num_frames);
Eigen::MatrixXf& rechannel_output = _rechannel.GetOutput();
Step 2: Layer Processing
Each layer processes the output of the previous layer:
First Layer: Processes the rechanneled input
Subsequent Layers: Process the residual output from the previous layer
Head Accumulation: Each “head output” is accumulated into the head buffer
for (size_t i = 0; i < this->_layers.size(); i++) {
if (i == 0) {
// First layer consumes the rechannel output buffer
this->_layers[i].Process(rechannel_output, condition, num_frames);
} else {
// Subsequent layers consume the previous layer's output
Eigen::MatrixXf& prev_output = this->_layers[i - 1].GetOutputNextLayer();
this->_layers[i].Process(prev_output, condition, num_frames);
}
// Accumulate head output from this layer
this->_head_inputs.leftCols(num_frames).noalias() +=
this->_layers[i].GetOutputHead().leftCols(num_frames);
}
Step 3: Head Rechanneling
The accumulated head outputs are proejcted (rechanneled) to the final output dimension for the layer array:
_head_rechannel.process_(this->_head_inputs, num_frames);
LayerArray Structure
graph TD
Input[Layer Input] --> Rechannel[Rechannel]
Rechannel --> Layer1[Layer 1]
Layer1 --> Layer2[Layer 2]
Layer2 --> Layer3[Layer 3]
Layer3 --> LayerN[Layer N]
Layer1 -->|Skip| HeadAccum[Head Accumulator]
Layer2 -->|Skip| HeadAccum
Layer3 -->|Skip| HeadAccum
LayerN -->|Skip| HeadAccum
HeadAccum --> HeadRechannel[Head Rechannel]
HeadRechannel --> HeadOut[Head Output]
LayerN --> LayerOut[Layer Output]
LayerArray Structure
WaveNet Processing
The complete WaveNet processing pipeline:
Step 1: Condition Processing
If a condition DSP is provided, the input is processed through it to generate the conditioning signal:
void WaveNet::_process_condition(const int num_frames) {
if (this->_condition_dsp != nullptr) {
// Process input through condition DSP
this->_condition_dsp->process(/* input */, /* output */, num_frames);
// Copy output to condition buffer
} else {
// Use input directly as condition
this->_condition_output = this->_condition_input;
}
}
The condition module can be a WaveNet, but it can also be something else–a convolution, an RNN, etc.
Step 2: LayerArray Processing
Each LayerArray processes the output of the previous array:
First LayerArray: Processes the input with zeroed head inputs
Subsequent LayerArrays: Process the previous array’s output and accumulate head inputs
// First layer array
this->_layer_arrays[0].Process(input, condition, num_frames);
// Subsequent layer arrays
for (size_t i = 1; i < this->_layer_arrays.size(); i++) {
Eigen::MatrixXf& prev_output = this->_layer_arrays[i-1].GetLayerOutputs();
Eigen::MatrixXf& prev_head = this->_layer_arrays[i-1].GetHeadOutputs();
this->_layer_arrays[i].Process(prev_output, condition, prev_head, num_frames);
}
Step 3: Head Scaling and Output
The final head output from the last LayerArray is scaled and written to output:
Eigen::MatrixXf& final_head = this->_layer_arrays.back().GetHeadOutputs();
// Apply head scale and write to output buffers
// (implementation details in wavenet.cpp)
Complete WaveNet Flow
graph TD
AudioIn[Audio Input] --> ConditionProc{Condition DSP?}
ConditionProc -->|Yes| CondDSP[Condition DSP]
ConditionProc -->|No| Condition[Condition Signal]
CondDSP --> Condition
AudioIn --> LayerArray1[LayerArray 1]
Condition --> LayerArray1
LayerArray1 -->|LayerN Output| LayerArray2[LayerArray 2]
LayerArray1 -->|Head Output| LayerArray2
Condition --> LayerArray2
LayerArray2 -->|LayerN Output| LayerArrayN[LayerArray N]
LayerArray2 -->|Head Output| LayerArrayN
Condition --> LayerArrayN
LayerArrayN -->|LayerN Output| Unused("(Unused)")
LayerArrayN -->|Head Output| HeadAccum[Head Accumulator]
HeadAccum --> HeadScale[Head Scale]
HeadScale --> AudioOut[Audio Output]
Complete WaveNet Processing Flow
See Also
WaveNet API - Complete API reference for WaveNet classes
DSP API - Base DSP interface documentation
Conv1D API - Convolution implementation details