# Custom differentiation

This tutorial will show you how to define your own custom derivatives, perform derivative surgery, and implement your own gradient checkpointing API in just 5 lines of Swift.

## Declaring custom derivatives

You can define custom derivatives for any Swift function that has differentiable parameters and results. By doing that, you can even import a C function and make it differentiable.

import Glibc

func sillyExp(_ x: Float) -> Float {
let 𝑒 = Float(M_E)
print("Taking 𝑒(\(𝑒)) to the power of \(x)!")
return pow(𝑒, x)
}

@derivative(of: sillyExp)
func sillyDerivative(_ x: Float) -> (value: Float, pullback: (Float) -> Float) {
let y = sillyExp(x)
return (value: y, pullback: { v in v * y })
}

print("exp(3) =", sillyExp(3))

Taking 𝑒(2.7182817) to the power of 3.0!
exp(3) = 20.085535
Taking 𝑒(2.7182817) to the power of 3.0!
𝛁exp(3) = 20.085535



## Stop derivatives from propagating

Commonly known as "stop gradient" in machine learning use cases, method withoutDerivative(at:) stops derivatives from propagating.

Plus, withoutDerivative(at:) can sometimes help the Swift compiler with identifying what not to differentiate and producing more efficient derivaitves. When it is detectable that the derivative of a function will always be zero, the Swift compiler will produce a warning. Explicitly using withoutDerivative(at:) silences that warning.

let x: Float = 2.0
let y: Float = 3.0
gradient(at: x, y) { x, y in
sin(sin(sin(x))) + withoutDerivative(at: cos(cos(cos(y))))
}

▿ 2 elements

- .0 : -0.18009877
- .1 : 0.0



## Derivative surgery

Method withDerivative(_:) makes arbitrary operations (including mutation) run on the gradient at a value during the enclosing function’s backpropagation.

Use this to debug or make experimental tweaks to backpropagation.

### It works anywhere

All differentiation APIs provided by the standard library are defined generically over all types that conform to the Differentiable protocol: Float, Double, Float80, SIMD vectors, and even your own types!

Read technical document Differentiable Types for more insights on the Differentiable protocol.

var x: Float = 30
gradient(at: x) { x -> Float in
// Print the partial derivative with respect to the result of sin(x).
let a = sin(x).withDerivative { print("∂+/∂sin = \($0)") } // Force the partial derivative with respect to x to be 0.5. let b = log(x.withDerivative { (dx: inout Float) in print("∂log/∂x = \(dx), but rewritten to 0.5"); dx = 0.5 }) return a + b }  ∂log/∂x = 0.033333335, but rewritten to 0.5 ∂+/∂sin = 1.0 0.65425146  ### Use it in a neural network module Just like how we used it in a simple Float function, we can use it in any numerical application, like the following neural network built using the Swift for TensorFlow Deep Learning Library. import TensorFlow struct MLP: Layer { var layer1 = Dense<Float>(inputSize: 2, outputSize: 10, activation: relu) var layer2 = Dense<Float>(inputSize: 10, outputSize: 1, activation: relu) @differentiable func callAsFunction(_ input: Tensor<Float>) -> Tensor<Float> { let h0 = layer1(input).withDerivative { print("∂L/∂layer1 =",$0) }
return layer2(h0)
}
}

var classifier = MLP()
let optimizer = SGD(for: classifier, learningRate: 0.02)

let x: Tensor<Float> = [[0, 0], [0, 1], [1, 0], [1, 1]]
let y: Tensor<Float> = [0, 1, 1, 0]

for _ in 0..<10 {
let 𝛁model = gradient(at: classifier) { classifier -> Tensor<Float> in
let ŷ = classifier(x).withDerivative { print("∂L/∂ŷ =", \$0) }
let loss = (ŷ - y).squared().mean()
print("Loss: \(loss)")
return loss
}
optimizer.update(&classifier, along: 𝛁model)
}

Loss: 0.5
∂L/∂ŷ = [[-0.25],
[-0.25],
[-0.25],
[-0.25]]
∂L/∂layer1 = [[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]]
Loss: 0.5
∂L/∂ŷ = [[-0.25],
[-0.25],
[-0.25],
[-0.25]]
∂L/∂layer1 = [[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]]
Loss: 0.5
∂L/∂ŷ = [[-0.25],
[-0.25],
[-0.25],
[-0.25]]
∂L/∂layer1 = [[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]]
Loss: 0.5
∂L/∂ŷ = [[-0.25],
[-0.25],
[-0.25],
[-0.25]]
∂L/∂layer1 = [[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]]
Loss: 0.5
∂L/∂ŷ = [[-0.25],
[-0.25],
[-0.25],
[-0.25]]
∂L/∂layer1 = [[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]]
Loss: 0.5
∂L/∂ŷ = [[-0.25],
[-0.25],
[-0.25],
[-0.25]]
∂L/∂layer1 = [[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]]
Loss: 0.5
∂L/∂ŷ = [[-0.25],
[-0.25],
[-0.25],
[-0.25]]
∂L/∂layer1 = [[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]]
Loss: 0.5
∂L/∂ŷ = [[-0.25],
[-0.25],
[-0.25],
[-0.25]]
∂L/∂layer1 = [[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]]
Loss: 0.5
∂L/∂ŷ = [[-0.25],
[-0.25],
[-0.25],
[-0.25]]
∂L/∂layer1 = [[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]]
Loss: 0.5
∂L/∂ŷ = [[-0.25],
[-0.25],
[-0.25],
[-0.25]]
∂L/∂layer1 = [[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]]



## Recomputing activations during backpropagation to save memory (checkpointing)

Checkpointing is a traditional technique in reverse-mode automatic differentiation for saving memory. Rather than saving large intermediate values in the original computation for computing derivatives, the intermediate values are instead recomputed as needed during backpropagation.

This technique has been realized in modern deep learning libraries as well. In Swift, API withRecomputationInPullbacks(_:) enables you to control what to recompute during backpropagation, and it is available on all Differentiable types.

But today, let us learn how to define our own gradient checkpointing APIs from scratch, in just a few lines of code.

We can define our own gradient checkpointing API, makeRecomputedInGradient(_:), in terms of standard library function differentiableFunction(from:), which is a shorthand for creating a differentiable function directly from a derivative function (also called a "vector-Jacobian products (VJP) function").

As we have seen before, the derivative function returns a tuple of the original function's result and a pullback closure. We return original(x) in value:, and call pullback(at:in:) on original to evaluate the original function again and get a pullback.

/// Given a differentiable function, returns the same differentiable function except when
/// derivatives of this function are being computed. In that case, values in the original function needed
/// for computing the derivatives will be recomputed, instead of being captured by the differential or pullback.
///
/// - Parameter body: The body of the differentiable function.
/// - Returns: The same differentiable function whose derivatives, when computed, will recompute
///   some values from the original function.
_ original: @escaping @differentiable (T) -> U
) -> @differentiable (T) -> U {
return differentiableFunction { x in
(value: original(x), pullback: { v in pullback(at: x, in: original)(v) })
}
}


### Verify it works

let input: Float = 10.0
print("Running original computation...")

// Differentiable multiplication with checkpointing.
let square = makeRecomputedInGradient { (x: Float) -> Float in
print("  Computing square...")
return x * x
}

// Differentiate f(x) = (cos(x))^2.
let (output, backprop) = valueWithPullback(at: input) { input -> Float in
return square(cos(input))
}
print("Running backpropagation...")

Running original computation...
Computing square...
Running backpropagation...
Computing square...



### Extend it to neural network modules

In this example, we define a simple convolutional neural network.

struct Model: Layer {
var conv = Conv2D<Float>(filterShape: (5, 5, 3, 6))
var maxPool = MaxPool2D<Float>(poolSize: (2, 2), strides: (2, 2))
var flatten = Flatten<Float>()
var dense = Dense<Float>(inputSize: 36 * 6, outputSize: 10)

@differentiable
func call(_ input: Tensor<Float>) -> Tensor<Float> {
return input.sequenced(through: conv, maxPool, flatten, dense)
}
}


We want to make activations in the convolution layer (conv) be recomputed during backpropagation. However, using makeRecomputedInGradient(_:) could make the resulting code look cumbersome, especially when we want to apply layers sequentially using sequenced(in:through:_:_:_:_:).

input.sequenced(in: context, through: conv, maxPool, flatten, dense)


So, why don't we define a special layer type that wraps a layer and makes its activations be recomputed during backpropagation? Let's do it.

First, we define a makeRecomputedInGradient(_:) function that takes a binary function.

// Same as the previous makeRecomputedInGradient(_:), except it's for binary functions.
func makeRecomputedInGradient<T: Differentiable, U: Differentiable, V: Differentiable>(
_ original: @escaping @differentiable (T, U) -> V
) -> @differentiable (T, U) -> V {
return differentiableFunction { x, y in
(value: original(x, y), pullback: { v in pullback(at: x, y, in: original)(v) })
}
}


Then, we define a generic layer ActivationDiscarding<Wrapped>.

import TensorFlow

/// A layer wrapper that makes the underlying layer's activations be discarded during application
/// and recomputed during backpropagation.
/// The wrapped layer.
var wrapped: Wrapped

@differentiable
func callAsFunction(_ input: Wrapped.Input) -> Wrapped.Output {
let apply = makeRecomputedInGradient { (layer: Wrapped, input: Input) -> Wrapped.Output in
print("    Applying \(Wrapped.self) layer...")
return layer(input)
}
return apply(wrapped, input)
}
}


Finally, we can add a method on all layers that returns the same layer except its activations are discarded during application and recomputed during backpropagation.

extension Layer {
}
}


Back in the model, all we have to change is to wrap the convolution layer into the activation-recomputing layer.

var conv = Conv2D<Float>(filterShape: (5, 5, 3, 6)).discardingActivations()


Now, simply use it in the model!

struct Model: Layer {
var conv = Conv2D<Float>(filterShape: (5, 5, 3, 6)).discardingActivations()
var maxPool = MaxPool2D<Float>(poolSize: (2, 2), strides: (2, 2))
var flatten = Flatten<Float>()
var dense = Dense<Float>(inputSize: 36 * 6, outputSize: 10)

@differentiable
func callAsFunction(_ input: Tensor<Float>) -> Tensor<Float> {
return input.sequenced(through: conv, maxPool, flatten, dense)
}
}


When we run a training loop, we can see that the convolution layer's activations are computed twice: once during layer application, and once during backpropagation.

// Use random training data.
let x = Tensor<Float>(randomNormal: [10, 16, 16, 3])
let y = Tensor<Int32>(rangeFrom: 0, to: 10, stride: 1)

var model = Model()
let opt = SGD(for: model)

for i in 1...5 {
print("Starting training step \(i)")
print("  Running original computation...")
let (logits, backprop) = model.appliedForBackpropagation(to: x)
let (loss, dL_dŷ) = valueWithGradient(at: logits) { logits in
softmaxCrossEntropy(logits: logits, labels: y)
}
print("  Loss: \(loss)")
print("  Running backpropagation...")
let (dL_dθ, _) = backprop(dL_dŷ)

opt.update(&model, along: dL_dθ)
}

Starting training step 1
Running original computation...
Applying Conv2D<Float> layer...
Loss: 3.2293372
Running backpropagation...
Applying Conv2D<Float> layer...
Starting training step 2
Running original computation...
Applying Conv2D<Float> layer...
Loss: 2.8945909
Running backpropagation...
Applying Conv2D<Float> layer...
Starting training step 3
Running original computation...
Applying Conv2D<Float> layer...
Loss: 2.6050858
Running backpropagation...
Applying Conv2D<Float> layer...
Starting training step 4
Running original computation...
Applying Conv2D<Float> layer...
Loss: 2.3532183
Running backpropagation...
Applying Conv2D<Float> layer...
Starting training step 5
Running original computation...
Applying Conv2D<Float> layer...
Loss: 2.131448
Running backpropagation...
Applying Conv2D<Float> layer...



Just like that, it is super easy to define generic differentiable programming libraries for different domains.