Skip to content

Commit 6707c6f

Browse files
committed
update to say Mooncake not Tapir nor Taped
1 parent 871d940 commit 6707c6f

File tree

1 file changed

+61
-23
lines changed

1 file changed

+61
-23
lines changed

docs/src/tutorials/gradient_zoo.md

Lines changed: 61 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,9 @@ also known as reverse-mode automatic differentiation.
55
Given a model, some data, and a loss function, this answers the question
66
"what direction, in the space of the model's parameters, reduces the loss fastest?"
77

8+
This page is a brief overview of ways to perform automatic differentiation in Julia,
9+
and how they relate to Flux.
10+
811
### `gradient(f, x)` interface
912

1013
Julia's ecosystem has many versions of `gradient(f, x)`, which evaluates `y = f(x)` then retuns `∂y_∂x`. The details of how they do this vary, but the interfece is similar. An incomplete list is (alphabetically):
@@ -21,11 +24,11 @@ julia> ForwardDiff.gradient(x -> sum(sqrt, x), [1 4 16.])
2124
1×3 Matrix{Float64}:
2225
0.5 0.25 0.125
2326

24-
julia> ReverseDiff.gradient(x -> sum(sqrt, x), [1 4 16.])
27+
julia> DifferentiationInterface.gradient(x -> sum(sqrt, x), AutoMooncake(; config=nothing), [1 4 16.])
2528
1×3 Matrix{Float64}:
2629
0.5 0.25 0.125
2730

28-
julia> DifferentiationInterface.gradient(x -> sum(sqrt, x), AutoTapir(), [1 4 16.])
31+
julia> ReverseDiff.gradient(x -> sum(sqrt, x), [1 4 16.])
2932
1×3 Matrix{Float64}:
3033
0.5 0.25 0.125
3134

@@ -64,7 +67,7 @@ julia> model = Chain(Embedding(reshape(1:6, 2,3) .+ 0.0), softmax)
6467
Chain(
6568
Embedding(3 => 2), # 6 parameters
6669
NNlib.softmax,
67-
)
70+
)
6871

6972
julia> model.layers[1].weight # this is the wrapped parameter array
7073
2×3 Matrix{Float64}:
@@ -90,11 +93,11 @@ julia> _, grads_t = Tracker.withgradient(loss, model)
9093
julia> grads_d = Diffractor.gradient(loss, model)
9194
(Tangent{Chain{Tuple{Embedding{Matrix{Float64}}, typeof(softmax)}}}(layers = (Tangent{Embedding{Matrix{Float64}}}(weight = [-0.18171549534589682 0.0 0.0; 0.18171549534589682 0.0 0.0],), ChainRulesCore.NoTangent()),),)
9295

93-
julia> grad_e = Enzyme.gradient(Reverse, loss, model)
94-
Chain(
95-
Embedding(3 => 2), # 6 parameters
96-
NNlib.softmax,
97-
)
96+
julia> grads_e = Enzyme.gradient(Reverse, loss, model)
97+
(Chain(Embedding(3 => 2), softmax),)
98+
99+
julia> grad_m = DifferentiationInterface.gradient(loss, AutoMooncake(; config=nothing), model)
100+
Mooncake.Tangent{@NamedTuple{layers::Tuple{Mooncake.Tangent{@NamedTuple{weight::Matrix{Float64}}}, Mooncake.NoTangent}}}((layers = (Mooncake.Tangent{@NamedTuple{weight::Matrix{Float64}}}((weight = [-0.18171549534589682 0.0 0.0; 0.18171549534589682 0.0 0.0],)), Mooncake.NoTangent()),))
98101
```
99102

100103
While the type returned for `∂loss_∂model` varies, they all have the same nested structure, matching that of the model. This is all that Flux needs.
@@ -105,10 +108,12 @@ julia> grads_z[1].layers[1].weight # Zygote's gradient for model.layers[1].weig
105108
-0.181715 0.0 0.0
106109
0.181715 0.0 0.0
107110

108-
julia> grad_e.layers[1].weight # Enzyme's gradient for the same weight matrix
111+
julia> grads_e[1].layers[1].weight # Enzyme's gradient for the same weight matrix
109112
2×3 Matrix{Float64}:
110113
-0.181715 0.0 0.0
111114
0.181715 0.0 0.0
115+
116+
julia> ans grad_m.fields.layers[1].fields.weight # Mooncake seems to differ?
112117
```
113118

114119
Here's Flux updating the model using each gradient:
@@ -128,7 +133,9 @@ julia> model_z.layers[1].weight # updated weight matrix
128133

129134
julia> model_e = deepcopy(model);
130135

131-
julia> Flux.update!(opt_state, model_e, grad_e)[2][1].weight # same update
136+
julia> Flux.update!(opt_state, model_e, grads_e[1]);
137+
138+
julia> model_e.layers[1].weight # same update
132139
2×3 Matrix{Float64}:
133140
1.06057 3.0 5.0
134141
1.93943 4.0 6.0
@@ -142,22 +149,40 @@ In this case they are all identical, but there are some caveats, explored below.
142149

143150
Both Zygote and Tracker were written for Flux, and at present, Flux loads Zygote and exports `Zygote.gradient`, and calls this within `Flux.train!`. But apart from that, there is very little coupling between Flux and the automatic differentiation package.
144151

145-
This page has very brief notes on how all these packages compare, as a guide for anyone wanting to experiment with them. We stress "experiment" since Zygote is (at present) by far the best-tested. All notes are from February 2024,
152+
This page has very brief notes on how all these packages compare, as a guide for anyone wanting to experiment with them. We stress "experiment" since Zygote is (at present) by far the best-tested. All notes are from February 2024,
146153

147154
### [Zygote.jl](https://github.com/FluxML/Zygote.jl/issues)
148155

149-
Reverse-mode source-to-source automatic differentiation, written by hooking into Julis's compiler.
156+
Reverse-mode source-to-source automatic differentiation, written by hooking into Julia's compiler.
150157

151158
* By far the best-tested option for Flux models.
152159

153160
* Long compilation times, on the first call.
154161

155162
* Allows mutation of structs, but not of arrays. This leads to the most common error... sometimes this happens because you mutate an array, often because you call some function which, internally, creates the array it wants to return & then fills it in.
156163

157-
* Custom rules via `ZygoteRules.@adjpoint` or better, `ChainRulesCore.rrule`.
164+
```julia
165+
function mysum2(x::AbstractMatrix) # implements y = vec(sum(x; dims=2))
166+
y = similar(x, size(x,1))
167+
for col in eachcol(x)
168+
y .+= col # mutates y, Zygote will not allow this
169+
end
170+
return y
171+
end
172+
173+
Zygote.jacobian(x -> sum(x; dims=2).^2, Float32[1 2 3; 4 5 6])[1] # returns a 2×6 Matrix
174+
Zygote.jacobian(x -> mysum2(x).^2, Float32[1 2 3; 4 5 6])[1] # ERROR: Mutating arrays is not supported
175+
```
158176

159-
* Returns nested NamedTuples and Tuples, and uses `nothing` to mean zero. Does not track shared arrays, hence may return different contributions
177+
* Custom rules via `ZygoteRules.@adjpoint` or (equivalently) `ChainRulesCore.rrule`.
160178

179+
* Returns nested NamedTuples and Tuples, and uses `nothing` to mean zero.
180+
181+
* Does not track shared arrays, hence may return different contributions.
182+
183+
```julia
184+
185+
```
161186

162187
!!! compat "Deprecated: Zygote's implicit mode"
163188
Flux's default used to be work like this, instead of using deeply nested trees for gradients as above:
@@ -194,7 +219,7 @@ julia> model_tracked = Flux.fmap(x -> x isa Array ? Tracker.param(x) : x, model)
194219
Chain(
195220
Embedding(3 => 2), # 6 parameters
196221
NNlib.softmax,
197-
)
222+
)
198223

199224
julia> val_tracked = loss(model_tracked)
200225
0.6067761f0 (tracked)
@@ -230,7 +255,23 @@ New package which works on the LLVM code which Julia compiles down to.
230255

231256
* Returns another struct of the same type as the model, such as `Chain` above. Non-differentiable objects are left alone, not replaced by a zero.
232257

233-
### [Tapir.jl](https://github.com/withbayes/Tapir.jl)
258+
Enzyme likes to work in-place, with objects and their gradients stored togeter in a `Duplicated(x, dx)`.
259+
Flux has an interface which uses this:
260+
```julia
261+
julia> Flux.train!((m,x) -> sum(abs2, m(1)), model, 1:1, opt_state) # train! with Zygote
262+
263+
julia> Flux.train!((m,x) -> sum(abs2, m(1)), Duplicated(model), 1:1, opt_state) # train! with Enzyme
264+
```
265+
and
266+
```julia
267+
julia> grads_e2 = Flux.gradient(loss, Duplicated(model))
268+
((layers = ((weight = [-0.18171549534589682 0.0 0.0; 0.18171549534589682 0.0 0.0],), nothing),),)
269+
270+
julia> Flux.withgradient(loss, Duplicated(model))
271+
(val = 0.5665111155481435, grad = ((layers = ((weight = [-0.15810298866515066 0.0 0.0; 0.1581029886651505 0.0 0.0],), nothing),),))
272+
```
273+
274+
### [Mooncake.jl](https://github.com/compintell/Mooncake.jl)
234275

235276
Another new AD to watch. Many similariries in its approach to Enzyme.jl, but operates all in Julia.
236277

@@ -262,11 +303,11 @@ Another Julia source-to-source reverse-mode AD.
262303

263304
### [ForwardDiff.jl](https://github.com/JuliaDiff/ForwardDiff.jl)
264305

265-
Forward mode is a different algorithm...
306+
Forward mode is a different algorithm...
266307

267-
* Needs a flat vector
308+
* Needs a simple array of parameters, i.e. supports only `gradient(f, x::AbstractArray{<:Real})`.
268309

269-
* Forward mode is generally not what you want!
310+
* Forward mode is generally not what you want for nerual networks! It's ideal for ``ℝ → ℝᴺ`` functions, but the wrong algorithm for ``ℝᴺ → ℝ``.
270311

271312
* `gradient(f, x)` will call `f(x)` multiple times. Layers like `BatchNorm` with state may get confused.
272313

@@ -316,7 +357,4 @@ This year's new attempt to build a simpler one?
316357

317358
Really `rrule_via_ad` is another mechanism, but only for 3 systems.
318359

319-
Sold as an attempt at unification, but its design of extensible `rrule`s turned out to be too closely tied to Zygote/Diffractor style AD, and not a good fit for Enzyme/Tapir which therefore use their own rule systems. Also not a natural fit for Tracker/ReverseDiff/ForwardDiff style of operator overloading AD.
320-
321-
322-
360+
Sold as an attempt at unification, but its design of extensible `rrule`s turned out to be too closely tied to Zygote/Diffractor style AD, and not a good fit for Enzyme/Mooncake which therefore use their own rule systems. Also not a natural fit for Tracker/ReverseDiff/ForwardDiff style of operator overloading AD.

0 commit comments

Comments
 (0)