Differentiating through sort


Differentiable programming is useful in machine learning research, because it allows for efficient optimization of any desired parameters. Forward- and reverse-mode automatic differentiation are the two major paradigms for finding these derivatives, and differentiable programming is often framed in terms of functional programming. In this post, I will focus on reverse-mode automatic differentiation, because forward-mode is trivial (dual numbers) and inefficient for functions from many variables to few variables, e.g. f:\mathbb{R}^n\rightarrow \mathbb R.

I was curious how derivatives for the more obscure/less functional-appearing operations are determined; one such operation is sort. Assume we have an array A with elements A_i. We perform the operation B = \text{sort}(A). The question is to (efficiently) find the Jacobian

$$J_{ij}=\frac{\partial B_i}{\partial A_j}=\bar J_{ij}^{-1}=\frac{1}{\frac{\partial A_i}{\partial B_j}}$$

Without trying to dig through the source code for JAX or another automatic differentiation framework, here are the observations I made. First, sorting an array is equivalent to identifying a particular permutation of the elements. In addition to the traditional jax.numpy.sort operation, there is jax.numpy.argsort which precisely finds the list of indices specifying the permutation. When we perform reverse mode automatic differentiation, we are given a computational graph, for which each node represents g(f(x)), and we are given g' and f(x), and we must use the chain rule to compute (g\circ f)(x)=g'(f(x))f'(x) (technically, we keep track of the adjoint, denoted by the \bar J above). In multiple dimensions, we must also use the multi-dimensional chain rule, so that there is an implicit sum/matrix-multiplication.

In the case of f(A)=\text{sort}(A)=B, the derivatives \frac{\partial B_i}{\partial A_j} are 1 or 0, with it being 1 when A_j is the i^{th} component of the sorted array and 0 otherwise.

Since the derivative is so simple, we just need to propagate the gradient from the sorted elements B_j back to the corresponding elements of A_i, taking care of the chain rule. In other words, if A_i is sorted to B_j, then the gradient from B_j is copied back to A_i. We thus need to invert the permutation provided by jax.numpy.argsort. How do we invert a sort? The trick is to argsort the argsort (https://stackoverflow.com/questions/9185768/inverting-permutations-in-python)!

Assume we have already argsorted A, which returns indices I. If we argsort I, then sort I is in the original order of the indices of A, [0,1,2,\cdots], therefore argsortI must be the inverse permutation to the permutation represent by I.

I implemented the forward and backward of jax.numpy.sort manually.

No description available.

 

After JITting the appropriate functions, the timing is precisely the same. The gradients are also correct (up to some transpositional/dimensional trickery). This strongly suggests that we achieved the reference implementation for differentiating sort.

No description available.

No description available.

I thought this was a neat trick of coercing what is typically a highly branched and seemingly imperative code, (sorting being a prototypical example of this) into a functional programming form. Essentially, argsort is a weird kind of idempotent operation, though it preserves only the ordering of a collection of items unless you explicitly keep track of the elements as well.

Code: https://github.com/mikesha2/diffsort/tree/main

Lancaster and Blundell Chapter 26

(26.1) Consider \hat\phi_A^\dagger,\hat\phi_B^\dagger such that $$[\hat Q_N,\hat \phi_A^\dagger]=\hat \phi_B^\dagger$$ for some generator of a symmetry group such that [\hat Q_N,\hat H]=0. Show that e^{i\alpha \hat Q_N}|0\rangle=|0\rangle.

$$\hat H e^{i\alpha\hat Q_N}|0\rangle=e^{i\alpha\hat Q_N}\hat H|0\rangle=0$$

Thus it must be that $$e^{i\alpha\hat Q_N}|0\rangle=|0\rangle$$ as we assume a complete basis of eigenstates in the Fock space.

Show also that \hat E_A=E_B, where \hat H\hat \phi_A^\dagger|0\rangle=E_A\hat \phi_A^\dagger|0\rangle and \hat H\hat \phi_B^\dagger|0\rangle=E_B\hat \phi_B^\dagger|0\rangle.

$$E_B\hat \phi_B^\dagger|0\rangle=\hat H\hat \phi_B^\dagger|0\rangle\\=\hat H\left[\hat Q_N,\hat\phi_A^\dagger\right]|0\rangle\\=\hat H\hat Q_N\hat \phi_A^\dagger|0\rangle-\hat H\hat\phi_A^\dagger\hat Q_N|0\rangle\\=\hat Q_N\hat H\hat \phi_A^\dagger|0\rangle\\=\hat Q_NE_A\phi_A^\dagger|0\rangle\\=E_A(\hat\phi_B^\dagger+\hat\phi_A^\dagger \hat Q)|0\rangle=E_A\hat\phi^\dagger_B|0\rangle$$

Thus E_A=E_B.


(26.2) Prove the Fabri-Picasso theorem.

$$\langle 0|\hat J(x)\hat Q|0\rangle=\langle 0|e^{i\hat p\cdot a}\hat J(0)e^{-i\hat p\cdot a}\hat Q|0\rangle$$

We know that [\hat p^\mu,\hat Q]=0, because \hat Q is an internal symmetry. The vacuum has momentum 0, so

$$\langle 0|\hat J(x)\hat Q|0\rangle=\langle 0|\hat J(0)\hat Q|0\rangle$$

Now considering

$$\langle 0|\hat Q\hat Q|0\rangle=\int d^3x\langle 0|J(x)\hat Q|0\rangle=\int d^3x \langle 0|J(0)\hat Q|0\rangle=\langle 0|J(0)\hat Q|0\rangle\int d^3x$$

This quantity diverges unless \hat Q|0\rangle, so \hat Q|0\rangle either has 0 or infinite norm.


(26.3) Prove Goldstone’s theorem.

Assume there is a continuous symmetry with charge \hat Q where \hat Q|0\rangle\neq 0. Consider a field \hat \phi(y) such that

$$[\hat Q, \hat \phi(y)]=\hat\psi(y)$$

and \langle 0|\psi(y)|0\rangle\neq 0 due to spontaneously broken symmetry.

Show that 

$$\frac{\partial}{\partial x^0} \langle 0|\psi(0)|0\rangle=-\int d\mathbf S\cdot \langle 0|[\hat {\mathbf J}(x),\hat\phi(0)]|0\rangle$$

This is obvious from the continuity equation, since

$$\frac{\partial}{\partial x^0}\hat J^0=-\nabla\cdot \hat J^i$$

The divergence theorem gives the relevant quantity surface integral.

 

Lancaster and Blundell Chapter 21

(21.1) Find the thermal average number of excitations in the quantum harmonic oscillator.

$$\langle \hat n\rangle_t=\text{Tr} \hat \rho\hat n$$

$$\hat\rho=\frac{e^{-\beta\hat H}}{Z}$$

$$Z=\sum_n e^{-\beta E_n}=\sum_n e^{-\beta\hbar\omega n}=\sum_n \left(e^{-\beta\hbar\omega}\right)^n\\=\frac{1}{1-e^{-\beta\hbar\omega}}$$

Use units where \hbar =1, and we also ignore the zero-point energy.

$$\text{Tr} \hat \rho\hat n=\sum_n \langle n|e^{-\beta\hat H}(1-e^{-\beta\omega})\hat n|n\rangle$$

$$=\sum_n n e^{-\beta \omega n}(1-e^{-\beta\omega})\\=-\frac{1}{\beta}(1-e^{-\beta\omega})\frac{\partial}{\partial\omega}\sum_n e^{-\beta\omega n}$$

$$=-\frac1\beta(1-e^{-\beta\omega})\frac{\partial}{\partial\omega}\frac{1}{1-e^{-\beta\omega}}\\=-\frac1\beta(1-e^{-\beta\omega})\frac{-\beta e^{-\beta\omega}}{(1-e^{-\beta\omega})^2}$$

$$=\frac{(1-e^{-\beta\omega})e^{-\beta\omega}}{(1-e^{-\beta\omega})^2}=\frac{1}{e^{\beta\omega}-1}$$


(21.2) Consider the Lagrangian $$L=\frac12m\dot x(t)^2-\frac12m\omega^2 x(t)^2+f(t)x(t)$$

$$\langle \psi(t)|\hat x(t)|\psi(t)\rangle=\int_{-\infty}^\infty dt’\chi(t-t’)f(t)’$$

Using H'=-f(t)\hat x(t) as the interaction part of the Hamiltonian, find the interaction picture state \psi_I(t)\rangle to first order in f_I(t).

$$f_I(t)=e^{iH_0t}f(t)e^{-iH_0t}$$

$$|\psi_I(t)\rangle=T[e^{-i \int_{-\infty}^t H_I’ dt’}]|0\rangle\approx (1-i\int_{-\infty}^t H_I’ dt’)|0\rangle\\=|0\rangle+i\int_{-\infty}^t dt’ f_I(t’)x_I(t’)|0\rangle$$

If we try to find the expectation of \hat x(t), we indeed find that

$$\langle\psi_I(t)|\hat x(t)|\psi(t)\rangle=\langle 0|\hat x_I(t)|0\rangle+i\int_{-\infty}^t dt’f_I(t’)\langle 0|[\hat x_I(t),\hat x_I(t’)]|0\rangle=i\int_{-\infty}^\infty dt’\theta(t-t’) f_I(t’)\langle 0|[\hat x_I(t), \hat x_I(t’)]|0\rangle$$

$$\chi(t-t’)=i\theta(t-t’)\langle 0|[\hat x_I(t),\hat x_I(t’)]|0\rangle$$

The commutator evaluates to $$\frac{-i}{m\omega}\sin(\omega(t-t’))$$


(21.3) Find the Green’s function for the diffusion equation.

Perform a Laplace transform in time and a Fourier transform in space. The Laplace transform is equivalent to a Fourier transform with s=i\omega, since then e^{-st}=e^{-i\omega t}.

$$-s\tilde G-D(i\mathbf q)^2\tilde G=1$$

$$\tilde G = \frac{1}{-i\omega + D|\mathbf q|^2}$$

Lancaster and Blundell Chapter 15

(15.1) Why is \pi^0\rightarrow\gamma+\gamma+\gamma disallowed?

Charge conjugation acts on each \gamma, producing a -1 to the quantum state. The eigenvalue of C must be preserved under charge conjugation, due to the electromagnetism, but would not be in this process, since \pi^0 has an eigenvalue of 1.


(15.2)

a) Magnetic flux is a pseudoscalar.

b) Angular momentum is a pseudovector.

c) Charge is a scalar.

d) The scalar product of a vector with a pseudovector is a pseudoscalar.

e) The scalar product of two vectors is a scalar.

f) The scalar product of two pseudovectors is a scalar.


(15.3) Find the representations for the spinor rotation matrices.

$$\mathbf R(\hat{\mathbf x},\theta)=\begin{bmatrix}\cos\frac\theta2&-i\sin\frac\theta2\\-i\sin\frac\theta2&\cos\frac\theta2\end{bmatrix}$$

$$\mathbf R(\hat{\mathbf y},\theta)=\begin{bmatrix}\cos\frac\theta2&-\sin\frac\theta2\\\sin\frac\theta2&\cos\frac\theta2\end{bmatrix}$$

$$\mathbf R(\hat{\mathbf z},\theta)=\begin{bmatrix}\cos\frac\theta2-i\sin\frac\theta2&0\\0&\cos\frac\theta2+i\sin\frac\theta2\end{bmatrix}=\begin{bmatrix}\exp-i\frac\theta2&0\\0&\exp i\frac\theta2\end{bmatrix}$$

what did you major in?

Sometimes, I’m asked about my college major, and I usually say physics and computer science. The truth is a little more complicated. I officially had majors in Biochemistry, Biophysics, Mathematics, Physics, and Computer and Information Science, and also a Master of Science in Physics. My planning worksheets were a mess:

Bachelor of Arts, majors in Biochemistry, Biophysics, and Physics (apparently I got honors for Biochemistry and Physics? neat)

Bachelor of Applied Science, majors in Computer and Information Science and Mathematics

There wasn’t even enough room for me to have the mathematics major on an official worksheet. The maximum on a single sheet is 3, and the BAS option usually uses a BA sheet for the second major. Only a single BA, BAS, MS, etc. sheet can be made official. It wasn’t easy, but I figured it all out. Looking back on it now, I’m impressed I got the paperwork in order. The only classes not listed are a few which are solely on my MS transcript (you can only double count each class, and not triple count, including between majors and among degrees).

Was it worth it? Absolutely. Maybe?

I’m a better physicist due to the practice I got coding and understanding the rigorous mathematics behind the physics. All the programming languages blend together, and it takes a single day to refresh myself on C++, MATLAB, Julia, Python, R, and all the packages out there. I’m a better biologist because of my broad knowledge of chemistry and physics. I’m a better scientist for understanding all the different approaches to a single problem, informed by intuition for each of these subjects.

The downside is I didn’t have as much research coming into the MD-PhD program. I didn’t have the laser focus on a single subject for which I was an expert. I’m getting there now in my graduate research, especially with machine learning and molecular simulation. But I feel ready to write any kind of paper: theory, computational, experimental, you name it. Just give me a week and I’ll become well-versed in the subject.