June 25, 2019 — Support my next blog post, buy me a coffee ☕.
In my latest paper, my colleague and I prove a theorem about the approximation of multivariate functions by deep ReLU networks, for which the curse of dimensionality is lessened. Our theorem is based on a constructive proof of the Kolmogorov–Arnold superposition theorem, and on a subset of multivariate continuous functions whose outer superposition functions can be efficiently approximated by deep ReLU networks.
At the second International Congress of Mathematicians in Paris 1900, Hilbert presented ten of his 23 problems, including the 13th problem about equations of degree seven. He considered the following equation, $$ x^7 + ax^3 + bx^2 + cx + 1 = 0, $$ and asked whether its solution \(x(a,b,c)\), seen as a function of the three parameters \(a\), \(b\) and \(c\), can be written as the composition of functions of only two variables.
Hilbert's 13th problem was solved by Kolmogorov and his 19 years old student Arnold in a series of papers in the 1950s. Kolmogorov first proved in 1956 that any continuous function of several variables can be expressed as the composition of functions of three variables. His student Arnold extended his theorem in 1957; three variables were reduced to two. Kolmogorov finally showed later that year that only functions of one variable were needed. The latter result is known as the Kolmogorov–Arnold superposition theorem, and states that any continuous functions \(f:[0,1]^n\rightarrow\mathbb{R}\) can be decomposed as $$ f(x_1,\ldots,x_n) = \sum_{j=0}^{2n}\phi_j\left(\sum_{i=1}^n\psi_{i,j}(x_i)\right), $$ with \(2n+1\) continuous outer functions \(\phi_j:\mathbb{R}\rightarrow\mathbb{R}\) (dependent of \(f\)) and \(2n^2+n\) continuous inner functions \(\psi_{i,j}:[0,1]\rightarrow\mathbb{R}\) (independent of \(f\)).
The Kolmogorov–Arnold superposition theorem was further improved in the 1960s and the 1970s. Lorentz showed in 1962 that the outer functions \(\phi_j\) might be chosen to be the same function \(\phi\), and replaced the inner functions \(\psi_{i,j}\) by \(\lambda_i\psi_j\), for some positive rationally independent constants \(\lambda_i\leq 1\), while Sprecher replaced the inner functions \(\psi_{i,j}\) by Hölder continuous functions \(x_i\mapsto\lambda^{ij}\psi(x_i+j\epsilon)\) in 1965. Two years later, Fridman demonstrated that the inner functions could be chosen to be Lipschitz continuous, but his decomposition used \(2n+1\) outer functions and \(2n^2+n\) inner functions. Finally, Sprecher provided in 1972 a decomposition with Lipschitz continuous functions \(x_i\mapsto\lambda^{i-1}\psi(x_i+j\epsilon)\). (See our paper for the references.)
One of the most important theoretical problems in deep network approximation theory is to determine why and when deep networks lessen or break the curse of dimensionality for multivariate continuous functions, characterized by the \(\mathcal{O}(\epsilon^{-n})\) growth of the network size, as the error \(\epsilon\rightarrow0\).
In our paper, we use a variant of Sprecher's 1965 version of the theorem, which reads $$ f(x_1,\ldots,x_n) = \sum_{j=0}^{2n}\phi_j\left(\sum_{i=1}^n\lambda_i\psi(x_i+ja)\right), $$ for some constants \(\lambda_1=1>\lambda_2>\ldots>\lambda_n\) and \(a=[(2n+1)(2n+2)]^{-1}\), and with Hölder continuous inner function \(\psi\).
We first show that the outer functions may be approximated by Lipschitz continuous functions. Then, we prove that—for a particular subset of the continuous functions—the inner and outer functions can be approximated with error \(\epsilon\) by deep networks of size \(\mathcal{O}(\epsilon^{-\log n})\) and \(\mathcal{O}(\epsilon^{-1/2})\). The resulting network that approximates \(f\) has depth and size \(\mathcal{O}(\epsilon^{-\log n})\); the curse of dimensionality is lessened. Check it out if you are interested!
2019 Deep networks and the Kolmogorov–Arnold theorem