Deep networks and the curse of dimensionality

December 22, 2017 — Support my next blog post, buy me a coffee ☕.

I started working on deep learning models when I began my research at Columbia University in September 2017. Coming from the approximation theory world, I wanted to learn about their approximation properties. A few months down the road Qiang Du and I have proven a new theorem concerning the approximation of multivariate functions by deep ReLU networks.

Deep learning has been successfully applied to many fields including computer vision, speech recognition, and natural language processing. It is based on approximations by deep networks, as opposed to shallow networks. The latter are neural networks with a single hidden layer and correspond to approximations \(f_N\) of multivariate functions \(f:\mathbb{R}^d\rightarrow\mathbb{R}\) of the form $$ f_N(\boldsymbol{x}) = \sum_{i=1}^N \alpha_i \sigma(\boldsymbol{w}_i^T\boldsymbol{x} + \theta_i), $$ with \(\alpha_i,\theta_i\in\mathbb{R},\, \boldsymbol{x},\boldsymbol{w}_i\in\mathbb{R}^d\), and for some activation function \(\sigma:\mathbb{R}\rightarrow\mathbb{R}\). The former are neural networks with one or more hidden layers, where each unit of each layer performs an operation of the form \(\sigma(\boldsymbol{w}\cdot\boldsymbol{x} + \theta)\). The depth of a network is the number of hidden layers, and the size is the total number of units. Shallow networks have depth \(1\) and their size is the number \(N\) in the expansion above, while deep networks usually have depth \(\gg 1\). Deep ReLU networks use the activation function \(\sigma(x) = \max(0,x)\).

One of the most important theoretical problems is to determine why and when deep (but not shallow) networks can lessen or break the curse of dimensionality. A possible way of addressing this problem is to focus on a particular set of functions that have a very special (compositional or polynomial) structure and to show that for this particular set deep networks perform extremely well. We have followed a different route. We have considered a generic set of functions and proved new error estimates for which the curse of dimensionality is lessened by establishing a connection with sparse grids.

For a real-valued function \(f\) in \(\mathbb{R}^d\) whose smoothness is characterized by some integer \(m\) (typically the order of integrable derivatives), and for some prescribed error \(\epsilon>0\), one tries to show that there exists a neural network \(f_N\) of size \(N\) that satisfies $$ \Vert f - f_N \Vert \leq \epsilon \;\; \text{with} \;\; N=\mathcal{O}(\epsilon^{-\frac{d}{m}}), $$ for some norm \(\Vert\cdot\Vert\). For deep networks, one also wants to find the asymptotic behavior of the depth as a function of \(\epsilon\). Results of this form are standard approximation results that suffer from the curse of dimensionality. For small dimensions \(d\), the size \(N\) of the network increases at a reasonable rate as \(\epsilon\) goes to zero. However, \(N\) grows geometrically with \(d\).

The picture is different for deep networks. My theorem shows that functions in the so-called Korobov spaces \(X^{2,p}(\Omega)\) of mixed derivatives of order two can be represented with error \(\epsilon\) by deep networks of depth \(\mathcal{O}(\vert\log_2\epsilon\vert\log_2d)\) and size $$ N=\mathcal{O}(\epsilon^{-\frac{1}{2}}\vert\log_2\epsilon\vert^{\frac{3}{2}(d-1)+1}(d-1)). $$ The curse of dimensionality is not totally overcome but is lessened since \(d\) only affects logarithmic factors \(\vert\log_2\epsilon\vert\).


Blog posts about neural network approximation theory

2019   Deep networks and the Kolmogorov–Arnold theorem

2019   Deep networks and bandlimited functions

2017   Deep networks and the curse of dimensionality