chapter3/ternary.tex

\section{Study of the ternary representation}
\label{sec:ternary}

In this section, we study the ternary representation, which is the general representation with weight sharing we obtained above. We already saw that it is linear, associative and commutative.

\subsection{Genericity}
\label{sec:gen}

%\todo{split between supervised and semi supervised}

The ternary representation can represent any kind of layer. We explain below how to obtain standard ones. See \tabref{tab:ind} and \tabref{tab:mid} for a description of the indices, notations and shapes.

%\renewcommand{\labelitemi}{$\blacktriangleright$}
\begin{itemize}%[noitemsep,nolistsep]
\item[\ding{227}] To obtain a fully connected layer, one can choose $\omega = nm$ such that the slices $S[:,i,j]$ constitute every possible one-hot vectors.
\item[\ding{227}] To obtain a convolutional layer, one can choose $\omega$ to be the size of the kernel. $S$ would contain one-hot vectors. A stride $> 1$ can be obtained by removing the corresponding dimensions. If the convolution is a classical convolution, or more generally, is characterized by a Cayley subgraph (see \chapref{chap:2}), then $S$ would be circulant along the input neurons rank in the canonical basis.
\item[\ding{227}] To obtain a graph convolutional layer (\cite{kipf2016semi}, see description in \secref{sec:vert}), one can choose $B = 1$, $\omega = 1$, and $S$ (viewed as a matrix) to be the normalized adjacency matrix~$\widetilde{A}$.
\item[\ding{227}] Other variants of GCN can also be obtained, for instance to obtain Chebychev filters \citep{defferrard2016convolutional}, $S$ would be $\sum_{i=0}^k T_i(\widetilde{L})$.
\item[\ding{227}] To obtain a graph attention layer (\cite{velickovic2017graph}, see description in \secref{sec:vert}), one can concatenate $K$ graph convolutional layers, where $K$ is the number of attention heads, and with $\widetilde{A}$ filled with the learned attention coefficients. Instead of concatenation, one could also choose $\omega=K$, and given an attention head~$k$, the slices $S[k,:,:]$ would be matrices containing the coefficients.
\item[\ding{227}] A topology-adaptive graph convolution layer (\cite{du2017topology}, see description in \secref{sec:vert}) is a neural contraction layer for which $S$ contains the powers of $\widetilde{A}$ along the first rank.
\item[\ding{227}] A mixture model convolutional layer (\cite{monti2016geometric}, equations (9) and (10)) is a neural contraction layer for which $S$ contains the values of the weighting functions \ie $S[k,i,j] = w_k(u[i,j])$.
\item[\ding{227}] A generalized convolution (\cite{vialatte2016generalizing}, equations (1) and (10)) is a neural contraction layer for which $S$ is the allocation tensor $A^e$ which has the sparse priors mentioned in the next subsection.
\item[\ding{227}] Any partially connected layer with (or without) weight sharing can be obtained with appropriate construction of~$S$.
\end{itemize}

\subsection{Sparse priors for the classification of signals}
\label{sec:sparse}

In the more general case, the scheme $S$ is a real-valued tensor, which may introduce more weights in the layer. However, for Euclidean convolutions, $S$ is already determined by the Euclidean convolution operator and does not count for more memory. To imitate this property, we can focus on the class of layer with these sparse priors:
\begin{enumerate}
  \item Given a couple of neuron indices $(i,j)$, $S[:,i,j]$ is either a one-hot vector, or the zero vector, as in $\eqref{eq:alloc}$.
  \item Given an output neuron index $j$, a weight $\theta[h]$ can appear only once in its LRF.
\end{enumerate}

\begin{remark}
For these sparse priors we consider the case of classification of signals. In the case of semi-supervised node classification, $S$ would only be sparse between the rank indexed by $i$ and the one indexed by $j$, but not on the other rank.
\end{remark}

Therefore, we can describe $S$ with a \emph{weight assignment table} $T$, which is a matrix such that
\begin{gather}
T[i,j] = \begin{cases}h \text{ if } \exists! h, \theta^T S[:,i,j] = \theta[h]\\ 0 \text{ else}\end{cases}
\end{gather}

In turn, $T$ can be described by the set of transformations (or partial transformations), defined as
\begin{gather}
g_h(j) = i \Leftrightarrow T[i,j] = h
\end{gather}
where each (partial) transformation $g_h$ corresponds to the weight indexed by $h$.

Without loss of generality, let us reduce to the simple case where $B=P=Q=1$. The neural contraction \eqref{eq:comm} rewrites as:
\begin{align}
\wideparen{\theta Sx}[j] & = \displaystyle \sum_{k=1}^\omega \sum_{i=1}^n \theta[k] \h{2} S[k,i,j] \h{2} x[i]\\
                         & = \displaystyle \sum_{h=1}^\omega \theta[h] \h{2} x[g_h(j)]%\\
                         %& = \theta \ast_{\M} x [j]
\end{align}
which for EC layers, can be rewritten as an EC* convolution of graph signals as studied in \chapref{chap:2}, since the (partial) transformations are invertible thanks to the sparse priors. Conversely, given a convolution of graph signals (formulated as $\ast_\varphi$ or $\ast_{\M}$), we can derive an assignment table $T$ and its corresponding scheme $S$ to obtain a neural contraction.

\subsection{Efficient implementation under sparse priors}

\begin{remark}
This section applies in supervised settings but not in semi-supervised setting for which it is best to use sparse functions of common deep learning libraries.
\end{remark}

\paragraph{What is the fastest way to compute $\wideparen{\Theta S X}$ ?}
As the equation \eqref{eq:ternary} is associative and commutative, there are three ways to start to calculate it: with $\Theta S$, $SX$, or $\Theta X$, which we will call \emph{middle-stage tensors}. The computation of a middle-stage tensor is the bottleneck to compute \eqref{eq:ternary} as it imposes to compute significantly more entries than for the second tensor contraction. In \tabref{tab:mid}, we compare their shapes. We refer the reader to \tabref{tab:ind} for the denomination of the indices.

\begin{table}[H]
  \centering
\begin{tabular}{ccc}
  tensor & shape\\
  \hline
  $\Theta$ & $\omega \times N \times M$\\
  $S$ & $\omega \times n \times m$\\
  $X$ & $n \times N \times B$\\
  $\Theta S$ & $n \times m \times N \times M$\\
  $SX$ & $\omega \times m \times N \times B$\\
  $\Theta X$ & $\omega \times n \times M \times B$\\
  $\wideparen{\Theta SX}$ & $m \times M \times B$
\end{tabular}
\caption{Table of shapes}
\label{tab:mid}
\end{table}

We usually want to have $\omega \ll n$ and $\omega \ll m$, which means that we have weight kernels of small sizes (for example in the case of images, convolutional kernel are of size significantly smaller than that of the images). Also, the number of input channels $N$ and of feature maps $M$ are roughly in the same order, with $N < M$ more often than the contrary. So in practice, the size of $\Theta S$ is significantly bigger than the size of $SX$ and of $\Theta X$, and the size of $SX$ is usually the smallest.

\paragraph{How to exploit $S$ sparsity ?}
Also, in usual supervised settings, $S$ is sparse as $S[:,i,j]$ are one-hot vectors. So computing $SX$ should be faster than computing $\Theta X$, provided we exploit the sparsity. Although $S$ is very sparse as it contains at most a fraction $\frac{1}{w}$-th of non-zero values, it is only sparse along the first rank, which makes implementation with sparse classes of common deep learning libraries use less parallelization than it can. However, we can benefit from the sparse priors mentioned in \secref{sec:sparse}.%: the local receptive fields (LRF) are usualy of the same size (or at least of around the same order), and each weight is associated to one \emph{and only one} neuron of each~LRF.

So we proceed differently. The idea is to use a non-sparse tensor $X_{\LRF}$ that has a rank that indexes the LRF, and another rank that indexes elements of these LRF, in order to lower the computation to a dense matrix multiplication (or a dense tensor contraction) which is already well optimized. This approach is similar to that proposed in \cite{chellapilla2006high}, which is also exploited in the cudnn primitives~\citep{chetlur2014cudnn} to efficiently implement the classical convolution.

\paragraph{The LRF representation}
In our case, it turns out that $X_{\LRF}$ can be exactly $SX$, as given fixed~$b$,~$p$, and~$j$, $SX[:,j,p,b]$ corresponds to entries of the input signal $X[:,p,b]$ restrained to a LRF $\ccr_j$ of size~$\omega$. Therefore,
\begin{gather}
\exists \LRF_j = [i_1, \ldots, i_\omega]~\st SX[:,j,p,b] = X[\LRF_j, p, b]
\label{eq:R}
\end{gather}
The elements of $\LRF_j$ can be found by doing a lookup in the one-hot vectors of~$S$, provided each kernel weight occurs exactly once in each LRF. We have:
\begin{gather}
R_j[k] = i_k~\st S[:,i_k,j][k] = 1
\end{gather}
This lookup needs not be computed each time and can be done beforehand. Finally, if we define $\LRF = [\LRF_1, \ldots, \LRF_m]$, \eqref{eq:R} gives:
\begin{gather}
SX = X[\LRF, :, :]
\label{eq:LRF}
\end{gather}
The equation \eqref{eq:LRF} is computed with only $\omega \times m$ assignations and can be simply implemented with automatic differentiation in commonly used deep learning libraries.

\paragraph{Benchmarks}
To support our theoretical analysis, we benchmark three methods for computing the tensor contraction $SX$:
\begin{itemize}[nolistsep,noitemsep]
  \item naively using dense multiplication,
  \item using sparse classes of deep learning libraries,
  \item using the LRF based method we described above.
\end{itemize}

We run the benchmarks under the sparse priors mentioned in \secref{sec:sparse}. For each method, we make $100$ runs of computations of $SX$, with $S$ and $X$ being randomly generated according to the assumptions. In \tabref{tab:ben}, we report the mean times. The values of the hyperparameters were each time $n = m = N = M = B = 256$, and $\omega = 4$. The computations were done on graphical processing units (GPU)\footnote{The GPU was an Nvidia Geforce GTX1060, the CPU was an Intel Xeon E5-1630v4 at 3.70GHz with 4 cores. We used tensorflow and the function gather\_nd for the assignations.}.

% \begin{table}[H]
%   \centering
% \begin{tabular}{lcc}
%   Method & No parallelization & Wall clock\\
%   \hline
%   Naive & 5650~ms & 950~ms\\
%   Sparse & \textbf{521}~ms & 413~ms\\
%   LRF & 1090~ms & \textbf{365}~ms
% \end{tabular}
% \caption{Benchmark mean times}
% \label{tab:ben}
% \end{table}
\begin{table}[H]
  \centering
\begin{tabular}{lc}
  Method & Mean time\\
  \hline
  Naive & 950~ms\\
  Sparse & 413~ms\\
  LRF & \textbf{365}~ms
\end{tabular}
\caption{Benchmark (100 runs each)}
\label{tab:ben}
\end{table}

We observe that the LRF method is faster, since the sparse implementation isn't optimized for this use case.

\subsection{Influence of symmetries}
\label{sec:sym}

In the case of images, or other signals over a grid, the grid structure of the domain defines the weight sharing scheme $S$ of the convolution operation. For example, for a layer $\cl : X  \mapsto Y = h(\wideparen{\Theta S X})$, and given fixed $b,p,q$, the classical convolution can be rewritten as
\begin{align}
y[j] &= h\left(\displaystyle \sum_{k=1}^{\omega} \theta[k] \h{2} \sum_{i=1}^n S[k,i,j] \h{2} x[i] \right)\\
&= h\left(\displaystyle \sum_{k=1}^{\omega} \theta[k] \h{2} x_{\LRF}[k,j] \right) \label{eq:rewr}
\end{align}
Where $x_{\LRF}[k,j]$ can be obtained by matching \eqref{eq:rewr} with the expression given by the definition (see \defref{def:conv}). So, $S[k,:,j]$ is a one-hot vector that specifies which input neuron in the LRF of $y$ is associated with the $k$-th kernel weight, and the index of the $1$ is determined by $x_{\LRF}[k,j]$.

That is to say, in the ternary representation, $\Theta$ captures the information the layer has learned, whereas $S$ captures how the layer exploits the symmetries of the input domain.

\paragraph{Visual construction}
Visually, constructing $S$ amounts to move a rectangular grid over the pixel domain, as depicted on \figref{fig:rect} where each point represents the center of a pixel, and the moving rectangle represents the LRF of its center $j$. Each of its squares represents a kernel weight $\theta[k]$ which are associated with the pixel $x_{\LRF}[k,j]$ that falls in it.

\input{chapter3/fig_rect}

The case of images is very regular, in the sense that every pixels are regularly spaced out, so that obtained $S$ is circulant along its last two ranks. This is a consequence of the translational symmetries of the input domain, which underlie the definition of the convolution, as seen in \chapref{chap:2}.
\begin{itemize}
\item What happens if we loose these symmetries?
\end{itemize}

To answer this question, we make the following experiment~\citep{vialatte2016generalizing}:
\begin{enumerate}
\item We distort the domain by moving the pixels randomly. The radial displacement is uniformly random with the angle, and its radius follows a Gaussian distribution~$\cn(0,\sigma)$.
\item Then we compare performances of shallow CNNs expressed under the ternary representation, for which $S$ is constructed similarly than with the above visual construction, for different values of~$\sigma$.
\end{enumerate}

The visual construction of $S$ on distorted domain is depicted by \figref{fig:distrect}.

\input{chapter3/fig_distrect}

We run a classification task with standard hyperparameters on a toy dataset (we used MNIST, \cite{lecun1998mnist}). The results are reported in \figref{fig:expres}.

\input{chapter3/fig_res}

The bigger is $\sigma$, the less accurate are the symmetries of the input domain, up to a point where the ternary representation becomes almost equivalent to a dense layer. The results illustrate nicely this evolution, and stress out the importance of trying to leverage symmetries when defining new convolutions.

Moreover, they indicate that the ternary representation also allows to improve performances compared to using dense layers, providing we are able to create a relevant weight sharing scheme $S$ to exploit symmetries. For example, in the case the underlying graph structure of the input is embedded in a two-dimensional Euclidean space, $S$ can be created just as in this experiment.

% \subsection{Randomly assigning weights}

% As we saw, the weight sharing scheme $S$ labels the propagation. In the case of datasets with known symmetries like image datasets, we saw in the previous section that it is best to do this labeling with respect to these symmetries. In the other case, when the symmerties are unknown, we can try to assign weights randomly. The idea is that, for each couple $(p,q)$ of input channel and output feature map, we randomly assign weights from the kernel $\Theta[:,p,q]$ to the propagation. Then, after propagation, we follow with a FC layer on the feature\'s dimension, as if we were learning which features have learnt the best representation in the previous layer. 

% \subsection{Semi-supervised node classification}

% When transposing the problem to the task of semi-supervised node classification, a ternary layer can be understood as a multi-head GCN \citep{kipf2016semi}, for which each head corresponds to a different graph. The slices $S[k,:,:]$ would characterize the corresponding adjacency matrices $A_k$.

% \todo{include an experiment if time ? ie with sampled graph for each head}

% \todo{transpose, w=1}
% kipf is ternary rep but w = 1
% Idea: multiple graphs, each graph is a w
% Redo kipf experiments with directed graph, one w for the digraph, and another w for its transpose

\subsection{Experiments with general graphs}
\label{sec:gsexp}

The neural contraction is not usable off-the-shelf for general graphs since the weight sharing scheme $S$ needs to be specified. One strategy is to learn it alongside the weight kernel $\Theta$, see \secref{sec:learningscheme}. Another one consists in inferring it from the graph itself, see \secref{sec:trans}. But first let us try some ideas based on randomizations.

\paragraph{Supervised classification of graph signals}

We test the idea to fill the schemes $S$ with one-hot vectors randomly, but with respect to the sparse priors mentioned in \secref{sec:sparse}. We apply neural contractions in a depthwise separable fashion similarly than for depthwise separable convolutions~\citep{chollet2016xception}. That is, we first propagate the neurons values on the propagation graph for each of the $N$ input feature maps independently. In the process, we use one different realization of $S$ per input feature map. Then we apply a fully connected layer in the feature space (which is a convolution with trivial sliding window). Since this amounts to do $M$ (one for each output feature map) learned weighted averages of $N$ random realizations, we call that a Monte-Carlo (MC) module. We call \emph{Monte-Carlo Networks} (MCNet) the corresponding neural networks.

We replicated the experimental setup done by \cite{defferrard2016convolutional} on the $20$NEWS dataset. We refer the reader to \secref{sec:spec} for the preliminary discussion on this experiment. We used the architecture depicted on \figref{fig:archiMC}. It starts with $2$ MC modules with $64$ output feature maps, followed by a trivial convolution with $1$ output feature map, then a FC layer with $500$ hidden units and a final FC layer for classification. A residual connection is added between the input and the first FC layer. For the MC modules, we used the graph connecting $2$ nearest neighbors. Therefore this architecture can be seen as a small random perturbation of the MLP from \tabref{tab:20}. Each layer except for the last is followed by $20\%$ dropout. As usual, we use reLU activations except for the last that has softmax activation.

\begin{figure}[h!tbp]
\centering
\begin{tikzpicture}[shorten >=1pt,node distance=1.5cm, thick]
   \tikzstyle{every node} = [inner sep=10pt, transform shape];
   \begin{scope}[rotate=90]
     \node[]   (a)                {input};
     \node[]           (b) [below of = a, inner sep=0pt] {};
     \node[rectangle, draw]           (c) [below of = b] {MC $64$};
     \node[rectangle, draw]           (d) [below of = c] {MC $64$};
     \node[rectangle, draw]           (e) [below of = d] {Trivial Conv $1$};
     \node[circle, draw, inner sep=0.03cm] (f) [below of = e] {\Large{+}};
     \node[rectangle, draw]           (g) [below of = f] {FC $500$};
     \node[rectangle, draw]           (h) [below of = g] {FC $20$};
     \node[]           (i) [below of = h] {output};
     \path[->] (a) edge (c);
     \path[->] (c) edge (d);
     \path[->] (d) edge (e);
     \path[->] (e) edge (f);
     \path[->] (f) edge (g);
     \path[->] (g) edge (h);
     \path[->] (h) edge (i);
     \draw[->] (b) to[out=180, in=90] ($(d) - (3,0)$) to[in=180,out=270] (f);
   \end{scope}
\end{tikzpicture}
\caption{Diagram of the MCNet architecture used}
\label{fig:archiMC}
\end{figure}

After $100$ runs of $10$ epochs, MCNet obtained a mean accuracy of $70.74~\pm~2.19\%$, which is a gain that is statistically significant compared to ChebNet ($65.76\%$, see \tabref{tab:20}), but not against the former MLP ($71.46~\pm~0.08\%$). It is to be noted that the variance of MCNet is quite high and cherry picking the max instead of considering the mean improves the result to $72.62\%$, which is better than the MLP's max at $72.25\%$. Therefore, in some random realizations, MCNet is capable of leveraging the underlying graph structure (and being state-of-the-art).

This is the least to be expected. But when the quality of the graph is low, there may not be anything better to demonstrate. A dataset with a graph that suits well the underpining structure of the domain is often resembling a grid graph to some extent. Therefore we expect the model that will be presented in \secref{sec:trans} to be more relevant in most cases.

\paragraph{Semi-supervised classification of nodes}

A similar kind of randomization based idea that come in mind when transposing the question to the semi-supervised task is to use dropout on the edges of the graph. We call that \emph{graph dropout}. We test this approach on citation networks, for which we set $\omega = 1$ (\ie one graph). The obtained architecture amounts to a GCN \citep{kipf2016semi} to which graph dropout is applied. We report and discuss our results in \secref{sec:lss}, \tabref{tab:lss}, where we compare them with other models.