main.tex

\documentclass[runningheads]{llncs}
%
\usepackage[utf8]{inputenc}
\usepackage{amsmath}
\usepackage{amsfonts}
\usepackage{amssymb}
%\usepackage{amsthm}
\usepackage{graphicx}
\usepackage{xcolor}
\usepackage{soul}
\usepackage{hyperref}
\usepackage{algorithm}
\usepackage[noend]{algpseudocode}
\usepackage{dsfont}
\usepackage{caption}
\usepackage{subcaption}

\input{macros}

% Used for displaying a sample figure. If possible, figure files should
% be included in EPS format.
%
% If you use the hyperref package, please uncomment the following line
% to display URLs in blue roman font according to Springer's eBook style:
% \renewcommand\UrlFont{\color{blue}\rmfamily}

\begin{document}
%
%\title{D-Cliques: Topology can compensate NonIIDness in Decentralized Federated Learning}
\title{D-Cliques: Compensating NonIIDness in Decentralized Federated Learning
with Topology}
%
\titlerunning{D-Cliques}
% If the paper title is too long for the running head, you can set
% an abbreviated paper title here
%
%\author{Aur\'elien Bellet\inst{1}\thanks{Authors in alphabetical order of last names, see Section 'Credits' for respective contributions.} \and
%Anne-Marie Kermarrec\inst{2} \and
%Erick Lavoie\inst{2}}
%%
%\authorrunning{A. Bellet, A-M. Kermarrec, E. Lavoie}
%% First names are abbreviated in the running head.
%% If there are more than two authors, 'et al.' is used.
%%
%\institute{Inria, Lille, France\\
%\email{aurelien.bellet@inria.fr} \and
%EPFL, Lausanne, Switzerland \\
%\email{\{anne-marie.kermarrec,erick.lavoie\}@epfl.ch}\\
%}
%
\maketitle              % typeset the header of the contribution
%
\begin{abstract}
The convergence speed of machine learning models trained with Federated
Learning is significantly affected by non-independent and identically
distributed (non-IID) data partitions, even more so in a fully decentralized
setting without a central server. In this paper, we show that the impact 
\textit{local class bias} can be significantly reduced by carefully designing
the underlying communication topology. We present D-Cliques, a novel topology
that reduces gradient bias by grouping nodes in interconnected cliques such
that the local joint distribution in a clique is representative of the global
class distribution. We also show how to adapt the updates of decentralized SGD
to obtain unbiased gradients and effective momentum with D-Cliques. Our
empirical evaluation on MNIST and CIFAR10 demonstrates that our approach
provides similar convergence speed as a fully-connected topology with a
significant reduction in the number of edges and messages. In a 1000-node
topology, D-Cliques requires 98\% less edges and 96\% less total messages,
with further possible gains using a small-world topology across cliques.
% Our study paves the way for tackling more general types of data non-IIDness
% through the design of appropriate topologies.

\keywords{Decentralized Learning \and Federated Learning \and Topology \and
Non-IID Data \and Stochastic Gradient Descent}
\end{abstract}
%
%
%
\section{Introduction}

% 1/ Decentralized FL approaches can be more scalable than Centralized FL approach when the number of nodes is large
% 2/ It is well known the topology can affect convergence of decentralized algorithms, as shown by classic convergence analysis. However the effect of topology has been observed to be often quite small in practice. This is because most of these results were obtained for iid data.
% 3/ In this paper, we show that the effect of topology is very significant for non-iid data. Unlike for centralized FL approaches, this happens even when nodes perform a single local update before averaging. We propose an approach to design a sparse data-aware topology which recovers the convergence speed of a centralized approach.
% 4/ An originality of our approach is to work at the topology level without changing the original efficient and simple D-SGD algorithm. Other work to mitigate the effect of non-iid on decentralized algorithms are based on performing modified updates (eg with variance reduction) or multiple averaging steps.

Machine learning is currently shifting from a \emph{centralized}
paradigm, in which models are trained on data located on a single machine or
in a data center, to \emph{decentralized} ones.
Effectively, the latter paradigm closely matches the natural data distribution
in the numerous use-cases where data is collected and processed by several
independent
parties (hospitals, companies, personal devices...).
Federated Learning (FL) allows a set
of participants to collaboratively train machine learning models
on their joint
data while keeping it where it has been produced. Not only does this avoid
the costs of moving data, but it also  mitigates privacy and confidentiality concerns~\cite{kairouz2019advances}.
Yet, such data distribution introduces new challenges for learning systems, as
local datasets
reflect the usage and production patterns specific to each participant: they are
\emph{not} independent and identically distributed
(non-IID). More specifically, the relative frequency of different classes of examples may significantly vary
across local datasets \cite{quagmire}.
Therefore, one of the key challenges in FL is to design algorithms that
can efficiently deal with such non-IID data 
\cite{kairouz2019advances,fedprox,scaffold,quagmire}.

Federated learning algorithms can be classified into two categories depending
on the underlying network topology they run on. In server-based FL, the
network is organized according to a star topology: a central server orchestrates the training process by
iteratively aggregating model updates received from the participants
(\emph{clients}) and sending
them back the aggregated model \cite{mcmahan2016communication}. In contrast,
fully decentralized FL algorithms operate over an arbitrary graph topology
where participants communicate only with their direct neighbors
in the graph. A classic example of such algorithms is Decentralized
SGD (D-SGD) \cite{lian2017d-psgd}, in which participants alternate between
local SGD updates and model averaging with neighboring nodes.

In this paper, we focus on fully decentralized algorithms as they can
generally scale better to the large number of participants seen in ``cross-device''
applications \cite{kairouz2019advances}. Effectively, while a central
server may quickly become a bottleneck as the number of participants increases, the topology used in fully decentralized algorithms can remain sparse
enough such that all participants need only to communicate with a small number of other participants, i.e. nodes have small (constant or logarithmic) degree 
\cite{lian2017d-psgd}. For IID data, recent work has shown both empirically 
\cite{lian2017d-psgd,Lian2018} and theoretically \cite{neglia2020} that sparse
topologies like rings or grids do not significantly affect the convergence
speed compared to using denser topologies.
% We also note that full decentralization can also provide benefits in terms of
% privacy protection \cite{amp_dec}.


\begin{figure}[t]
     \centering
     
     % From directory results/mnist
     % python ../../../learn-topology/tools/plot_convergence.py ring/iid/all/2021-03-30-16:07:06-CEST ring/non-iid/all/2021-03-30-16:07:03-CEST --add-min-max --legend 'lower right' --yaxis test-accuracy --labels '100 nodes IID' '100 nodes non-IID' --save-figure ../../figures/ring-IID-vs-non-IID.png --font-size 20
     \begin{subfigure}[b]{0.31\textwidth}
         \centering
         \includegraphics[width=\textwidth]{figures/ring-IID-vs-non-IID}
\caption{\label{fig:ring-IID-vs-non-IID} Ring}
     \end{subfigure}
     \quad
    % From directory results/mnist
     % python ../../../learn-topology/tools/plot_convergence.py grid/iid/all/2021-03-30-16:07:01-CEST grid/non-iid/all/2021-03-30-16:06:59-CEST --add-min-max --legend 'lower right' --yaxis test-accuracy --labels '100 nodes IID' '100 nodes non-IID' --save-figure ../../figures/grid-IID-vs-non-IID.png --font-size 20
     \begin{subfigure}[b]{0.31\textwidth}
         \centering
         \includegraphics[width=\textwidth]{figures/grid-IID-vs-non-IID}
\caption{\label{fig:grid-IID-vs-non-IID} Grid}
     \end{subfigure}
     \quad
         % From directory results/mnist
     % python ../../../learn-topology/tools/plot_convergence.py fully-connected/iid/all/2021-03-30-16:07:20-CEST fully-connected/all/2021-03-10-09:25:19-CET  --add-min-max --legend 'lower right' --yaxis test-accuracy --labels '100 nodes IID' '100 nodes non-IID' --save-figure ../../figures/fully-connected-IID-vs-non-IID.png --font-size 20
     \begin{subfigure}[b]{0.31\textwidth}
         \centering
         \includegraphics[width=\textwidth]{figures/fully-connected-IID-vs-non-IID}
\caption{\label{fig:fully-connected-IID-vs-non-IID} Fully-connected}
     \end{subfigure}
        \caption{IID vs non-IID convergence speed of decentralized SGD for
        logistic regression on
        MNIST for different topologies. Bold lines show the average test
        accuracy across nodes
        while thin lines show the minimum
        and maximum accuracy of individual nodes. While the effect of topology
        is negligible for IID data, it is very significant in the
        non-IID case. When fully-connected, both cases converge similarly. See
        Section~\ref{section:experimental-settings} for details on
        the experimental setup.}
        \label{fig:iid-vs-non-iid-problem}
\end{figure}


In contrast to the IID case however, our experiments demonstrate that \emph{the impact of topology is extremely significant for non-IID data}. This phenomenon is illustrated
in Figure~\ref{fig:iid-vs-non-iid-problem}: we observe that  a ring or
a grid topology clearly jeopardizes the convergence speed as local
distributions do not have relative frequency of classes similar to the global
distribution, i.e. they have \textit{local class bias}. We stress the fact that, unlike in centralized FL
\cite{kairouz2019advances,scaffold,quagmire}, this
happens even when nodes perform a single local update before averaging the
model with their neighbors. In this paper, we address the following question:

% \textit{Are there regular topologies, i.e. where all nodes have similar or the same number of neighbours, with less connections than a fully-connected graph that retain a similar convergence speed and non-IID behaviour?}

%\textit{Are there sparse topologies with similar convergence speed as the fully connected graph under a large number of participants with local class bias?}

\textit{Can we design sparse topologies with  convergence
  speed similar to the one obtained in a  fully connected network graph under
  a large number of participants with local class bias?}
%AMK: do we talk about local class bias or noniidness?


%Indeed, as we show with the following contributions:
In this paper, we make the following contributions:
(1) We propose D-Cliques, a sparse topology in which nodes are organized in
interconnected cliques, i.e. locally fully-connected sets of nodes, such that
the joint data distribution of each clique is representative of the global 
(IID) distribution; (2) We propose Clique Averaging, a  modified version of 
the standard D-SGD algorithm which decouples gradient averaging, used for
optimizing local models, from distributed averaging, used to ensure all models
converge, therefore reducing the bias introduced by inter-clique connections; 
(3) We show how Clique Averaging can be used to implement unbiased momentum
that would otherwise be detrimental in a non-IID setting; (4) We  demonstrate
through an extensive experimental study that our approach  removes the effect
of the local class bias both for the MNIST~\cite{mnistWebsite} and CIFAR10~
\cite{krizhevsky2009learning} datasets, for training a linear model and a deep
convolutional network;  (5) Finally, we demonstrate the scalability of our
approach by considering  up to 1000 node networks, in contrast to most
previous work on fully decentralized learning that considers only a few tens
of participants~\cite{tang18a,neglia2020,momentum_noniid,cross_gradient,consensus_distance}.
%we show that these results hold up to 1000 participants, in contrast to most previous work on fully decentralized algorithms that considers only a few tens of participants \cite{tang18a,more_refs}. 

For instance, our results show that using D-Cliques in a 1000-node network
requires 98\% less edges ($18.9$ vs $999$ edges per participant on average) thus yielding a 96\% reduction in the total number of required messages (37.8 messages per round per node on average instead of 999) to obtain a similar convergence speed as a fully-connected topology. Furthermore an additional 22\% improvement 
% (14.5 edges per node on average instead of 18.9)
is possible when using a small-world inter-clique topology, with further potential gains at larger scales because of its linear-logarithmic scaling.

The rest of this paper is organized as follows. We first present the problem
statement and our methodology (Section~\ref{section:problem}). The D-Cliques
design is presented in Section~\ref{section:d-cliques}) along with an
empirical illustration of its benefits. In
Section~\ref{section:clique-averaging-momentum}, we
show how to further reduce bias with Clique Averaging and how to use it to
implement momentum.  We present the results or our extensive experimental
study in  Section~\ref{section:non-clustered}. We review some related work in
 Section~\ref{section:related-work}), and conclude with promising directions
 for future work in Section~\ref{section:conclusion}.
 %When % then explain how to construct D-Cliques and show their benefits (Section~\ref{section:d-cliques}). We show how to further reduce bias with Clique Averaging (Section~\ref{section:clique-averaging}). We then show how to use Clique Averaging to implement momentum (Section~\ref{section:momentum}). Having shown the effectiveness of D-Cliques, we evaluate the importance of clustering (Section~\ref{section:non-clustered}), and full intra-clique connections (Section~\ref{section:intra-clique-connectivity}). Having established the design, we then study how best to scale it (Section~\ref{section:interclique-topologies}). We conclude with a survey of related work (Section~\ref{section:related-work}) and a brief summary of the paper (Section~\ref{section:conclusion}).

%\footnotetext{This is different from the accuracy of the average model across nodes that is sometimes used once training is completed.}

\section{Problem Statement}

\label{section:problem}

We consider a set of $n$ nodes $N = \{1, \dots, n \}$ seeking to
collaboratively solve a classification task with $c$ classes.
% where each node can communicate with its neighbours according to the mixing matrix $W$ in which $W_{ij}$ defines the \textit{weight} of the outgoing connection from node $i$ to $j$. $W_{ij} = 0$ means that there is no connection from node $i$ to $j$ and $W_{ij} > 0$ means there is a connection.
%AMK:explain the weight
%Training data is sampled from a global distribution $D$ unknown to the nodes.
%AMK:Removed the sentence above
Each node has access to a local dataset that
 follows its own local distribution $D_i$. The goal is to find a global model
 $x$ that performs well on the union of the local distributions by minimizing
 the average training loss:
\begin{equation}
\min_{x} \frac{1}{n}\sum_{i=1}^{n} \mathds{E}_
{s_i \sim D_i} [F_i(x;s_i)],
\label{eq:dist-optimization-problem}
\end{equation}
where $s_i$ is a data sample of $D_i$, $F_i$ is the loss function
on node $i$. Therefore, $\mathds{E}_{s_i \sim D_i} F_i(x;s_i)$ denotes  the
expected loss of model $x$ on a random sample $s_i$ drawn from $D_i$.

To collaboratively solve Problem \eqref{eq:dist-optimization-problem}, each
node can exchange messages with its neighbors in an undirected network graph
$G(N,E)$ where $\{i,j\}\in E$ denotes an edge (communication channel)
between nodes $i$ and $j$.

\subsection{Training Algorithm}
%AMK: if we need space this could be a paragraph

In this work, we use the popular Decentralized Stochastic
Gradient Descent algorithm, aka D-SGD~\cite{lian2017d-psgd}. As
shown in Algorithm~\ref{Algorithm:D-PSGD},
%AMK: can we say why: most popular, most efficient ?
a single step of D-SGD at node $i$ consists of sampling a mini-batch
from its local distribution
$D_i$, taking stochastic gradient descent (SGD) step according to this
sample, and performing a weighted average of its model with its neighbors.
This weighted average is defined by a
mixing matrix $W$, in which $W_{ij}$ corresponds to the weight of
the outgoing connection from node $i$ to $j$ and $W_{ij} = 0$ for $
\{i,j\}\notin
E$. To ensure that D-SGD converges to a (local) optimum, $W$ must be doubly
stochastic ($\sum_{j \in N} W_{ij} = 1$ and $\sum_{j \in N} W_{ji} = 1$) and symmetric, i.e. $W_{ij} = W_{ji}$.

\begin{algorithm}[t]
   \caption{D-SGD, Node $i$}
   \label{Algorithm:D-PSGD}
   \begin{algorithmic}[1]
        \State \textbf{Require:} initial model parameters $x_i^{(0)}$,
        learning rate $\gamma$, mixing weights $W$, number of steps $K$
        \For{$k = 1,\ldots, K$}
          \State $s_i^{(k)} \gets \text{(mini-batch) sample from~} D_i$
          \State $x_i^{(k-\frac{1}{2})} \gets x_i^{(k-1)} - \gamma \nabla F(x_i^{(k-1)}; s_i^{(k)})$ 
          \State $x_i^{(k)} \gets \sum_{j \in N} W_{ji}^{(k)} x_j^{(k-\frac{1}{2})}$
        \EndFor
   \end{algorithmic}
\end{algorithm}

\subsection{Methodology}

\subsubsection{Non-IID assumptions.}
\label{section:non-iid-assumptions}

As demonstrated in Figure~\ref{fig:iid-vs-non-iid-problem}, lifting the
assumption of IID data significantly challenges the learning algorithm. In
this paper, we focus on an \textit{extreme case of local class bias}: we
consider that each node only has samples
%examples
from a single class.
% Our results should generalize to lesser, and more
% frequent, cases.
%AMK: a bit weak can't we say our results generalize....
%: e.g., if some classes are globally less represented, the position of the nodes with the rarest classes will be significant; and if two local datasets have different number of examples, the examples in the smaller dataset may be visited more often than those in a larger dataset, skewing the optimization process. 

To isolate the effect of local class bias from other potentially compounding
factors, we make the following simplifying assumptions: (1) All classes are
equally represented in the global dataset; (2) All classes are represented on the same number of nodes; (3) All nodes have the same number of examples.

We believe that these assumptions are reasonable in the context of our study
because: (1)
Global class
imbalance equally
affects the optimization process on a single node and is therefore not
specific to the decentralized setting; (2) Our results do not exploit specific
positions in the topology;  (3) Imbalanced dataset sizes across nodes can be
addressed for instance by appropriately weighting the individual loss
functions.
% with less examples could
% simply skip some rounds until the nodes with more examples catch up.
Our results can be extended to support additional compounding factors in future work.

\subsubsection{Experimental setup.}
\label{section:experimental-settings}
%AMK: j'aurais mis ca dans la section eval car je n'aurais pas mélangé design et eval.

Our main goal is to provide a fair comparison of the convergence speed of
different topologies and algorithmic variations, to show that our approach
can remove much of the effect of local class bias.

We experiment with two datasets: MNIST~\cite{mnistWebsite} and
CIFAR10~\cite{krizhevsky2009learning}, which both have $c=10$ classes.
For MNIST, we use 45k and 10k samples from the original 60k
training set for training and validation respectively. The remaining 5k
training samples were randomly removed to ensure all 10 classes are balanced
while ensuring the dataset is evenly divisible across 100 and 1000 nodes.
We use all 10k examples of
the test set to measure test accuracy. For CIFAR10, classes are evenly
balanced: we use 45k/50k images of the original training set for training,
5k/50k for validation, and all 10k examples of the test set for measuring
prediction accuracy.

We
use a logistic regression classifier for MNIST, which
provides up to 92.5\% percent accuracy in the centralized setting.
% compared to
% $99\%$ for the state-of-the-art~\cite{mnistWebsite}.
For CIFAR10, we use a Group-Normalized variant of LeNet~\cite{quagmire}, a
deep convolutional network which achieves an accuracy of $72.3\%$ in the
centralized setting.
% compared to the 99\% achieved by start-of-the-art.
These models are thus reasonably accurate (which is sufficient to
study the effect of the topology) while being sufficiently fast to train in a
fully decentralized setting, and are simple enough to configure and analyze.
Regarding hyper-parameters, we jointly optimized the learning rate and
mini-batch size on the
validation set for 100 nodes, obtaining respectively $0.1$ and $128$ for
MNIST and $0.002$ and $20$ for CIFAR10.
For CIFAR10, we additionally use a momentum of $0.9$.

We evaluate 100- and 1000-node networks by creating multiple models in memory and simulating the exchange of messages between nodes.
To ignore the impact of distributed execution strategies and system
optimization techniques, we report the test accuracy of all nodes (min, max,
average) as a function of the number of times each example of the dataset has
been sampled by a node, i.e. an \textit{epoch}. This is equivalent to the classic case of a single node sampling the full distribution.
To further make results comparable across different number of nodes, we lower
the batch size proportionally to the number of nodes added, and inversely,
e.g. on MNIST, 12800 with 1 node, 128 with 100 nodes, 13 with 1000 nodes. This
ensures the same number of model updates and averaging per epoch, which is is
important to have a fair comparison.\footnote{Updating and averaging models
after every example can eliminate the impact of local class bias. However, the
resulting communication overhead is impractical.}

Finally, we compare our results against an ideal baseline: either a
fully-connected network topology with the same number of nodes or a single IID
node. In both approaches, the topology has no effect on
the optimization. For a certain choice of number of nodes and
mini-batch size, both approaches are equivalent.  %ensure a single
% model is optimized, which therefore removes the effect of the topology. While, both approaches compute an equivalent gradient with the same expectation, we favored  using a single IID node for CIFAR10 for the sake of training speed.


\section{D-Cliques: Creating Locally Representative Cliques}
\label{section:d-cliques}

In this section we present the design of D-Cliques. To give an intuition of our approach, let us consider the neighbourhood of a single node in a grid similar to that of Figure~\ref{fig:grid-IID-vs-non-IID},  represented on Figure~\ref{fig:grid-iid-vs-non-iid-neighbourhood} where  each color represent a class of data.
The colors of a node, represented as a circle, correspond to the different classes it hosts locally. In the IID setting (Figure~\ref{fig:grid-iid-neighbourhood}), each node has examples of all classes in equal proportions. In the non-IID setting (Figure~\ref{fig:grid-non-iid-neighbourhood}), each node has examples of only a single class and nodes are distributed randomly in the grid. A single training step, from the point of view of the middle node, is equivalent to sampling a mini-batch five times larger from the union of the local distributions of all illustrated nodes. 

%For an intuition on the effect of local class bias, examine the neighbourhood of a single node in a grid similar to that of Figure~\ref{fig:grid-IID-vs-non-IID}. As illustrated in Figure~\ref{fig:grid-iid-vs-non-iid-neighbourhood}, the color of a node, represented as a circle, corresponds to a different class. In the IID setting (Figure~\ref{fig:grid-iid-neighbourhood}), each node has examples of all classes in equal proportions. In the non-IID setting (Figure~\ref{fig:grid-non-iid-neighbourhood}), each node has examples of only a single class and nodes are distributed randomly in the grid. A single training step, from the point of view of the middle node, is equivalent to sampling a mini-batch five times larger from the union of the local distributions of all illustrated nodes. 


\begin{figure}
     \centering
     \begin{subfigure}[b]{0.25\textwidth}
         \centering
         \includegraphics[width=\textwidth]{figures/grid-iid-neighbourhood}
\caption{\label{fig:grid-iid-neighbourhood} IID}
     \end{subfigure}
     \begin{subfigure}[b]{0.25\textwidth}
         \centering
         \includegraphics[width=\textwidth]{figures/grid-non-iid-neighbourhood}
\caption{\label{fig:grid-non-iid-neighbourhood}  Non-IID}
     \end{subfigure}
        \caption{Neighbourhood in an IID and non-IID Grid.}
        \label{fig:grid-iid-vs-non-iid-neighbourhood}
\end{figure}

In the IID case, since gradients are computed from examples of all classes, the resulting average gradient  points in a direction that lowers the loss for all. However, in the non-IID case, not all classes are in the immediate neighbourhood. Therefore nodes diverge from one another according to the classes represented,% more than in the IID case.
Moreover, as the distributed averaging algorithm takes several steps to converge, this variance persists between steps as  the computed gradients are far from the global average.\footnote{It is possible, but impractical, to compensate with enough additional averaging steps.} This can significantly slow down convergence speed to the point of making parallel optimization impractical.

In D-Cliques, we address the issues of non-iidness by carefully design the underlying network topology composed of \textit{cliques} and \textit{inter-clique connections}.
\begin{itemize}
 \item  D-Cliques recovers a balanced representation of classes, similar to that of the IID case, by modifying the topology such that each node is part of a \textit{clique} with neighbours representing all classes.
 \item To ensure all cliques converge, \textit{inter-clique connections} are introduced, established directly between nodes that are part of cliques.
\end{itemize}
Because a joint location distribution $D_{\textit{clique}} = \sum_{i \in \textit{clique}} D_i$ is representative of the global distribution, a sparse topology can be used between cliques, significantly reducing the total number of edges required to obtain quick convergence. Because the number of connections required per node is low and even, this approach is well suited to decentralized federated learning. \footnote{See Algorithm~\ref{Algorithm:D-Clique-Construction} in Appendix for set-based formulation of D-Cliques construction.}

Finally, weights
%AMK: explain weights
are assigned to edges to ensure quick convergence. For this study we use Metropolis-Hasting weights~\cite{xiao2004fast}, which while not necessarily optimal, are quick to compute and still provide good convergence speed: 
\begin{equation}
  W_{ij} = \begin{cases}
    \frac{1}{max(\text{degree}(i), \text{degree}(j)) + 1} & \text{if}~i \neq j, \text{and $\exists$ edge between $i$ and $j$}\\
   1 - \sum_{j \neq i} W_{ij} & \text{if}~$i = j$ \\
   0 & \text{otherwise}
  \end{cases}
\end{equation}

Note that for the sake of simplicity we assume that the topology is generated while assuming a global knowledge of the class distribution. Relaxing this assumption is part of future work.

%We centrally generate the topology, which is then tested in a custom simulator. We expect our approach should be straightforward to adapt for a decentralized execution: the presence and relative frequency of global classes could be computed using PushSum~\cite{kempe2003gossip}, and neighbours could be selected with PeerSampling~\cite{jelasity2007gossip}.

\begin{figure}[htbp]
    \centering 
             
    \begin{subfigure}[b]{0.4\textwidth}
    \centering
    \includegraphics[width=\textwidth]{figures/fully-connected-cliques}
    \caption{\label{fig:d-cliques-figure} D-Cliques Connected Pairwise}
    \end{subfigure}
    \hfill
    % To regenerate figure, from results/mnist
    % python ../../../learn-topology/tools/plot_convergence.py fully-connected/all/2021-03-10-09:25:19-CET no-init-no-clique-avg/fully-connected-cliques/all/2021-03-12-11:12:49-CET --add-min-max --yaxis test-accuracy --ymin 80 --ymax 92.5 --labels '100 nodes non-IID fully-connected' '100 nodes non-IID d-cliques' --save-figure ../../figures/d-cliques-mnist-vs-fully-connected.png --legend 'lower right' --font-size 16
    \begin{subfigure}[b]{0.55\textwidth}
    \centering
    \includegraphics[width=\textwidth]{figures/d-cliques-mnist-vs-fully-connected.png}
    \caption{\label{fig:d-cliques-example-convergence-speed} Convergence Speed on MNIST. Y-axis starts at 80.}
    \end{subfigure}
    
\caption{\label{fig:d-cliques-example} D-Cliques}
\end{figure}

A network of 100 non-IID nodes with D-Cliques is illustrated in Figure~\ref{fig:d-cliques-figure}, with the convergence speed of Figure~\ref{fig:d-cliques-example-convergence-speed}. The convergence speed is quite close to that of a fully-connected topology, and significantly better than that of the ring and grid of Figure~\ref{fig:iid-vs-non-iid-problem}. At a scale of 100 nodes, it uses only $\approx10\%$ of the number of edges of a fully-connected topology, offering a reduction of $\approx90\%$. Nonetheless, there is still significant variance in accuracy between nodes, which we address in the next section by removing the bias actually introduced by inter-clique edges.


%The degree of \textit{skew} of local distributions $D_i$, i.e. how much the local distribution deviates from the global distribution on each node, influences the minimal size of cliques. 
%
%The global distribution of classes, for classification tasks, can be computed from the distribution of class examples on the nodes, with Distributed Averaging (CITE). Given the global distribution of classes, neighbours within cliques can be chosen based on a PeerSampling (CITE) service. Both services can be implemented such that they converge in a logarithmic number of steps compared to the number of nodes. It is therefore possible to obtain this information in a scalable way.
%
% In the rest of this paper, we assume these services are available and show that the approach provides a useful convergence speed after the cliques have been formed.


\section{Optimizing with Clique Averaging}
\label{section:clique-averaging-momentum}

In this section we present Clique Averaging,a feature that removes further the bias introduce by data non-iidness.
%AMK: check

\subsection{Removing Gradient Bias from Inter-Clique Edges}
\label{section:clique-averaging}

While limiting the number of inter-clique connections also limits the amount of data traveling on the network, it may introduce some bias as observed in ours experiments, either because of the Metropolis-Hasting edge weights or because some classes are more represented in a neighbourhood. Figure~\ref{fig:connected-cliques-bias} illustrates the problem with the simplest case of two cliques connected by one inter-clique edge, i.e. this edge connects the green node of the left clique with the purple node of the right clique. 

Using Metropolis-Hasting weights, Node A implicit self-edge has a weight of $\frac{12}{110}$ while all of A's neighbours have a weight of $\frac{11}{110}$, except the green node connected to B, that has a weight of $\frac{10}{110}$. This weight assignment therefore biases the gradient towards A's purple class and away from the green class. The same analysis holds for all other nodes without inter-clique edges with their respective classes. For node B, all edges and B's self-edge have weights of $\frac{1}{11}$. However, the green class is represented twice, once as a clique neighbour and once at the other end of the inter-clique edge, while all other classes are represented only once. This biases the gradient toward the green class. The combined effect of these two sources of bias is to increase the variance between models after a D-PSGD step of training.

\begin{figure}[htbp]
         \centering
         \includegraphics[width=0.5\textwidth]{figures/connected-cliques-bias}
\caption{\label{fig:connected-cliques-bias} Sources of Bias in Connected Cliques: Non-uniform weights in neighbours of A (A has a higher weight); Non-uniform class representation in neighbours of B (extra green node).}
\end{figure}

We solve this problem by adding Clique Averaging to D-PSGD (Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}): gradient averaging is decoupled from model averaging by sending each in separate rounds of messages. Only the gradients of neighbours within the same clique are used to compute the average gradient, providing an equal representation to all classes. But all models of neighbours, including those across inter-clique edges, participate in the model averaging as in the original version.

\begin{algorithm}[h]
   \caption{D-PSGD with Clique Averaging, Node $i$}
   \label{Algorithm:Clique-Unbiased-D-PSGD}
   \begin{algorithmic}[1]
        \State \textbf{Require} initial model parameters $x_i^{(0)}$, learning rate $\gamma$, mixing weights $W$, number of steps $K$, loss function $F$
        \For{$k = 1,\ldots, K$}
          \State $s_i^{(k)} \gets \textit{sample from~} D_i$
          \State $g_i^{(k)} \gets \frac{1}{|\textit{Clique}(i)|}\sum_{j \in \textit{Clique(i)}}  \nabla F(x_j^{(k-1)}; s_j^{(k)})$
          \State $x_i^{(k-\frac{1}{2})} \gets x_i^{(k-1)} - \gamma g_i^{(k)}$ 
          \State $x_i^{(k)} \gets \sum_{j \in N} W_{ji}^{(k)} x_j^{(k-\frac{1}{2})}$
        \EndFor
   \end{algorithmic}
\end{algorithm}

% To regenerate figure, from results/mnist:
% python ../../../learn-topology/tools/plot_convergence.py fully-connected/all/2021-03-10-09:25:19-CET no-init-no-clique-avg/fully-connected-cliques/all/2021-03-12-11:12:49-CET  no-init/fully-connected-cliques/all/2021-03-12-11:12:01-CET --add-min-max --yaxis test-accuracy --labels '100 nodes non-IID fully-connected' '100 nodes non-IID d-cliques w/o clique avg.' '100 nodes non-IID w/ clique avg.' --legend 'lower right' --ymin 89 --ymax 92.5 --font-size 13 --save-figure ../../figures/d-clique-mnist-clique-avg.png
\begin{figure}[htbp]
         \centering
         \includegraphics[width=0.55\textwidth]{figures/d-clique-mnist-clique-avg}
\caption{\label{fig:d-clique-mnist-clique-avg} Effect of Clique Averaging on MNIST. Y-axis starts at 89.}
\end{figure}

As illustrated in Figure~\ref{fig:d-clique-mnist-clique-avg}, this significantly reduces variance between nodes and accelerates convergence speed. The convergence speed is now essentially identical to that obtained when fully connecting all nodes. The tradeoff is a higher messaging cost, double to that without clique averaging, and increased latency of a single training step by requiring two rounds of messages. Nonetheless, compared to fully connecting all nodes, the total number of messages is reduced by $\approx 80\%$. 

\subsection{Implementing Momentum with Clique Averaging}
\label{section:momentum}

Quickly training higher capacity models, such as a deep convolutional network, on harder datasets, such as CIFAR10, usually requires additional optimization techniques. We show here how Clique Averaging (Section~\ref{section:clique-averaging}) easily enables the implementation of optimization techniques in the presence of local class bias, that otherwise would require IID mini-batches.

In particular, we implement momentum~\cite{pmlr-v28-sutskever13}, which increases the magnitude of the components of the gradient that are shared between several consecutive steps. Momentum is critical for making deep convolutional networks, such as LeNet~\cite{lecun1998gradient,quagmire}, converge quickly. However, a simple application of momentum in a non-IID setting can actually be detrimental. As illustrated in Figure~\ref{fig:d-cliques-cifar10-momentum-non-iid-effect}, LeNet, on CIFAR10 with 100 nodes using the 
D-Cliques and momentum, actually fails to converge. As shown, not using momentum gives a better convergence speed, but there is still a significant gap compared to a single IID node.

\begin{figure}[htbp]
    \centering 
    % To regenerate figure, from results/cifar10
    % python ../../../learn-topology/tools/plot_convergence.py 1-node-iid/all/2021-03-10-13:52:58-CET  no-init-no-clique-avg/fully-connected-cliques/all/2021-03-13-18:34:35-CET no-init-no-clique-avg-no-momentum/fully-connected-cliques/all/2021-03-26-13:47:35-CET/ --legend 'upper right' --add-min-max --labels '1-node IID w/ momentum'  '100 nodes non-IID d-cliques w/ momentum' '100 nodes non-IID d-cliques w/o momentum'  --font-size 14 --yaxis test-accuracy --save-figure ../../figures/d-cliques-cifar10-momentum-non-iid-effect.png --ymax 100         
    \begin{subfigure}[b]{0.45\textwidth}
    \centering
    \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-momentum-non-iid-effect}
    \caption{\label{fig:d-cliques-cifar10-momentum-non-iid-effect} Without Clique Averaging }
    \end{subfigure}
    \hfill
    % To regenerate figure, from results/cifar10
    % python ../../../learn-topology/tools/plot_convergence.py 1-node-iid/all/2021-03-10-13:52:58-CET no-init/fully-connected-cliques/all/2021-03-13-18:32:55-CET --legend 'upper right' --add-min-max --labels '1-node IID w/ momentum' '100 nodes non-IID d-clique w/ momentum' --font-size 14 --yaxis test-accuracy --save-figure ../../figures/d-cliques-cifar10-momentum-non-iid-clique-avg-effect.png --ymax 100
    \begin{subfigure}[b]{0.45\textwidth}
    \centering
    \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-momentum-non-iid-clique-avg-effect}
    \caption{\label{fig:d-cliques-cifar10-momentum-non-iid-clique-avg-effect} With Clique Averaging}
    \end{subfigure}
\caption{\label{fig:cifar10-momentum} Non-IID Effect of Momentum on CIFAR10 with LeNet}
\end{figure}

Using Clique Averaging (Section~\ref{section:clique-averaging}), unbiased momentum can be calculated from the unbiased average gradient $g_i^{(k)}$ of Algorithm~\ref{Algorithm:Clique-Unbiased-D-PSGD}:
\begin{equation}
v_i^{(k)} \leftarrow m v_i^{(k-1)} +  g_i^{(k)} 
\end{equation}
It then suffices to modify the original gradient step to use momentum:
\begin{equation}
x_i^{(k-\frac{1}{2})} \leftarrow x_i^{(k-1)} - \gamma v_i^{(k)} 
\end{equation}

Using momentum closes the gap, with a slightly lower convergence speed in the first 20 epochs, as illustrated in Figure~\ref{fig:d-cliques-cifar10-momentum-non-iid-clique-avg-effect}.

\section{Comparative evalution and extensions}
\label{section:non-clustered}

%AMK: add what is in there
In this section, we first compare D-Cliques to alternatives, we then further evaluate the impact of using Clique-Averaging and evaluate D-Cliques-based extensions.

\subsection{Comparing D-Cliques to alternatives} %Non-Clustered Topologies}

%\label{section:non-clustered}

%We now show, in this section and the next, that the particular structure of D-Cliques is necessary. \label{section:non-clustered}
We compare D-cliques against competitors to demonstrate its  advantages  over alternative topologies.
First, we show that similar results may not necessarily be obtained from a similar number of edges chosen at random. We therefore compare D-Cliques, with and without Clique Averaging, to a random topology on 100 nodes chosen such that each node has exactly 10 edges, which is similar and even slightly higher than the 9.9 edges on average of the previous D-Clique example (Fig.~\ref{fig:d-cliques-figure}). To better understand the effect of clustering, we also compare to a similar random topology where edges are chosen such that each node has neighbours of all possible classes but without them forming a clique. We finally also compare with an analogous of Clique Averaging, where all nodes de-bias their gradient with that of their neighbours. In the latter case, since nodes do not form a clique, no node actually compute the same resulting average gradient.

Results for MNIST and CIFAR10 are shown in Figure~\ref{fig:d-cliques-comparison-to-non-clustered-topologies}. For MNIST, a random topology has higher variance and lower convergence speed than D-Cliques, with or without Clique Averaging. However, a random topology with enforced diversity performs as well and even slightly better than D-Cliques without Clique Averaging. Suprisingly, a random topology with unbiased gradient performs worse  than without, but only marginally, so this does not seem quite significant. Nonetheless, the D-Cliques topology with Clique Averaging performs better than any other random topology so it seems that clustering in this case has a small but significant effect.

\begin{figure}[htbp]
     \centering     
         \begin{subfigure}[b]{0.48\textwidth}
% To regenerate the figure, from directory results/mnist
% python ../../../learn-topology/tools/plot_convergence.py fully-connected-cliques/all/2021-03-10-10:19:44-CET no-init-no-clique-avg/fully-connected-cliques/all/2021-03-12-11:12:49-CET  random-10/all/2021-03-17-20:28:12-CET  random-10-diverse/all/2021-03-17-20:28:35-CET --labels 'd-clique (fcc)' 'd-clique (fcc) no clique avg.' '10 random edges' '10 random edges (all classes represented)' --add-min-max --legend 'lower right' --ymin 88 --ymax 92.5 --yaxis test-accuracy --save-figure ../../figures/d-cliques-mnist-linear-comparison-to-non-clustered-topologies.png --font-size 13
         \centering
         \includegraphics[width=\textwidth]{figures/d-cliques-mnist-linear-comparison-to-non-clustered-topologies}
                  \caption{MNIST with Linear Model}
         \end{subfigure}
                 \hfill                      
% To regenerate the figure, from directory results/cifar10
% python ../../../learn-topology/tools/plot_convergence.py no-init/fully-connected-cliques/all/2021-03-13-18:32:55-CET no-init-no-clique-avg/fully-connected-cliques/all/2021-03-13-18:34:35-CET random-10/all/2021-03-17-20:30:03-CET  random-10-diverse/all/2021-03-17-20:30:41-CET random-10-diverse-unbiased-gradient/all/2021-03-17-20:31:14-CET --labels 'd-clique (fcc) clique avg.' 'd-clique (fcc) no clique avg.' '10 random edges' '10 random edges (all classes repr.)' '10 random (all classes repr.) with unbiased grad.' --add-min-max --legend 'upper left' --yaxis test-accuracy --save-figure ../../figures/d-cliques-cifar10-linear-comparison-to-non-clustered-topologies.png --ymax 119 --font-size 13
        \begin{subfigure}[b]{0.48\textwidth}
        \centering
         \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-linear-comparison-to-non-clustered-topologies}
         \caption{CIFAR10 with LeNet}
     \end{subfigure} 
 \caption{\label{fig:d-cliques-comparison-to-non-clustered-topologies} Comparison to Non-Clustered Topologies} 
\end{figure}

For CIFAR10, the result is more dramatic, as Clique Averaging is critical for convergence (with momentum). All random topologies fail to converge, except when combining both node diversity and unbiased gradient, but in any case D-Cliques with Clique Averaging converges significantly faster. This suggests clustering helps reducing variance between nodes and therefore helps with convergence speed. We have tried to use LeNet on MNIST to see if the difference between MNIST and CIFAR10 could be attributed to the capacity difference between the Linear and Convolutional networks, whose optimization may benefit from clustering (see Appendix). The difference is less dramatic than for CIFAR10, so it must be that the dataset also has an impact. The exact nature of it is still an open question.

\subsection{Importance of Intra-Clique Full Connectivity}
\label{section:intra-clique-connectivity}

Intra-clique full connectivity is also necessary.
%AMK: check sentence above: justify
Figure~\ref{fig:d-cliques-intra-connectivity} shows the convergence speed of D-Cliques with respectively 1 and 5 edges randomly removed, out of 45 (2 and 10 out of 90 if counting both direction separately), as well as with and without Clique Averaging (resulting in a biased average gradient within cliques). In all cases, both for MNIST and CIFAR10, it has significant effect on the convergence speed. In the case of CIFAR10, it also negates the benefits of D-Cliques. 
\begin{figure}[htbp]
     \centering

\begin{subfigure}[htbp]{0.48\textwidth}
     \centering   
% To regenerate the figure, from directory results/mnist
% python ../../../learn-topology/tools/plot_convergence.py no-init/fully-connected-cliques/all/2021-03-12-11:12:01-CET rm-1-edge/all/2021-03-18-17:28:27-CET rm-5-edges/all/2021-03-18-17:29:10-CET rm-1-edge-unbiased-grad/all/2021-03-18-17:28:47-CET rm-5-edges-unbiased-grad/all/2021-03-18-17:29:36-CET --add-min-max --ymin 85 --ymax 92.5 --legend 'lower right' --yaxis test-accuracy --labels 'fcc with clique grad.' 'fcc -1 edge/clique, no clique avg.' 'fcc -5 edges/clique, no clique avg.'  'fcc -1 edge/clique, clique avg.' 'fcc -5 edges/clique, clique avg.' --save-figure ../../figures/d-cliques-mnist-clique-clustering-fcc.png  --font-size 13   
         \includegraphics[width=\textwidth]{figures/d-cliques-mnist-clique-clustering-fcc}     
\caption{\label{fig:d-cliques-mnist-clique-clustering} MNIST}
\end{subfigure}
\hfill
\begin{subfigure}[htbp]{0.48\textwidth}
     \centering
% To regenerate the figure, from directory results/cifar10
% python ../../../learn-topology/tools/plot_convergence.py no-init/fully-connected-cliques/all/2021-03-13-18:32:55-CET rm-1-edge/all/2021-03-18-17:29:58-CET rm-5-edges/all/2021-03-18-17:30:38-CET rm-1-edge-unbiased-grad/all/2021-03-18-17:30:17-CET rm-5-edges-unbiased-grad/all/2021-03-18-17:31:04-CET --add-min-max --ymax 80 --legend 'upper left' --yaxis test-accuracy --labels 'fcc, clique grad.' 'fcc -1 edge/clique, no clique grad.' 'fcc -5 edges/clique, no clique grad.'  'fcc -1 edge/clique, clique grad.' 'fcc -5 edges/clique, clique grad.' --save-figure ../../figures/d-cliques-cifar10-clique-clustering-fcc.png --font-size 13
         \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-clique-clustering-fcc}
\caption{\label{fig:d-cliques-cifar10-clique-clustering} CIFAR10}
\end{subfigure}

\caption{\label{fig:d-cliques-intra-connectivity} Importance of Intra-Clique Full-Connectivity}
%AMK: how many nodes?
\end{figure}

%\section{Scaling with Different Inter-Clique Topologies}
\subsection{Scaling with D-Cliques extensions}
%with Different Inter-Clique Topologies}
\label{section:interclique-topologies}

We finally evaluate the effect of the inter-clique topology on convergence speed on a larger network of 1000 nodes. We compare the scalability and convergence speed of variants based on D-Cliques, and therefore all using $O(nc)$ edges to create cliques as a foundation, where $n$ is the number of nodes and $c$ is the size of a clique.

First, the scheme that uses the fewest (almost\footnote{A path uses one less edge at significantly slower convergence speed and is therefore never really used in practice.}) number of extra edges is a \textit{ring}. A ring adds $\frac{n}{c} - 1$ inter-clique edges and therefore scales linearly in $O(n)$.

We introduce a second scheme that scales linearly with a logarithmic bound on the averaging shortest number of hops between nodes, which we call "\textit{fractal}". In this scheme, as nodes are added, cliques are assembled in larger groups of $c$ cliques that are connected internally with one edge per pair of cliques, but with only one edge between pairs of larger groups. The scheme is recursive such that $c$ groups will themselves form a larger group the next level up. This scheme results in at most $nc$ edges per node if edges are evenly distributed, and therefore also scales linearly in the number of nodes.

Third, we propose to connect cliques according to a smallworld-like~\cite{watts2000small} topology, applied to a ring~\cite{stoica2003chord}. In this scheme, cliques are first arranged in a ring. Then each clique add symmetric edges, both clockwise and counter-clockwise on the ring, to the $ns$ closest cliques in sets of cliques that are exponentially bigger the further they are on the ring.\footnote{See Algorithm~\ref{Algorithm:Smallworld} in Appendix for a detailed listing.} This ensures good clustering with other cliques that are close on the ring, while still keeping the average shortest path small. This scheme adds a $2(ns)log(\frac{n}{c})$ inter-clique edges and therefore grows in the order of $O(n + log(n))$ with the number of nodes.

Finally, we also fully connect cliques together, which bounds the average shortest path to $2$ between any pair of nodes. This adds $\frac{n}{c}(\frac{n}{c} - 1)$ edges, which scales quadratically in the number of nodes, in $O(\frac{n^2}{c^2})$, which can be significant at larger scales when $n$ is large compared to $c$.

Figure~\ref{fig:d-cliques-cifar10-convolutional} shows convergence speeds for all schemes, both on MNIST and CIFAR10, compared to a single IID node performing the same number of updates per epoch (showing the fastest convergence speed achievable if topology had no impact). A ring converges but is much slower. Our "fractal" scheme helps significantly. But the sweet spot really seems to be with a smallworld topology, as the convergence speed is almost the same to a fully-connected topology, but uses 22\% less edges at that scale (14.5 edges on average instead of 18.9), and seems to have potential to have larger benefits at larger scales. Nonetheless, even the fully-connected topology offers significant benefits with 1000 nodes, as it represents a 98\% reduction in the number of edges compared to fully connecting individual nodes (18.9 edges on average instead of 999) and a 96\% reduction in the number of messages (37.8 messages per round per node on average instead of 999). 

\begin{figure}[htbp]
     \centering
 % To regenerate the figure, from directory results/mnist
 % python ../../../learn-topology/tools/plot_convergence.py 1-node-iid/all/2021-03-10-09:20:03-CET ../scaling/1000/mnist/fully-connected-cliques/all/2021-03-14-17:56:26-CET ../scaling/1000/mnist/smallworld-logn-cliques/all/2021-03-23-21:45:39-CET ../scaling/1000/mnist/fractal-cliques/all/2021-03-14-17:41:59-CET ../scaling/1000/mnist/clique-ring/all/2021-03-13-18:22:36-CET     --add-min-max --yaxis test-accuracy --legend 'lower right' --ymin 84 --ymax 92.5 --labels '1 node IID'  'd-cliques (fully-connected cliques)' 'd-cliques (smallworld)' 'd-cliques (fractal)' 'd-cliques (ring)'  --save-figure ../../figures/d-cliques-mnist-1000-nodes-comparison.png --font-size 13
     \begin{subfigure}[b]{0.48\textwidth}
         \centering
            \includegraphics[width=\textwidth]{figures/d-cliques-mnist-1000-nodes-comparison}
             \caption{\label{fig:d-cliques-mnist-1000-nodes-comparison} MNIST with Linear}
     \end{subfigure}
     \hfill
     % To regenerate the figure, from directory results/cifar10
% python ../../../learn-topology/tools/plot_convergence.py 1-node-iid/all/2021-03-10-13:52:58-CET ../scaling/1000/cifar10/fully-connected-cliques/all/2021-03-14-17:41:20-CET ../scaling/1000/cifar10/smallworld-logn-cliques/all/2021-03-23-22:13:57-CET  ../scaling/1000/cifar10/fractal-cliques/all/2021-03-14-17:42:46-CET ../scaling/1000/cifar10/clique-ring/all/2021-03-14-09:55:24-CET  --add-min-max --yaxis test-accuracy --labels '1-node IID' 'd-cliques (fully-connected cliques)' 'd-cliques (smallworld)' 'd-cliques (fractal)' 'd-cliques (ring)' --legend 'lower right' --save-figure ../../figures/d-cliques-cifar10-1000-vs-1-node-test-accuracy.png --font-size 13
     \begin{subfigure}[b]{0.48\textwidth}
         \centering
         \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-1000-vs-1-node-test-accuracy}
\caption{\label{fig:d-cliques-cifar10-1000-vs-1-node-test-accuracy}  CIFAR10 with LeNet}
     \end{subfigure}
     
\caption{\label{fig:d-cliques-cifar10-convolutional} D-Cliques Convergence Speed with 1000 nodes, non-IID, Constant Updates per Epoch, with Different Inter-Clique Topologies.}
\end{figure}

\section{Related Work}
\label{section:related-work}

\aurelien{TODO: where to place TornadoAggregate and related refs?}

\paragraph{Dealing with non-IID data in server-based FL.}
Dealing with non-IID data in server-based FL has
recently attracted a lot of interest. While non-IID data is not an issue if
clients send their parameters to the server after each gradient update,
problems arise when one seeks to reduce
the number of communication rounds by allowing each participant to perform
multiple local updates, as in the popular FedAvg algorithm 
\cite{mcmahan2016communication}. This led to the design of extensions that are
specifically designed to mitigate the impact of non-IID data when performing
multiple local updates, using adaptive sampling \cite{quagmire}, update
corrections \cite{scaffold} or regularization in the local objective 
\cite{fedprox}. Another direction is to embrace the non-IID scenario by
learning personalized models for each client 
\cite{smith2017federated,perso_fl_mean,maml,moreau}.

\paragraph{Dealing with non-IID data in fully decentralized FL.}
Non-IID data is known to negatively impact the convergence speed
of fully decentralized FL algorithms in practice \cite{jelasity}. This
motivated the design of algorithms with modified updates based on variance
reduction \cite{tang18a}, momentum correction \cite{momentum_noniid},
cross-gradient
aggregation \cite{cross_gradient}, or multiple averaging steps
between updates (see \cite{consensus_distance} and references therein). These
algorithms
typically require additional communication and/or computation.\footnote{We
also observed that \cite{tang18a} is subject to numerical
instabilities when run on topologies other than rings. When
the rows and columns of $W$ do not exactly
sum to $1$ (due to finite precision), these small differences get amplified by
the proposed updates and make the algorithm diverge.}

\aurelien{emphasize that they only do small scale experiments}

% non-IID known to be a problem for fully decentralized FL. cf Jelasity paper
% D2 and other recent papers on modifying updates: Quasi-Global Momentum,
% Cross-Gradient Aggregation
% papers using multiple averaging steps
% also our personalized papers
% D2 \cite{tang18a}: numerically unstable when $W_{ij}$ rows and columns do not exactly
% sum to $1$, as the small differences are amplified in a positive feedback loop. More work is therefore required on the algorithm to make it usable with a wider variety of topologies. In comparison, D-cliques do not modify the SGD algorithm and instead simply removes some neighbor contributions that would otherwise bias the direction of the gradient. D-Cliques with D-PSGD are therefore as tolerant to ill-conditioned $W_{ij}$ matrices as regular D-PSGD in an IID setting.
In contrast, D-Cliques focuses on the design of a sparse topology which is
able to compensate for the effect of non-IID data. We do not modify the simple
and efficient D-SGD
algorithm \cite{lian2017d-psgd} beyond removing some neighbor
contributions
that would otherwise bias the direction of the gradient.

\aurelien{add personalized models - or merge all that in specific paragraph}

% An originality of our approach is to focus on the effect of topology
% level without significantly changing the original simple and efficient D-SGD
% algorithm \cite{lian2017d-psgd}. Other work to mitigate the effect of non-IID
% data on decentralized algorithms are based on performing modified updates (eg
% with variance reduction) or multiple averaging steps.


\paragraph{Impact of topology in fully decentralized FL.} It is well
known
that the choice of network topology can affect the
convergence of fully decentralized algorithms: this is typically accounted
for in the theoretical convergence rate by a dependence on the spectral gap of
the network, see for instance 
\cite{Duchi2012a,Colin2016a,lian2017d-psgd,Nedic18}.
However, for IID data, practice contradicts these classic
results: fully decentralized algorithms converge essentially as fast
on sparse topologies like rings or grids as they do on a fully connected
graph \cite{lian2017d-psgd,Lian2018}. Recent work 
\cite{neglia2020,consensus_distance} sheds light on this phenomenon with refined convergence analyses based on differences between gradients or parameters across nodes, which are typically
smaller in the IID case. However, these results do not give any clear insight
regarding the role of the topology in the non-IID case. We note that some work
has gone into designing efficient topologies to optimize the use of
network resources (see e.g., \cite{marfoq}), but this is done independently
of how data is distributed across nodes. In summary, the role
of topology in the non-IID data scenario is not well understood and we are not
aware of prior work focusing on this question. Our work shows that an
appropriate choice of data-dependent topology can effectively compensate for
non-IID data.

\section{Conclusion}
\label{section:conclusion}

We have proposed D-Cliques, a sparse topology that recovers the convergence speed and non-IID compensating behaviour of a fully-connected topology in the presence of local class bias. D-Cliques are based on assembling cliques of diverse nodes such that their joint local distribution is representative of the global distribution, essentially locally recovering IID-ness. Cliques are joined in a sparse inter-clique topology such that they quickly converge to the same model. Within cliques, Clique Averaging can be used to remove the non-IID bias in gradient computation by averaging gradients only with other nodes of clique. Clique Averaging can in turn be used to implement unbiased momentum to recover the convergence speed usually only possible with IID mini-batches. We have shown the clustering of D-Cliques and full connectivity within cliques to be critical in obtaining these results. Finally, we have evaluated different inter-clique topologies with 1000 nodes. While they all provide significant reduction in the number of edges compared to fully connecting all nodes, a smallworld approach that scales in $O(n + log(n))$ in the number of nodes seems to be the most advantageous compromise between scalability and convergence speed. The D-Clique topology approach therefore seems promising to reduce bandwidth usage on FL servers and to implement fully decentralized alternatives in a wider range of applications. For instance, the presence and relative frequency of global classes could be computed using PushSum~\cite{kempe2003gossip}, and neighbours could be selected with PeerSampling~\cite{jelasity2007gossip}. This is part of our future work.

%\section{Future Work}
%\begin{itemize}
%  \item Non-uniform Class Representation
%  \item End-to-End Wall-Clock Training Time, including Clique Formation
%  \item Comparison to Shuffling Data in a Data Center
%  \item Behaviour in the Presence of Churn
%  \item Relaxing Clique Connectivity: Randomly choose a subset of clique neighbours to compute average gradient.
%\end{itemize}


%\section{Credits}

%
% ---- Bibliography ----
%
% BibTeX users should specify bibliography style 'splncs04'.
% References will then be sorted and formatted in the correct style.
%
 \bibliographystyle{splncs04}
 \bibliography{main}
 
 \newpage
 \appendix
 
 \section{Algorithms}
 
 \begin{algorithm}[h]
   \caption{D-Clique Construction}
   \label{Algorithm:D-Clique-Construction}
   \begin{algorithmic}[1]
        \State \textbf{Require} set of classes globally present $L$, 
        \State~~ set of all nodes $N = \{ 1, 2, \dots, n \}$,
        \State~~ fn $\textit{classes}(S)$ that returns the classes present in a subset of nodes $S$,
        \State~~ fn $\textit{intraconnect}(DC)$ that returns edges intraconnecting cliques of $DC$,
        \State~~ fn $\textit{interconnect}(DC)$ that returns edges interconnecting cliques of $DC$ (Sec.~\ref{section:interclique-topologies})
         \State~~ fn $\textit{weights}(E)$ that assigns weights to edges in $E$ 
        \State $R \leftarrow \{ n~\text{for}~n \in N \}$ \Comment{Remaining nodes}
        \State $DC \leftarrow \emptyset$ \Comment{D-Cliques}
        \State $\textit{C} \leftarrow \emptyset$ \Comment{Current Clique}
        \While{$R \neq \emptyset$}
		\State $n \leftarrow \text{pick}~1~\text{from}~\{ m \in R | \textit{classes}(\{m\}) \subsetneq \textit{classes}(\textit{C}) \}$
		\State $R \leftarrow R \setminus \{ n \}; C \leftarrow C \cup \{ n \}$
		\If{$\textit{classes}(C) = L$}
		    \State $DC \leftarrow DC \cup \{ C \}; C \leftarrow \emptyset$
		\EndIf
        \EndWhile
        \State \Return $weights(\textit{intraconnect}(DC) \cup \textit{interconnect}(DC))$
   \end{algorithmic}
\end{algorithm}


\begin{algorithm}[h]
   \caption{$\textit{smallworld}(DC)$:  adds $O(\# N + log(\# N))$ edges}
   \label{Algorithm:Smallworld}
   \begin{algorithmic}[1]
        \State \textbf{Require} Set of cliques $DC$ (set of set of nodes), size of neighbourhood $ns$ (default 2), function $\textit{least\_edges}(S, E)$ that returns one of the nodes in $S$ with the least number of edges in $E$
        \State $E \leftarrow \emptyset$ \Comment{Set of Edges}
        \State $L \leftarrow [ C~\text{for}~C \in DC ]$ \Comment{Arrange cliques in a list}
        \For{$i \in \{1,\dots,\#DC\}$} \Comment{For every clique}
          \State \Comment{For sets of cliques exponentially further away from $i$}
          \For{$\textit{offset} \in \{ 2^x~\text{for}~x~\in \{ 0, \dots, \lceil \textit{log}_2(\#DC) \rceil \} \}$} 
             \State \Comment{Pick the $ns$ closests}
             \For{$k \in \{0,\dots,ns-1\}$}
                 \State \Comment{Add interclique connections in both directions}
                 \State $n \leftarrow \textit{least\_edges}(L_i, E)$
                 \State $m \leftarrow \textit{least\_edges}(L_{(i+\textit{offset}+k) \% \#DC}, E)$ \Comment{clockwise in ring}
                 \State $E \leftarrow E \cup \{ (n,m), (m,n) \}$
                 \State $n \leftarrow \textit{least\_edges}(L_i, E)$
                 \State $m \leftarrow \textit{least\_edges}(L_{(i-\textit{offset}-k)\% \#DC} , E)$ \Comment{counter-clockwise in ring}
                 \State $E \leftarrow E \cup \{ (n,m), (m,n) \}$
             \EndFor
           \EndFor
        \EndFor
        \State \Return E
   \end{algorithmic}
\end{algorithm}
 
 \section{Other Experiments}
 
 % REMOVED: Constant Batch-size
%         % To regenerate the figure, from directory results/scaling
%% python ../../../learn-topology/tools/plot_convergence.py 10/mnist/fully-connected-cliques/all/2021-03-10-14:40:35-CET ../mnist/fully-connected-cliques/all/2021-03-10-10:19:44-CET 1000/mnist/fully-connected-cliques/all/2021-03-10-16:44:35-CET --labels '10 nodes bsz=128' '100 nodes bsz=128' '1000 nodes bsz=128 (45)' --legend 'lower right' --yaxis test-accuracy --save-figure ../../figures/d-cliques-mnist-scaling-fully-connected-cst-bsz.png --ymin 80 --add-min-max
%         \begin{subfigure}[b]{0.48\textwidth}
%         \centering
%         \includegraphics[width=\textwidth]{figures/d-cliques-mnist-scaling-fully-connected-cst-bsz}
%         \caption{FCC: Constant Batch-Size}
%     \end{subfigure}

    \begin{figure}[htbp]
     \centering
     % To regenerate the figure, from directory results/mnist
     % python ../../../learn-topology/tools/plot_convergence.py clique-ring/all/2021-03-10-18:14:35-CET no-clique-avg/clique-ring/all/2021-03-12-10:40:37-CET no-init/clique-ring/all/2021-03-12-10:40:11-CET no-init-no-clique-avg/clique-ring/all/2021-03-12-10:41:03-CET --add-min-max --yaxis test-accuracy --labels '  'with uniform init., with clique avg.'    'with uniform init., without clique avg.'  'without uniform init., with clique avg.' 'without uniform init., without clique avg.' --legend 'lower right' --ymin 85 --ymax 92.5 --save-figure ../../figures/d-cliques-mnist-init-clique-avg-effect-ring-test-accuracy.png   
      \begin{subfigure}[b]{0.48\textwidth}
         \centering
         \includegraphics[width=\textwidth]{figures/d-cliques-mnist-init-clique-avg-effect-ring-test-accuracy}
         \caption{\label{fig:d-cliques-mnist-init-clique-avg-effect-ring-test-accuracy} Ring}
     \end{subfigure}
     % To regenerate the figure, from directory results/mnist
     %python ../../../learn-topology/tools/plot_convergence.py fully-connected-cliques/all/2021-03-10-10:19:44-CET no-clique-avg/fully-connected-cliques/all/2021-03-12-11:12:26-CET no-init/fully-connected-cliques/all/2021-03-12-11:12:01-CET no-init-no-clique-avg/fully-connected-cliques/all/2021-03-12-11:12:49-CET --add-min-max --yaxis test-accuracy --labels 'with uniform init., with clique avg.'    'with uniform init., without clique avg.'  'without uniform init., with clique avg.' 'without uniform init., without clique avg.' --legend 'lower right' --ymin 85 --ymax 92.5 --save-figure ../../figures/d-cliques-mnist-init-clique-avg-effect-fcc-test-accuracy.png
       \begin{subfigure}[b]{0.48\textwidth}
         \centering
         \includegraphics[width=\textwidth]{figures/d-cliques-mnist-init-clique-avg-effect-fcc-test-accuracy}
         \caption{\label{fig:d-cliques-mnist-init-clique-avg-effect-fcc-test-accuracy} Fully-Connected}
     \end{subfigure}
\caption{\label{fig:d-cliques-mnist-init-clique-avg-effect} MNIST: Effects of Clique Averaging and Uniform Initialization on Convergence Speed. (100 nodes, non-IID, D-Cliques, bsz=128)}
\end{figure}

    \begin{figure}[htbp]
     \centering
     % To regenerate the figure, from directory results/cifar10
     % python ../../../learn-topology/tools/plot_convergence.py clique-ring/all/2021-03-10-11:58:43-CET no-init/clique-ring/all/2021-03-13-18:28:30-CET no-clique-avg/clique-ring/all/2021-03-13-18:27:09-CET  no-init-no-clique-avg/clique-ring/all/2021-03-13-18:29:58-CET --add-min-max --yaxis test-accuracy --labels 'with clique avg., with uniform init.' 'with clique avg., without uniform init.'  'without clique avg., with uniform init.'   'without clique avg., without uniform init.' --legend 'lower right' --ymax 75  --save-figure ../../figures/d-cliques-cifar10-init-clique-avg-effect-ring-test-accuracy.png  
      \begin{subfigure}[b]{0.48\textwidth}
         \centering
         \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-init-clique-avg-effect-ring-test-accuracy}
         \caption{\label{fig:d-cliques-cifar10-init-clique-avg-effect-ring-test-accuracy} Ring}
     \end{subfigure}
     % To regenerate the figure, from directory results/cifar10
     %python ../../../learn-topology/tools/plot_convergence.py fully-connected-cliques/all/2021-03-10-13:58:57-CET no-init/fully-connected-cliques/all/2021-03-13-18:32:55-CET no-clique-avg/fully-connected-cliques/all/2021-03-13-18:31:36-CET  no-init-no-clique-avg/fully-connected-cliques/all/2021-03-13-18:34:35-CET --add-min-max --yaxis test-accuracy --labels 'with clique avg., with uniform init.' 'with clique avg., without uniform init.'  'without clique avg., with uniform init.'   'without clique avg., without uniform init.' --legend 'lower right'  --ymax 75 --save-figure ../../figures/d-cliques-cifar10-init-clique-avg-effect-fcc-test-accuracy.png 
       \begin{subfigure}[b]{0.48\textwidth}
         \centering
         \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-init-clique-avg-effect-fcc-test-accuracy}
         \caption{\label{fig:d-cliques-cifar10-init-clique-avg-effect-fcc-test-accuracy} Fully-Connected}
     \end{subfigure}
\caption{\label{fig:d-cliques-cifar10-init-clique-avg-effect} CIFAR10: Effects of Clique Averaging and Uniform Initialization on Convergence Speed. (100 nodes, non-IID, D-Cliques, bsz=20)}
\end{figure}

\begin{figure}[htbp]
     \centering
% To regenerate the figure, from directory results/mnist
% python ../../../learn-topology/tools/plot_convergence.py 1-node-iid/all/2021-03-10-09:20:03-CET fully-connected/all/2021-03-10-09:25:19-CET clique-ring/all/2021-03-10-18:14:35-CET fully-connected-cliques/all/2021-03-10-10:19:44-CET --add-min-max --yaxis test-accuracy --labels '1-node IID bsz=12800' '100-nodes non-IID fully-connected bsz=128' '100-nodes non-IID D-Cliques (Ring) bsz=128' '100-nodes non-IID D-Cliques (Fully-Connected) bsz=128' --legend 'lower right' --ymin 85 --ymax 92.5 --save-figure ../../figures/d-cliques-mnist-vs-1-node-test-accuracy.png
         \centering
         \includegraphics[width=0.7\textwidth]{figures/d-cliques-mnist-vs-1-node-test-accuracy}
         \caption{\label{fig:d-cliques-mnist-linear-w-clique-averaging-w-initial-averaging} MNIST: D-Cliques Convergence Speed (100 nodes, Constant Updates per Epoch)}
\end{figure}    
     
 \begin{figure}[htbp]
     \centering
          % To regenerate the figure, from directory results/cifar10
% python ../../../learn-topology/tools/plot_convergence.py 1-node-iid/all/2021-03-10-13:52:58-CET clique-ring/all/2021-03-10-11:58:43-CET fully-connected-cliques/all/2021-03-10-13:58:57-CET --add-min-max --yaxis training-loss --labels '1-node IID bsz=2000' '100-nodes non-IID D-Cliques (Ring) bsz=20' '100-nodes non-IID D-Cliques (Fully-Connected) bsz=20' --legend 'lower right' --save-figure ../../figures/d-cliques-cifar10-vs-1-node-training-loss.png
     \begin{subfigure}[b]{0.48\textwidth}
         \centering
         \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-vs-1-node-training-loss}
\caption{\label{fig:d-cliques-cifar10-training-loss} Training Loss}
     \end{subfigure}
     \hfill
     % To regenerate the figure, from directory results/cifar10
% python ../../../learn-topology/tools/plot_convergence.py 1-node-iid/all/2021-03-10-13:52:58-CET clique-ring/all/2021-03-10-11:58:43-CET fully-connected-cliques/all/2021-03-10-13:58:57-CET --add-min-max --yaxis test-accuracy --labels '1-node IID bsz=2000' '100-nodes non-IID D-Cliques (Ring) bsz=20' '100-nodes non-IID D-Cliques (Fully-Connected) bsz=20' --legend 'lower right' --save-figure ../../figures/d-cliques-cifar10-vs-1-node-test-accuracy.png
     \begin{subfigure}[b]{0.48\textwidth}
         \centering
         \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-vs-1-node-test-accuracy}
\caption{\label{fig:d-cliques-cifar10-test-accuracy}  Test Accuracy}
     \end{subfigure}
\caption{\label{fig:d-cliques-cifar10-convolutional-extended} D-Cliques Convergence Speed with Convolutional Network on CIFAR10 (100 nodes, Constant Updates per Epoch).}
\end{figure}

\subsection{Scaling behaviour as the number of nodes increases?}
          
     \begin{figure}[htbp]
         \centering     
              % To regenerate the figure, from directory results/scaling
% python ../../../learn-topology/tools/plot_convergence.py 10/mnist/fully-connected-cliques/all/2021-03-12-09:13:27-CET ../mnist/fully-connected-cliques/all/2021-03-10-10:19:44-CET 1000/mnist/fully-connected-cliques/all/2021-03-14-17:56:26-CET --labels '10 nodes bsz=1280' '100 nodes bsz=128' '1000 nodes bsz=13' --legend 'lower right' --yaxis test-accuracy --save-figure ../../figures/d-cliques-mnist-scaling-fully-connected-cst-updates.png --ymin 80 --add-min-max

      \begin{subfigure}[b]{0.7\textwidth}
         \centering
         \includegraphics[width=\textwidth]{figures/d-cliques-mnist-scaling-fully-connected-cst-updates}
         \caption{Fully-Connected (Cliques), $O(\frac{n^2}{c^2} + nc)$ edges}
     \end{subfigure}
     
          % To regenerate the figure, from directory results/scaling
% python ../../../learn-topology/tools/plot_convergence.py 10/mnist/clique-ring/all/2021-03-13-18:22:01-CET ../mnist/fully-connected-cliques/all/2021-03-10-10:19:44-CET 1000/mnist/fractal-cliques/all/2021-03-14-17:41:59-CET --labels '10 nodes bsz=1280' '100 nodes bsz=128' '1000 nodes bsz=13' --legend 'lower right' --yaxis test-accuracy --save-figure ../../figures/d-cliques-mnist-scaling-fractal-cliques-cst-updates.png --ymin 80 --add-min-max
         \begin{subfigure}[b]{0.7\textwidth}
         \centering
         \includegraphics[width=\textwidth]{figures/d-cliques-mnist-scaling-fractal-cliques-cst-updates}
         \caption{Fractal, $O(nc)$ edges}
     \end{subfigure}  

     
     % To regenerate the figure, from directory results/scaling
% python ../../../learn-topology/tools/plot_convergence.py 10/mnist/clique-ring/all/2021-03-13-18:22:01-CET ../mnist/clique-ring/all/2021-03-10-18:14:35-CET 1000/mnist/clique-ring/all/2021-03-13-18:22:36-CET --labels '10 nodes bsz=1280' '100 nodes bsz=128' '1000 nodes bsz=13' --legend 'lower right' --yaxis test-accuracy --save-figure ../../figures/d-cliques-mnist-scaling-clique-ring-cst-updates.png --ymin 80 --add-min-max
         \begin{subfigure}[b]{0.7\textwidth}
         \centering
         \includegraphics[width=\textwidth]{figures/d-cliques-mnist-scaling-clique-ring-cst-updates}
         \caption{Ring, $O(n)$ edges}
     \end{subfigure}  
     
     \caption{\label{fig:d-cliques-mnist-scaling-fully-connected} MNIST: D-Clique Scaling Behaviour, where $n$ is the number of nodes, and $c$ the size of a clique (Constant Updates per Epoch).}
     \end{figure}
     
          \begin{figure}[htbp]
         \centering
     
              % To regenerate the figure, from directory results/scaling
% python ../../../learn-topology/tools/plot_convergence.py ../cifar10/1-node-iid/all/2021-03-10-13:52:58-CET 10/cifar10/fully-connected-cliques/all/2021-03-13-19:06:02-CET ../cifar10/fully-connected-cliques/all/2021-03-10-13:58:57-CET 1000/cifar10/fully-connected-cliques/all/2021-03-14-17:41:20-CET --labels '1 node IID bsz=2000' '10 nodes non-IID bsz=200' '100 nodes non-IID bsz=20' '1000 nodes non-IID bsz=2' --legend 'lower right' --yaxis test-accuracy --save-figure ../../figures/d-cliques-cifar10-scaling-fully-connected-cst-updates.png --add-min-max

      \begin{subfigure}[b]{0.7\textwidth}
         \centering
         \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-scaling-fully-connected-cst-updates}
         \caption{Fully-Connected (Cliques), $O(\frac{n^2}{c^2} + nc)$ edges}
     \end{subfigure}
     
          % To regenerate the figure, from directory results/scaling
% python ../../../learn-topology/tools/plot_convergence.py  ../cifar10/1-node-iid/all/2021-03-10-13:52:58-CET 10/cifar10/fully-connected-cliques/all/2021-03-13-19:06:02-CET ../cifar10/fully-connected-cliques/all/2021-03-10-13:58:57-CET 1000/cifar10/fractal-cliques/all/2021-03-14-17:42:46-CET  --labels '1 node IID bsz=2000' '10 nodes non-IID bsz=200' '100 nodes non-IID bsz=20' '1000 nodes non-IID bsz=2'  --legend 'lower right' --yaxis test-accuracy --save-figure ../../figures/d-cliques-cifar10-scaling-fractal-cliques-cst-updates.png --add-min-max
         \begin{subfigure}[b]{0.7\textwidth}
         \centering
         \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-scaling-fractal-cliques-cst-updates}
         \caption{Fractal, $O(nc)$ edges}
     \end{subfigure}  

     
     % To regenerate the figure, from directory results/scaling
% python ../../../learn-topology/tools/plot_convergence.py  ../cifar10/1-node-iid/all/2021-03-10-13:52:58-CET 10/cifar10/fully-connected-cliques/all/2021-03-13-19:06:02-CET ../cifar10/clique-ring/all/2021-03-10-11:58:43-CET 1000/cifar10/clique-ring/all/2021-03-14-09:55:24-CET  --labels '1 node IID bsz=2000' '10 nodes non-IID bsz=200' '100 nodes non-IID bsz=20' '1000 nodes non-IID bsz=2'   --legend 'lower right' --yaxis test-accuracy --save-figure ../../figures/d-cliques-cifar10-scaling-clique-ring-cst-updates.png --add-min-max
         \begin{subfigure}[b]{0.7\textwidth}
         \centering
         \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-scaling-clique-ring-cst-updates}
         \caption{Ring, $O(n)$ edges}
     \end{subfigure}  
     
     \caption{\label{fig:d-cliques-cifar10-scaling-fully-connected} CIFAR10: D-Clique Scaling Behaviour, where $n$ is the number of nodes, and $c$ the size of a clique (Constant Updates per Epoch).}
     \end{figure}
     
     \begin{figure}
\centering
              \begin{subfigure}[htb]{0.48\textwidth}
% To regenerate the figure, from directory results/mnist/gn-lenet
% python ../../../../learn-topology/tools/plot_convergence.py no-init/all/2021-03-22-21:39:54-CET no-init-no-clique-avg/all/2021-03-22-21:40:16-CET random-10/all/2021-03-22-21:41:06-CET random-10-diverse/all/2021-03-22-21:41:46-CET random-10-diverse-unbiased-grad/all/2021-03-22-21:42:04-CET --legend 'lower right' --add-min-max --labels 'd-clique (fcc) clique avg.' 'd-clique (fcc) no clique avg.' '10 random edges' '10 random edges (all classes repr.)' '10 random edges (all classes repr.) with unbiased grad.' --ymin 80 --yaxis test-accuracy --save-figure ../../../figures/d-cliques-mnist-lenet-comparison-to-non-clustered-topologies.png
         \includegraphics[width=\textwidth]{figures/d-cliques-mnist-lenet-comparison-to-non-clustered-topologies}
         \caption{\label{fig:d-cliques-mnist-lenet-comparison-to-non-clustered-topologies} LeNet Model}
        \end{subfigure}
        \hfill
                      \begin{subfigure}[htb]{0.48\textwidth}
% To regenerate the figure, from directory results/mnist/gn-lenet
% python ../../../../learn-topology/tools/plot_convergence.py no-init/all/2021-03-22-21:39:54-CET no-init-no-clique-avg/all/2021-03-22-21:40:16-CET random-10/all/2021-03-22-21:41:06-CET random-10-diverse/all/2021-03-22-21:41:46-CET random-10-diverse-unbiased-grad/all/2021-03-22-21:42:04-CET --legend 'upper right' --add-min-max --labels 'd-clique (fcc) clique avg.' 'd-clique (fcc) no clique avg.' '10 random edges' '10 random edges (all classes repr.)' '10 random edges (all classes repr.) with unbiased grad.' --ymax 0.7 --yaxis scattering --save-figure ../../../figures/d-cliques-mnist-lenet-comparison-to-non-clustered-topologies-scattering.png
         \includegraphics[width=\textwidth]{figures/d-cliques-mnist-lenet-comparison-to-non-clustered-topologies-scattering}
         \caption{\label{fig:d-cliques-mnist-lenet-comparison-to-non-clustered-topologies-scattering} LeNet Model (Scattering)}
        \end{subfigure}
       
         \caption{\label{fig:d-cliques-mnist-comparison-to-non-clustered-topologies} MNIST: Comparison to non-Clustered Topologies}
\end{figure}

 \begin{figure}
 \centering
     % To regenerate the figure, from directory results/cifar10
% python ../../../learn-topology/tools/plot_convergence.py fully-connected-cliques/all/2021-03-10-13:58:57-CET no-init-no-clique-avg/fully-connected-cliques/all/2021-03-13-18:34:35-CET  random-10/all/2021-03-17-20:30:03-CET  random-10-diverse/all/2021-03-17-20:30:41-CET random-10-diverse-unbiased-gradient/all/2021-03-17-20:31:14-CET random-10-diverse-unbiased-gradient-uniform-init/all/2021-03-17-20:31:41-CET --labels 'd-clique (fcc) clique avg., uniform init.' 'd-clique (fcc) no clique avg. no uniform init.' '10 random edges' '10 random edges (all classes repr.)' '10 random (all classes repr.) with unbiased grad.' '10 random (all classes repr.) with unbiased grad., uniform init.' --add-min-max --legend 'upper right' --yaxis scattering --save-figure ../../figures/d-cliques-cifar10-linear-comparison-to-non-clustered-topologies-scattering.png --ymax 0.7
        \begin{subfigure}[b]{0.48\textwidth}
        \centering
         \includegraphics[width=\textwidth]{figures/d-cliques-cifar10-linear-comparison-to-non-clustered-topologies-scattering}
         \caption{\label{fig:d-cliques-cifar10-linear-comparison-to-non-clustered-topologies-scattering} LeNet Model: Scattering}
     \end{subfigure}  
         
\caption{\label{fig:d-cliques-cifar10-linear-comparison-to-non-clustered-topologies} CIFAR10: Comparison to non-Clustered Topologies}
\end{figure}