diff --git a/main.bib b/main.bib
index 52b6cbd1217fbadadd30ceb4e822b83b15957fa2..aa91cd3d7774a88826382a14f0972eb1140702fa 100644
--- a/main.bib
+++ b/main.bib
@@ -688,11 +688,10 @@ pages={211-252}
 }
 
 @misc{mnistWebsite,
-  title={{THE MNIST DATABASE of handwritten digits}},
+  title={{The MNIST database of handwritten digits}},
   author={LeCun, Yann and Cortes, Corinna and Burges, Christopher J.C.},
   year={2020},
-  howpublished={\url{http://yann.lecun.com/exdb/mnist/}},
-  note={[online, accessed 2020-06-03]}
+  howpublished={\url{http://yann.lecun.com/exdb/mnist/}}
 }
 
 @misc{shallue2018measuring,
diff --git a/main.tex b/main.tex
index df7d5772cb8a754ca70657860f53cd76bc2503e2..d0260416b1b2ea3ed20a83867e79455194e41f7b 100644
--- a/main.tex
+++ b/main.tex
@@ -54,7 +54,7 @@ with Topology}
 The convergence speed of machine learning models trained with Federated
 Learning is significantly affected by non-independent and identically
 distributed (non-IID) data partitions, even more so in a fully decentralized
-setting without a central server. In this paper, we show that the impact 
+setting without a central server. In this paper, we show that the impact of
 \textit{local class bias} can be significantly reduced by carefully designing
 the underlying communication topology. We present D-Cliques, a novel topology
 that reduces gradient bias by grouping nodes in interconnected cliques such
@@ -110,9 +110,9 @@ network is organized according to a star topology: a central server orchestrates
 iteratively aggregating model updates received from the participants
 (\emph{clients}) and sending
 them back the aggregated model \cite{mcmahan2016communication}. In contrast,
-fully decentralized FL algorithms operate over an arbitrary graph topology
+fully decentralized FL algorithms operate over an arbitrary network topology
 where participants communicate only with their direct neighbors
-in the graph. A classic example of such algorithms is Decentralized
+in the network. A classic example of such algorithms is Decentralized
 SGD (D-SGD) \cite{lian2017d-psgd}, in which participants alternate between
 local SGD updates and model averaging with neighboring nodes.
 
@@ -305,7 +305,7 @@ from a single class.
 
 To isolate the effect of local class bias from other potentially compounding
 factors, we make the following simplifying assumptions: (1) All classes are
-equally represented in the global dataset; (2) All classes are represented on the same number of nodes; (3) All nodes have the same number of examples.
+equally represented in the global dataset; (2) All classes are represented on the same number of nodes; (3) All nodes have the same number of samples.
 
 We believe that these assumptions are reasonable in the context of our study
 because: (1)
@@ -391,8 +391,8 @@ mini-batch size, both approaches are equivalent.  %ensure a single
 \section{D-Cliques: Creating Locally Representative Cliques}
 \label{section:d-cliques}
 
-In this section we present the design of D-Cliques. To give an intuition of our approach, let us consider the neighbourhood of a single node in a grid similar to that of Figure~\ref{fig:grid-IID-vs-non-IID},  represented on Figure~\ref{fig:grid-iid-vs-non-iid-neighbourhood} where  each color represent a class of data.
-The colors of a node, represented as a circle, correspond to the different classes it hosts locally. In the IID setting (Figure~\ref{fig:grid-iid-neighbourhood}), each node has examples of all classes in equal proportions. In the non-IID setting (Figure~\ref{fig:grid-non-iid-neighbourhood}), each node has examples of only a single class and nodes are distributed randomly in the grid. A single training step, from the point of view of the middle node, is equivalent to sampling a mini-batch five times larger from the union of the local distributions of all illustrated nodes. 
+In this section we present the design of D-Cliques. To give an intuition of our approach, let us consider the neighbourhood of a single node in a grid similar to that of Figure~\ref{fig:grid-IID-vs-non-IID},  represented on Figure~\ref{fig:grid-iid-vs-non-iid-neighbourhood} where  each color represents a class of data.
+The colors of a node, represented as a circle, correspond to the different classes it hosts locally. In the IID setting (Figure~\ref{fig:grid-iid-neighbourhood}), each node has samples of all classes in equal proportions. In the non-IID setting (Figure~\ref{fig:grid-non-iid-neighbourhood}), each node has samples of only a single class and nodes are distributed randomly in the grid. A single training step, from the point of view of the middle node, is equivalent to sampling a mini-batch five times larger from the union of the local distributions of all illustrated nodes. 
 
 %For an intuition on the effect of local class bias, examine the neighbourhood of a single node in a grid similar to that of Figure~\ref{fig:grid-IID-vs-non-IID}. As illustrated in Figure~\ref{fig:grid-iid-vs-non-iid-neighbourhood}, the color of a node, represented as a circle, corresponds to a different class. In the IID setting (Figure~\ref{fig:grid-iid-neighbourhood}), each node has examples of all classes in equal proportions. In the non-IID setting (Figure~\ref{fig:grid-non-iid-neighbourhood}), each node has examples of only a single class and nodes are distributed randomly in the grid. A single training step, from the point of view of the middle node, is equivalent to sampling a mini-batch five times larger from the union of the local distributions of all illustrated nodes. 
 
@@ -413,10 +413,10 @@ The colors of a node, represented as a circle, correspond to the different class
         \label{fig:grid-iid-vs-non-iid-neighbourhood}
 \end{figure}
 
-In the IID case, since gradients are computed from examples of all classes, the resulting average gradient  points in a direction that lowers the loss for all. However, in the non-IID case, not all classes are in the immediate neighbourhood. Therefore nodes diverge from one another according to the classes represented,% more than in the IID case.
-Moreover, as the distributed averaging algorithm takes several steps to converge, this variance persists between steps as  the computed gradients are far from the global average.\footnote{It is possible, but impractical, to compensate with enough additional averaging steps.} This can significantly slow down convergence speed to the point of making parallel optimization impractical.
+In the IID case, since gradients are computed from examples of all classes, the resulting average gradient  points in a direction that lowers the loss for all. However, in the non-IID case, not all classes are in the immediate neighbourhood. Therefore nodes diverge from one another according to the classes represented. % more than in the IID case.
+In addition, as the distributed averaging algorithm takes several steps to converge, this variance persists between steps as  the computed gradients are far from the global average.\footnote{It is possible, but impractical, to compensate with enough additional averaging steps.} This can significantly slow down convergence speed to the point of making parallel optimization impractical.
 
-In D-Cliques, we address the issues of non-iidness by carefully design the underlying network topology composed of \textit{cliques} and \textit{inter-clique connections}.
+In D-Cliques, we address the issues of non-iidness by carefully designing the underlying network topology composed of \textit{cliques} and \textit{inter-clique connections}.
 \begin{itemize}
  \item  D-Cliques recovers a balanced representation of classes, similar to that of the IID case, by modifying the topology such that each node is part of a \textit{clique} with neighbours representing all classes.
  \item To ensure all cliques converge, \textit{inter-clique connections} are introduced, established directly between nodes that are part of cliques.