# Information Bottleneck Classification in Extremely Distributed Systems

^{*}

Next Article in Journal

Next Article in Special Issue

Next Article in Special Issue

Previous Article in Journal

Previous Article in Special Issue

Previous Article in Special Issue

SIP—Stochastic Information Processing Group, Computer Science Department CUI, University of Geneva, Route de Drize 7, 1227 Carouge, Switzerland

Author to whom correspondence should be addressed.

Received: 28 August 2020 / Revised: 24 October 2020 / Accepted: 26 October 2020 / Published: 30 October 2020

(This article belongs to the Special Issue Information-Theoretic Methods for Deep Learning Based Data Acquisition, Analysis and Security)

We present a new decentralized classification system based on a distributed architecture. This system consists of distributed nodes, each possessing their own datasets and computing modules, along with a centralized server, which provides probes to classification and aggregates the responses of nodes for a final decision. Each node, with access to its own training dataset of a given class, is trained based on an auto-encoder system consisting of a fixed data-independent encoder, a pre-trained quantizer and a class-dependent decoder. Hence, these auto-encoders are highly dependent on the class probability distribution for which the reconstruction distortion is minimized. Alternatively, when an encoding–quantizing–decoding node observes data from different distributions, unseen at training, there is a mismatch, and such a decoding is not optimal, leading to a significant increase of the reconstruction distortion. The final classification is performed at the centralized classifier that votes for the class with the minimum reconstruction distortion. In addition to the system applicability for applications facing big-data communication problems and or requiring private classification, the above distributed scheme creates a theoretical bridge to the information bottleneck principle. The proposed system demonstrates a very promising performance on basic datasets such as MNIST and FasionMNIST.

Most classical machine learning architectures are based on a common classifier that typically requires centralizing all the training data in a common data center for training as schematically shown in Figure 1a. However, such a centralized system faces several critical requirements related to data privacy and the need for big-data communications to collect data of all classes at the central location. In practice, sensitive data such as medical and financial records or any personal data are usually kept private in multiple independent data centers and cannot be shared between third parties for various reasons. At the same time, a huge amount of newly acquired private data that requires special care when fed to a machine learning tool, is captured daily. However, from both privacy and practical points of view, it is not feasible to transfer all collected data to a centralized data center and to perform the system re-training on new data. To face these challenges, a concept of “Decentralized machine learning” is proposed and developed in several works, where the data are stored locally on the devices and a common centralized model is trained. Not pretending to be exhaustive in our overview, we mention some of the most important contributions within the literature. “Parallelized SGD” was introduced in 2007 [1] and further extended in [2] to reduce the communication costs using compression or pruning algorithms. An alternative solution known as “Federated Averaging” was proposed in [3] with many attempts to improve the performance and communication cost as in [4].

The term Federated Learning (FL) is used for a type of decentralized learning, where a global model is kept in a central node/device and many local nodes/devices have different amounts of samples from different classes. In FL, the local and/or global nodes share the gradients or model parameters during training by efficient techniques such that RingAllReduce [8] for gradients sharing or Federated Averaging [4] for local model parameters averaging on the central node, and Ensemble Learning [9] for local predictions averaging. When all devices have samples from all classes in equal amounts, the setup is commonly referred to as Independent Identically Distributed Federated Learning (IID-FL). However, in practice, it is often that different nodes/edges/devices might have samples only from some classes in different proportions. Such an unconstrained environment would almost always mean that not all edge devices will have data from all the classes. This is commonly referred to as a Non-Independent Identically Distributed Federated Learning (Non-IID-FL). This represents a real challenge for FL and leads to significant drops in classification accuracy. Recently, many works propose solutions to cope with this problem, such as mixing Federated Averaging with Ensemble Learning [10], incorporating recent communication and data-privacy amplification techniques [11], sharing small subsets of IID training data among the local nodes [12], adapting the local nodes communication frequencies to the skewness [13], and efficiently defending communications between nodes [14]. In [13], the authors compare the performance of different classification architectures on the IID and Non-IID data with different training tricks and demonstrate a significant drop in performance for Non-IID data. Therefore, to the best of our knowledge there is a significant gap between the performance of centralized systems and the Non-IID-FL systems.

Contrary to the centralised classification presented in Figure 1a, each class is assigned to one decentralized training node, which learns to optimally compress and decompress in-class data as shown in Figure 1b. The setup under analysis of this paper is shown in Figure 2. We assume that the system consists of ${N}_{\mathbf{m}}$ local nodes and one centralized node. Each node has access to its own privacy sensitive dataset. The entries of each local dataset are generated from one-class distribution. The centralized node does not have any access to the local node datasets and cannot receive any information about the updates of gradients typically considered in the FL settings. The only information that can be exchanged between the local nodes and centralized node is the probe, which is considered to be a public one, and the feedback of local nodes in a form of scalar variables. Therefore, the communication between the local nodes and centralized node is reduced to minimum at the testing stage. At the training stage, we assume no communication between the centralized node and local nodes. Additionally, the local nodes do not share any information between them. Up to our best knowledge such a scenario was not addressed in known FL systems.

The privacy protection model considered in this work has an asymmetric character. We only address the privacy protection of owner datasets, i.e., the training data. At the same time, the probe to be classified at the testing stage is not considered to be privacy sensitive one. Therefore, we assume that it can be shared in the plain form among different nodes. Although our model can also address the privacy protection mechanism for the probe, this problem is out of scope of this paper. We also assume that the nodes are playing a fair game and do not modify or taper their feedback to the centralized node. Therefore, the model under investigation assumes the following setting.

At the training stage we assume an extreme case of a Non-IID system setup, where each node/device has an access only to the samples of a single class/distribution. According to this assumption we want to address many practical scenarios, where the nodes representing some institutions like labs and research centers, companies, service providers, individuals or even countries, do not want or cannot share their data with each other for various reasons that include for example privacy concerns, competition issues, national security interests, etc. as well as technical constrained related to the transfer of big volumes of data via band-limited channels in a restricted amount of time. At the same time, the data owners representing such kind of nodes are interested in providing classification services to some third parties based on the possessed data without revealing it explicitly to any third party. There are numerous examples of considered scenarios ranging from privacy sensitivity medical records or biological research, where some particular institutions, which are specialized on study of some disease or phenomena, invest considerable amount of time, efforts and money to collect such kind of data. In addition, one institution might possess data from healthy population and yet the others with some specific diseases. Obviously, these institutions would be interested in sharing their data to reveal new discoveries but cannot proceed due to the above economical, privacy or competition reasons. One can also envision a scenario of personalized marketing, where each node represents a client or a company that has some unique experience or interest expressed by the data collected from its activity in certain domain. The advertising party suggests some services or product to all clients by sending a probe and if there is a match between the interest and proposal, a deal is concluded. At the same time, it is obvious that the interests of each client are private. The scenarios of astronomical or genetic research might also face big-data communication concerns, where a lot of data are collected and labeled at some distributed locations and to transfer all these data to a central node might represent a great technical or economic challenge. Additionally, the situation might be complicated by a need for regular data updates. All these scenarios are evidently exemplified on systems like Square Kilometer array (SKA) [22], where the data are planned to be collected on two continents with a rate of 1 Petabyte per day, with the envisioned daily transfer to a centralized location by an airplane.

Therefore, in the considered setup we assume very restricted communications between local and central nodes. Furthermore, we assume that no global model is stored in the central node and the nodes have no communications with the central node in terms of both sharing samples (local class in- or outliers) and gradients or parameters in the open or obfuscated form.

At the testing stage we address the classification problem. We assume that the central node has a probe that is not private, and it can be openly communicated between the nodes. In this way, the privacy of the probe is not considered in our work.

During the classification, the local nodes only communicate the reconstruction error in the form of a scalar to the central node, thus allowing for efficient and fast training and classification even when the local nodes are distributed around the world. For instance, this can be the case for astronomy observation centers which contain large quantities of data, where the training and classification have high transmission restrictions. Such a problem formulation is not directly addressed in the FL formulation and to the best of our knowledge there are no results reporting the performance of FL on this extreme Non-IID setup. We refer to this particular case of Non-IID data as One Node–One Class (ON-OC) setup.

The considered classification setup has a significant conceptual difference with the centralized classification systems. Centralized classification is based on the notion of a decision boundary between the classes that should be learned by observing multiple training samples from all classes simultaneously, as shown in Figure 3a. The classification is based on a decision to which region of space, split by the decision regions, a probe $\mathbf{x}$ belongs to. In the fully distributed case, referred to as the ON-OC setups, where no gradient is shared between nodes, the proposed system learns manifolds of each data class independently represented by colors in Figure 3b. The encoding-decoding of $\mathbf{x}$ achieves this by class-dependent encoder–decoder systems producing minimum reconstruction error for the matched probe case.

We propose a theoretical justification and proof of concept for a fully decentralized classification model, where the classifier training procedure is not required to see all class data at the same time to achieve a high accuracy classification. More precisely, we assume that each class is assigned to one decentralized training node, which learns to optimally compress and decompress in-class data, such that the reconstruction error is minimized for in-class data, and the latent compressed representation learns in-class data manifold instead of inter-class boundaries (Figure 3). At the same time, the presented framework can be extended to the more general case of multiple classes per node. In this case, the nodes can benefit from a priori simpler semi-supervised training, or at least they can train as many models as the number of classes per node, given that they have enough data. Once the training is completed, the classification step presented in Figure 4 is as follows: the central node sends a sample $\mathbf{x}$, a probe, from the data distribution to be classified, to each of the local nodes. These local nodes are optimized on a single class to compress and decompress and only the reconstruction errors of the probe are transmitted from each node to the central node, which votes in favor of a class with the lowest error.

In contrast to the Federated Learning-based classification considered in Section 1, in this section we will consider concepts related to the proposed framework.

The potential benefits of the considered architecture are as follows: (a) there is no need to transfer all of the data or gradients to a centralized location (large-scale applications); (b) data privacy is ensured by keeping data and model parameters locally; (c) the reconstruction score is produced locally and also (d) it might eliminate a potential vulnerability against adversarial attacks by preventing the ability to learn a sensitive classification boundary. To fully benefit from these attractive features, we have to validate the performance of the proposed distributed classification architecture against the classical fully supervised architecture that has access to all training samples simultaneously for the optimal decision rule. This is the targeted goal addressed in the current paper.

- We propose a fully distributed learning framework without any gradient communication to the centralized node as it is done in the distributed systems based on FL. As pointed out in [11,26] this resolves many common issues of FL related to the communication burden at the training stage and the need for gradient obfuscation for privacy reasons.
- We consider a new problem formulation of decentralized learning, where each node has an access only to the samples of some class. No communication between the nodes is assumed. We call this extreme case of Non-IID Federated Learning as ON-OC setup.
- We propose a theoretical model behind the proposed decentralized system based on the information bottleneck principle and justify the role of lossy feature compression as an important part of the information bottleneck implementation for the considered ON-OC classification.
- In contrast to the centralized classification systems and distributed Federated Learning, which both mimic the learning of decision boundaries between classes based on the simultaneously available training samples from all classes, we propose a novel approach, which tries to learn the data manifolds of each individual class at the local nodes and make the decision based on the proximity of a probe to each data manifold at the centralized node.
- The manifold learning is also accomplished in a new way using a system similar to an auto-encoder architecture [27] but keeping the encoder fixed for all classes. Thus, the only learnable parts of each node are compressor and decoder. This leads to the reduced training complexity and flexibility in the design of compression strategies. Additionally, by choosing the encoder based on the geometrically invariant network a.k.a. ScatNet [28], one can hope that the amount of training data needed to cope with the geometrical variability in training data might be reduced as suggested by the authors of [28].
- Finally, the proposed approach also differs to our previous framework [29] in the following way:
- The framework in [29] was not based on the IB principle, while the current work explicitly extends the IB framework.
- The previous work [29] did not use the compression in the latent space while the current work uses an explicit compression in a form of a vector quantization. The use of quantization is an important element of the IB framework in the considered ON-OC setup. In this work that the results of classification with the properly selected compression are considerably improved with respect to the unquantized latent space case considered in our prior work [29].
- The [29] was based on the concept of Variational Auto-Encoder (VAE), which includes the training of the encoder and decoder parts. This requires sufficient amount of data to obtain the invariance of the encoder to the different types of geometrical deviations. At the same time, the current work is based on the use of geometrically invariant transform, in particular ScatNet, which is designed to be invariant to the geometrical deviations. This allows, first of all, to avoid the training of encoder and, secondly, to train the system without big amount of labeled data or necessity to observe the data from all classes.
- In the case of VAE-based system the latent space is difficult to interpret in terms of the selection of dimensions for the quantization. In the case of use of ScanNet as an encoder part the latent space is well interpretable, and its different sub-bands correspond to different frequencies. In this respect, it becomes evident which sub-bands should be preserved and which ones could be suppressed (depending on the solved problem).
- Finally, this new setup shows higher classification accuracy for the ON-OC setup.

The theory of the centralized classification model is based on the recently proposed Information Bottleneck (IB) principle [6]. In a centralized classification model, the training samples are taken from the available labeled data of all ${N}_{\mathbf{m}}$ classes: ${\left\{{\mathbf{x}}_{i},{\mathbf{m}}_{i}\right\}}_{i=1}^{{N}_{\mathcal{D}}}\sim {p}_{\mathcal{D}}(\mathbf{m},\mathbf{x})$, where ${N}_{\mathcal{D}}$ corresponds to the number of training samples. It corresponds to the supervised version [7] of the IB with a variational approximation, where the model learns to minimize the mutual information ${I}_{\mathit{\varphi}}(\mathbf{X};\mathbf{Z})$ between the labeled data $\mathbf{X}$ and the latent representation $\mathbf{Z}$, while retaining the mutual information ${I}_{\mathit{\varphi},\mathit{\theta}}(\mathbf{Z};\mathbf{M})$ between the latent representation $\mathbf{Z}$ and class label $\mathbf{M}$ larger than some value ${I}_{m}$. This explains a compression of $\mathbf{X}$ by means of a parametrized encoding ${q}_{\mathit{\varphi}}\left(\mathbf{z}\right|\mathbf{x})$ such that $\mathbf{Z}$ is a sufficient statistics for $\mathbf{M}$, allowing the training of a mapper to classify from $\mathbf{Z}$ to $\mathbf{M}$, using ${p}_{\mathit{\theta}}\left(\mathbf{m}\right|\mathbf{z})$. Figure 1a describes the architecture of transmission of information $\mathbf{M}\stackrel{p\left(\mathbf{x}\right|\mathbf{m})}{\to}\mathbf{X}\stackrel{{q}_{\mathit{\varphi}}\left(\mathbf{z}\right|\mathbf{x})}{\to}\mathbf{Z}\stackrel{{p}_{\mathit{\theta}}\left(\mathbf{m}\right|\mathbf{z})}{\to}\mathbf{M}$. The parameters $\mathit{\varphi}$ for compressing $\mathbf{X}$ into the latent representation $\mathbf{Z}$, and $\mathit{\theta}$ for classifying $\mathbf{Z}$ into $\mathbf{M}$, are jointly trained to optimize the Lagrangian of the supervised IB developed in [7] as:
where $\mathrm{S}$ stands for the supervised setup and $\beta $ is a regularization parameter corresponding to ${I}_{m}$. Moreover, the mutual information between the input $\mathbf{Z}$ and the output $\mathbf{M}$ of the classification can be decomposed as:
where $\mathbf{M}$ is a categorical variable whose realizations are one-hot-class encoded vectors $\mathbf{m}$ of dimension ${N}_{\mathbf{m}}$ corresponding to the number of classes. As a result, assuming that all classes are equiprobable, the value of $H\left(\mathbf{M}\right)$ is determined as $H\left(\mathbf{M}\right)={log}_{2}\left({N}_{\mathbf{m}}\right)$, and therefore not parametrized, which leads to:
where ${I}_{\mathit{\varphi}}(\mathbf{X};\mathbf{Z})={H}_{\mathit{\varphi}}\left(\mathbf{Z}\right)-{H}_{\mathit{\varphi}}\left(\mathbf{Z}\right|\mathbf{X})$. The common classification models therefore optimize these three terms simultaneously, and we have the following interpretations for Equation (3):

$$\left(\widehat{\mathit{\varphi}},\widehat{\mathit{\theta}}\right)=\underset{\left(\mathit{\varphi},\mathit{\theta}\right)}{arg\; min}{\mathcal{L}}^{\mathrm{S}}(\mathit{\varphi},\mathit{\theta}),\phantom{\rule{1.em}{0ex}}\mathrm{with}\phantom{\rule{1.em}{0ex}}{\mathcal{L}}^{\mathrm{S}}(\mathit{\varphi},\mathit{\theta})={I}_{\mathit{\varphi}}(\mathbf{X};\mathbf{Z})-\beta {I}_{\mathit{\varphi},\mathit{\theta}}(\mathbf{Z};\mathbf{M}),$$

$${I}_{\mathit{\varphi},\mathit{\theta}}(\mathbf{Z};\mathbf{M})=H\left(\mathbf{M}\right)-{H}_{\mathit{\varphi},\mathit{\theta}}\left(\mathbf{M}\right|\mathbf{Z}),$$

$$\left(\widehat{\mathit{\varphi}},\widehat{\mathit{\theta}}\right)=\underset{\left(\mathit{\varphi},\mathit{\theta}\right)}{arg\; min}{H}_{\mathit{\varphi}}\left(\mathbf{Z}\right)-{H}_{\mathit{\varphi}}\left(\mathbf{Z}\right|\mathbf{X})+\beta {H}_{\mathit{\varphi},\mathit{\theta}}\left(\mathbf{M}\right|\mathbf{Z}),$$

- A minimization of ${H}_{\mathit{\varphi}}\left(\mathbf{Z}\right)$ such that $\mathbf{Z}$ should contain as little information as possible about $\mathbf{X}$ for compression purposes; therefore one has to compress at the encoding $\mathbf{X}\stackrel{{\mathbf{q}}_{\mathit{\varphi}}\left(\mathbf{z}\right|\mathbf{x})}{\to}\mathbf{Z}$. In general, this compressing encoding is learned by optimizing $\mathit{\varphi}$. We simplified the learning process by using a deterministic compression map $\mathbf{Z}={Q}_{\mathit{\varphi}}\left({f}_{\mathit{\varphi}}\left(\mathbf{X}\right)\right)$, where ${f}_{\mathit{\varphi}}(\xb7)$ is a feature extractor and ${Q}_{\mathit{\varphi}}(\xb7)$ is a vector quantizer. Accordingly, the rate ${R}_{Q}={H}_{\mathit{\varphi}}\left(\mathbf{Z}\right)\le {log}_{2}K$ is determined by the number of centroids K in the considered vector quantizer, with equality, if and only if all centroids are equiprobable.
- A maximization of ${H}_{\mathit{\varphi}}\left(\mathbf{Z}\right|\mathbf{X})$ under the deterministic encoding $\mathbf{Z}={Q}_{\mathit{\varphi}}\left({f}_{\mathit{\varphi}}\left(\mathbf{X}\right)\right)$ reduces to zero and thus: ${H}_{\mathit{\varphi}}\left(\mathbf{Z}\right|\mathbf{X})=0$ in Equation (3).
- A minimization of ${H}_{\mathit{\varphi},\mathit{\theta}}\left(\mathbf{M}\right|\mathbf{Z})$, which represents the cross-entropy between the distribution of the true labels $p\left(\mathbf{m}\right)$ and the estimated ones ${p}_{\mathit{\theta}}\left(\mathbf{m}\right|\mathbf{z})$:$${H}_{\mathit{\varphi},\mathit{\theta}}\left(\mathbf{M}\right|\mathbf{Z})=-{\mathbb{E}}_{p(\mathbf{x},\mathbf{m})}\left[{\mathbb{E}}_{{q}_{\mathit{\varphi}}\left(\mathbf{z}\right|\mathbf{x})}\left[{log}_{2}{p}_{\mathit{\theta}}\left(\mathbf{m}\right|\mathbf{z})\right]\right],$$

with $p(\mathbf{x},\mathbf{m})=p\left(\mathbf{m}\right)p\left(\mathbf{x}\right|\mathbf{m})$.

Finally, under the deterministic compressing encoding $\mathbf{Z}={Q}_{\mathit{\varphi}}\left({f}_{\mathit{\varphi}}\left(\mathbf{X}\right)\right)$, we can conclude that the low rate, ${R}_{Q}$ achievable with smaller K, corresponds to higher compression and increased distortion, and as a result increased, ${H}_{\mathit{\varphi},\mathit{\theta}}\left(\mathbf{Z}\right|\mathbf{X})$ and leads to the minimization of ${I}_{\mathit{\varphi}}(\mathbf{X};\mathbf{Z})={H}_{\mathit{\varphi}}\left(\mathbf{Z}\right)-{H}_{\mathit{\varphi}}\left(\mathbf{Z}\right|\mathbf{X})$ in Equation (1). At the same time, $\mathbf{Z}$ should contain enough information about $\mathbf{M}$, which is controlled by the term $\beta {H}_{\mathit{\varphi},\mathit{\theta}}\left(\mathbf{M}\right|\mathbf{Z})$ in Equation (3) and by the term $\beta {I}_{\mathit{\varphi},\mathit{\theta}}(\mathbf{Z};\mathbf{M})$ in Equation (1). Under the fixed rate ${R}_{Q}$, one trains the decoder ${p}_{\mathit{\theta}}\left(\mathbf{m}\right|\mathbf{z})$ that simultaneously represents a classifier:

$$\begin{array}{c}\hfill \widehat{\mathit{\theta}}=\underset{\mathit{\theta}}{arg\; min}{H}_{\mathit{\varphi},\mathit{\theta}}\left(\mathbf{M}\right|\mathbf{Z}),\phantom{\rule{1.em}{0ex}}\mathrm{where}\phantom{\rule{1.em}{0ex}}\mathbf{Z}={Q}_{\mathit{\varphi}}\left({f}_{\mathit{\varphi}}\left(\mathbf{X}\right)\right),\\ \hfill \mathrm{then}\phantom{\rule{1.em}{0ex}}\widehat{\mathit{\theta}}=\underset{\mathit{\theta}}{arg\; max}{\mathbb{E}}_{p(\mathbf{x},\mathbf{m})}\left[{log}_{2}{p}_{\mathit{\theta}}\left(\mathbf{m}\right|{Q}_{\mathit{\varphi}}\left({f}_{\mathit{\varphi}}\left(\mathbf{x}\right)\right))\right].\end{array}$$

This setup represents many classical state-of-the-art centralized fully supervised classifiers trained based on the maximum likelihood in Equation (5).

In the general case, in contrast to the centralized systems considered above, the proposed decentralized classification is based on the ${N}_{\mathbf{m}}$ nodes, each representing an unsupervised system, and the centralized node that distributes the probes for classification, and collects ${N}_{\mathbf{m}}$ scores for the final decision. Therefore, given a training set ${\left\{{\mathbf{x}}_{i}\right\}}_{i=1}^{{N}_{{\mathcal{D}}_{m}}}$ for each class $m\in \left\{1,\cdots ,{N}_{\mathbf{m}}\right\}$ generated from $\mathbf{x}\sim {p}_{{\mathcal{D}}_{m}}\left(\mathbf{x}\right)$ as shown in Figure 1b, each decentralized unsupervised system includes an encoder ${E}_{{\mathit{\varphi}}_{m}}(\xb7)={Q}_{{\mathit{\varphi}}_{m}}\left(f(\xb7)\right)$, decomposed in a deterministic data-independent feature extraction $f(\xb7)$ followed by a trainable compression ${Q}_{{\mathit{\varphi}}_{m}}(\xb7)$ and a parametrized decoder ${D}_{{\mathit{\theta}}_{m}}$.

The training of unsupervised nodes is based on the unsupervised IB considered in [7] (see Figure 1b):
where $\mathrm{U}$ stands for the unsupervised setup and similarly to the supervised counterpart:

$$\left({\widehat{\mathit{\varphi}}}_{m},{\widehat{\mathit{\theta}}}_{m}\right)=\underset{\left({\mathit{\varphi}}_{m},{\mathit{\theta}}_{m}\right)}{arg\; min}{\mathcal{L}}^{\mathrm{U}}({\mathit{\varphi}}_{m},{\mathit{\theta}}_{m}),\phantom{\rule{1.em}{0ex}}\mathrm{with}\phantom{\rule{1.em}{0ex}}{\mathcal{L}}^{\mathrm{U}}({\mathit{\varphi}}_{m},{\mathit{\theta}}_{m})={I}_{{\mathit{\varphi}}_{m}}(\mathbf{X};\mathbf{Z})-\beta {I}_{{\mathit{\varphi}}_{m},{\mathit{\theta}}_{m}}(\mathbf{Z};\mathbf{X}),$$

$$\begin{array}{c}\hfill {I}_{{\mathit{\varphi}}_{m}}(\mathbf{X};\mathbf{Z})={H}_{{\mathit{\varphi}}_{m}}\left(\mathbf{Z}\right)-{H}_{{\mathit{\varphi}}_{m}}\left(\mathbf{Z}\right|\mathbf{X})={H}_{{\mathit{\varphi}}_{m}}\left(\mathbf{Z}\right)={log}_{2}\left(K\right),\\ \hfill \mathrm{and}\phantom{\rule{1.em}{0ex}}{I}_{{\mathit{\varphi}}_{m},{\mathit{\theta}}_{m}}(\mathbf{Z};\mathbf{X})={H}_{{\mathit{\varphi}}_{m},{\mathit{\theta}}_{m}}\left(\mathbf{X}\right)-{H}_{{\mathit{\varphi}}_{m},{\mathit{\theta}}_{m}}\left(\mathbf{X}\right|\mathbf{Z}).\end{array}$$

In this work, we will assume that ${H}_{{\mathit{\varphi}}_{m},{\mathit{\theta}}_{m}}\left(\mathbf{X}\right)={H}_{\mathcal{D}}\left(\mathbf{X}\right)$ is independent of encoding-decoding parameters and represents the entropy of the training dataset and:
represents the conditional entropy that is determined by the decoder ${p}_{{\mathit{\theta}}_{m}}\left(\mathbf{x}\right|\mathbf{z})$. Assuming that ${p}_{{\mathit{\theta}}_{m}}\left(\mathbf{x}\right|\mathbf{z})\propto {e}^{-d(\mathbf{x},{D}_{{\mathit{\theta}}_{m}}\left(\mathbf{z}\right))}$, one can interpret ${log}_{2}{p}_{{\mathit{\theta}}_{m}}\left(\mathbf{x}\right|\mathbf{z})\propto -d(\mathbf{x},{D}_{{\mathit{\theta}}_{m}}\left(\mathbf{z}\right))$, where $d(\mathbf{x},{D}_{{\mathit{\theta}}_{m}}\left(\mathbf{z}\right))$ denotes the distortion function between $\mathbf{x}$ and its reconstructed counterpart $\widehat{\mathbf{x}}={D}_{{\mathit{\theta}}_{m}}\left(\mathbf{z}\right)$. Accordingly, for the considered non-stochastic encoding $\mathbf{Z}={Q}_{{\mathit{\varphi}}_{m}}\left(f\left(\mathbf{X}\right)\right)$, Equation (8) reduces to ${H}_{{\mathit{\varphi}}_{m},{\mathit{\theta}}_{m}}\left(\mathbf{X}\right|\mathbf{Z})=-{\mathbb{E}}_{{p}_{\mathcal{D}}\left(\mathbf{x}\right)}\left[d(\mathbf{x},{D}_{{\mathit{\theta}}_{m}}\left({Q}_{{\mathit{\varphi}}_{m}}\left(f\left(\mathbf{x}\right)\right)\right))\right]$ and:

$${H}_{{\mathit{\varphi}}_{m},{\mathit{\theta}}_{m}}\left(\mathbf{X}\right|\mathbf{Z})=-{\mathbb{E}}_{{p}_{\mathcal{D}}\left(\mathbf{x}\right)}\left[{\mathbb{E}}_{{q}_{{\mathit{\varphi}}_{m}}\left(\mathbf{z}\right|\mathbf{x})}\left[{log}_{2}{p}_{{\mathit{\theta}}_{m}}\left(\mathbf{x}\right|\mathbf{z})\right]\right],$$

$${\widehat{\mathit{\theta}}}_{m}=\underset{{\mathit{\theta}}_{m}}{arg\; min}{\mathbb{E}}_{{p}_{\mathcal{D}}\left(\mathbf{x}\right)}\left[d(\mathbf{x},{D}_{{\mathit{\theta}}_{m}}\left({Q}_{{\mathit{\varphi}}_{m}}\left(f\left(\mathbf{x}\right)\right)\right))\right].$$

The encoder in the considered setup consists of data-independent transform $f(.)$ and trainable quantizer ${Q}_{{\mathit{\varphi}}_{m}}(\xb7)$. There are several ways to implement such a quantizer. In this paper, we consider a vector quantizer that consists of a codebook ${\mathcal{Q}}_{\mathbf{m}}$. Practically, the centroids of this codebook are learned using K-means algorithm and the quantization procedure consists of searching the closest centroid in the codebook to each entry as explained in Figure 5. Each class is represented by its own ${K}_{m}$ centroids.

Therefore, given a training set ${\left\{{\mathbf{x}}_{i}\right\}}_{i=1}^{{N}_{{\mathcal{D}}_{m}}}$ for each class $m\in \left\{1,\cdots ,{N}_{\mathbf{m}}\right\}$ generated from $\mathbf{x}\sim {p}_{{\mathcal{D}}_{m}}\left(\mathbf{x}\right)$ as shown in Figure 1b, each node trains its own encoder–decoder pair $\left({E}_{{\mathit{\varphi}}_{m}},{D}_{{\mathit{\theta}}_{m}}\right)$, i.e., the compressor ${Q}_{{\mathit{\varphi}}_{m}}(\xb7)$ and the parameters of the decoder ${\mathit{\theta}}_{m}$ according to:
where ${K}_{m}$ is several centroids for each class. The total number of centroids for all classes is bounded that corresponds to the constrain on the total allowable rate. One can easily notice that the first term represents the rate of latent space and the second one the reconstruction distortion. Therefore, in this formulation, the unsupervised IB reduces to the rate-distortion formulation [30] averaged over all classes/nodes. This also explains the role of the rate-distortion function shown in Figure 4. For our experiments, the compression ratio is not learned and the structure of the compressing encoder ${E}_{\mathit{\varphi}}(\xb7)={Q}_{\mathit{\varphi}}\left(f(\xb7)\right)$ allows us to set this ratio to be fixed to meet certain requirements considered below. In case of the fixed number of centroids per class considered in this paper, one can skip the term ${log}_{2}{K}_{m}$ in (10).

$${\mathcal{L}}^{\mathrm{U}}({\mathit{\varphi}}_{m},{\mathit{\theta}}_{m})={log}_{2}{K}_{m}+\beta \sum _{i=1}^{{N}_{{\mathcal{D}}_{m}}}d({\mathbf{x}}_{i},{D}_{{\mathit{\theta}}_{m}}\left({E}_{{\mathit{\varphi}}_{m}}\left({\mathbf{x}}_{i}\right)\right),$$

Once trained, the ${N}_{\mathbf{m}}$ nodes return the distortions ${e}_{m}=d(\mathbf{x},{D}_{{\mathit{\theta}}_{m}}\left({Q}_{{\mathit{\varphi}}_{m}}\left(f\left(\mathbf{x}\right)\right)\right)),\phantom{\rule{4pt}{0ex}}m=1,\cdots ,{N}_{\mathbf{m}}$ for each probe $\mathbf{x}$. The centralized node receives all distortions ${\left\{{e}_{m}\right\}}_{m=1}^{{N}_{\mathbf{m}}}$ and picks up the minimum one as the result of classification:

$$\widehat{m}=\underset{1\le m\le {N}_{\mathbf{m}}}{arg\; min}{e}_{m}.$$

The detailed architecture of our model for a local node is sketched in Figure 5. The chosen encoding strategies for the scattering feature extractor f and the compression ${Q}_{{\mathit{\varphi}}_{m}}$ are detailed in Section 5.1.

Our architecture resembles the principles of multi-class classification with the independent encoding of each class. Our approach is also linked to the information theory of digital communications. In the classical Shannon’s communication theory, the asymptotic equipartition property ensures that the capacity of the communication system is asymptotically achieved using ${N}_{\mathbf{m}}$ independent binary classifiers known as jointly typical decoders assigned to each message to be communicated [30], Chap. 3. As with these frameworks, we want to build ${N}_{\mathbf{m}}$ distributed classifiers ensuring unified classification, e.g., decoding. However, instead of using ${N}_{\mathbf{m}}$ binary classifiers, we will use a compression framework, which assumes that each class has its own optimal compressor-decompressor pair in terms of reconstruction error. If the probe comes from a corresponding class, its compression-decompression distortion is minimum for a chosen rate ${R}_{Q}$, while if it is not the case, the distortion is maximized. The compressor-decompressor pair corresponds to an encoder-quantizer-decoder setup, where the latent space vector is quantized to a certain number of bits or rate ${R}_{Q}={log}_{2}\left(K\right)$. We investigate an extreme case, when the encoder has the same architecture for all classes consisting of the data-independent feature extractor and the quantizer that is optimized for each class. The encoder is based on the recently proposed deterministic geometrically invariant scattering transform a.k.a ScatNet [28]. On the other side, the decoders are trained independently for each class to ensure the best class reconstruction accuracy.

We use the IB formulation as the theoretical basis for the fully distributed system. At the same time, the mechanism of exact information compression in the IB is not fully understood and there are various interpretations how the deep network tries to compress information by keeping the only most important information in the latent representation for the targeted classification task. The original work [6] suggests that the stochastic gradient descent places the noise to the irrelevant dimensions of the latent space at the second stage of training. Other authors [31] interpret the IB compression as clustering, where several inputs are clustered together, if they contain the same information according to the assigned class labels. Otherwise, VAE [32] and Adversarial Auto-Encoders (AAE) [20] try to produce the latent space that follows some pre-defined distributions, where the IB compression can be controlled by a proper selection of the dimension of the latent space or addition of noise to some dimensions or shaping the distribution of the latent space by an introduced prior.

In this work, we proceed with a hypothesis that the IB is achieved by the direct compression of certain dimensions in the latent space representation, even when the dimension of latent space is larger than the input one. At the same time, the selection of dimensions or groups of dimensions referred to as channels in the latent space to be compressed is based on the analysis of class common features. The lack of the knowledge of common dimensions in the considered formulation of distributed classification between the classes is compensated by the known properties of the scattering transform [33], obtained with ScatNet [28]. The low frequency channels of ScatNet represent low resolution data that are very correlated for all classes. Therefore, its lossy representation corresponds to the selective compression suggested by the IB principle.

In this paper, we proceed with the local compressing encoders ${E}_{{\mathit{\varphi}}_{m}}(\xb7)$ consisting of a deterministic feature extractor $f(\xb7)$, followed by a learnable compressor ${Q}_{{\mathit{\varphi}}_{m}}(\xb7)$: ${E}_{{\mathit{\varphi}}_{m}}(\xb7)={Q}_{{\mathit{\varphi}}_{m}}\left(f(\xb7)\right)$. The compressing encoding minimizes ${I}_{{\mathit{\varphi}}_{m}}(\mathbf{X};\mathbf{Z})$ for the classifying purposes. In our setup, the feature extractor $f(\xb7)={S}^{J}(\xb7)$ is fixed to be the scattering transform of deepness J for all classes as defined in [28]. There are several reasons for this choice: (i) the scattering transform is known to preserve the energy in the Fourier domain [34], and is highly sparse and invariant to some geometrical transformations [33], i.e., it produces the same latent space representation $f\left(\mathbf{X}\right)$ for small variability in $\mathbf{X}$, (ii) in turn it needs less training examples to ensure invariance to geometrical transformations as shown in [28], where the authors show that the ScatNet of depth 2 with a simple linear SVM can achieve better classification accuracy for the smaller amount of training samples and (iii) the invariance and sparsity of the latent representations also help better training the decoding due to smaller variability and simpler (sparse) manifold, but also (iv) the invariant and deterministic scattering feature extraction brings interpretability of the latent representation to choose the compression strategy for unseen classes. The last is very important for the considered distributed setup, where no information about the classes is shared between the nodes.

In following, we consider the details of implementation of the fixed and class-independent scattering transform $f(.)$ and learnable quantizer ${Q}_{\mathit{\varphi}m}(.)$.

The feature extractor used to encode $\mathbf{X}$ is a deep scattering convolutional network defined in [28] of depth J equal to 2 or 3: $f(\xb7)={S}^{J}(\xb7)$. We recall from Section 4 that the role of the feature extractor $f(\xb7)$ is to provide an exhaustive and qualitative description of $\mathbf{X}$ in such a way that the subsequent compression can select only the strictly relative components for the classification towards $\mathbf{M}$. This role falls perfectly to the scattering transform ${S}^{J}(\xb7)$, which can produce on demand more or less features of $\mathbf{X}$ according to its depth J. If some data need very fine features to separate between classes by the compression, a deeper decomposition of ${S}^{J}(\xb7)$ is required. Table 1 presents the number of features extracted by ${S}^{J}(\xb7)$ according to the depth J, and the way in which these features are obtained.

The scattering extraction defined in [28] involves using a wavelet [34] basis ${\psi}_{j}^{\alpha}\left(u\right)={2}^{-2j}\psi \left({r}_{-\alpha}{2}^{-j}u\right)$, where $\psi $ is the Morlet mother wavelet, $1\le j\le J$ is the scale and ${r}_{-\alpha}$ is the rotation by $-\alpha $ with $\frac{\alpha}{2\pi}\in \mathbb{Z}/L\mathbb{Z}$ the finite group of L elements. It also involves the use of the absolute function as an activation function applied after convolutions with the wavelets and a local averaging ${\Phi}_{J}$ on a spatial window of scale ${2}^{J}$. Each feature channel is of size $H/{2}^{J}\times W/{2}^{J}$, where $H\times W$ is the original image size. Table 1 shows the dimension of the scattering representation according to J and the initial size of a realization $\mathbf{x}$ of the random variable $\mathbf{X}$.

The interpretation of the scattering feature space helps us choose the compression strategy for our experiments. As described in Table 1, the size of ${S}^{J}\left(\mathbf{x}\right)$ is $H/4\times W/4\times {N}_{{S}^{J}}$ (with the format $Height\times Width\times Channel$) when $\mathbf{x}$ is a grayscale input image of size $H\times W$; and when $\mathbf{x}$ is a color input image of size $H\times W\times 3$ the size of ${S}^{J}\left(\mathbf{x}\right)$ is $H/4\times W/4\times 3{N}_{{S}^{J}}$. Each channel of deepness $\delta \le J$ of the scattering transform ${S}^{J}\left(\mathbf{x}\right)$ corresponds to a fixed parameter path ${\alpha}_{1},\cdots ,{\alpha}_{\delta}$, and ${j}_{1}<\cdots <{j}_{\delta}$ applied to the input image. The channels are ordered by increasing depths $\delta <J$ and parameters ${\left\{{\alpha}_{d},{j}_{d}\right\}}_{d=1}^{\delta}$ of their corresponding path, therefore the first channel ${S}_{0}^{J}$ is only a blurry version of $\mathbf{x}$. For a better visualization and understanding, we give examples of the 81 channels obtained by the scattering transformation of two MNIST samples in Figure 6 and more examples are shown in Figure A1.

As shown in Figure 4, because the local encoding-decoding node was trained on the distribution of a class m its rate-distortion curve (RDC) will be sub-optimal for the distribution of another class ${m}^{\prime}$ and it will be above the RDC for the distribution of the class m, as soon as the distribution of these classes do not overlap in the considered space. Consequently, we target a rate ${R}_{Q}$ for the local node encoder–decoder, where the RDC for the dedicated class distribution is highly separable from the RDCs of other class distributions. For a sake of simplicity and interpretability, we have selected the same compression strategy and the same rate for all nodes.

The compression strategy is hybrid: (i) we want to quantize the channels with the lowest entropy, e.g. the channels that produce the same output for the in-class samples, and (ii) keep the channels with the lowest inter-class mutual information. The interpretations of the scattering transform channels given in Section 5.1.1 allows us to make the choice of: (i) quantizing the first channel ${S}_{0}^{J}$, (ii) keeping as is the channels of index larger than a given ${i}^{\star}$, and (iii) suppressing all channels ${S}_{2}^{J},\cdots ,{S}_{{i}^{\star}-1}^{J}$. In the local node for the class m, the encoding ${\mathbf{z}}_{m}\left(\mathbf{x}\right)={E}_{{\mathit{\varphi}}_{m}}\left(\mathbf{x}\right)={Q}_{{\mathit{\varphi}}_{m}}\left({S}^{J}\left(\mathbf{x}\right)\right)$ of a given sample $\mathbf{x}$ is defined by:
where ${N}_{\mathbf{z}}={N}_{{S}^{J}}+2-{i}^{\star}$ is the number of channels of $\mathbf{z}$, ${\langle \xb7,\xb7\rangle}_{CS}$ is the cosine similarity and ${\mathcal{Q}}_{\mathbf{m}}$ is the codebook of centroids for the given class m used for the vector quantization of ${S}_{0}^{J}\left(x\right)$. For our experiment, it is made up of the centroids $\mathbf{q}\in {\mathcal{Q}}_{\mathbf{m}}$ of a K-means pre-trained on ${\left\{{S}_{0}^{J}\left(\mathbf{x}\right)\right\}}_{\mathbf{x}\in {\mathcal{D}}_{m}}$, i.e., the training samples coming from the first scattering channel of the local data class m. The quantized and kept channels are highlighted with violet frames for some MNIST samples in Figure 6 and Figure A1.

$${\mathbf{z}}_{m}^{\left(1\right)}\left(\mathbf{x}\right)=\underset{\mathbf{q}\in {\mathcal{Q}}_{\mathbf{m}}}{arg\; min}{\langle {S}_{0}^{J}\left(\mathbf{x}\right),\mathbf{q}\rangle}_{CS},\phantom{\rule{2.em}{0ex}}{\mathbf{z}}_{m}^{(2,\cdots ,{N}_{\mathbf{z}})}\left(\mathbf{x}\right)={S}_{{i}^{\star},\cdots ,{N}_{{S}^{J}}}^{J}\left(\mathbf{x}\right),$$

The architecture of the decoder ${D}_{{\mathit{\theta}}_{m}}$ ($1\le m\le {N}_{\mathbf{m}}$) of class m is presented in Table 2. As suggested by Equation (10), its parameters ${\mathit{\theta}}_{m}$ are trained and optimized only over the dedicated class dataset ${\mathcal{D}}_{m}$ by minimizing the distortion between the original and locally reconstructed samples. The distortion measure used in this experiment is the ${\ell}_{1}$ loss, which has been proven to generate finer images than ${\ell}_{2}$ loss [35]. One could also go further and train it jointly with the adversarial loss like in [7], but the simple use of ${\ell}_{1}$ is shown to produce satisfactory results on our experiments with lower complexity. We use the Adam optimizer [36] to find:
where ${\mathit{\varphi}}_{m}$ corresponds to the parameters of encoding considered in Section 5.1, namely the codebook ${\mathcal{Q}}_{\mathbf{m}}$, the index ${i}^{\star}$ and the scattering depth J.

$$\begin{array}{c}\hfill {\widehat{\mathit{\theta}}}_{m}=\underset{{\mathit{\theta}}_{m}}{arg\; min}{\mathbb{E}}_{{p}_{{\mathcal{D}}_{m}}\left(\mathbf{x}\right)}\left[\parallel {\widehat{\mathbf{x}}}_{m}{-\mathbf{x}\parallel}_{1}\right],\phantom{\rule{1.em}{0ex}}\mathrm{with}\phantom{\rule{4.pt}{0ex}}{\widehat{\mathbf{x}}}_{m}={D}_{{\mathit{\theta}}_{m}}\left({E}_{{\mathit{\varphi}}_{m}}\left(\mathbf{x}\right)\right),\end{array}$$

It is important to point out that different classes might have different complexity of manifolds. To balance the same reconstruction error for different local encoder–decoder pairs, we assumed that the reconstruction error for all nodes should be approximately the same. Given different complexity of data manifolds for various classes, it can be achieved either by optimizing the structure of encoder–decoder pairs or adapting the number of epochs per each node. In this work, we proceeded with later and kept the structure of encoder–decoder fixed for all classes and just adapted the number of epochs to ensure approximately the same reconstruction error at the training stage.

Given a probe $\mathbf{x}$ coming from the testing dataset ${\mathcal{D}}_{\mathbf{test}}$, we pass it through the ${N}_{\mathbf{m}}$ class-dependant local node encoder-decoders and communicate the ${N}_{\mathbf{m}}$ reconstruction errors $\left({e}_{1},\cdots ,{e}_{{N}_{\mathbf{m}}}\right)$ to the central node. As shown in Figure 4 and Equation (11), the probe $\mathbf{x}$ is classified according to minimum of the reconstruction error: $\widehat{m}={arg\; min}_{1\le m\le {N}_{\mathbf{m}}}{e}_{m},\phantom{\rule{4.pt}{0ex}}\mathrm{where}\phantom{\rule{4.pt}{0ex}}{e}_{m}=d({\widehat{\mathbf{x}}}_{m},\mathbf{x})$. The spatial differences between the probe and its reconstructions contributing to these errors are shown in the third and sixth lines of Figure 7 to exemplify the underlying process. We tested different classifying losses than the training ${\ell}_{1}$ loss. Experimental metrics for the distortion measurements considered in this paper include the following reconstruction errors: ${\left\{{e}_{m}\right\}}_{m=1}^{N\mathbf{m}}$:

- the Manhattan distance ${d}_{{\ell}_{1}}$,
- the perceptual distance ${d}_{VGG}$ defined in [37],
- the pseudo-distance ${d}_{t}$, which counts the number of pixels with an absolute error larger than a threshold t:$$\begin{array}{c}\hfill \begin{array}{cc}\hfill {d}_{t}(\widehat{\mathbf{x}},\mathbf{x})=\sum _{i=1}^{{N}_{\mathbf{x}}}{\U0001d7d9}_{\left|\widehat{\mathbf{x}}\left[i\right]-\mathbf{x}\left[i\right]\right|\ge t},\phantom{\rule{4pt}{0ex}}\mathrm{where}\phantom{\rule{4.pt}{0ex}}{\U0001d7d9}_{\left|\widehat{\mathbf{x}}\left[i\right]-\mathbf{x}\left[i\right]\right|\ge t}& =\left\{\begin{array}{cc}& 1,\phantom{\rule{4.pt}{0ex}}\mathrm{if}\phantom{\rule{4.pt}{0ex}}\left|\widehat{\mathbf{x}}\left[i\right]-\mathbf{x}\left[i\right]\right|\ge t,\hfill \\ & 0,\phantom{\rule{4.pt}{0ex}}\mathrm{else}.\hfill \end{array}\right.\hfill \end{array}\end{array}$$

For too small or large thresholds t, the pseudo-distance ${d}_{t}(.,.)$ fails to really capture reconstruction errors. For instance, for any images ${\mathbf{x}}_{1}$ and ${\mathbf{x}}_{2}$ of the same size ${N}_{{\mathbf{x}}_{1}}$ with pixel values ranging from 0 to 1, ${d}_{0}({\mathbf{x}}_{1},{\mathbf{x}}_{2})={N}_{{\mathbf{x}}_{1}}={N}_{{\mathbf{x}}_{2}}$ and ${d}_{2}({\mathbf{x}}_{1},{\mathbf{x}}_{2})=0$. For this reason, Section 6.1 presents the classifying results experimented with ${d}_{t}$ for 6 different median value thresholds: $0.2,\phantom{\rule{3.33333pt}{0ex}}\cdots \phantom{\rule{3.33333pt}{0ex}},0.7$.

The experimental validation is performed with untrained encoding and controlled compression. To investigate the importance of compression at the encoding step, we considered the encoder presented in Figure 1 and Figure 4, consisting of feature extractor $f(\xb7)$ followed by a controlled compression ${Q}_{{\mathit{\varphi}}_{m}}(\xb7)$ of these features. The decoders ${D}_{{\mathit{\theta}}_{m}}(\xb7)$ are trained for each class and corresponding nodes. Our untrained feature extraction $f(\xb7)$ is performed by the scattering transform defined in [33], gaining in invariance to geometrical transformations [28] and facilitating the learning of neural networks with this sparse representation as an input [5]. This experimentation validates the theoretical approach to challenging datasets for decentralized classification, even though it is a simple task for centralized ones.

MNIST [15] and FashionMNIST [38] are fairly simple tasks for the common centralized supervised deep classifiers, but it is not the case for decentralized models, where it is challenging to learn fine-tuned decision boundaries, when restricted gradients are communicated to the central node. As mentioned in the previous sections, the aim of the experiment is to present the results that practically confirm the theory discussed in Section 4. The compression parameters have been fixed beforehand to simplify the learning and depend on the used dataset.

Our results in terms of classification error: (1) for MNIST are provided in Table 3 for which we achieve the state-of-the-art results with exactly 0 error on the testing dataset, and (2) for FashionMNIST are provided in Table 4 for which we have competitive results with centralized classifiers. The encoding parameters J (scattering deepness), K (number of centroids), ${i}^{\star}$ (first kept channel in compression) are different between both datasets because they have different statistics and Fourier spectra reflected in the scattering transform. Our results considerably outperform the Federated Averaging for Non-IID-FL setup, and the perfect MNIST classification is not a fluke. In Table 3, the error on the training dataset is $0\%$ for half of cases of classifying metrics, and only the ${d}_{.4}$ classifying loss gives $0\%$ of error on the testing dataset for 3 cross-validation sessions. To compare with FL, we present the results for Federated Averaging from [13], where it is shown that from IID-FL to Non-IID-FL, there is a drop in performance in the classification accuracy from −3% to −74% depending the model and data used, and the distribution of the data across the local nodes.

We use the following parameters: $J=2$, $K=5$, ${i}^{\star}=81$ (only the last subband is kept), batch size is 128 and the learning rate is ${10}^{-5}$. This implies that the size of the scattering features has a dividing factor ${2}^{J}=4$ from the MNIST original image size $28\times 28$. The compression of the scattering representation goes from a $81\times 7\times 7$ tensor (channels first) to a $2\times 7\times 7$ with the first channel quantized by a codebook of $K=5$ centroids. The compression rate of the feature vector is $80{log}_{2}\left(5\right):1$. The training of the 10 class-dedicated encoder-decoders described in Section 5.2 is performed with the Adam [36] optimizer. For each local node, their dedicated training dataset ${\mathcal{D}}_{m}$ is sampled in their entirety at each epoch and the 10 local training losses are shown in Figure 7c: the training is very stable and converges. The structure of the decoder is fully convolutional and described in Table 2 with $J=2$: the size of input is $2\times 7\times 7$ (channel first) followed by a sequence of 7 layers alternating 4 convolutions and 3 batch-normalizations [45], $ReLU$ activations [46], and $tanh$ activation for the output layer.

We used $J=2$, $K=5$, ${i}^{\star}=18$: only the paths of deepness 2 (for more details see Table 1) are kept, otherwise the reconstructions have too large distortions, batch size is 128 and the learning rate is ${10}^{-5}$. If we keep the same ${i}^{\star}=81$ as for MNIST, the reconstructions have too large distortions. This compression does not hold enough information for optimal reconstruction. With ${i}^{\star}=18$, the compression of the scattering representation goes from a $81\times 7\times 7$ tensor (channels first) to a $65\times 7\times 7$ with the first channel quantized by a codebook of $K=5$ centroids. The compression rate of the feature vector is $\frac{5{log}_{2}\left(5\right)}{4}:1$. Under these settings, the 10 independent class-dedicated encoder–decoder training converges with the same behavior as in Figure 7c. Nevertheless, the classification accuracy shown in Table 4 is less than for MNIST. This is due to the fact that when the compression rate is too small, the class-dedicated encoder-decoders are less separable as shown in Figure 4. Also, playing with the rate ${R}_{Q}$ and augment from $K=5$ to $K=15$ the length of the 10 local quantizing codebooks ${\mathcal{Q}}_{\mathbf{m}}$, the classification accuracy drops from $89.9$ to $82.81$, hence confirming the rate-distortion theory interpretation. The structure of the decoder is a fully convolutional and described in Table 2 with $J=2$: the size of input is $65\times 7\times 7$ (channel first) followed by a sequence of 7 layers alternating 4 convolutions and 3 batch-normalizations [45], $ReLU$ activations [46], and $tanh$ activation for the output layer.

To investigate and experimentally justify the assumptions behind the bottleneck compression described in Section 5.1.2, we describe the steps of compression in Figure 8 and show the corresponding representations of data manifolds at these different steps of compression in Figure 9: the over-complete sparse and geometrically invariant scattering transform representations shown in (b) already give a higher separability than the raw data of (a). The subband selection (c) and quantization (d) proposed in Section 5.1.2 increase separability between the classes. We highlight that the tSNE shown in (d) is assuming an ideal quantization, where the scattering transform channel of deepness 0 is assigned to the image taken in the corresponding label dictionary; in reality, the quantization is done on each class node with their local dictionary.

An ideal case for the ON–OC–IBC would be to have ideal anomaly detectors or one-class classifiers at each node. This prompts us to investigate the one-class separating power of each local node. We experimentally show this with tSNEs in Figure 10 for node 9 and Figure A3, Figure A4 and Figure A5 for the others. After the compression in the bottleneck, the inliers and outliers tend to separate but in different subgroups, whereas after the reconstruction, the manifold of inliers seems to be a single nested set, separated from the outliers. At the end, we see that the reconstruction error followed by the non-linearity ${d}_{.4}$ applied to each difference plays an important role for the final classification at the central node, this is made evident from the improved separability of in- and outlier manifolds in the t-SNE representations.

Figure 11a for the node of label 7 and Figure A4 for all other nodes gives an experimental proof of the rate-distortion concept for classification on a fully decentralized systems presented in Figure 4. According to the features produced by the ScatNet, the compression can be controlled by two parameters to get the best separability between the in- and outliers rate-distortion curves: (1) for better classification at the central node, several channels from the scattering transform is chosen with the parameter ${i}^{\star}$ defined in Section 5.1.2, when $J=2$, the scattering transform has 81 channels, in consequence, when ${i}^{\star}=80$, only two channels are kept, the first and the last one; (2) and the second parameter of compression is K, the number of elements in the codebook used for the quantization of the first channel.

The link between the rate-distortion separability in the local nodes and the classification accuracy in the central node is confirmed by the rate-distortion curves of Figure 11b, where the highest accuracy of classification is achieved when ${i}^{\star}=80$, which means a local quantization of the scattering transform channel 0 by the dictionaries of Figure A2, stacked with the last channel of index 80 and a suppression of all the intermediate scattering channels. With less compression, when ${i}^{\star}$ is smaller, the in- and outliers are less separable in the rate-distortion curves. It should be pointed out that we interpret the rate of compression as several selected channels. We did not investigate which sub-bands out of 80 are the most distinguishable due to high complexity and simply controlled the number of sub-bands indexed in the descending order. Obviously, these parameters can be optimized to further increase the accuracy of classification.

Table 5 summarizes for MNIST dataset how the classification error on the central classifying node is impacted by the size K of the codebook for the quantization of the first scattering channel. We fixed ${i}^{\star}=80$ for this experiment for classification purposes and used the classifying metric ${d}_{.4}$. This experiment shows that:

- $K=5$ achieves smallest classification error in the central node,
- near $K=5$ there is a smooth behavior and $K=5$ remains optimal in terms of classification.
- $K=1$ leads to the overfitting as the table shows a drop of performance between the train and the test datasets,
- for $K>5$, the table shows a drop in performance due to non-separability of rates of distortions between nodes.

It is important to note that for apart from $K\in \{4,5,6\}$, no search for the best hyperparameters were performed. This is an important factor as for $K=5$, the central classification starts to perform very well given that that same reconstruction accuracy is achieved for all nodes with enough epochs. However, if nodes are trained with different errors of reconstruction, this imbalances the whole system and leads to erroneous classification at the central node. We assumed that a good way to measure the quality of learning of one node is to use its training loss curve across the time shown in Figure 7c: we fixed each nodes to stop learning after their training loss reach $0.065$, with a maximum variation of $10\%$ during 10 epochs. The first thresholding criteria ensures that all nodes have similar distortion measures, and the second criteria ensures that all nodes learned quite well their own class manifold. We also added a maximum number of epochs for practical reasons, and for the results given in Table 5, apart from $K\in \{4,5,6\}$, it is this last criteria which stopped the trainings.

It is also important to note that we use the ${\ell}_{1}$ norm to estimate the reconstruction error at training, but for the recognition/testing we use the considered ${d}_{.4}$ metric. This is a potential source of the observed performance but due to the non-differentiability of these metrics we do not consider them in the training loss.

The hyperparameter search, including K, ${i}^{\star}$ and the stopping criteria, remains an open question for us that we would like to answer in future studies. We also have in mind to make the rate of quantization learnable, but this is not under the scope of this paper.

The relative competitive results presented in Table 3 and Table 4 constitute a proof of concept for our fully decentralized model. We want to emphasize that it is constructed from the interplay between information bottleneck principles and recent attempts to make machine learning architectures simpler and more interpretable (see [28,33,47]).

Conceptualization, D.U. and S.V.; Formal analysis, T.H.; Investigation, D.U.; Methodology, S.R., O.T. and T.H.; Project administration, D.U.; Resources, S.V.; Software, D.U.; Supervision, S.V.; Validation, D.U., S.R. and O.T.; Visualization, D.U.; Writing—original draft, D.U. and S.V.; Writing—review and editing, D.U., S.R., O.T., T.H., B.P. and S.V. All authors have read and agreed to the published version of the manuscript.

DU was funded by the Swiss National Science Foundation SNSF, NRP75 ’Big Data’ project No. 407540_167158 and OT and SR by SNSF project No. 200021_182063.

The authors declare no conflict of interest.

AE, Auto-Encoder; AAE, Adversarial Auto-Encoder; BMCNN + HC, Branching and Merging Convolutional Network with Homogeneous Filter Capsules; CNN(s), Convolutional Neural Network(s); CNN++, CNN with Batch Normalization and Residual Skip Connections; ELBO, Evidence Lower BOund; EnsNet, Ensemble learning in CNN augmented with fully connected sub-networks; FedAvg, Federated Averaging; FL, Federated Learning; GAN, Generative Adversarial Network; IB, Information Bottleneck; (Non-)IID, (Non-)independent identically distributed; IT, Number of iterations of tSNE; LR, Learning rate of tSNE; LSTM, Long-short term memory; MNIST, Mixed National Institute of Standards and Technology; MSE, Mean Squared Error; NN(s), Neural Network(s); ON–OC–IBC, One Node–One Class–Information bottleneck classification; P, Perplexity of tSNE; RMDL, Random Multimodal Deep Learning for Classification; RDC, Rate-Distortion Curve; ScatNet, Scattering Network; SKA, Square Kilometer Array; VAE, Variational Auto-Encoder. For all our diagrams, the colored trapezes represent NNs, and CNNs. The letters E, D and C are reserved for encoders, decoders and classifiers, respectively. The parameters of the networks are labeled with Greek indices. Columns represent vectors, with circles and squares being used for numerical and binary values, respectively.

- Delalleau, O.; Bengio, Y. Parallel Stochastic Gradient Descent; CIAR Summer School: Toronto, ON, Canada, 2007. [Google Scholar]
- Tian, L.; Jayaraman, B.; Gu, Q.; Evans, D. Aggregating Private Sparse Learning Models Using Multi-Party Computation. In Proceedings of the Private MultiParty Machine Learning (NIPS 2016 Workshop), Barcelona, Spain, 8 December 2016. [Google Scholar]
- McMahan, H.B.; Moore, E.; Ramage, D.; y Arcas, B.A. Federated Learning of Deep Networks using Model Averaging. arXiv
**2016**, arXiv:1602.05629. [Google Scholar] - McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; y Arcas, B.A. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017, Fort Lauderdale, FL, USA, 20–22 April 2017. [Google Scholar]
- Oyallon, E.; Belilovsky, E.; Zagoruyko, S. Scaling the Scattering Transform: Deep Hybrid Networks. arXiv
**2017**, arXiv:1703.08961. [Google Scholar] - Tishby, N.; Zaslavsky, N. Deep Learning and the Information Bottleneck Principle. In Proceedings of the 2015 IEEE Information Theory Workshop (ITW), Jerusalem, Israel, 26 April–1 May 2015. [Google Scholar]
- Voloshynovskiy, S.; Kondah, M.; Rezaeifar, S.; Taran, O.; Holotyak, T.; Rezende, D.J. Information bottleneck through variational glasses. In Proceedings of the Bayesian Deep Learning (NeurIPS 2019 Workshop), Vancouver, BC, Canada, 13 December 2019. [Google Scholar]
- Gibiansky, A. Bringing HPC Techniques to Deep Learning. Available online: https://andrew.gibiansky.com/blog/machine-learning/baidu-allreduce/ (accessed on 28 October 2020).
- You, S.; Xu, C.; Xu, C.; Tao, D. Learning from multiple teacher networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 13–17 August 2017; pp. 1285–1294. [Google Scholar]
- Lin, T.; Kong, L.; Stich, S.U.; Jaggi, M. Ensemble Distillation for Robust Model Fusion in Federated Learning. arXiv
**2020**, arXiv:2006.07242. [Google Scholar] - Asad, M.; Moustafa, A.; Ito, T.; Aslam, M. Evaluating the Communication Efficiency in Federated Learning Algorithms. arXiv
**2020**, arXiv:2004.02738. [Google Scholar] - Zhao, Y.; Li, M.; Lai, L.; Suda, N.; Civin, D.; Chandra, V. Federated Learning with Non-IID Data. arXiv
**2018**, arXiv:1806.00582. [Google Scholar] - Hsieh, K.; Phanishayee, A.; Mutlu, O.; Gibbons, P.B. The Non-IID Data Quagmire of Decentralized Machine Learning. arXiv
**2019**, arXiv:1910.00189. [Google Scholar] - Fung, C.; Yoon, C.J.M.; Beschastnikh, I. Mitigating Sybils in Federated Learning Poisoning. arXiv
**2018**, arXiv:1808.04866. [Google Scholar] - Deng, L. The MNIST Database of Handwritten Digit Images for Machine Learning Research. IEEE Signal Process. Mag.
**2012**, 29, 141–142. [Google Scholar] [CrossRef] - Kingma, D.P.; Rezende, D.J.; Mohamed, S.; Welling, M. Semi-Supervised Learning with Deep Generative Models. arXiv
**2014**, arXiv:1406.5298. [Google Scholar] - Gordon, J.; Hernández-Lobato, J.M. Bayesian Semisupervised Learning with Deep Generative Models. arXiv
**2017**, arXiv:1706.09751. [Google Scholar] - Tax, D.M.; Duin, R.P. Support vector data description. Mach. Learn.
**2004**, 54, 45–66. [Google Scholar] [CrossRef] - Sabokrou, M.; Khalooei, M.; Fathy, M.; Adeli, E. Adversarially Learned One-Class Classifier for Novelty Detection. arXiv
**2018**, arXiv:1802.09088. [Google Scholar] - Pidhorskyi, S.; Almohsen, R.; Adjeroh, D.A.; Doretto, G. Generative Probabilistic Novelty Detection with Adversarial Autoencoders. arXiv
**2018**, arXiv:1807.02588. [Google Scholar] - Perera, P.; Nallapati, R.; Xiang, B. OCGAN: One-class Novelty Detection Using GANs with Constrained Latent Representations. arXiv
**2019**, arXiv:1903.08550. [Google Scholar] - Dewdney, P.; Turner, W.; Braun, R.; Santander-Vela, J.; Waterson, M.; Tan, G.H. SKA1 System Baselinev2 Description; SKA Organisation: Macclesfield, UK, 2015. [Google Scholar]
- Alemi, A.A.; Fischer, I.; Dillon, J.V.; Murphy, K. Deep variational information bottleneck. arXiv
**2016**, arXiv:1612.00410. [Google Scholar] - Estella-Aguerri, I.; Zaidi, A. Distributed variational representation learning. IEEE Trans. Pattern Anal. Mach. Intell.
**2019**, 1. [Google Scholar] [CrossRef] - Razeghi, B.; Stanko, T.; Škoric´, B.; Voloshynovskiy, S. Single-Component Privacy Guarantees in Helper Data Systems and Sparse Coding with Ambiguation. In Proceedings of the IEEE International Workshop on Information Forensics and Security (WIFS), Delft, The Netherlands, 9–12 December 2019. [Google Scholar]
- Chen, Y.; Sun, X.; Jin, Y. Communication-Efficient Federated Deep Learning with Asynchronous Model Update and Temporally Weighted Aggregation. arXiv
**2019**, arXiv:1903.07424. [Google Scholar] [CrossRef] - Hinton, G.E.; Salakhutdinov, R.R. Reducing the Dimensionality of Data with Neural Networks. Science
**2006**, 313, 504–507. [Google Scholar] [CrossRef] - Bruna, J.; Mallat, S. Invariant scattering convolution networks. IEEE Trans. Pattern Anal. Mach. Intell.
**2013**, 35, 1872–1886. [Google Scholar] [CrossRef] - Rezaeifar, S.; Taran, O.; Voloshynovskiy, S. Classification by Re-generation: Towards Classification Based on Variational Inference. arXiv
**2018**, arXiv:1809.03259. [Google Scholar] - Cover, T.M.; Thomas, J.A. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing); Wiley-Interscience: Hoboken, NJ, USA, 2006. [Google Scholar]
- Zhang, Y.; Ozay, M.; Sun, Z.; Okatani, T. Information Potential Auto-Encoders. arXiv
**2017**, arXiv:1706.04635. [Google Scholar] - Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. arXiv
**2013**, arXiv:1312.6114. [Google Scholar] - Mallat, S. Group invariant scattering. Commun. Pure Appl. Math.
**2012**, 65, 1331–1398. [Google Scholar] [CrossRef] - Bernstein, S.; Bouchot, J.L.; Reinhardt, M.; Heise, B. Generalized analytic signals in image processing: Comparison, theory and applications. In Quaternion and Clifford Fourier Transforms and Wavelets; Birkhäuser: Basel, Switzerland, 2013; pp. 221–246. [Google Scholar]
- Zhao, H.; Gallo, O.; Frosio, I.; Kautz, J. Is L2 a Good Loss Function for Neural Networks for Image Processing. arXiv
**2015**, arXiv:1511.08861. [Google Scholar] - Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv
**2014**, arXiv:1412.6980. [Google Scholar] - Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z.; et al. Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network. arXiv
**2016**, arXiv:1609.04802. [Google Scholar] - Xiao, H.; Rasul, K.; Vollgraf, R. Fashion-MNIST: A Novel Image Dataset for Benchmarking Machine Learning Algorithms. arXiv
**2017**, arXiv:1708.07747. [Google Scholar] - Byerly, A.; Kalganova, T.; Dear, I. A Branching and Merging Convolutional Network with Homogeneous Filter Capsules. arXiv
**2020**, arXiv:2001.09136. [Google Scholar] - Hirata, D.; Takahashi, N. Ensemble learning in CNN augmented with fully connected subnetworks. arXiv
**2020**, arXiv:2003.08562. [Google Scholar] - Kowsari, K.; Heidarysafa, M.; Brown, D.E.; Meimandi, K.J.; Barnes, L.E. RMDL: Random Multimodel Deep Learning for Classification. arXiv
**2018**, arXiv:1805.01890. [Google Scholar] - Harris, E.; Marcu, A.; Painter, M.; Niranjan, M.; Prügel-Bennett, A.; Hare, J. FMix: Enhancing Mixed Sample Data Augmentation. arXiv
**2020**, arXiv:2002.12047. [Google Scholar] - Bhatnagar, S.; Ghosal, D.; Kolekar, M.H. Classification of fashion article images using convolutional neural networks. In Proceedings of the 2017 Fourth International Conference on Image Information Processing (ICIIP), Shimla, India, 21–23 December 2017. [Google Scholar]
- Hao, W.; Mehta, N.; Liang, K.J.; Cheng, P.; El-Khamy, M.; Carin, L. WAFFLe: Weight Anonymized Factorization for Federated Learning. arXiv
**2020**, arXiv:2008.05687. [Google Scholar] - Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv
**2015**, arXiv:1502.03167. [Google Scholar] - Hahnloser, R.H.R.; Sarpeshkar, R.; Mahowald, M.A.; Douglas, R.J.; Seung, H.S. Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature
**2000**, 405, 947–951. [Google Scholar] [CrossRef] [PubMed] - HasanPour, S.H.; Rouhani, M.; Fayyaz, M.; Sabokrou, M. Lets keep it simple, Using simple architectures to outperform deeper and more complex architectures. arXiv
**2016**, arXiv:1608.06037. [Google Scholar] - Belghazi, M.I.; Baratin, A.; Rajeswar, S.; Ozair, S.; Bengio, Y.; Courville, A.; Hjelm, R.D. MINE: Mutual Information Neural Estimation. arXiv
**2018**, arXiv:1801.04062. [Google Scholar] - Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- Quattoni, A.; Torralba, A. Recognizing indoor scenes. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 413–420. [Google Scholar]
- Huang, G.B.; Ramesh, M.; Berg, T.; Learned-Miller, E. Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments Technical Report 07-49; University of Massachusetts: Amherst, MA, USA, 2007. [Google Scholar]

Scattering Features for One Given | Number of | ${\mathit{S}}^{2}\left(\mathit{x}\right)$ | ${\mathit{S}}^{3}\left(\mathit{x}\right)$ | Tensors | ${\mathit{S}}^{2}\left(\mathit{x}\right)$ | ${\mathit{S}}^{3}\left(\mathit{x}\right)$ |
---|---|---|---|---|---|---|

Path by Growing Deepness | Channels | $(\mathit{J}=2)$ | $(\mathit{J}=3)$ | Sizes | $(\mathit{J}=2)$ | $(\mathit{J}=3)$ |

$\mathbf{x}\star {\mathit{\varphi}}_{J}\left({2}^{J}u\right)$ | 1 | 1 | 1 | ${N}_{{S}^{J}}$ | 81 | 729 |

$\left|\mathbf{x}\star {\psi}_{{j}_{1}}^{{\alpha}_{1}}\right|\star {\mathit{\varphi}}_{J}\left({2}^{J}u\right)$ | $JL$ | 16 | 24 | $Height$ | $H/4$ | $H/8$ |

$\left|\left|\mathbf{x}\star {\psi}_{{j}_{1}}^{{\alpha}_{1}}\right|\star {\psi}_{{j}_{2}}^{{\alpha}_{2}}\right|\star {\mathit{\varphi}}_{J}\left({2}^{J}u\right)$ | $\left(\genfrac{}{}{0pt}{}{J}{2}\right){L}^{2}$ | 64 | 192 | $Width$ | $W/4$ | $W/8$ |

$\left|\left|\left|\mathbf{x}\star {\psi}_{{j}_{1}}^{{\alpha}_{1}}\right|\star {\psi}_{{j}_{2}}^{{\alpha}_{2}}\right|\star {\psi}_{{j}_{3}}^{{\alpha}_{3}}\right|\star {\mathit{\varphi}}_{J}\left({2}^{J}u\right)$ | $\left(\genfrac{}{}{0pt}{}{J}{3}\right){L}^{3}$ | 0 | 512 |

Stage | Number of Channels | Filter Size | Stride | Size Scale | Activation |
---|---|---|---|---|---|

input ${\mathbf{z}}_{m}={E}_{{\mathit{\varphi}}_{m}}\left(\mathbf{x}\right)$ | ${N}_{\mathbf{z}}$ | $\raisebox{1ex}{$1$}\!\left/ \!\raisebox{-1ex}{${2}^{J}$}\right.$ | |||

Batch Normalization | |||||

Convolution | ${2}^{3(J+1)}c$ | $3\times 3$ | $1\times 1$ | $ReLU$ | |

Deconvolution | ${2}^{3J}c$ | $4\times 4$ | $2\times 2$ | $\raisebox{1ex}{$1$}\!\left/ \!\raisebox{-1ex}{${2}^{J-1}$}\right.$ | |

Batch Normalization | $ReLU$ | ||||

Deconvolution | ${2}^{3(J-1)}c$ | $4\times 4$ | $2\times 2$ | $\raisebox{1ex}{$1$}\!\left/ \!\raisebox{-1ex}{${2}^{J-2}$}\right.$ | |

Batch Normalization | $ReLU$ | ||||

⋮ | |||||

Deconvolution, output: $\widehat{\mathbf{x}}$ | c | $4\times 4$ | $2\times 2$ | 1 | $tanh$ |

Centralized Methods | FedAvg | |||||||
---|---|---|---|---|---|---|---|---|

Method | BMCNN + HC | EnsNet | RMDL | IID | Non-IID | |||

Testing Data Error | $\mathbf{0}.\mathbf{16}$ | $\mathbf{0}.\mathbf{16}$ | $0.18$ | $1.43$ | $\mathbf{7}.\mathbf{77}$ | |||

Proposed fully decentralized ON–OC–IBC | ||||||||

Method | ${d}_{{\ell}_{1}}$ | ${d}_{VGG}$ | ${d}_{.2}$ | ${d}_{.3}$ | ${d}_{.4}$ | ${d}_{.5}$ | ${d}_{.6}$ | ${d}_{.7}$ |

Training data error | $1.5$ | $\mathbf{0}$ | $3.1$ | $1.5$ | $\mathbf{0}$ | $\mathbf{0}$ | $\mathbf{0}$ | $1.5$ |

Testing data error | $4.6$ | $3.1$ | $1.5$ | $3.1$ | $\mathbf{0}$ | $4.6$ | $6.2$ | $7.8$ |

Centralized Methods | FedAvg | WAFFLe | ||||||
---|---|---|---|---|---|---|---|---|

Method | RN18+FMix | CNN | CNN++ | LSTM | Uni | Multi | Uni | Multi |

Testing data error | $\mathbf{3}.\mathbf{64}$ | $8.83$ | $7.46$ | $11.74$ | $16.04$ | $16.57$ | $\mathbf{12}.\mathbf{88}$ | $13.91$ |

Proposed fully decentralized ON–OC–IBC | ||||||||

Method | ${d}_{{\ell}_{1}}$ | ${d}_{VGG}$ | ${d}_{.2}$ | ${d}_{.3}$ | ${d}_{.4}$ | |||

Testing data error | $\mathbf{10}.\mathbf{1}$ | $12.2$ | 12 | $13.1$ | $14.4$ |

K | 1 | 4 | 5 | 6 | 15 | 20 | 50 | 100 | ∞ |
---|---|---|---|---|---|---|---|---|---|

on train (%) | $80.4$ | $19.1$ | 0 | $19.0$ | $89.6$ | $91.1$ | $91.8$ | $91.1$ | $90.2$ |

on test (%) | $90.2$ | $23.9$ | 0 | $24.4$ | $89.7$ | $91.2$ | $92.1$ | $91.2$ | $90.2$ |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).