Cycle and ECycle GAN

Generative Adversarial Networks for Image-to-Image translation

Generative adversarial network (GAN) is a class of machine learning frameworks designed by Goodfellow et al. . Two neural networks compete with each other. In particular, a discriminator and a generator are trained at the same time, competing in a sort of two players’ adversarial game. The Generator receives an input that belongs to a certain data distribution and produces a new sample that should belong to a target distribution. On the other hand, the Discriminator is fed with a sample that may or may not be produced by the generator and outputs the probability of the sample being real (not generated) or fake (generated).

The goal of the generator is to fool the discriminator, producing realistic images belonging to the target domain, while the goal of the discriminator is to find a way to distinguish between real and fake samples.

Traditionally, the input of the generator is sampled from a known distribution and it is referred to as noise. In case the input of the generator is an image of a specific (source) domain, that should be transformed into an image of a different (target) domain, the addressed task is Image-to-Image translation .

In particular, the aim is to learn the mapping between an input image and an output image, by using a training set of aligned image pairs. This could be useful for instance to perform style transfer, season transfer, and photo enhancement.

Recently, different works have been developed to solve this task. For instance, Isola et al. proposed Pix2Pix, a conditional adversarial network that learns the mapping from the input images to output images. However, training this network requires paired input-output images.

Therefore, Zhu et al. proposed CycleGAN, a new architecture that learns to translate an image from a source domain \(X\) to a target domain \(Y\) in the absence of paired samples.

CycleGAN makes use of two couples of GAN where each couple can transform images from one domain to the other. The goal is to learn a mapping \(G : X \rightarrow Y\) such that the distribution of generated images \(G(X)\) is indistinguishable from the distribution of images belonging to domain \(Y\). At the same time, this should hold also for the other domain. However, there are countless mappings between the two domains. For this reason, the authors introduced the cycle consistency loss to enforce that \(F(G(X)) \approx X\) and \(G(F(Y)) \approx Y\). This means that given an image \(x\) belonging to domain \(X\), that is transformed by \(G\) into an image \(\tilde{x}\) belonging to domain \(Y\), when \(\tilde{x}\) is transformed by \(F\) the output should be \(x\).

An additional constraint is added since the authors show that it can provide better quality solutions. This constraint is called identity constraint and it enforces that \(F(X) \approx X\) and \(G(Y) \approx Y\). This means that the architecture should not modify an image if it already belongs to the target domain. To accomplish this task, the loss function is composed of multiple terms. First of all, as usual, the GANs are trained using an adversarial loss . However, the original loss is replaced with a least square loss since it ensures more stable training and higher quality results. Thus, the objective function to train the generator \(G : X \rightarrow Y\) and its corresponding discriminator \(D_{Y}\) is:

\[\displaylines{ \mathcal{L}_{\text{LSGAN}}[G,\ D_{Y}](X,\ Y)=\mathbb{E}_{y\sim p_{\text{data}}(Y)}[(D_{Y}(y)-1)^{2}] \\ +\mathbb{E}_{x\sim p_{\text{data}}(X)}[D_{Y}(G(x))^{2}] }\]

Another important term is the cycle consistency loss used to ensure both forward cycle-consistency, i.e., \(x \rightarrow G(x) \rightarrow F(G(x)) \approx x\), and backward cycle-consistency, i.e., \(y \rightarrow F(y) \rightarrow G(F(y)) \approx y\). This is done in a pixel-wise manner and is expressed as:

\[\label{eq:pixel-cyc} \displaylines{ \mathcal{L}_{pixel-cyc}[G, F](X,\ Y)=\mathbb{E}_{x\sim P_{\text{data}}(X)}[\lVert F(G(x)) - x \rVert_1 ] + \\ + \mathbb{E}_{y\sim P_{\text{data}}(Y)}[\lVert G(F(y)) - y \rVert_1 ] }\]

Finally, the last term to be considered is the identity loss used to ensure the identity constraints for both the \(X\) and \(Y\) domains and expressed via the following pixel-wise computation:

\[\label{eq:pixel-idt} \displaylines{ \mathcal{L}_{pixel-idt}[G, F](X,\ Y) =\mathbb{E}_{x\sim P_{\text{data}}(X)}[\lVert F(x) - x \rVert_1 ] \\ +\mathbb{E}_{y\sim P_{\text{data}}(Y)}[\lVert G(y) - y \rVert_1 ] }\]

Thus, the overall loss function is given by a weighted sum of these three terms:

\[\label{eq:overall} \displaylines{ \mathcal{L}[G, F, D_{X}, D_{Y}](X,\ Y) = \mathcal{L}_{\text{LSGAN}}[G,\ D_{Y}](X,\ Y) \\ +\mathcal{L}_{\text{LSGAN}}[F,\ D_{X}](Y,\ X) \\ +\lambda_{cyc} \mathcal{L}_{pixel-cyc}[G, F](X,\ Y) \\ +\lambda_{idt} \mathcal{L}_{pixel-idt}[G, F](X,\ Y) }\]

Even though CycleGANs can produce quite impressive results, for the task of 2-domain translation, the details about texture and style are often accompanied by unpleasant artifacts as reported in . Thus, to improve the effectiveness of domain translation, and obtain more realistic images, Zhang et al. proposed to modify the architecture, leading to the definition of Enhanced CycleGANs (ECycleGANs).

One of the improvements suggested in the paper is to avoid relying only on pixel-wise losses to ensure cycle consistency since this could result in perceptually unsatisfying solutions with overly smooth textures. Thus, a loss function that takes into consideration the perceptual similarity is employed.

This perceptual loss function includes a term named feature loss that is the euclidean distance between the high-level abstract feature representation of a cycle reconstructed image \(F(G(x))\) and the original image \(x\). The feature representations are extracted using a pre-trained 19 layer VGG network . \(\phi_{i,j}\) identifies the feature map obtained by the \(j-th\) convolution (after activation) before the \(i-th\) max-pooling layer. This term is defined as:

\[\label{eq:ecycle_feature} \displaylines{ \mathcal{L}_{feature-cyc}[G, F](X,\ Y) \\ = \mathbb{E}_{i,j,x\sim P_{\text{data}}(X)}&[\phi_{i,j}(F(G(x))) - \phi_{i,j}(x) ] \\ +\mathbb{E}_{i,j,y\sim P_{\text{data}}(Y)}&[\phi_{i,j}(G(F(y))) - \phi_{i,j}(y) ] }\]

The feature loss and the pixel-wise loss are combined into the perceptual loss that is used as cycle consistency loss, where \(\alpha\) and \(\beta\) are the coefficients to balance these loss terms.

\[\label{eq:ecycle_perceptual} \displaylines{ \mathcal{L}_{perc-cyc}[G, F](X,\ Y) &= \alpha(\mathcal{L}_{feature-cyc}[G, F](X,\ Y) \\ &+ \beta(\mathcal{L}_{pixel-cyc}[G, F](X,\ Y)) }\]

To further improve the quality of the images produced by CycleGAN, a major adjustment to the structure of the generator is conducted. More specifically, the original basic residual block used inside the generator architecture is replaced with a Residual Dense Normalization Block (RDNB), which consists of a multi-level residual network, dense connection , and instance normalization layers . This modification is performed following the fact that more layers and connections can optimize the performance of neural networks. However, to prevent instability in training a very deep network, a scaling factor named residual scaling is used to scale down the residuals. Thus, the perceptual loss encourages the translated images to be more realistic, while the introduction of RDNB blocks allows to generate high-quality images.

I proposed an interesting work regarding Face Expression Data Augmentation considering both Cycle and ECycleGAN. You can find more detail about this projects at this link.