Generative Adversarial Networks for Image-to-Image translation
Generative adversarial network (GAN) is a class of machine learning frameworks designed by Goodfellow et al.
The goal of the generator is to fool the discriminator, producing realistic images belonging to the target domain, while the goal of the discriminator is to find a way to distinguish between real and fake samples.
Traditionally, the input of the generator is sampled from a known distribution and it is referred to as noise. In case the input of the generator is an image of a specific (source) domain, that should be transformed into an image of a different (target) domain, the addressed task is Image-to-Image translation
In particular, the aim is to learn the mapping between an input image and an output image, by using a training set of aligned image pairs. This could be useful for instance to perform style transfer, season transfer, and photo enhancement.
Recently, different works have been developed to solve this task. For instance, Isola et al.
Therefore, Zhu et al.
CycleGAN makes use of two couples of GAN where each couple can transform images from one domain to the other. The goal is to learn a mapping \(G : X \rightarrow Y\) such that the distribution of generated images \(G(X)\) is indistinguishable from the distribution of images belonging to domain \(Y\). At the same time, this should hold also for the other domain. However, there are countless mappings between the two domains. For this reason, the authors introduced the cycle consistency loss to enforce that \(F(G(X)) \approx X\) and \(G(F(Y)) \approx Y\). This means that given an image \(x\) belonging to domain \(X\), that is transformed by \(G\) into an image \(\tilde{x}\) belonging to domain \(Y\), when \(\tilde{x}\) is transformed by \(F\) the output should be \(x\).
An additional constraint is added since the authors show that it can provide better quality solutions. This constraint is called identity constraint and it enforces that \(F(X) \approx X\) and \(G(Y) \approx Y\). This means that the architecture should not modify an image if it already belongs to the target domain. To accomplish this task, the loss function is composed of multiple terms. First of all, as usual, the GANs are trained using an adversarial loss
Another important term is the cycle consistency loss used to ensure both forward cycle-consistency, i.e., \(x \rightarrow G(x) \rightarrow F(G(x)) \approx x\), and backward cycle-consistency, i.e., \(y \rightarrow F(y) \rightarrow G(F(y)) \approx y\). This is done in a pixel-wise manner and is expressed as:
\[\label{eq:pixel-cyc} \displaylines{ \mathcal{L}_{pixel-cyc}[G, F](X,\ Y)=\mathbb{E}_{x\sim P_{\text{data}}(X)}[\lVert F(G(x)) - x \rVert_1 ] + \\ + \mathbb{E}_{y\sim P_{\text{data}}(Y)}[\lVert G(F(y)) - y \rVert_1 ] }\]Finally, the last term to be considered is the identity loss used to ensure the identity constraints for both the \(X\) and \(Y\) domains and expressed via the following pixel-wise computation:
\[\label{eq:pixel-idt} \displaylines{ \mathcal{L}_{pixel-idt}[G, F](X,\ Y) =\mathbb{E}_{x\sim P_{\text{data}}(X)}[\lVert F(x) - x \rVert_1 ] \\ +\mathbb{E}_{y\sim P_{\text{data}}(Y)}[\lVert G(y) - y \rVert_1 ] }\]Thus, the overall loss function is given by a weighted sum of these three terms:
\[\label{eq:overall} \displaylines{ \mathcal{L}[G, F, D_{X}, D_{Y}](X,\ Y) = \mathcal{L}_{\text{LSGAN}}[G,\ D_{Y}](X,\ Y) \\ +\mathcal{L}_{\text{LSGAN}}[F,\ D_{X}](Y,\ X) \\ +\lambda_{cyc} \mathcal{L}_{pixel-cyc}[G, F](X,\ Y) \\ +\lambda_{idt} \mathcal{L}_{pixel-idt}[G, F](X,\ Y) }\]Even though CycleGANs can produce quite impressive results, for the task of 2-domain translation, the details about texture and style are often accompanied by unpleasant artifacts as reported in
One of the improvements suggested in the paper is to avoid relying only on pixel-wise losses to ensure cycle consistency since this could result in perceptually unsatisfying solutions with overly smooth textures. Thus, a loss function that takes into consideration the perceptual similarity is employed.
This perceptual loss function includes a term named feature loss that is the euclidean distance between the high-level abstract feature representation of a cycle reconstructed image \(F(G(x))\) and the original image \(x\). The feature representations are extracted using a pre-trained 19 layer VGG network
The feature loss and the pixel-wise loss are combined into the perceptual loss that is used as cycle consistency loss, where \(\alpha\) and \(\beta\) are the coefficients to balance these loss terms.
\[\label{eq:ecycle_perceptual} \displaylines{ \mathcal{L}_{perc-cyc}[G, F](X,\ Y) &= \alpha(\mathcal{L}_{feature-cyc}[G, F](X,\ Y) \\ &+ \beta(\mathcal{L}_{pixel-cyc}[G, F](X,\ Y)) }\]To further improve the quality of the images produced by CycleGAN, a major adjustment to the structure of the generator is conducted. More specifically, the original basic residual block used inside the generator architecture is replaced with a Residual Dense Normalization Block (RDNB), which consists of a multi-level residual network, dense connection
I proposed an interesting work regarding Face Expression Data Augmentation considering both Cycle and ECycleGAN. You can find more detail about this projects at this link.