A paper review on Wasserstein GAN
Generative Adversarial Networks (GANs)
In this blog post, we explore how Wasserstein GANs (WGANs)
Standard GANs consist of two neural networks—the generator (G) and the discriminator (D)—engaged in a minimax game. The generator aims to create samples indistinguishable from real data, while the discriminator strives to differentiate between real and generated samples. The loss functions for the generator and discriminator can be expressed as:
\[\label{eq:gan_loss} \begin{aligned} \min_G \max_D & \, \mathbb{E}_{x \sim P_r} [\log D(x)] + \mathbb{E}_{\hat{x} \sim P_g} [\log (1 - D(\hat{x}) )], \end{aligned}\]where \(P_r\) denotes the real data distribution, and \(P_g\) is the distribution induced by the generator over the data space.
These problems largely arise from the nature of the Jensen-Shannon (JS) divergence used in standard GANs. When the supports of \(P_r\) and \(P_g\) do not overlap, the JS divergence becomes constant, leading to vanishing gradients.
Wasserstein GANs propose an elegant solution by replacing the JS divergence with the Wasserstein distance (also known as the Earth Mover’s distance), which measures the cost of transforming one distribution into another. This metric remains meaningful even when the distributions have non-overlapping supports.
The Wasserstein distance between two probability distributions \(P_r\) and \(P_g\) is defined as:
\[\label{eq:wasserstein} W(P_r, P_g) = \inf_{\gamma \in \Pi(P_r, P_g)} \mathbb{E}_{(x, y) \sim \gamma}[\|x - y\|],\]where \(\Pi(P_r, P_g)\) is the set of all joint distributions \(\gamma(x, y)\) whose marginals are \(P_r\) and \(P_g\), respectively. Intuitively, it represents the minimum amount of “work” needed to transform \(P_g\) into \(P_r\), considering the cost of moving mass from point \(x\) to \(y\).
In practice, directly computing the Wasserstein distance is intractable. Instead, the problem can be reformulated using Kantorovich-Rubinstein duality
where the supremum is over all 1-Lipschitz functions \(f\).
In the context of WGANs, the discriminator \(D\) is used to approximate the optimal 1-Lipschitz function \(f\). Thus, the loss functions for the generator and discriminator can be written as:
\[\label{eq:wgan_loss} \begin{aligned} \max_D & \, \mathbb{E}_{x \sim P_r}[D(x)] - \mathbb{E}_{\hat{x} \sim P_g}[D(\hat{x})], \\ \min_G & \, \mathbb{E}_{\hat{x} \sim P_g}[D(\hat{x})]. \end{aligned}\]A critical requirement for the discriminator in WGANs is the 1-Lipschitz constraint. In the original WGAN formulation, this constraint was enforced by clipping the weights of the discriminator to a narrow range. However, this clipping can lead to several practical problems, such as underfitting.
An improvement to this approach is the gradient penalty method
where \(P_{\hat{x}}\) is the distribution of points along straight lines between pairs of points sampled from \(P_r\) and \(P_g\).
Wasserstein GANs mark a significant advancement in the field of generative modeling, especially for image synthesis. By addressing the fundamental shortcomings of standard GANs with the imposition of the Wasserstein distance, WGANs achieve more stable training dynamics and generate higher quality samples.
The theoretical robustness rooted in optimal transport theory and practical improvements like gradient penalty make WGANs a critical tool in the arsenal of machine learning practitioners. As research in this area progresses, we can expect further refinements and hybrid models that continue to push the boundaries of what generative models can achieve.
In subsequent posts, we will explore other transformative advancements in AI and machine learning, providing a comprehensive view of the state-of-the-art technologies shaping our world.