<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Michal Pandy</title>
    <description>Personal blog of Michal Pándy. Come for the AI, stay for the jokes.</description>
    <link>https://mpmisko.github.io/</link>
    <atom:link href="https://mpmisko.github.io/feed.xml" rel="self" type="application/rss+xml"/>
    <pubDate>Tue, 30 Jul 2024 19:48:55 +0000</pubDate>
    <lastBuildDate>Tue, 30 Jul 2024 19:48:55 +0000</lastBuildDate>
    <generator>Jekyll v3.9.5</generator>
    
      <item>
        <title>AI Fundamentals: Energy-Based Models</title>
        <description>&lt;p&gt;This blog post was inspired by &lt;a href=&quot;https://x.com/bryan_johnson/status/1767603894094631200&quot;&gt;Bryan Johnson’s tweet&lt;/a&gt; from last March:&lt;/p&gt;

&lt;p align=&quot;center&quot;&gt;
  &lt;img src=&quot;/assets/images/intro.png&quot; width=&quot;460&quot; /&gt;
&lt;/p&gt;

&lt;p&gt;This tweet caught my eye: “solve death” and “EBMs” in one sentence. I chuckled, remembering my last encounter with Energy-Based Models (EBMs). Many hours of training, diverging loss curves, and an existential crisis later, I felt much closer to the pits of hell than to the fountain of youth. All jokes aside, let’s dive into the beautiful beasts that are EBMs.&lt;/p&gt;

&lt;h2 id=&quot;generative-models&quot;&gt;Generative models&lt;/h2&gt;

&lt;p&gt;Generative models are a class of algorithms designed to learn the underlying distribution of data and generate new samples that resemble the training data. When you are chatting with ChatGPT, there is a generative model generating text back to you. Some popular generative models you may have heard of are &lt;a href=&quot;https://lilianweng.github.io/posts/2021-07-11-diffusion-models/&quot;&gt;Diffusion Models&lt;/a&gt; (the stuff that powers &lt;a href=&quot;https://www.midjourney.com/&quot;&gt;Midjourney&lt;/a&gt;), &lt;a href=&quot;https://www.jeremyjordan.me/variational-autoencoders/&quot;&gt;VAEs&lt;/a&gt;, or &lt;a href=&quot;https://ym2132.github.io/GenerativeAdversarialNetworks_Goodfellow&quot;&gt;GANs&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Each model has advantages and disadvantages, so model selection depends on factors such as data type, computational resources, and training stability. Energy-Based Models (EBMs) represent another class of generative models with their own unique strengths and limitations.&lt;/p&gt;

&lt;p&gt;We assume real data follows a complex distribution $p$. We seek model parameters $\theta$ to sample from a model distribution $p_\theta$ that mimics $p$.  The aim is to minimize the discrepancy between $p_\theta$ and $p$, allowing us to generate new samples that are indistinguishable from real data.&lt;/p&gt;

&lt;h2 id=&quot;introduction-to-ebms&quot;&gt;Introduction to EBMs&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;https://yann.lecun.com/exdb/publis/pdf/lecun-06.pdf&quot;&gt;EBMs&lt;/a&gt; offer a unique approach to generative modeling by framing the problem in terms of an energy function. Unlike other generative models that directly learn to produce data, EBMs learn to assign low energy to likely data configurations and high energy to unlikely ones.&lt;/p&gt;

&lt;p&gt;EBMs define an energy function $E_\theta(x)$ parameterized by $\theta$ (for example, a neural network), which maps each possible data configuration $x$ to a scalar energy value. The probability of a data point is then defined as:&lt;/p&gt;

\[p_\theta(x) = \frac{\exp(-E_\theta(x))}{Z(\theta)}\]

&lt;p&gt;where $Z(\theta) = \int \exp(-E_\theta(x)) dx$ is the normalizing constant.&lt;/p&gt;

&lt;p&gt;While this may sound overwhelming, all we have introduced so far is a way to assign non-negative values to data points $\exp(-E_\theta(x))$ and then normalize over the data domain to get a probability distribution. A bit like how you’d normalize a histogram of discrete data points to get a probability mass function.&lt;/p&gt;

&lt;h2 id=&quot;sampling-ebms&quot;&gt;Sampling EBMs&lt;/h2&gt;

&lt;p&gt;It’s good to understand how to sample data from EBMs as it is a prerequisite for some of the training methods. One nice property of EBMs is that the gradient:&lt;/p&gt;

\[-\nabla_x E_\theta(x)\]

&lt;p&gt;points in the same direction as $\nabla_x \log p_\theta(x)$. To demonstrate this, let’s recall the definition EBMs from &lt;a href=&quot;#introduction-to-ebms&quot;&gt;the introduction&lt;/a&gt;:&lt;/p&gt;

\[p_\theta(x) = \frac{\exp(-E_\theta(x))}{Z(\theta)}\]

&lt;p&gt;Then, taking the logarithm of both sides:&lt;/p&gt;

\[\log p_\theta(x) = -E_\theta(x) - \log Z(\theta)\]

&lt;p&gt;Now, if we take the gradient with respect to $x$:&lt;/p&gt;

\[\nabla_x \log p_\theta(x) = -\nabla_x E_\theta(x)\]

&lt;p&gt;The negative gradient of the energy function $-\nabla_x E_\theta(x)$ indicates the direction in which the energy decreases most rapidly. Since lower energy corresponds to higher probability in an EBM, following this negative gradient leads us towards regions of higher probability. If $p_\theta(x)$ approximates $p(x)$ well, we can generate samples that faithfully represent the true data distribution by following this gradient.&lt;/p&gt;

&lt;p&gt;This insight leads us to &lt;a href=&quot;https://www.stats.ox.ac.uk/~teh/research/compstats/WelTeh2011a.pdf&quot;&gt;Langevin dynamics&lt;/a&gt;, a powerful method for sampling from EBMs. Langevin dynamics combines gradient information with random noise to explore the probability landscape defined by the EBM. The basic Langevin dynamics algorithm is as follows:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Step 1&lt;/strong&gt;: Start with an initial point $x_0$, often chosen randomly.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Step 2&lt;/strong&gt;: For $t = 0$ to $T$, iterate:&lt;/li&gt;
&lt;/ul&gt;

\[x_t = x_{t-1} - \frac{\epsilon}{2} \nabla_x E_\theta(x_{t-1}) + \sqrt{\epsilon}  z_t\]

&lt;p&gt;where $z_t \sim \mathcal{N}(0, I)$ is standard Gaussian noise, and $\epsilon$ is a small step size.&lt;/p&gt;

&lt;p&gt;This update rule has two key components:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;The gradient term $-\frac{\epsilon}{2} \nabla_x E_\theta(x_{t-1})$ guides the sample towards regions of lower energy (higher probability).&lt;/li&gt;
  &lt;li&gt;The noise term $\sqrt{\epsilon} z_t$ allows for exploration of the probability space, helping to escape local minima and sample from the full distribution.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Here is a visualization of how Langevin sampling looks in a simple 2D case (&lt;a href=&quot;https://colab.research.google.com/drive/1QAhFDVGXLjErXNrJwWNL8fkif7d771Dr#scrollTo=ocjk5Cm1I9EC&amp;amp;line=22&amp;amp;uniqifier=1&quot;&gt;relevant code&lt;/a&gt;):&lt;/p&gt;

&lt;p align=&quot;center&quot;&gt;
  &lt;img src=&quot;/assets/images/langevin.gif&quot; width=&quot;450&quot; /&gt;
&lt;/p&gt;

&lt;p&gt;In the visualization above, we start from four different initialization points, and run Langevin sampling in a simple energy landscape with four minima. For more in-depth information about sampling, check out &lt;a href=&quot;https://friedmanroy.github.io/blog/2022/Langevin/&quot;&gt;this great blog post&lt;/a&gt; by Roy Friedman.&lt;/p&gt;

&lt;h2 id=&quot;training-ebms&quot;&gt;Training EBMs&lt;/h2&gt;

&lt;p&gt;Unlike classification or regression where we directly get the target values, generative modeling requires us to capture the essence of a data distribution we can’t directly observe. We never actually get $p(x)$, we only have access to samples from this distribution - our training data. So how do we train an EBM to learn an energy function that effectively models this unseen distribution? In this section, we’ll explore a couple of approaches to tackle this problem.&lt;/p&gt;

&lt;h3 id=&quot;contrastive-divergence&quot;&gt;Contrastive divergence&lt;/h3&gt;

&lt;p&gt;The first approach for training EBMs we explore is rooted in &lt;a href=&quot;https://richardstartin.github.io/posts/maximum-likelihood-estimation&quot;&gt;maximum likelihood estimation&lt;/a&gt;. In general, MLE aims to maximize the likelihood of the observed data under our model, which is equivalent to minimizing the negative log-likelihood:&lt;/p&gt;

\[J(\theta) = -\mathbb{E}_{x \sim p}[\: \log p_\theta(x) \:]\]

&lt;p&gt;Expanding this using the definition of $p_θ(x)$ for EBMs:&lt;/p&gt;

\[J(\theta) = \mathbb{E}_{x \sim p}[E_\theta(x)] + \log Z(\theta)\]

&lt;p&gt;Then, the gradient of this loss with respect to $\theta$ is:&lt;/p&gt;

\[\nabla_\theta J(\theta) = \mathbb{E}_{x \sim p}[\nabla_\theta E_\theta(x)] - \mathbb{E}_{x \sim p_\theta}[\nabla_\theta E_\theta(x)]\]

&lt;p&gt;This gradient has an intuitive interpretation:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;The first term, $\mathbb{E}_{x \sim p}[\nabla_\theta E_\theta(x)]$, pushes the energy of real data points down.&lt;/li&gt;
  &lt;li&gt;The second term, $-\mathbb{E}_{x \sim p_\theta}[\nabla_\theta E_\theta(x)]$, pulls the energy of model samples up.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To get an intuition for what happens, you may find this visualization of a single MLE step helpful:&lt;/p&gt;

&lt;p align=&quot;center&quot;&gt;
  &lt;img src=&quot;/assets/images/mle_viz.gif&quot; width=&quot;500&quot; /&gt;
&lt;/p&gt;

&lt;p&gt;MLE training ideally places low energy in regions with data and high energy elsewhere. Unfortunately, MLE faces a significant practical challenge: computing the expectation over the model requires sampling from $p_\theta$ (e.g., using Langevin dynamics from above), which is computationally expensive.&lt;/p&gt;

&lt;p&gt;This is where &lt;a href=&quot;https://www.cs.toronto.edu/~hinton/absps/tr00-004.pdf&quot;&gt;Contrastive Divergence (CD)&lt;/a&gt;, introduced by Geoffrey Hinton, comes into play. CD can be viewed as a practical approximation of MLE, designed to make the training process computationally feasible. CD approximates the MLE gradient by replacing the model expectation ($x \sim p_\theta$) with samples obtained by running a few steps of Langevin dynamics starting from data points. These are two simple changes to the Langevin dynamics algorithm we introduced above:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Step 1&lt;/strong&gt;: Use data samples to initialize $x_0$. For example, we can sample random data from the batch.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Step 2&lt;/strong&gt;: Same as the original, except $T$ is usually in the order of tens of steps.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Putting all of this together, we can now fit the simple ‘&lt;a href=&quot;https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_moons.html&quot;&gt;two moons&lt;/a&gt;’ dataset. Below, you can find a visualization of how the EBM samples evolve during contrastive divergence training (left) and the evolution of the full energy landscape (right). Initially, the samples and energy landscape are random, but as training progresses, the model learns to concentrate the low-energy regions around the two crescents of the dataset. You can play around with the code that produced this visualization &lt;a href=&quot;https://colab.research.google.com/drive/1QAhFDVGXLjErXNrJwWNL8fkif7d771Dr#scrollTo=0vbQt-LzHJD3&amp;amp;line=2&amp;amp;uniqifier=1&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p align=&quot;center&quot;&gt;
  &lt;img src=&quot;/assets/images/cd_viz.gif&quot; width=&quot;900&quot; /&gt;
&lt;/p&gt;

&lt;h3 id=&quot;score-matching&quot;&gt;Score matching&lt;/h3&gt;

&lt;p&gt;Even though contrastive divergence alleviates some of the difficulties of sampling from the model during training, it still requires running Langevin dynamics for a number of steps. &lt;a href=&quot;https://jmlr.csail.mit.edu/papers/v6/hyvarinen05a.html&quot;&gt;Score matching&lt;/a&gt; by Aapo Hyvärinen is an alternative method for training EBMs that completely avoids the need for explicit sampling from the model distribution.&lt;/p&gt;

&lt;p&gt;The core idea of score matching is to train the model to match the score function of the data distribution. The score function is defined as the gradient of the log-probability with respect to the input:&lt;/p&gt;

\[\psi(x) = \nabla_x \log p(x)\]

&lt;p&gt;For an EBM with energy function $E_\theta(x)$, the score function is (see &lt;a href=&quot;#sampling-ebms&quot;&gt;Sampling EBMs&lt;/a&gt; for derivation):&lt;/p&gt;

\[\psi_\theta(x) = -\nabla_x E_\theta(x) \]

&lt;p&gt;The score matching objective is to minimize the expected squared difference between the model’s score and the data score:&lt;/p&gt;

\[J(\theta) = \frac{\mathbb{E}_{x \sim p} [\lVert \psi_\theta(x) - \psi(x)\rVert^2]}{2}\]

&lt;p&gt;However, we don’t have access to the true data score $\psi(x)$. Fortunately, Hyvärinen showed that this objective can be reformulated into an equivalent form that only requires samples from the data distribution (proof in the &lt;a href=&quot;#appendix&quot;&gt;Appendix&lt;/a&gt;):&lt;/p&gt;

\[J(\theta) = \mathbb{E}_{x \sim p} [\frac{1}{2} \lVert \psi_\theta(x) \rVert^2 + \text{tr}(\nabla_x \psi_\theta(x))] + \text{const}\]

&lt;p&gt;Returning to the ‘two moons’ example, we can now visualize the evolution of the score function during training. The score function represents the direction and magnitude of the steepest increase in log-probability at each point in the input space. As training progresses, we expect the score function to align with the true data distribution, pointing towards areas of high density.&lt;/p&gt;

&lt;p align=&quot;center&quot;&gt;
  &lt;img src=&quot;/assets/images/score_viz.gif&quot; width=&quot;700&quot; /&gt;
&lt;/p&gt;

&lt;p&gt;As we can observe in the visualization, the samples (left) evolve from scattered points to the characteristic two-moon shape, while the score field (right) progressively aligns towards areas of high data density. This illustrates how score matching effectively captures the data distribution without explicit sampling. You can experiment with the code &lt;a href=&quot;https://colab.research.google.com/drive/1QAhFDVGXLjErXNrJwWNL8fkif7d771Dr#scrollTo=QOLWh7ho39tT&amp;amp;line=5&amp;amp;uniqifier=1&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id=&quot;noise-contrastive-estimation-nce&quot;&gt;Noise contrastive estimation (NCE)&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;https://proceedings.mlr.press/v9/gutmann10a/gutmann10a.pdf&quot;&gt;Noise Contrastive Estimation&lt;/a&gt; (NCE), introduced by Gutmann and Hyvärinen in 2010, offers another approach to training EBMs. The fundamental concept behind NCE is to train the model to distinguish between samples from the data distribution and samples from a known noise distribution.&lt;/p&gt;

&lt;p&gt;In NCE, we introduce a noise distribution $p_n$ (often Gaussian or uniform) alongside our data distribution. We then create a binary classification problem where the model learns to distinguish between data samples (label 1) and noise samples (label 0). The NCE objective is to maximize:&lt;/p&gt;

\[J(\theta) = \mathbb{E}_{x \sim p}[\log h(x; \theta)] + \mathbb{E}_{x \sim p_n}[\log (1 - h(x; \theta))]\]

&lt;p&gt;where $h(x; \theta)$ is the probability that $x$ comes from the data distribution rather than the noise distribution:&lt;/p&gt;

\[h(x; \theta) = \frac{p_\theta(x)}{p_\theta(x) + p_n(x)}\]

&lt;p&gt;The NCE loss function has a simple intuitive interpretation:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Data samples&lt;/strong&gt;: The first term, $\mathbb{E}_{x \sim p}[\log h(x; \theta)]$, encourages the model to assign high probabilities to real data points.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Noise samples&lt;/strong&gt;: The second term, $\mathbb{E}_{x \sim p_n}[\log (1 - h(x; \theta))]$, encourages the model to assign low probabilities to noise samples.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;As training progresses, $p_\theta(x)$ should become large for real data points and small for noise samples, shaping the energy landscape to match the true data distribution.&lt;/p&gt;

&lt;p&gt;For an EBM, we face a challenge: we want to compute $p_\theta(x)$, but we don’t know the normalization constant $Z(\theta)$. A key insight is that we can treat $\log Z(\theta)$ as a parameter to be learned, avoiding the need to compute it explicitly. We can re-parameterize our model by introducing a new parameter $c = -\log Z(\theta)$. This allows us to write:  $p_\theta(x) = \exp(-E_\theta(x) + c)$.  Now, instead of learning $\theta$ alone, we learn both $\theta$ and $c$. This approach is viable in NCE because the objective function’s structure prevents trivial solutions where $c$ could grow arbitrarily large, unlike in maximum likelihood estimation.&lt;/p&gt;

&lt;p&gt;To illustrate NCE in action, let’s revisit how the energy landscape evolves during training in the ‘two moons’ dataset:&lt;/p&gt;

&lt;p align=&quot;center&quot;&gt;
  &lt;img src=&quot;/assets/images/nce_viz.gif&quot; width=&quot;450&quot; /&gt;
&lt;/p&gt;

&lt;p&gt;Once again, as training progresses, the model learns to concentrate the low-energy regions around the two crescents of the dataset. You can play with the code &lt;a href=&quot;https://colab.research.google.com/drive/1QAhFDVGXLjErXNrJwWNL8fkif7d771Dr#scrollTo=eTAZant-h6U4&amp;amp;line=126&amp;amp;uniqifier=1&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;ebm-problems-and-practical-tips&quot;&gt;EBM problems and practical tips&lt;/h2&gt;

&lt;p&gt;While the examples above make it look like training EBMs is a piece of cake, the reality is far more challenging. Scaling EBMs to complex, high-dimensional problems often requires a toolbox of sophisticated tricks. Some common issues include:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Poor negatives&lt;/strong&gt;: You should regularly check the negatives used during training. For NCE, the noise should be “close” to data. For CD, you should make sure sampling produces reasonable examples as training evolves.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Unregularized energy functions&lt;/strong&gt;: It can be useful to incorporate some &lt;a href=&quot;https://arxiv.org/abs/1903.08689&quot;&gt;L1 / L2 regularization&lt;/a&gt; to make the training stable. At the very least, it’s nice if the energy function is bounded so that it can converge. Otherwise, the optimization can just keep increasing the difference between the data and sample energy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Non-smooth energy functions&lt;/strong&gt;: When using gradient-based sampling methods like Langevin dynamics with deep neural nets, it can be helpful to implement gradient clipping or forms of &lt;a href=&quot;https://arxiv.org/abs/1705.10941&quot;&gt;spectral regularization&lt;/a&gt;. This prevents excessively large updates that can cause the sampling process to diverge.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Slow computation / memory problems&lt;/strong&gt;: Especially in high-dimensional spaces, sampling becomes a lot more computationally demanding. It is worth considering if the data can be projected to a lower-dimensional space. Alternatively, I’d consider using sampling-free methods (like &lt;a href=&quot;https://arxiv.org/abs/1905.07088&quot;&gt;sliced score matching&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Just learn the score function&lt;/strong&gt;: This is a fun experience I have with score matching. I found that if you do not need the energy values, it can be easier to learn a function that directly outputs the scores (the gradient of the energy).&lt;/p&gt;

&lt;h3 id=&quot;conclusion-and-further-reading&quot;&gt;Conclusion and further reading&lt;/h3&gt;

&lt;p&gt;Hope you enjoyed this deep dive into the world of Energy-Based Models!&lt;/p&gt;

&lt;p&gt;Remember Bryan Johnson’s tweet that kicked off this whole exploration? Now we have the tools necessary to understand what Extropic proposes. &lt;a href=&quot;https://www.extropic.ai/future&quot;&gt;Extropic’s idea&lt;/a&gt; seems to be to create physical systems where $E_\theta(x)$ is directly encoded in hardware. Instead of simulating Langevin dynamics on a digital computer, they’re proposing to build circuits where this dynamics occurs naturally. I don’t know enough about hardware to know whether this is viable, but it would be extremely exciting if it works. Better sampling from complex probability distributions could revolutionize how we approach probabilistic AI algorithms, especially those involving EBMs.&lt;/p&gt;

&lt;p&gt;For those looking to dive deeper into EBMs and related topics, here are some recommended resources:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a href=&quot;https://arxiv.org/abs/2101.03288&quot;&gt;How to Train Your Energy-Based Models&lt;/a&gt; - very nice introduction to EBMs following a similar pattern and notation to this blog post by Song et al.&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://yann.lecun.com/exdb/publis/pdf/lecun-06.pdf&quot;&gt;A Tutorial on Energy-Based Learning&lt;/a&gt; - a comprehensive introduction to EBMs by LeCun et al.&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://arxiv.org/abs/1905.07088&quot;&gt;Sliced Score Matching: A Scalable Approach to Density and Score Estimation&lt;/a&gt; - more advanced score matching technique by Song et al.&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://arxiv.org/abs/1907.05600&quot;&gt;Generative Modeling by Estimating Gradients of the Data Distribution&lt;/a&gt; - another cool score matching extension by Song et al.&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://proceedings.neurips.cc/paper_files/paper/2019/file/378a063b8fdb1db941e34f4bde584c7d-Paper.pdf&quot;&gt;Implicit Generation and Modeling with Energy-Based Models&lt;/a&gt; - tricks to get contrastive divergence to scale by Du et al.&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://arxiv.org/pdf/2109.00137&quot;&gt;Implicit Behavioral Cloning&lt;/a&gt; - EBMs applied to learning policies in robotics by Florence et al.&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://arxiv.org/abs/2004.13167&quot;&gt;Energy-based models for atomic-resolution protein conformations&lt;/a&gt; - EBMs applied to proteins by Du et al.&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/yataobian/awesome-ebm&quot;&gt;Collection of awesome EBM papers&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;appendix&quot;&gt;Appendix&lt;/h2&gt;

&lt;h3 id=&quot;proof-of-score-function-reformulation&quot;&gt;Proof of score function reformulation&lt;/h3&gt;

&lt;p&gt;Let’s start with the original score matching objective:&lt;/p&gt;

\[J(\theta) = \frac{1}{2} \mathbb{E}_{x \sim p} [\lVert \psi_\theta(x) - \psi(x) \rVert^2]\]

&lt;p&gt;Where $\psi_\theta(x) = -\nabla_x E_\theta(x)$ is the model’s score function and $\psi(x) = \nabla_x \log p(x)$ is the true data score function.&lt;/p&gt;

&lt;p&gt;We’ll prove that this is equivalent to:&lt;/p&gt;

\[J(\theta) = \mathbb{E}_{x \sim p} [\frac{1}{2} \lVert \psi_\theta(x) \rVert^2 + \text{tr}(\nabla_x \psi_\theta(x))] + \text{const}\]

&lt;p&gt;&lt;strong&gt;Step 1:&lt;/strong&gt; Expand the squared norm in the original objective:&lt;/p&gt;

\[J(\theta) = \frac{1}{2} \mathbb{E}_{x \sim p} [\lVert \psi_\theta(x) \rVert^2 - 2\psi_\theta(x)^T\psi(x) + \lVert \psi(x) \rVert^2]\]

&lt;p&gt;&lt;strong&gt;Step 2:&lt;/strong&gt; Focus on the middle term:&lt;/p&gt;

\[-\mathbb{E}_{x \sim p} [\psi_\theta(x)^T\psi(x)]\]

&lt;p&gt;&lt;strong&gt;Step 3:&lt;/strong&gt; Definition of expectation&lt;/p&gt;

\[ -\int p(x) \psi_\theta(x)^T\psi(x) dx\]

&lt;p&gt;&lt;strong&gt;Step 4:&lt;/strong&gt; Substitute &lt;strong&gt;$\psi(x) = \nabla_x \log p(x)$&lt;/strong&gt;&lt;/p&gt;

\[-\int p(x) \psi_\theta(x)^T\nabla_x \log p(x) dx \]

&lt;p&gt;&lt;strong&gt;Step 5&lt;/strong&gt;: &lt;a href=&quot;https://andrewcharlesjones.github.io/journal/log-derivative.html&quot;&gt;Log derivative trick&lt;/a&gt; (just applying the chain rule to the log probability)&lt;/p&gt;

\[-\int \psi_\theta(x)^T\nabla_x p(x) dx\]

&lt;p&gt;&lt;strong&gt;Step 6:&lt;/strong&gt; Integration by parts&lt;/p&gt;

&lt;p&gt;Recall the general form of integration by parts:&lt;/p&gt;

\[\int u \frac{dv}{dx} dx = uv - \int v \frac{du}{dx} dx\]

&lt;p&gt;In our case:&lt;/p&gt;

\[\begin{align*}u &amp;amp;= \psi_\theta(x)^T \\\frac{dv}{dx} &amp;amp;= \nabla_x p(x)\end{align*}\]

&lt;p&gt;Now, we can apply integration by parts:&lt;/p&gt;

\[\begin{align}
-\int \psi_\theta(x)^T\nabla_x p(x) dx &amp;amp;= -[\psi_\theta(x)^T p(x)]_{-\infty}^{\infty} + \int p(x) \nabla_x \psi_\theta(x)^T dx \\
&amp;amp;= \int p(x) \nabla_x \psi_\theta(x)^T dx \\
&amp;amp;= \mathbb{E}_{x \sim p} [\text{tr}(\nabla_x \psi_\theta(x))]
\end{align}\]

&lt;p&gt;Note how $[\psi_\theta(x)^T p(x)]_{-\infty}^{\infty}$ vanished. This is a weak regularity assumption made in the &lt;a href=&quot;https://jmlr.csail.mit.edu/papers/v6/hyvarinen05a.html&quot;&gt;original paper&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 7:&lt;/strong&gt; Combining all the terms&lt;/p&gt;

\[J(\theta) = \mathbb{E}_{x \sim p} [\frac{1}{2} \lVert \psi_\theta(x) \rVert^2 + \text{tr}(\nabla_x \psi_\theta(x))] + \frac{1}{2}\mathbb{E}_{x \sim p} [\lVert \psi(x) \rVert^2]\]

&lt;p&gt;The last term doesn’t depend on $\theta$, so it’s constant with respect to the optimization. This completes the proof.&lt;/p&gt;
</description>
        <pubDate>Sun, 28 Jul 2024 00:00:00 +0000</pubDate>
        <link>https://mpmisko.github.io/2024/ai-fundamentals-energy-based-models/</link>
        <guid isPermaLink="true">https://mpmisko.github.io/2024/ai-fundamentals-energy-based-models/</guid>
        
        
      </item>
    
      <item>
        <title>WTF happened to blogs</title>
        <description>&lt;p&gt;Remember when blogs were raw, unfiltered windows into someone’s mind?&lt;/p&gt;

&lt;p&gt;Yeah, those days are long gone.&lt;/p&gt;

&lt;p&gt;Now we’re drowning in an ocean of SEO-optimized garbage. Every “blog post” reads like it was written by the same soulless AI, regurgitating keywords to climb Google’s rankings.&lt;/p&gt;

&lt;p&gt;What we used to write in a few sentences now gets dragged out over many paragraphs so that you scroll through &lt;span style=&quot;color:black;font-weight:700;&quot;&gt;16 banner ads&lt;/span&gt; (I am talking about you, WebMD).&lt;/p&gt;

&lt;p&gt;It’s a tragedy, really.&lt;/p&gt;

&lt;h2 id=&quot;the-seo-game&quot;&gt;The SEO game&lt;/h2&gt;

&lt;p&gt;What happened? Simple:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Google became the gatekeeper of the internet&lt;/li&gt;
  &lt;li&gt;Marketers realized they could game the system&lt;/li&gt;
  &lt;li&gt;Content farms sprouted like weeds&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now, instead of passionate creators sharing their thoughts, we have armies of underpaid writers churning out “10 Ways to Boost Your Productivity (You Won’t Believe #7!)”&lt;/p&gt;

&lt;h2 id=&quot;the-death-of-authenticity&quot;&gt;The death of authenticity&lt;/h2&gt;
&lt;p&gt;High-quality blogs exist, but they are impossible to find in the sea of “personal” blogs trying to sell you something.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Affiliate links? Check.&lt;/li&gt;
  &lt;li&gt;Pop-up newsletter signup? Check.&lt;/li&gt;
  &lt;li&gt;Carefully curated personal brand? Double check.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It’s exhausting.&lt;/p&gt;

&lt;p&gt;Want to stand out in this wasteland of mediocrity? Here’s a radical idea: Write like a human being.&lt;/p&gt;

&lt;p&gt;Will you top Google’s search results? Probably not.&lt;/p&gt;

&lt;p&gt;But you might just create something worth reading. That is more impactful than $0.01 per impression.&lt;/p&gt;

&lt;p&gt;Some of my personal favorites are:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://www.julian.com/&quot;&gt;Julian Shapiro’s guides&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://www.paulgraham.com/articles.html&quot;&gt;Paul Graham’s classics&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://lilianweng.github.io/&quot;&gt;Lilian Weng’s ML notes&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://pmarchive.com/&quot;&gt;pmarca archive&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        <pubDate>Fri, 12 Jul 2024 00:00:00 +0000</pubDate>
        <link>https://mpmisko.github.io/2024/wtf-happened-to-blogs/</link>
        <guid isPermaLink="true">https://mpmisko.github.io/2024/wtf-happened-to-blogs/</guid>
        
        
      </item>
    
      <item>
        <title>How to write a good personal statement for UK universities</title>
        <description>&lt;p&gt;In the last five years, I have reviewed more than twenty personal statements that high school students wrote for admission to UK universities. There were many overlapping issues that I thought would be worth aggregating. A small caveat is that I reviewed engineering and natural sciences statements, so take my tips with a grain of salt if you are applying for social sciences.&lt;/p&gt;

&lt;p&gt;&lt;span style=&quot;color:black;font-weight:700;&quot;&gt; Focus on academics. &lt;/span&gt; Degrees in the UK are highly specialized, so you have to demonstrate that you can dive deep into your subject of interest. During my time at Imperial College, I barely managed to take two non-CS courses because I was overwhelmed with a high CS workload. This education model is not for everyone, but if it is for you, make it clear in your statement. I would suggest focusing 80% of your statement on why you want to pursue your subject. Off-topic extra-curriculars should be treated as a nice addition, but should not be the core of your text.&lt;/p&gt;

&lt;p&gt;&lt;span style=&quot;color:black;font-weight:700;&quot;&gt; Demonstrate passion through action. &lt;/span&gt; Do not say that you like your subject, demonstrate it through what you have done. Did you prove a nice theorem for a competition? Great, write about it. Did you write a compiler for a toy language? Awesome, include it. Did you make a robot follow a line? Sweet, put it in. You get the story, the more you can demonstrate your passion via projects, reading, competitions, etc., the better. Mention some extensions of these projects that the university will allow you to &lt;em&gt;finally&lt;/em&gt; learn about.&lt;/p&gt;

&lt;p&gt;&lt;span style=&quot;color:black;font-weight:700;&quot;&gt; Polish your English writing. &lt;/span&gt; Chances are that your native language’s writing style leaks into your English texts. This problem goes beyond using the right grammar and vocabulary. Even if you are a native English speaker, make sure that your statement loosely follows an academic English writing style. Academics (i.e., the people that will read your statements) enjoy reading concise, clear, and structured texts. There are many resources on this topic online, but here are a few links that seem useful: &lt;a href=&quot;https://www.scribbr.com/research-paper/topic-sentences/&quot;&gt;topic sentences&lt;/a&gt;, &lt;a href=&quot;https://www.masterclass.com/articles/how-to-write-a-perfect-paragraph#5-tips-for-structuring-and-writing-better-paragraphs&quot;&gt;structuring paragraphs&lt;/a&gt;, &lt;a href=&quot;https://www.awelu.lu.se/writing/writing-stage/structuring-the-text/&quot;&gt;academic writing&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;span style=&quot;color:black;font-weight:700;&quot;&gt; Write a coherent story. &lt;/span&gt; People tend to write paragraphs that do not link well together. Each paragraph would be a self-contained story with an introduction and a conclusion, making the personal statement similar to an extended CV. Good statements take people on an intellectual journey which makes sense both on a paragraph-level and as a whole. Try to find some thread that you can use to link your paragraphs and make them flow.&lt;/p&gt;

&lt;p&gt;&lt;span style=&quot;color:black;font-weight:700;&quot;&gt; Avoid clichés. &lt;/span&gt; While writing your statement you should ask yourself the question: &lt;em&gt;Can anyone else write these exact sentences?&lt;/em&gt; If the answer is yes, you are probably doing it wrong. This is an exaggeration, but the more unique (while not too weird) your text is, the better you stand out among the thousands of applicants that are competing for a place.&lt;/p&gt;

&lt;p&gt;&lt;span style=&quot;color:black;font-weight:700;&quot;&gt; Strong opening and closing. &lt;/span&gt;  The admissions officer that reads your statement will likely have read a few hundred others before reading yours. It is important to have memorable opening and concluding paragraphs to “wake” them up. Coming up with these paragraphs often introduces a &lt;a href=&quot;https://en.wikipedia.org/wiki/Writer%27s_block&quot;&gt;writer’s block&lt;/a&gt;, so do not spend too much energy on them. Quickly write down a few ideas and iterate as you write the rest of the personal statement. The right ideas will come to you as days pass.&lt;/p&gt;

&lt;p&gt;&lt;span style=&quot;color:black;font-weight:700;&quot;&gt; Why did you choose the university? &lt;/span&gt; Being specific about why you like certain universities goes a long way. Everyone applies to the University of Cambridge because it is a top institution, which means that it is a &lt;em&gt;bad&lt;/em&gt; reason to write about in &lt;em&gt;your&lt;/em&gt; statement. Are there particular courses that you like? Is there research that you would like to explore? Is the department you are applying to particularly good at something? Specificity shows that you are mature enough to consider reasons beyond rankings.&lt;/p&gt;

&lt;p&gt;I hope these tips help. You should aslo check out the following resources for writing personal statements: &lt;a href=&quot;https://www.ucas.com/undergraduate/applying-university/writing-personal-statement/how-write-personal-statement&quot;&gt;UCAS tips&lt;/a&gt;, &lt;a href=&quot;https://bridge-u.com/blog/how-to-write-a-personal-statement/&quot;&gt;BridgeU&lt;/a&gt;, &lt;a href=&quot;https://www.britishcouncil.org/voices-magazine/how-write-personal-statement-uk-university&quot;&gt;British council&lt;/a&gt;. Good luck with your application!&lt;/p&gt;
</description>
        <pubDate>Mon, 02 May 2022 00:00:00 +0000</pubDate>
        <link>https://mpmisko.github.io/2022/How-to-write-a-good-personal-statement-for-UK-universities/</link>
        <guid isPermaLink="true">https://mpmisko.github.io/2022/How-to-write-a-good-personal-statement-for-UK-universities/</guid>
        
        
      </item>
    
  </channel>
</rss>
