Renyí-Entropy Backpropagation for Model Training

Rényi Entropy

Many people are familiar with the standard definition of entropy as the summation of a probability-weighted logarithm. But there are other definitions that extend from alternative definitions of averages.

Say we have a random variable $X$ that can take on a set of values $\{X_i\}$ each of which has a probability $p(X_i)$ of occurring. The amount of information [1] it takes to select $X_i$ from the space of possible values is $I_i = -\log p(X_i)$ and the average amount of information across all possible values of the random variable is

$$ S_{\text{Shannon}} = - k_2\sum_{i} p(X_i) \log p(X_i). \qquad (1) $$

where $k_2 = 1/\log 2$. Eq.(1) is Shannon’s definition of entropy (or information) and it is the entropy definition that people are most familiar with. But there are alternative definitions of entropy that follow from alternative definitions of averages.

In 1961, Alfred Rényi wanted to explore these possible alternatives. To do so, he first generalized definitions of the average through the following. The average of a quantity ${\mathcal O}(X)$ that is a function of a random variable is typically expressed as

$$ \overline{{\mathcal O}(X)} = \sum_{i} p(X_i) {\mathcal O}(X_i), \qquad (2) $$

but this can be generalized by defining an invertible function $f(x)$ which yields the new average

$$ \overline{{\mathcal O}(X)}_{f} = f^{-1}\left(\sum_{i} p(X_i) f\big({\mathcal O}(X_i)\big)\right). \qquad (3) $$

That is, for this “generalized-average” we make the desired observable an argument of a function $f$, compute the probability-weighted sum, and then we compute the inverse $f^{-1}$ of the result. When $f(x) = x$ we find $f^{-1}(x) = x$ and we simply obtain the regular average of the observable ${\mathcal O}$ and with ${\mathcal O}(X) = - \log p(X)$, we obtain Eq. (1).

But what about other choices for $x$? This is the question Renyi asked. He noted that one could take $f(x) = \log (x)$, $f(x) = x^2$, and $f(x) = 1/x$, and obtain, respectively, the geometric, the root-mean-square, and the harmonic mean of a quantity (this is easy to check), but when trying to use these other averages to compute an average information, the results did not obey the additive property of information.

The additive property of information asserts that the average information a larger system composed of two independent systems is the sum of the averages of each system alone, i.e., $S_{A\otimes B} = S_{A} + S_B$. None of the above exotic choices for $f(x)$ yields an entropy that satisfies this property. However, Rényi found one function that did: $f(x) = e^{-(\alpha-1) x}$ for a general $\alpha$. With this function, the average of a ${\mathcal O}(X)$ becomes

$$ \overline{{\mathcal O}(X)}_{f} =- \frac{1}{\alpha-1} \log\left(\sum_{i} p(X_i) \exp\left(-(\alpha-1){\mathcal O}(X_i)\right)\right). \qquad (4) $$

and the average information is

$$\begin{aligned} S_{f} =- \frac{1}{\alpha-1} \log\left(\sum_{i} p(X_i) \exp\left( (\alpha-1)\log p(X_i)\right)\right) = - \frac{1}{\alpha-1} \log\left(\sum_{i} p(X_i)^{\alpha}\right) \equiv H_{\alpha}[p]. \qquad (5) \end{aligned}$$

which is known now as the Rényi entropy. See “Perspective on physical interpretations of Rényi entropy in statistical mechanics” for a proof of Eq.(5)’s consistency with the additive property.

Using L’Hopital’s rule, we find that the $\alpha\to 1$ limit of Eq.(5) leads to the classic $p \log p$ entropy expression. So there’s a mathematical sense that this new entropy obeys the correct properties. But to get a better qualitative sense of how this entropy differs from the standard one, we can plot it for a two state system for various values of $\alpha$. In this two-state case the Rényi entropy is

$$ H_{\alpha}(q) = -\frac{1}{1-\alpha} \log \left(q^{\alpha} + (1-q)^{\alpha}\right). \qquad (7) $$

Plotting this function for $\alpha=0.5$, $\alpha=1.0$, $\alpha = 2.0$ and $\alpha =100$, we find

Two state Renyi entropy as a function of probability

Figure 1: Two-state Rényi entropy as a function of probability. For all but the most extreme $\alpha$, the slope of the entropy is zero at $q=0.5$, but has a larger or smaller magnitude relative to the $\alpha=1$ case depending on the value of $\alpha$ and $q$. For $\alpha>1$, the "steepness" of the entropy curve is less than that for the $\alpha=1$ case when $q\simeq 0$ or $q \simeq 1$, and for $\alpha < 1$, the steepness of the entropy curve is greater than that for the $\alpha=1$ case when $q \simeq 0 $ and when $q \simeq 1$.

We see that the choice of $\alpha$ results in either a decrease or an increase in the $\partial_q H_{\alpha}(q)$ values relative to their $\alpha=1$ values. Recognizing that such derivatives occur in deep learning model training, naturally leads us to the main question for this discussion:

Question: How does using the above entropy in deep learning training problems affect the progression of training?

Given the properties of the Rényi entropy relative to regular entropy as shown in Figure 1, we can make the following hypothesis.

Hypothesis: The sharper dependence of this entropy on $p$ should lead to different model convergence properties when training a neural network. The $p^{\alpha}$ for $\alpha>1$ suppresses low probabilities so that the weight-landscape of the loss function would likely force weights to have higher probabilities for accuracy. This means the model would be better at high-precision predictions.

Likelihood and Divergence

Entropy might not seem to have anything to do with machine learning, but it is actually embedded in how deep learning models are trained. To understand how we will have to build up some formalism, moving from likelihoods to divergences and then ultimately ending in what is known as a categorical cross entropy.

Consider the traditional two-label classification problem. Say we have a set of labels $\{y^{(i)}\}$ and a corresponding set of feature variables $\{x^{(i)}\}$. The labels can be $1$ or $0$, and we take $q(x)$ to be the probability that feature variable $x$ leads to a label of $1$. Then the net likelihood of obtaining the data $\{y^{(i)}\}$ given the feature values $\{x^{(i)}\}$ and the model is

$$ \text{Likelihood} = \prod_{i=1}^N q(x_i)^{y_{i}}(1-q(x_i))^{1-y_i}. \qquad (8) $$

When training an ML model, we aim to maximize this likelihood so that, given a particular parameter space for a model, we are selecting the parameters that are most likely to lead to the data. And given that optimizing a quantity is the same as optimizing a monotonic function (such as the logarithm) of that quantity, we can rephrase the machine learning optimization task as one that seeks to minimize the “negative log Likelihood”

$$ -\log \text{Likelihood} = - \sum_{i=1}^N \left(y^{(i)} \log q(x^{(i)}) + (1- y^{(i)})\log(1- q(x^{(i)}))\right). \qquad (9) $$

This logarithm is reminiscent of the classic entropy expression $p\log p$ however it seems to be defined by two distributions of values $\{y^{(i)}\}$ and $q(x^{(i)})$.

This is because, formally, Eq. (9) measures the distance between the target distribution of labels $\{y^{(i)}\}$ and the predicted distribution of labels $q(x)$. Generalizing this expression for arbitrary target probability distribution $p(X)$ and a prediction probability distribution $q(X)$, we have the Kullback-Leibler (KL) divergence

$$ D_{\text{KL}}\left(p||q\right) = \sum_X p(X) \log \frac{p(X)}{q(X)} \qquad (10) $$

which differs from Eq.(9) by an additive constant. Conceptually, the KL divergence represents the difference between the average amounts of information contained in the distribution $q(X)$ and $p(X)$ when $X$ values are distributed according to $p(X)$. The closer this average gets to zero, the closer the prediction distribution $q(X)$ gets to the true distribution $p(X)$.

The KL divergence is important in ML model training because of its connection to Eq.(8). Optimizing the likelihood is equivalent to optimizing the KL divergence and thus we can frame ML problems in terms of the KL optimization.

In Eq.(9) we used only two classes but we can write this result more generally. Define the set of all labels as $C$ and take $\lambda$ to stand for an arbitrary label in this set. Take $\textbf{x}^{(i)}$ to be the feature vector for data point $i$ and $y^{(i)} \in C$ to be the corresponding label. Since the feature vectors are associated with particular labels, the probability that any particular vector yields a label is either $1$ or $0$, and can be written as

$$ p(\textbf{x}^{(i)}) = \delta(\lambda, y^{(i)}), \qquad (11) $$

for Kronecker delta $\delta(A, B)$. This “probability” replaces $p(X)$ within Eq.(10). For the ML model we write $q(X)$ as $q(\lambda| \textbf{x}^{(i)})$, the probability that feature vector $\textbf{x}^{(i)}$ is associated with label $\lambda$ [2]. Now, we can use Eq.(10) to compute the divergence of the predicted distribution from the true “distribution” for a single data point $i$:

$$ D_{\text{KL}}^{(i)} = - \sum_{\lambda \in C}\delta(\lambda, y_i) \log q(\lambda|\textbf{x}^{(i)}), \qquad (12) $$

Averaging this single-data-point divergence over all data points gives us

$$ {\mathcal L}_\text{CE} \equiv \frac{1}{N}\sum_{i=1}^N D_{\text{KL}}^{(i)} = - \frac{1}{N} \sum_{i=1}^N\sum_{\lambda \in C}\delta(\lambda, y^{(i)}) \log q(\lambda|\textbf{x}^{(i)}). \qquad \qquad (13) $$

which is known as the categorical cross-entropy loss.

Rényi Cross Entropy

Now, we would like to find the Rényi Categorical Cross Entropy. To do so we will proceed in three steps. We will write the standard entropy and the KL divergence in terms of averages. We will then use the extrapolation of the standard entropy to the Rényi entropy to find the KL Rényi entropy. Finally we will make replacements similar to the above to go from the KL Rényi entropy to a Rényi categorical cross entropy.

We define the average of a function $ f_X $ of random variable $ X $ with probability distribution $ p_X $ as

$$ \underset{X \sim p_X}{\mathbb{E}} \, f_X \equiv \sum_{X} p(X) f(X). $$

Thus the Rényi entropy can be written as

$$ H_{\alpha}[p] = \frac{1}{1-\alpha} \log \underset{X \sim p_X}{\mathbb{E}} p_{X}^{\alpha-1}, \qquad (14) $$

and the Shannon entropy (modulo $\log 2$) is

$$ H_{1}[p] = -\underset{X \sim p_X}{\mathbb{E}} \log p_{X}. \qquad (15) $$

The regular KL divergence is

$$ D_{\text{KL}}(p||q) = \underset{X \sim p_X}{\mathbb{E}} \log \frac{p_X}{q_X}. \qquad (16) $$

Comparing (16) and (15), we see that we move from an entropy to a divergence by taking $ p_X \to q_X/p_X $ (or $ p_X \to p_X/q_X $ with a sign change in front of the logarithm). Performing a similar operation on (14), we can infer that the Rényi KL divergence is

$$ D_{\text{KL}, \alpha}(p||q) = -\frac{1}{1-\alpha} \log \underset{X \sim p_X}{\mathbb{E}} \left(\frac{p_X}{q_X}\right)^{\alpha-1} = -\frac{1}{1-\alpha} \log \sum_{X} p(X) \left(\frac{p(X)}{q(X)}\right)^{\alpha-1} \qquad (17) $$

From (17), we can make the substitutions in the previous section to obtain the cross entropy Eq.(13). In all our probability transformations from entropy, to divergence, and finally to cross entropy follow the path below

Figure 2: In going from the standard entropy to a loss function, we first introduce a new probability distribution and then we set the relevant distributions to either the target distribution or the prediction distribution.

From (17) the extrapolation to a categorical cross entropy is fairly simple. Replacing $p(X)$ with $p(\textbf{x}^{(i)}) = \delta(\lambda, y^{(i)})$ and $q(X)$ with $q(\lambda|\textbf{x}^{(i)}) $ we find

$$ \boxed{ {\mathcal L}_{\alpha} = - \frac{1}{1-\alpha} \log \sum_{i = 1}^{N} \sum_{\lambda \in C} \delta(\lambda, y^{(i)}) q(\lambda|\textbf{x}^{(i)})^{\alpha-1},} \qquad (18) $$

from which we can confirm that the $\alpha \to 1$ limit of Eq.(18) results in standard categorical cross entropy loss Eq.(13).

Back Propagation from Rényi

In machine learning, the loss defines an “energy” landscape for which we seek a minimum. This minimum is found by computing gradients of the loss function with respect to weights. Having obtained the Rényi entropy loss in Eq.(18), we can now explore how such gradients change for this generalized loss function.

For the back propagation equations, the key thing that will change is the first derivative that goes from the final output to the final hidden layer. Every other gradient depends on that final layer derivative but does so in a way that is identical to such dependence in models defined by regular (i.e., $\alpha=1$) cross-entropies. Therefore, to understand how Eq.(18) modifies model training, we just need to compute the derivative of this new cross entropy with respect to the softmax function $q(\lambda|\textbf{x})$, and then we can simply use our previous gradient formulas to implement the subsequent gradients.

Computing this derivative, we obtain

$$ \begin{aligned} \partial_{q} {\mathcal L}_{\alpha} = - \frac{\displaystyle \sum_{i = 1}^{N} \sum_{\lambda \in C} \delta(\lambda, y^{(i)}) q(\lambda|\textbf{x}^{(i)})^{\alpha-2}}{\displaystyle \sum_{i = 1}^{N} \sum_{\lambda \in C} \delta(\lambda, y^{(i)}) q(\lambda|\textbf{x}^{(i)})^{\alpha-1}}. \qquad (19) \end{aligned} $$

For the purpose of illustration, we can consider the binary classification case for Eq.(19): We will take $\lambda, y^{(i)} \in \{0, 1\}$ and $q(1|\textbf{x}) = \sigma(\textbf{x})$, $q(0|\textbf{x}) = 1- \sigma(\textbf{x})$ for a sigmoidal function $\sigma(\textbf{x}) = 1/(1+ e^{-\textbf{w}\cdot \textbf{x}})$ with weight vector $\textbf{w}$. We can then write

$$ \begin{aligned} \partial_{q} {\mathcal L}_{\alpha} = - \frac{\displaystyle \sum_{i = 1}^{N} \left[ y_i \sigma(\textbf{x}^{(i)})^{\alpha-2} + (1-y_i)(1- \sigma(\textbf{x}^{(i)}))^{\alpha-2}\right]}{\displaystyle \sum_{i = 1}^{N} \left[ y_i \sigma(\textbf{x}^{(i)})^{\alpha-1} + (1-y_i) (1- \sigma(\textbf{x}^{(i)}))^{\alpha-1}\right]}. \qquad (20) \end{aligned} $$

In logistic regression problems, this derivative fully determines how the model parameters within $q(\lambda|\textbf{x})$ evolve towards their data optimizing values. For deep learning problems $q(\lambda|\textbf{x})$ represents the final layer softmax function and thus there are more derivatives that must multiply Eq.(19) in order to determine the evolution of all parameter weights.

Computational Demonstration

OK enough with the theory. Let’s see how this works.

We will code a simple feedforward neural network from scratch, train it on a simple classification data set, and then record performance metrics. We will do this procedure multiple times: Once for the standard loss (i.e., $\alpha=1$ case) and a few more times for various other values of $\alpha$. We want to see which of these choices leads to better model performance.

We will use the simple catvnoncat dataset since it is public and has the strange property of being an image dataset that seems classifiable through simple feedforward networks. For the main tests, we will compare the standard-loss trained neural network (corresponding to $\alpha \to 1$) to various Rényi loss neural networks (specifically, those for $\alpha =1/2$, $\alpha = 3/2$ and $\alpha = 2$)

On to the coding! In the file dl_renyi_backprop.py, there is a from-scratch 4-layer neural network with layer dimensions layers_dims = [12288, 20, 7, 5, 1] (This is the neural network you build in Coursera’s “Deep Learning Specialization”). For this network, the important thing we want to change is the gradient function. Here is the standard gradient function with the standard loss:


def L_model_backward(AL, Y, caches):
    """
    Standard gradient
    """
    grads = {}
    L = len(caches) # the number of layers
    m = AL.shape[1]
    Y = Y.reshape(AL.shape) # after this line, Y is the same shape as AL
    
    # Initializing the backpropagation
    dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))
    
    current_cache = caches[L-1]
    grads["dA" + str(L-1)], grads["dW" + str(L)], grads["db" + str(L)] = linear_activation_backward(dAL, current_cache, activation = "sigmoid")
    
    for l in reversed(range(L-1)):
        current_cache = caches[l]
        dA_prev_temp, dW_temp, db_temp = linear_activation_backward(grads["dA" + str(l + 1)], current_cache, activation = "relu")
        grads["dA" + str(l)] = dA_prev_temp
        grads["dW" + str(l + 1)] = dW_temp
        grads["db" + str(l + 1)] = db_temp

    return grads

Now, implementing the gradients in Eq.(19), the new gradient function becomes


def L_model_backward_renyi(AL, Y, alpha, caches):
    """
    New Rényi gradient
    """
    grads = {}
    L = len(caches) # the number of layers
    m = AL.shape[1]
    Y = Y.reshape(AL.shape) # after this line, Y is the same shape as AL
    
    # Initializing backward propagation
    if alpha == 1:
        dAL = -(1/m)*(np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))
    else:
        Y = np.maximum(Y-1e-5, 1e-5)
        dAL = - (np.divide(Y, AL)**alpha - np.divide(1 - Y, 1 - AL)**alpha)
        DEN = np.sum(Y* np.divide(Y, AL)**(alpha-1) 
              + (1-Y)* np.divide(1-Y, 1-AL)**(alpha-1))
        dAL = dAL / DEN            
    
    current_cache = caches[L-1]
    grads["dA" + str(L)], grads["dW" + str(L)], grads["db" + str(L)] =
    linear_activation_backward(dAL, current_cache, activation = "sigmoid")
               
    ...

We ordinarily would change the loss as well, but the loss computed from the feed-forward processing within neural networks is mostly a bookkeeping procedure to track weight convergence and overtraining. Changing such a loss would make it difficult to compare how the Rényi back propagation model performs relative to the standard model since with different loss functions it would not be easy to compare the convergence of one to the other. So we will use the standard loss function for both the original neural network and the Rényi back-propagation versions, so we can make apples-to-apples comparisons of how well the latter reduces the loss.

For comparing these trained models we will use, precision, recall, F1 score, and accuracy for a probability threshold of $0.5$. But we’ll also use the “Area Under the Curve of the Receiver Operating Characteristic” (AUC of the ROC) since it give us a threshold-independent sense of how the models compare.

To implement the experiment, we train various neural networks with the backpropagation function L_model_backward_renyi $\alpha=1, \text{(standard)}; \alpha=1/2; \alpha =3/2; \alpha=2$. Here are the loss curves resulting from the training:

$Model loss curves for various $\alpha$. The curves all converge to the same minimum value$

Figure 3: Model loss curves for various $\alpha$. The curves all converge to the same minimum value with convergence after 20 epochs being slower for the models with $\alpha >1$.

Finally collecting the performance results for each value of $\alpha$, we find

Results for Test Set at Threshold = 0.5:
+---------+------------+-------------+----------+------------+-------------+
|   Alpha |   Accuracy |   Precision |   Recall |   F1 Score |   AUC Score |
+=========+============+=============+==========+============+=============+
|     0.5 |       0.76 |       0.818 |    0.818 |      0.818 |      0.8075 |
+---------+------------+-------------+----------+------------+-------------+
|     1   |       0.78 |       0.824 |    0.848 |      0.836 |      0.8075 |
+---------+------------+-------------+----------+------------+-------------+
|     1.5 |       0.8  |       0.848 |    0.848 |      0.848 |      0.8075 |
+---------+------------+-------------+----------+------------+-------------+
|     2   |       0.78 |       0.806 |    0.879 |      0.841 |      0.8021 |
+---------+------------+-------------+----------+------------+-------------+

We see that the $\alpha=1.5$ model leads to a higher precision than the $\alpha=1$ model but all other metrics remain essentially the same. Even the AUC score for the $\alpha \neq 1$ models do not differ significantly from that of the $\alpha =1$ model.

However, in the above hypothesis our focus was on precision since positive predictions that contribute to this metric represent the model not making a mistake in its positive prediction. In other words, models that are confidently correct (i.e., high prediction probability for positive predictions) will tend to have higher precision. We hypothesized that the sharper dependence on prediction probabilities would lead the system to converge to parameter spaces where it was more confident of its correct predictions and that seems to be the case.

These results thus suggest a soft conclusion about how $\alpha$ and the Rényi backpropagation equations can be used within hyperparameter tuning for neural networks:

Soft Conclusion: The Rényi entropy order parameter $\alpha$ provides an additional tuning hyperparameter which allows us to bias our model to slightly higher precision or recall depending on $\alpha$’s value.

Final Remarks and Extensions

So it seems that choosing $\alpha \neq 1$ for the Renyi back propagation equations leads to slightly different model performance on the test set. But the model we trained was fairly simple. How well do these results generalize to more complicated networks?

If we were publishing a paper on this work, we might want to extend the model to the current hotness in DL models like transformer models, some obscure CNN, or something even more exotic, but for my exploratory ends I am pretty satisfied.

As a follow up it is worth considering how this alternative entropy choice affects other ML problems where information or entropy plays a prominent role (such as for decision trees). But that task is for another time. For related work on Rényi entropy and divergence in machine learning, see [3].

Footnotes

[1] When describing entropy people rarely give an explicit definition “information,” or they equate information to something obscure like “uncertainty” or “surprise.” The notes "Entropy from Information" bring us to a concrete definition of information: Information is the average number of binary-valued questions it takes to determine the outcome of an experiment under an optimal (probability-space splitting) questioning strategy. This is a mouthful, but its exactness is clearer through an example. If you try to determine the minimum number of yes/no questions you need to ask (and get an answer) to determine a hidden number from 1 to 1000 then you already get the basic idea of information.

[2] For final layer activation functions, this probability is typically written in terms of the "softmax" prediction function $$ q(\lambda| \textbf{x}^{(i)}) = \frac{\exp\left(\sum_{j=1}^M W_{\lambda, j} x^{(i)}_{j}\right)}{\sum_{\lambda \in C}\exp\left(\sum_{j=1}^M W_{\lambda, j} x^{(i)}_{j}\right)} \qquad \text{[Softmax Function]}, $$

where $\textbf{x}^{(i)} \equiv (x^{(i)}_1, x^{(i)}_2, \ldots, x^{(i)}_M)$ for a feature vector with $M$ components, and we introduced “weights” $W_{\lambda, j}$ defining how component $j$ of the feature vector contributes to the probability of the data point being in class $\lambda$. But we won’t use this expression explicitly.

[3] While Rényi entropy and divergence have been explored in various machine learning contexts, their application as primary loss functions for standard supervised classification is less common. The following works are similar in spirit to this post and are included here for those interested in related methods:

Bhatia et al. (2021) "Least k-th Order and Rényi Generative Adversarial Networks" arXiv:2006.02479 [Link]
Uses Rényi cross-entropy functionals for GAN training.
Gajowniczek et al. (2020) "Semantic and Generalized Entropy Loss Functions for Semi-Supervised Deep Learning" Entropy 22(3), 334 [Link]
Uses Rényi entropy directly (without conversion to cross-entropy) as a regularization term for model training.
Kieferova et al. (2021) "Quantum Generative Training Using Rényi Divergences" arXiv:2106.09567 [Link]
Uses Rényi divergence as a loss function in the training of quantum neural networks.
Gronowski et al. (2022) "Renyi Fair Information Bottleneck for Image Classification" arXiv:2203.04950 [Link]
Uses Rényi entropy in a variational approach to reduce bias in feature variables.
Huang et al. (2023) "Rényi Divergence Deep Mutual Learning" arXiv:2209.05732 [Link]
Uses Rényi divergence as a regularization term for mutual learning

Rényi Entropy¶

Likelihood and Divergence¶

Rényi Cross Entropy¶

Back Propagation from Rényi¶

Computational Demonstration¶

Final Remarks and Extensions¶

Footnotes¶