KL divergence
KL divergence
Assume we need to find the difference/ how much the probability distribution varies for the same underlying set of outcomes X.
Lets say, we measured the probability distribution in 2 timestamp for the same set of X
t1: P(1) = 40%, P(2) = 20%, P(3) = 40%
t2: Q(1) = 40%, Q(2) = 40%, Q(3) = 20%
for the same x from X; P -> distribution at t1, Q -> distribution at t2
A simple way to calculate relative difference(P wrt Q) is to:
-> P(x)/Q(x) and to get overall sense, we can average for all the x
but this is very naive way and have issues, even though P(2) and P(3) has same 20% differnce in opposite directions, the higher value skews the average.
we need a function to map both to same value but in opposite direction.
i.e if f(x) = y ; f(1/x) = -y
log matches this exactly
so lets fix our formula by,
we have one final thing to make it better,
now, we are weighing all the observations with same weightage, but the changes in higher probabilites needs to be weighted rather than smaller changes just like expectation calculation of a distribution. thats why we instead of equal weightage 1/n , we need to use P(x)
so our final form becomes
And that is infact our KL divergence formula for discrete function
and for continuos, it is intergal over change of x