9. PDFs and CDFs

This notebook demonstrates how to move between a probability density function PDF and cumulative density function CDF . If one has a PDF, a CDF may be derived from integrating over the PDF; if one has a CDF, the PDF may be derived from taking the derivative over the CDF.

9.1. Standard normal distribution

Here, we visualize the PDF and CDF for the standard normal distribution. The functions scipy.stats.norm.pdf and scipy.stats.norm.cdf will be used to generate the curves and data.

import matplotlib.pyplot as plt import numpy as np import pandas as pd from scipy.stats import norm import warnings plt.style.use('ggplot') np.random.seed(37) warnings.filterwarnings('ignore')

x = np.arange(-6, 6.1, 0.1) y_pdf = norm.pdf(x) y_cdf = norm.cdf(x) fig, ax = plt.subplots(figsize=(15, 6)) ax = [ax, ax.twinx()] _ = ax[0].plot(x, y_pdf, label='pdf', color='r') _ = ax[1].plot(x, y_cdf, label='cdf', color='b') _ = ax[0].tick_params(axis='y', labelcolor='r') _ = ax[1].tick_params(axis='y', labelcolor='b') _ = ax[0].set_ylabel('pdf', color='r') _ = ax[1].set_ylabel('cdf', color='b') _ = ax[0].set_title('PDF and CDF of standard normal')

_images/pdf-cdf_3_0.png

We will use scipy.misc.derivative and scipy.integrate.quad to take the derivative of the CDF to get the PDF and to integrate the PDF to get the CDF, respectively.

from scipy.misc import derivative from scipy.integrate import quad y_cdf = np.array([tup[0] for tup in [quad(norm.pdf, a, b) for a, b in [(a, b) for a, b in zip(x, x[1:len(x)])]]] + [0]).cumsum() y_pdf = derivative(norm.cdf, x, dx=1e-6) fig, ax = plt.subplots(1, 2, figsize=(15, 6)) _ = ax[0].plot(x, y_pdf, color='r') _ = ax[1].plot(x, y_cdf, color='b') _ = ax[0].set_title('PDF from derivative over CDF') _ = ax[1].set_title('CDF from integration over PDF')

_images/pdf-cdf_5_0.png

9.2. Log-normal distribution

from scipy.stats import lognorm lognorm_pdf = lambda x: lognorm.pdf(x, 1) lognorm_cdf = lambda x: lognorm.cdf(x, 1) x = np.arange(0, 10.1, 0.05) y_pdf = lognorm_pdf(x) y_cdf = lognorm_cdf(x) fig, ax = plt.subplots(figsize=(15, 6)) ax = [ax, ax.twinx()] _ = ax[0].plot(x, y_pdf, label='pdf', color='r') _ = ax[1].plot(x, y_cdf, label='cdf', color='b') _ = ax[0].tick_params(axis='y', labelcolor='r') _ = ax[1].tick_params(axis='y', labelcolor='b') _ = ax[0].set_ylabel('pdf', color='r') _ = ax[1].set_ylabel('cdf', color='b') _ = ax[0].set_title('PDF and CDF of log-normal')

_images/pdf-cdf_7_0.png

y_cdf = np.array([tup[0] for tup in [quad(lognorm_pdf, a, b) for a, b in [(a, b) for a, b in zip(x, x[1:len(x)])]]] + [0]).cumsum() y_pdf = derivative(lognorm_cdf, x, dx=1e-6) fig, ax = plt.subplots(1, 2, figsize=(15, 6)) _ = ax[0].plot(x, y_pdf, color='r') _ = ax[1].plot(x, y_cdf, color='b') _ = ax[0].set_title('PDF from derivative over CDF') _ = ax[1].set_title('CDF from integration over PDF')

_images/pdf-cdf_8_0.png

9.3. Learn a PDF from arbitrary CDF

We will generate an arbitrary CDF using the logistic function.

logistic = lambda x, L=1, x_0=0, k=1: L / (1 + np.exp(-k * (x - x_0))) x = np.arange(-6, 6.1, 0.1) y = logistic(x) x = x + 6.0 fig, ax = plt.subplots(figsize=(15, 6)) _ = ax.plot(x, y, color='b') _ = ax.set_title('Basic s-curve using logistic function')

_images/pdf-cdf_10_0.png

The parameters, \(L\) , \(x_0\) and \(k\) , for the logistic function will be learned.

from scipy.optimize import curve_fit L_estimate = y.max() x_0_estimate = np.median(x) k_estimate = 1.0 p_0 = [L_estimate, x_0_estimate, k_estimate] popt, pcov = curve_fit(logistic, x, y, p_0, method='dogbox') L, x_0, k = popt[0], popt[1], popt[2]

Assuming the PDF is log-normal, we will take the derivative of the CDF to estimate the PDF.

logistic = lambda x, L=L, x_0=x_0, k=k: L / (1 + np.exp(-k * (x - x_0))) y_pdf = derivative(lognorm_cdf, x, dx=1e-6) fig, ax = plt.subplots(figsize=(15, 6)) _ = ax.plot(x, y_pdf, color='r') _ = ax.set_title('Log-normal PDF from derivative over CDF')

_images/pdf-cdf_14_0.png

9.4. Learn a CDF from arbitrary PDF

We will generate a guassian-mixture (GM) PDF and derive the CDF using integration.

N = 1000 X = np.concatenate((np.random.normal(0, 1, int(0.3 * N)), np.random.normal(5, 1, int(0.7 * N))))

s = pd.Series(X) fig, ax = plt.subplots(figsize=(15, 6)) _ = s.plot(kind='kde', bw_method='scott', ax=ax) _ = ax.set_title('Mixed gaussian PDF')

_images/pdf-cdf_17_0.png

We then use a kernel density estimator to learn the PDF.

from sklearn.neighbors import KernelDensity kde = KernelDensity(kernel='gaussian', bandwidth=0.75).fit(X[:, np.newaxis]) gmm_pdf = lambda x: np.exp(kde.score(np.array([x]).reshape(-1, 1)))

Finally, the CDF of the PDF will be estimated using integration.

%%time x = np.arange(-5, 10.1, 0.1) y_cdf = np.array([tup[0] for tup in [quad(gmm_pdf, a, b) for a, b in [(a, b) for a, b in zip(x, x[1:len(x)])]]] + [0]).cumsum()

CPU times: user 522 ms, sys: 0 ns, total: 522 ms Wall time: 548 ms

fig, ax = plt.subplots(figsize=(15, 6)) _ = ax.plot(x, y_cdf, color='b') _ = ax.set_title('CDF from integration over gaussian-mixture PDF')

_images/pdf-cdf_22_0.png