- Finite mixture models
- Infinite mixture models
- Continuous mixture models
from pathlib import Path
import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotnine as gg
import pymc3 as pm
import seaborn as sns
import theano.tensor as tt
from scipy import stats
gg.theme_set(gg.theme_minimal())
%config InlineBackend.figure_format = 'retina'- occurs when the population is a combination of sub-populations
- do not really have to believe there are sub-populations
- can be used as a trick to add flexibility to the model
- a finite weighted mixture of two or more distributions
- the probability density of the observed data is a weighted sum of the probability density for
$K$ subgroups-
$w_i$ : wieght of each subgroup (a.k.a component); the probability of component$i$ -
$\Pr(y|\theta)_i$ : distribution of the subgroup$i$
-
- to solve a MM, need to properly assign each data point to one of the components
- introduce a random variable
$z$ that specifies to which component an observation belongs- called a latent variable
- introduce a random variable
- example using chemical shift data (from protein NMR)
cs = pd.read_csv(Path("data/chemical_shifts_theo_exp.csv"))
cs_exp = cs["exp"]
(
gg.ggplot(cs, gg.aes("exp"))
+ gg.geom_histogram(
gg.aes(y="..density.."), bins=30, alpha=0.3, fill="blue", color="blue"
)
+ gg.geom_density(color="blue", size=1.5)
+ gg.scale_x_continuous(expand=(0, 0))
+ gg.scale_y_continuous(expand=(0, 0, 0.02, 0))
+ gg.labs(x="exp", y="density", title="Chemical shift mixture")
)<ggplot: (8787291433617)>
- most general discrete distribution
- parameterized using a parameter specifying the probabilities of each possible outcome
example_cat_data = pd.DataFrame(
{
"cat": ["A", "A", "A", "B", "B", "B", "B"],
"y": [0.1, 0.6, 0.3, 0.3, 0.1, 0.1, 0.5],
"x": [0, 1, 2, 0, 1, 2, 3],
}
)
(
gg.ggplot(example_cat_data, gg.aes(x="x", y="y", color="cat"))
+ gg.geom_line(alpha=0.5, size=1)
+ gg.geom_point(size=3)
+ gg.scale_color_brewer(type="qual", palette="Set1")
+ gg.labs(x="x", y="f(x)", title="Example categorial distribution")
)<ggplot: (8787291419058)>
- the Dirichlet distribution:
- lives in the simplex (an
$n$ -dim. triangle) - output of the dist. is a
$K$ -length vector$\alpha$ - output are restricted to be nonnegative and sum to 1
- generalization of the beta distribution
- i.e. extend the beta dist. beyond 2 outcomes
- the beta dist. has two outcomes and returns a 2-element vector:
$(p, 1-p)$ - for 3 outcomes, need a 3-element vector
$(p, q, r)$ where$p+q+r=1$ - for
$K$ outcomes, declare a vector$\alpha$ where$||\alpha|| = K$
- lives in the simplex (an
- in a model, use the Dirichlet dist. as a
$K$ -sided coin flip model on top of a Gaussian estimated model- in the diagram, there are
$K$ copies of the model in the rounded-rect -
$\mu_k$ is dependent upon the group$k$ -
$\sigma_\mu$ and$\sigma_\sigma$ are shared for all groups
- in the diagram, there are
# NOTE: We expect this model to fit poorly.
# See below for an example of how to reparameterize for a more
# effective fitting process.
clusters = 2
with pm.Model() as model_kg:
p = pm.Dirichlet("p", a=np.ones(clusters))
z = pm.Categorical(
"z", p=p, shape=len(cs.exp)
) # group assignment; one per data point
means = pm.Normal("means", mu=cs.exp.mean(), sd=10, shape=clusters)
sd = pm.HalfNormal("sd", sd=10)
y = pm.Normal("y", mu=means[z], sd=sd, observed=cs.exp)
# trace_kg = pm.sample() # NOT RUN BECAUSE TOOOOOOO SLOW- the above model fits poorly because the latent variable
zis explcitly included - reparameterize:
- in a MM, the observed vairable
$y$ is conditional on the latent variable$z$ , that is$\Pr(y|z, \theta)$ - can consider
$z$ is nuisance variable and marginalize to get$\Pr(y | \theta)$ - can do this in PyMC3 using the
NormalMixture()distribution to get a Gaussian mixture model
- in a MM, the observed vairable
clusters = 2
with pm.Model() as model_mg:
p = pm.Dirichlet("p", a=np.ones(clusters))
means = pm.Normal("means", mu=cs.exp.mean(), sd=10, shape=clusters)
sd = pm.HalfNormal("sd", sd=10)
y = pm.NormalMixture("y", w=p, mu=means, sd=sd, observed=cs.exp)
trace_mg = pm.sample(random_seed=123)Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [sd, means, p]
<style>
/*Turns off some styling*/
progress {
/*gets rid of default border in Firefox and Opera.*/
border: none;
/*Needs to be in here for Safari polyfill so background images work as expected.*/
background-size: auto;
}
.progress-bar-interrupted, .progress-bar-interrupted::-webkit-progress-bar {
background: #F44336;
}
</style>
100.00% [4000/4000 00:21<00:00 Sampling 2 chains, 0 divergences]
Sampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 29 seconds.
The rhat statistic is larger than 1.4 for some parameters. The sampler did not converge.
The estimated number of effective samples is smaller than 200 for some parameters.
az_trace_mg = az.from_pymc3(trace_mg, model=model_mg)az.plot_trace(az_trace_mg, var_names=["means", "p"])
plt.show()az.summary(trace_mg, var_names=["means", "p"])/usr/local/Caskroom/miniconda/base/envs/bayesian-analysis-with-python_e2/lib/python3.9/site-packages/arviz/data/io_pymc3.py:87: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context.
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
</style>
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_mean | ess_sd | ess_bulk | ess_tail | r_hat | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| means[0] | 52.154 | 5.329 | 46.242 | 57.681 | 3.747 | 3.169 | 2.0 | 2.0 | 3.0 | 63.0 | 1.83 |
| means[1] | 52.174 | 5.299 | 46.358 | 57.664 | 3.726 | 3.152 | 2.0 | 2.0 | 3.0 | 61.0 | 1.83 |
| p[0] | 0.500 | 0.410 | 0.076 | 0.924 | 0.288 | 0.244 | 2.0 | 2.0 | 3.0 | 93.0 | 1.83 |
| p[1] | 0.500 | 0.410 | 0.076 | 0.924 | 0.288 | 0.244 | 2.0 | 2.0 | 3.0 | 93.0 | 1.83 |
- looking at the trace plots and summary table above, notive that the distributions for both means and both
$p$ are bimodal and that the average values for these parameters are the same- caused by parameter non-identifiability because can make multiple different models where one mean is the either mode and the other is the remaining mode (and the same for
$p$ ) - in mixture models, also known as lbel-switching problem
- two ways to prevent non-identifiability in MM:
- force the component sto be ordered (e.g. arrange the means of the components in increasing order)
- use informative priors
- caused by parameter non-identifiability because can make multiple different models where one mean is the either mode and the other is the remaining mode (and the same for
- in PyMC3, can add
pm.potential()to the likelihood of the model- is an arbitrary factor that does not necessarily depend on the data
- thus, does not add any parameters to the model
- example of adding a constraing to a model:
- if the constraint is not violation, add 0 to the likelihood
- if is violated, add
$-\infty$ to make the outcome impossible - thus, the model considers parameters that violate the constraints to be impossible without affecting the rest of the values
clusters = 2
with pm.Model() as model_mgp:
p = pm.Dirichlet("p", a=np.ones(clusters))
means = pm.Normal(
"means", mu=np.array([0.9, 1]) * cs_exp.mean(), sd=10, shape=clusters
)
sd = pm.HalfNormal("sd", sd=10)
order_means = pm.Potential(
"order_means", tt.switch(means[1] - means[0] < 0, -np.inf, 0)
)
y = pm.NormalMixture("y", w=p, mu=means, sd=sd, observed=cs_exp)
trace_mgp = pm.sample(1000, random_seed=123)Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [sd, means, p]
<style>
/*Turns off some styling*/
progress {
/*gets rid of default border in Firefox and Opera.*/
border: none;
/*Needs to be in here for Safari polyfill so background images work as expected.*/
background-size: auto;
}
.progress-bar-interrupted, .progress-bar-interrupted::-webkit-progress-bar {
background: #F44336;
}
</style>
100.00% [4000/4000 00:22<00:00 Sampling 2 chains, 0 divergences]
Sampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 29 seconds.
varnames = ["means", "p"]
az.plot_trace(trace_mgp, varnames)
plt.show()/usr/local/Caskroom/miniconda/base/envs/bayesian-analysis-with-python_e2/lib/python3.9/site-packages/arviz/data/io_pymc3.py:87: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context.
az.summary(trace_mgp)/usr/local/Caskroom/miniconda/base/envs/bayesian-analysis-with-python_e2/lib/python3.9/site-packages/arviz/data/io_pymc3.py:87: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context.
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
</style>
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_mean | ess_sd | ess_bulk | ess_tail | r_hat | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| means[0] | 46.811 | 0.438 | 46.040 | 47.687 | 0.009 | 0.007 | 2185.0 | 2185.0 | 2188.0 | 1432.0 | 1.0 |
| means[1] | 57.461 | 0.098 | 57.290 | 57.650 | 0.002 | 0.001 | 2255.0 | 2255.0 | 2253.0 | 1686.0 | 1.0 |
| p[0] | 0.090 | 0.009 | 0.072 | 0.106 | 0.000 | 0.000 | 1818.0 | 1818.0 | 1804.0 | 1639.0 | 1.0 |
| p[1] | 0.910 | 0.009 | 0.894 | 0.928 | 0.000 | 0.000 | 1818.0 | 1817.0 | 1804.0 | 1639.0 | 1.0 |
| sd | 3.651 | 0.074 | 3.506 | 3.788 | 0.002 | 0.001 | 2236.0 | 2222.0 | 2267.0 | 1619.0 | 1.0 |
- can also add a constraint to ensure that each component has a non-null probability
- i.e. that each component gets at least one observation
min_p = 0.01
p_min = pm.Potential("p_min", tt.switch(tt.min(p) < min_p, -np.inf, 0))- general advice: start small and increase to imprive model-fit
- example: compare fit of model with
$K = {3, 4, 5, 6}$
clusters = [3, 4, 5, 6]
models = []
traces = []
for cluster in clusters:
with pm.Model() as model:
p = pm.Dirichlet("p", a=np.ones(cluster))
means = pm.Normal(
"means",
mu=np.linspace(cs_exp.min(), cs_exp.max(), cluster),
sd=10,
shape=cluster,
transform=pm.distributions.transforms.ordered,
)
sd = pm.HalfNormal("sd", sd=10)
y = pm.NormalMixture("y", w=p, mu=means, sd=sd, observed=cs_exp)
trace = pm.sample(1000, tune=2000, random_seed=123)
traces.append(trace)
models.append(model)Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [sd, means, p]
<style>
/*Turns off some styling*/
progress {
/*gets rid of default border in Firefox and Opera.*/
border: none;
/*Needs to be in here for Safari polyfill so background images work as expected.*/
background-size: auto;
}
.progress-bar-interrupted, .progress-bar-interrupted::-webkit-progress-bar {
background: #F44336;
}
</style>
100.00% [6000/6000 00:51<00:00 Sampling 2 chains, 0 divergences]
Sampling 2 chains for 2_000 tune and 1_000 draw iterations (4_000 + 2_000 draws total) took 58 seconds.
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [sd, means, p]
<style>
/*Turns off some styling*/
progress {
/*gets rid of default border in Firefox and Opera.*/
border: none;
/*Needs to be in here for Safari polyfill so background images work as expected.*/
background-size: auto;
}
.progress-bar-interrupted, .progress-bar-interrupted::-webkit-progress-bar {
background: #F44336;
}
</style>
100.00% [6000/6000 00:57<00:00 Sampling 2 chains, 0 divergences]
Sampling 2 chains for 2_000 tune and 1_000 draw iterations (4_000 + 2_000 draws total) took 65 seconds.
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [sd, means, p]
<style>
/*Turns off some styling*/
progress {
/*gets rid of default border in Firefox and Opera.*/
border: none;
/*Needs to be in here for Safari polyfill so background images work as expected.*/
background-size: auto;
}
.progress-bar-interrupted, .progress-bar-interrupted::-webkit-progress-bar {
background: #F44336;
}
</style>
100.00% [6000/6000 01:37<00:00 Sampling 2 chains, 0 divergences]
Sampling 2 chains for 2_000 tune and 1_000 draw iterations (4_000 + 2_000 draws total) took 105 seconds.
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [sd, means, p]
<style>
/*Turns off some styling*/
progress {
/*gets rid of default border in Firefox and Opera.*/
border: none;
/*Needs to be in here for Safari polyfill so background images work as expected.*/
background-size: auto;
}
.progress-bar-interrupted, .progress-bar-interrupted::-webkit-progress-bar {
background: #F44336;
}
</style>
100.00% [6000/6000 02:29<00:00 Sampling 2 chains, 23 divergences]
Sampling 2 chains for 2_000 tune and 1_000 draw iterations (4_000 + 2_000 draws total) took 158 seconds.
There were 23 divergences after tuning. Increase `target_accept` or reparameterize.
The rhat statistic is larger than 1.4 for some parameters. The sampler did not converge.
The estimated number of effective samples is smaller than 200 for some parameters.
- to show how
$K$ affects inference, compare these models' fits with the one from kernel density estimation
trace = traces[0]
def sample_ppc(trace, x_):
idx = np.random.randint(0, len(trace))
means_y = trace["means"][idx]
p_y = trace["p"][idx]
sd = trace["sd"][idx]
dist = stats.norm(means_y, sd)
y = np.sum(dist.pdf(x_) * p_y, 1)
return y
def sample_ppcs(trace, x, n_clusters, n=50):
x_ = np.array([x] * n_clusters).T
data = {i: sample_ppc(trace, x_) for i in range(n)}
data = pd.DataFrame(data)
data["x"] = x
data = data.melt(id_vars="x", var_name="sample_idx", value_name="y")
return data
def sample_ppc_per_cluster(trace, x, n_clusters):
x_ = np.array([x] * n_clusters).T
means_y = trace["means"].mean(0)
p_y = trace["p"].mean(0)
sd = trace["sd"].mean(0)
dist = stats.norm(means_y, sd)
y = dist.pdf(x_) * p_y
data = pd.DataFrame(y)
data["x"] = x
data = data.melt(id_vars="x", var_name="cluster", value_name="y")
data["cluster"] = data["cluster"].astype(str)
return datadef plot_mixture_model_ppc(sampled_ppc, cluster_ppc, n_clusters):
summary_ppc_df = sampled_ppc.groupby("x").mean().reset_index(drop=False)
p = (
gg.ggplot(sampled_ppc, gg.aes(x="x", y="y"))
+ gg.geom_line(gg.aes(group="sample_idx"), alpha=0.2, color="#1075B7")
+ gg.geom_line(data=summary_ppc_df, color="#004B8D", size=1, linetype="--")
+ gg.geom_line(gg.aes(group="cluster"), data=cluster_ppc, linetype="--")
+ gg.theme(figure_size=(4, 4))
+ gg.labs(
x="exp",
y="expected density",
title=f"Mixture Model with {n_clusters} clusters",
)
)
return px = np.linspace(cs_exp.min(), cs_exp.max(), 200)
for n_clusters, trace in zip(clusters, traces):
ppc_df = sample_ppcs(trace=trace, x=x, n_clusters=n_clusters)
clusters_ppc_df = sample_ppc_per_cluster(trace=trace, x=x, n_clusters=n_clusters)
print(plot_mixture_model_ppc(ppc_df, clusters_ppc_df, n_clusters))<ggplot: (8787283105211)>
<ggplot: (8787302350954)>
<ggplot: (8787274282873)>
<ggplot: (8787274160644)>
mm_comparison = az.compare(dict(zip(clusters, traces)), method="BB-pseudo-BMA")
mm_comparison/usr/local/Caskroom/miniconda/base/envs/bayesian-analysis-with-python_e2/lib/python3.9/site-packages/arviz/stats/stats.py:149: UserWarning:
The scale is now log by default. Use 'scale' argument or 'stats.ic_scale' rcParam if you rely on a specific value.
A higher log-score (or a lower deviance) indicates a model with better predictive accuracy.
/usr/local/Caskroom/miniconda/base/envs/bayesian-analysis-with-python_e2/lib/python3.9/site-packages/arviz/data/io_pymc3.py:87: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context.
/usr/local/Caskroom/miniconda/base/envs/bayesian-analysis-with-python_e2/lib/python3.9/site-packages/arviz/data/io_pymc3.py:87: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context.
/usr/local/Caskroom/miniconda/base/envs/bayesian-analysis-with-python_e2/lib/python3.9/site-packages/arviz/data/io_pymc3.py:87: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context.
/usr/local/Caskroom/miniconda/base/envs/bayesian-analysis-with-python_e2/lib/python3.9/site-packages/arviz/data/io_pymc3.py:87: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context.
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
</style>
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| rank | loo | p_loo | d_loo | weight | se | dse | warning | loo_scale | |
|---|---|---|---|---|---|---|---|---|---|
| 6 | 0 | -5127.827401 | 12.447644 | 0.000000 | 0.760694 | 32.171885 | 0.000000 | False | log |
| 5 | 1 | -5129.245710 | 9.899071 | 1.418309 | 0.235350 | 31.679818 | 1.213311 | False | log |
| 4 | 2 | -5139.433919 | 7.436035 | 11.606518 | 0.003956 | 31.277297 | 4.356708 | False | log |
| 3 | 3 | -5178.648194 | 6.042494 | 50.820792 | 0.000000 | 31.258120 | 8.935154 | False | log |
az.plot_compare(mm_comparison)
plt.show()- from the above plots and table, the mixture model where
$K=6$ appears to have the best fit (of the models tested)- there is considerable overlap between the models with
$K=4, 5, 6$ in the LOO analysis
- there is considerable overlap between the models with
- model-based clustering: using a probabilistic model to compute the probability of each data point to belong to each cluster
- a form of soft-clustering (instead of the binary hard-clustering)
- get the number of clusters from the data using a Dirichlet process (DP)
Non-parametric models are models with a theoretically infinite number of parameters. In practice, we somehow let the data to reduce the theoretically infinite number of parameters to some finite number, in other words the data decides the actual number of parameters, thus non-parametric models are very flexible.
- DP is the infinite-dimensional generalization of the Dirichlet distribution
- a single draw (a realization) of a DP is a distribution
- can use a DAP to assign a prior distribution to a non-fixed number of clusters
- can think of a DP as a way to sample from a prior distribution of distributions
- some relevant properties of a DP:
- a DP is a distribution whose realizations are probability distributions
- is specified by a base distribution
$\mathcal{H}$ and a positive real number$\alpha$ called the concentration parameter -
$\mathcal{H}$ is the expected value of the DP - as
$\alpha$ increased, the realizations become less concentrated - on the limit
$\alpha \to \infty$ the realizations from a DP will match those of the base distribution; thus if the base distribution is continuous, the DP will be continuous; this is very rare in practice - in practice, the DP almost always generates a discrete distribution
- can define a distribution by providing the probability
$y$ of each possible value$x$ - the values for
$x$ are determined by the type of distribution (e.g. a Gaussian is all real numbers, a Poisson are nonnegative real integers) - the values for
$y$ (must sum to 1) are generated using a Gedankenexperiment known as the stick-breaking process- break a stick into 2 parts, break one of those pieces in 2, and so on for
$K$ times -
$\alpha$ determines how small we break the sticks (larger$\alpha$ means smaller pieces and$\alpha \to \infty$ means breaking the stick into infinite pieces - the following plot shows four results of this process from four different values of
$\alpha$ (an explanation of the functionstick_breaking_truncated()is provided later)
- break a stick into 2 parts, break one of those pieces in 2, and so on for
- the values for
def stick_breaking_truncated(alpha, H, K):
betas = stats.beta.rvs(1, alpha, size=K)
w = np.empty(K)
w = betas * np.concatenate(([1.0], np.cumprod(1 - betas[:-1])))
locs = H.rvs(size=K)
return locs, w
K = 500
H = stats.norm # DP base distribution: Gaussian normal
alphas = [1, 10, 100, 1000]
stick_break_res = None
for idx, alpha in enumerate(alphas):
locs, w = stick_breaking_truncated(alpha, H, K)
data = pd.DataFrame({"locs": locs, "w": w, "alpha": f"$\\alpha$: {alpha}"})
if stick_break_res is None:
stick_break_res = data
else:
stick_break_res = pd.concat([stick_break_res, data])
(
gg.ggplot(stick_break_res)
+ gg.facet_wrap("alpha", scales="free")
+ gg.geom_linerange(gg.aes(x="locs", ymin=0, ymax="w"), color="#0C73B4")
+ gg.theme(subplots_adjust={"wspace": 0.25, "hspace": 0.5})
+ gg.labs(
x="x",
y="probability",
title="Dirichlet process distributions at different concentration parameters $\\alpha$",
)
)<ggplot: (8787302259806)>
- notes on abovbe plot:
- the DP is a discrete distribution
- as
$\alpha$ increases, the DP pieces get smaller and resemble the base distribution
- the DP places a Gaussian (or whatever base distribution) at the location of each substick of the DP realization and scale/weight that Gaussian by the length of the substick
- below is an example of that process
alpha = 10
H = stats.norm
K = 5
np.random.seed(124)
x = np.linspace(-4, 4, 250)
locs, w = stick_breaking_truncated(alpha, H, K)
print(f"locations: {locs}")
print(f"weights: {w}")locations: [ 0.44167705 0.80818623 0.23199239 -1.40035848 0.71590778]
weights: [0.00993163 0.03023728 0.15708906 0.08662139 0.03511271]
dist = stats.norm(locs, 0.5)
x_ = np.array([x] * K).T # Make K copies of `x`
prob = dist.pdf(x_) * w
prob_cum = np.sum(prob, 1)
data = pd.DataFrame(prob)
data["x"] = x
data = data.melt(id_vars="x", var_name="peak_idx", value_name="y")
data["peak_idx"] = data["peak_idx"].astype(str)
data_cum = pd.DataFrame({"x": x, "y": prob_cum})
(
gg.ggplot(data_cum, gg.aes(x="x", y="y"))
+ gg.geom_line(color="black", size=4, alpha=0.2)
+ gg.geom_line(
gg.aes(color="peak_idx"), data=data, linetype="--", size=1, alpha=0.9
)
+ gg.scale_color_brewer(type="qual", palette="Set1", guide=False)
+ gg.labs(
x="x", y="probability", title="Example DP using the stick-breaking process"
)
)<ggplot: (8787301333327)>
- explanation of
stick_breaking_truncated():
- where:
-
$\delta_{\theta_k}$ : indicator function that evaluate to 0 everywhere except where$\delta_{\theta_k}(\theta) = 1$ ; these represent the locations sampled from the base distribution$\mathcal{H}$ -
$w_k$ : probabilities given by the following equation
-
-
where:
-
$w_k$ : length of a substick -
$\prod_{i=1}^{k-1} (1-\beta^{\prime}_{i})$ is the length of the remaining portion (the part of the stick to keep breaking) -
$\beta^\prime_k \sim \text{Beta}(1, \alpha)$ : indicates how to break the remaining portion
-
-
can now try an example in PyMC3
- need a version of
stick_breaking()that works with PyMC3 - also common to define the prior for
$\alpha$ as a Gamma distribution
- need a version of
def stick_breaking_pymc3(alpha, K):
beta = pm.Beta("beta", 1.0, alpha, shape=K)
w = beta * pm.math.concatenate([[1.0], tt.extra_ops.cumprod(1.0 - beta)[:-1]])
return w
N = cs_exp.shape[0]
K = 20
with pm.Model() as model:
alpha = pm.Gamma("alpha", 1.0, 1.0)
w = pm.Deterministic("w", stick_breaking_pymc3(alpha, K))
means = pm.Normal("means", mu=cs_exp.mean(), sd=10, shape=K)
sd = pm.HalfNormal("sd", sd=10, shape=K)
obs = pm.NormalMixture("obs", w, means, sd=sd, observed=cs_exp.values)
trace = pm.sample(
draws=1000, init="advi", n_init=100000, tune=2000, target_accept=0.9
)Auto-assigning NUTS sampler...
Initializing NUTS using advi...
<style>
/*Turns off some styling*/
progress {
/*gets rid of default border in Firefox and Opera.*/
border: none;
/*Needs to be in here for Safari polyfill so background images work as expected.*/
background-size: auto;
}
.progress-bar-interrupted, .progress-bar-interrupted::-webkit-progress-bar {
background: #F44336;
}
</style>
58.28% [58283/100000 02:44<01:57 Average Loss = 5,263.6]
Convergence achieved at 58300
Interrupted at 58,299 [58%]: Average Loss = 5,301.9
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [sd, means, beta, alpha]
<style>
/*Turns off some styling*/
progress {
/*gets rid of default border in Firefox and Opera.*/
border: none;
/*Needs to be in here for Safari polyfill so background images work as expected.*/
background-size: auto;
}
.progress-bar-interrupted, .progress-bar-interrupted::-webkit-progress-bar {
background: #F44336;
}
</style>
100.00% [6000/6000 31:49<00:00 Sampling 2 chains, 129 divergences]
Sampling 2 chains for 2_000 tune and 1_000 draw iterations (4_000 + 2_000 draws total) took 1918 seconds.
There were 61 divergences after tuning. Increase `target_accept` or reparameterize.
There were 68 divergences after tuning. Increase `target_accept` or reparameterize.
The rhat statistic is larger than 1.4 for some parameters. The sampler did not converge.
The estimated number of effective samples is smaller than 200 for some parameters.
- the distribution of
alphais centered around 0.5 suggesting only a few components were needed by the model
az.plot_trace(trace["alpha"])
plt.show()- important to check that
$K$ was large enough by plotting average weight of each component- should be at least a few components with negligible weight
d = pd.DataFrame({"w": np.arange(K), "avg": trace["w"].mean(0)})
(
gg.ggplot(d, gg.aes(x="w", y="avg"))
+ gg.geom_line()
+ gg.geom_point()
+ gg.labs(x="component", y="average weight")
)<ggplot: (8787271517591)>
- the plot below shows the posterior predictions of the model
x_plot = np.linspace(cs.exp.min() - 1, cs.exp.max() + 1, 200)
post_pdf_contribs = stats.norm.pdf(
np.atleast_3d(x_plot),
trace["means"][:, np.newaxis, :],
trace["sd"][:, np.newaxis, :],
)
post_pdfs = (trace["w"][:, np.newaxis, :] * post_pdf_contribs).sum(axis=-1)
post_pdfs_df = (
pd.DataFrame(post_pdfs.T[:, :50])
.assign(x=x_plot)
.melt(id_vars="x", var_name="sample_idx")
.assign(sample_idx=lambda d: d["sample_idx"].astype(str))
)
(
gg.ggplot(post_pdfs_df)
+ gg.geom_density(
gg.aes(x="exp"),
data=cs,
color="purple",
fill="purple",
linetype="--",
alpha=0.1,
size=0.5,
)
+ gg.geom_line(
gg.aes(x="x", y="value", group="sample_idx"), alpha=0.1, color="black"
)
+ gg.geom_line(
gg.aes(x="x", y="value"),
data=post_pdfs_df.groupby("x").mean().reset_index(drop=False),
alpha=0.8,
color="blue",
size=1,
)
+ gg.labs(
x="exp",
y="probability density",
title="Posterior predictions of Dirichlet process MM",
)
)<ggplot: (8787301018381)>
- examples from previous chapters:
- zero-inflated distribution
- robust logistic regression (mixture of a random guessing model and logistic regression)
- can interpret hierarchical models as continuous mixture models
- beta-binomial is a discrete distribution used to describe the number of successes
$y$ for$n$ Bernoulli trials with the probability of success$p$ for each trial- parameterized by
$\alpha$ and$\beta$
- parameterized by
- find the probability of observing an outcome
$y$ by averaging over all possible values of$p$ -
$p$ is a continuous variable
-
- negative-binomial is a Gamma-Poisson mixture
- mixture of Poisson distributions where the rate parameter
$\lambda$ is a Gamma distribution - often used to fight over-dispersion: when the variance is greater than the mean
- thus, model data as a mixture of Poisson distributions
- mixture of Poisson distributions where the rate parameter
- both beta-binomial and negative-binomial and their zero-inflated counterparts are implemented in PyMC3
%load_ext watermark
%watermark -d -u -v -iv -b -h -mLast updated: 2021-01-06
Python implementation: CPython
Python version : 3.9.1
IPython version : 7.19.0
Compiler : Clang 10.0.0
OS : Darwin
Release : 20.1.0
Machine : x86_64
Processor : i386
CPU cores : 4
Architecture: 64bit
Hostname: JHCookMac
Git branch: master
pandas : 1.2.0
arviz : 0.10.0
scipy : 1.6.0
matplotlib: 3.3.3
theano : 1.0.5
seaborn : 0.11.1
plotnine : 0.7.1
numpy : 1.19.4
pymc3 : 3.9.3















