The only way to deal with large distributions is to constrain the nature of the variable interactions in some manner. The key idea is to specify which variables are independent of others, leading to a structured factorisation of the joint probability distribution.
Belief networks are a way to depict the independence assumptions made in a distribution. Nevertheless, in expressing these independencies it can be useful (though also potentially misleading) to think of ‘what causes what’.
Reducing the burden of specification
Consider a discrete variable $y$ with many discrete parental variables $x_1, \dots, x_n$.
Formally, the structure of the graph implies nothing about the form of the parameterisation of the table $p(y \mid x_1, \dots, x_n)$. If each parent $x_i$ has dim($x_i$) states, and there is no constraint on the table $p(y \mid x_1, \dots, x_n)$ contains (dim(y) - 1)$\prod_i$dim($x_i$) entries.
If stored explicitly for each state, this would require potentially huge storage. An alternative is to constrain the table to have a simpler parametric form. For example, one might write a decomposition in which only a limited number of parental interactions are required (this is called divorcing parents) For example,
we have
$p(y \mid x_1, \dots, x_5) = \sum_{z_1, z_2} p(y \mid z_1, z_2)p(z_1\mid x_1, x_2, x_3)p(z_2 \mid x_4, x_5)$
Assuming all variables are binary, the number of states requiring specification is $2^3 + 2^2 + 2^2 = 16$, compared to the $2^5 = 32$ states in the unconstrained case.
Logic gates
Another technique to constrain tables uses simple classes of conditional tables. For example,
one could use a logic OR gate on binary $z_i$, say
We can then make a table $p(y \mid x_1, \dots, x_5)$ by including The additional terms $p(z_i = 1 \mid x_i)$. When each $x_i$ is binary there are in total only 2 + 2 + 2 + 2 + 2 = 10 quantities required fr specify $p(y \mid x)$. In this case, fig(3.2c) can be used to represent any noisy logic gate, such as the noisy Or or noisy AND, where the number of parameters required to specify the noisy gate is linear in the number of parents.
The noisy-OR is particularly common in disease-symptom networks in which many diseases x can give rise to the same symptom $y$ provided that at least one of the disease is present, the probability that the symptom will be present is high.
We make a distinction between evidence that is Uncertain, and evidence that is reliable.
Uncertain evidence
In soft or uncertain evidence, the evidence variable is in more than one state, with the strength of our belief about each state being given by probabilities. For example, if x has the states dom(x) = {red,blue,green} the vector (0.6, 0.1, 0.3) represents the belief in the respective states. In contrast, for hard evidence we are certain that a variable is in a particular state. In this case, all the probability mass is in one of the vector components, for example (0, 0, 1).
Performing inference with soft-evidence is straightforward and can be achieved using Bayes’ rule. For example, for a mode $p(x, y)$, Consider that we have some soft evidence $\tilde{y}$ about the variable $y$, and wish to know what effect this has on the variable $x$ that is we wish to compute $p(x\mid \tilde{y})$. From Bayes’ rule, and the assumption $p(x \mid y, \tilde{y}) = p(x \mid y)$, we have
$p(x \mid \tilde{y}) = \sum_y p(x, y \mid \tilde{y}) = \sum_y p(x \mid y, \tilde{y}) p(y \mid \tilde{y}) = \sum_y p(x \mid y)p(y \mid \tilde{y})$
where $p(y = i \mid \tilde{y})$ represents the probability that $y$ is in state $i$ under the soft-evidence. This is a generalisation of hard-Evidence in which the vector $p(y \mid \tilde{y})$ has all zero component values, except for a single component. This procedure in which we first define the model conditional on the evidence, and then average over the distribution of the evidence is also knwon as Jeffrey’s rule.
In the BN we use a dashed circle to represent that a variable is in a soft-evidence state.
Definition 3.1 (Belief network)
A belief network is a distribution of the Formally
$p(x_1, \dots, x_D) = \displaystyle\prod_{i=1}^D p(x_i \mid pa(x_i))$
where $pa(x_i)$ represent the parental variables of variable $x_i$. Represented as a directed graph, with an arrow pointing from a parent variable to child variable, a belief network corresponds to a Directed Acyclic Graph (DAG), with the $i^{th}$ node in the graph.
Remark 3.2 (Graphs and distributions)
Whether or not a belief network corresponds to a specific instance of a distribution requiring also the numerical specification of the conditional probability tables, or whether or not it refers to any distribution which is consistent with the specified structure. In this one can potentially distinguish between a belief network distribution (containing a numerical specification) and a belief network graph (which contains no numerical specification).
Remark 3.3 (Dependencies and the Markov Blanket)
Consider a distribution on a set of variables $X$. For a variable $x_i \in X$ and corresponding belief network represented by a DAG $G$, let $MB(x_i)$ be the variables in the Markov blanket of $x_i$. Then for any other variable $y$ that is also not in the Markov blanket of $x_i$, then $x_i \bot y \mid MB(x_i)$. That is, the Markov blanket of $x_i$ carries all information about $x_i$. As an example, for fig(3.2b), $MB(z_1) = ${$x_1, x_2, x_3, y, z_2$} and $z_1 \bot x_4 \mid MB(z_1)$
The DAG corresponds to a statement of conditional independencies in the model. To complete the specification of the BN we need to define all elements of the conditional probability tables $p(x_i\mid pa(x_i))$. Once the graphical structure is defined, the entries of the conditional probability tables (CPTs) $p(x_i \mid pa(x_i))$ can be expressed. For every possible state of the parental variables $pa(x_i)$, a value for each of the states of $x_i$ needs to be specified. For a large number of parents, writing out a table of values is intractable, and the tables are usually parameterised in a low dimensional manner.
Conditional independence
Whilst a BN corresponds to a set of conditional independence assumptions, it is not always immediately clear from the DAG whether a set of variables is conditionally independent of a set of other variables. For example, in fig(3.5) are $x_1$ and $x_2$ independent, given the state of $x_4$ ?
$p(x_1, x_2 \mid x_4) = \frac{1}{p(x_4)} \displaystyle\sum{x_3} p(x_1, x_2, x_3, x_4) = \frac{1}{p(x_4)}\displaystyle\sum{x_3} p(x_1 \mid x_4)p(x_2 \mid x_3, x_4)p(x_3)p(x_4)$
$= p(x_1 \mid x_4) \displaystyle \sum_{x_3} p(x_2 \mid x_3, x_4)p(x_3)$
Now
$p(x_2 \mid x_4) = \frac{1}{p(x_4)} \displaystyle\sum{x_1, x_3} p(x_1, x_2, x_3, x_4) = \frac{1}{p(x_4)}\displaystyle\sum{x_1, x_3} p(x_1 \mid x_4)p(x_2 \mid x_3, x_4)p(x_3)p(x_4)$
$= \displaystyle \sum_{x_3} p(x_2 \mid x_3, x_4)p(x_3)$
Because $\displaystyle \sum_{x_1} p(x_1 \mid x_4) = 1$
Combining the two results above we have
$p(x_1, x_2 \mid x_4) = p(x_1 \mid x_4)p(x_2 \mid x_4)$
So that $x_1$ and $x_2$ are indeed independent conditional on $x_4$.
To help develop intuition, consider the three variable distribution $p(x_1, x_2, x_3)$. We may write this in any of the 6 ways:
$p(x_1, x_2, x_3) = p(x{i_1} \mid x{i_2}, x{i_3})p(x{i_2} \mid x{i_3}p(x{i_3})$
where $(i_1, i_2, i_3)$ is any of the 6 permutations of (1, 2, 3). Whilst each factorisation produces a different DAG, all represent the same distribution, namely one that makes no independence statements. If the DAGs are of the cascade form, no independence assumptions have been made. The minimal independence assumptions then correspond to dropping a single link in the casecade graph. This gives rise to the 4 DAGs in fig(3.6). Are any of these graphs equivalent, in the sense that they represent the same distribution?
so that DAGs (b), (c) and (d) represent the same conditional independece (CI) assumptions, given the state of variable $x_3$, variables $x_1$ and $x_2$ are independent, $x_1 \bot x_2 \mid x_3$.
However, graph (a) represents something fundamentally different, namely: $p(x_1, x_2) = p(x_1)p(x_2)$. There is no way to transform the distribution $p(x_3 \mid x_1, x_2)p(x_1)p(x_2)$ into any of the others.
Remark 3.4 (Graphical Dependence)
Belief network (graphs) are good for encoding conditional independence but are not well suited for encoding dependence. For example, consider the graph $a \rightarrow b$. This may appear to encode the relation that a and b are dependent. However, a specific numerical instance of a belief network distribution could be such that $p(b \mid a) = p(b)$, for which $a \bot b$. When the DAG appears to show ‘graphical’ dependence, there can be instances of the distributions for which dependence does not follow. The same caveat holds for Markov networks.
Conditionally independence:
$p(A \mid C, B) = p(C)$
If we know the variable $C$, $A$ and $B$ is conditionally independent.
$B \bot C \mid A$
So if we know $B$ and $C$ are independent basic the fact $A$, does it imply $B \bot C$ ?
Answer is No.
Intuitively, given the fact $B$ would gives the information of A, therefore the probability of $A$ will change.