-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathK2_algorithm.Rmd
290 lines (222 loc) · 9.55 KB
/
K2_algorithm.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
---
jupyter:
jupytext:
formats: ipynb,Rmd
text_representation:
extension: .Rmd
format_name: rmarkdown
format_version: '1.2'
jupytext_version: 1.4.1
kernelspec:
display_name: R
language: R
name: ir
---
# Bayesian network structure from database
Bayesian networks are networks that can provide insight into probabilistic dependencies that exist among variables in a dataset. In particular we focus on using a Bayesian belief network as a model of probabilistic dependency.
A Bayesian belief-network $B_S$ is a directed acyclic graph in which nodes represent domain variables and the links represent probabilistic dependencies. Variables can be discrete of continuous, but in this work we will focus on discrete variables.
A Bayesian belief structure , $B_S$ is augmented by conditional probabilities, $B_P$, to form a Bayesian belief network $B=(B_S, B_P)$. For each node $x_i$ there is a conditional probability function that relates the node to its immediate predecessor, called parents and denoted with $\pi_i$. If a node has no parents a prior probability $P(x_i)$ is specified.
The key feature of belief networks is their explicit representation of the conditional indipendence and dependence among events. In particular the joint probability of any instantiation of all $n$ variables in a belief network can be calculated as follows:
$$
P(X_1, \ldots,X_n)=\prod_{i=1}^n P(X_i|\pi_i)
$$
where $X_i$ represent the instantiation of the variable $x_i$ and the $\pi_i$ are the parents of $x_i$.
## The basic model
Let $D$ be a database of cases, $Z$ be the set of variables represented by $D$ and $B_{S_i}$, $B_{S_j}$ be two belief networks structures containing exactly those variables that are in Z. Then:
$$
\frac{P(B_{S_i}|D)}{P(B_{S_j}|D)}\stackrel{prod-rule}{=}\frac{P(B_{S_i},D)}{P(B_{S_j},D)}
$$
We now make some assumptions for computing $P(B_S,D)$ efficiently:
\begin{enumerate}
\item \textbf{ The database variables, which we denote with $Z$, are discrete}. We thus have:
$$
P(B_S, D)=\int_{B_P}P(D|B_s,B_P)f(B_P|B_S)P(B_S)dB_P
\label{discr}
$$
where $B_P$ is a vector which denotes the conditional probability assignments associated with belief network structure $B_S$, and $f$ is the conditional probability density function over $B_P$ given $B_S$.
\item \textbf{ Cases occur indelendently, given a belief network model}. The Eq \ref{discr} becomes:
$$
P(B_S, D)=\int_{B_P}\left[\prod_{h=1}^mP(C_h|B_S,B_P)\right]f(B_P|B_S)P(B_S)dB_P
$$
where $m$ is the number of cases in $D$ and $C_h$ is the $h$-th case in $D$.
\item \textbf{ There are no cases that have variables with missing values.}
\item \textbf{ The density function $f(B_P|B_S)$ is uniform.}
\end{enumerate}
We now denote with $w_{ij}$ the $j$-th of $q_i$ unique instantiations of the values of the variables in $\pi_i$, relative to the ordering of the cases in $D$. We then define $v_{ik}$ the $k$-th of the $r_i$ possible instantiations of the variable $x_i$ in $Z$. We lastly define $N_{ijk}$ as the number of casas in $D$ in which the variable $x_i$ has value $v_{ik}$ and $\pi_i$ is instantiated as $w_{ij}$.
Calling:
\begin{equation*}
N_{ij}=\sum_{k=1}^{r_i}N_{ijk}
\end{equation*}
we finally get:
$$
P(B_S,D)=P(B_S)\prod_{i=1}^n\prod_{j=1}^{q_i} \frac{(r_i-1)!}{(N_{ij}+r_i-1)!}\prod_{k=1}^{r_i}N_{ijk}!
$$
Once we have $P(B,D)$ it is easy to find the conditional probability. Let $Q$ be the set of all those belief-network structures that contain just the variables in set $Z$, then:
$$
P(B_{S_i}|D)=\frac{P(B_{S_i},D)}{\sum_{B_S\in Q}P(B_S,D)}
$$
We can also look at smaller part of the network. Let $G$ be a belief network structure, such that the variables in G are a subset of those in $Z$. Let $R$ be the set of those structures in $Q$ that contains $G$ as a subgraph. Then:
$$
P(G|D)=\frac{\sum_{B_S\in R}P(B_S,D)}{\sum_{B_S\in Q}P(B_S,D)}
$$
## K2 algorithm
However the size and the computational time of finding the structure of a belief network increase exponentially with $n$. So an exact method is not possible: we will apply a greedy search algorithm known as K2. This algorithm begins by making the assumption that a node has no parents. and then adds incrementally that parent whose addition most increase the probability of the resulting structure. We stop the procedure when the addition of no single parent does not increase further the probability.
We define:
$$
g(i,\pi_i)=\prod_{j=1}^{q_i} \frac{(r_i-1)!}{(N_{ij}+r_i-1)!}\prod_{k=1}^{r_i}N_{ijk}!
$$
We also define the function Pred($x_i$) which returns the set of nodes that precede $x_i$ in the node ordering. We thus have the procedure:
\begin{enumerate}
\item *Input*: a set of $n$ nodes, an ordering of the nodes, an upper bound $u$ to the number of parents a node may have and a database $D$ containing $m$ cases.
\item *Output*: For each node a printout of the parents of the node.
\end{enumerate}
```{r}
# Pseudocode, not to run
for(i in 1:n){ # Cycle over the nodes
pi.i <- c() # Empty vector
p.old <- g(i, pi.i) # Function defined above
OKtoGO <- TRUE
while(OKtoGO & length(pi.i)<u ){
# let z be the node in Pred(x_i)-pi.i that maximizes g(i, pi.i U {z} )
z <- which.max( g(i, c(pi.i, pred(x_i)) ) )
p.new <- g(i, pi.i)
if(p.new >p.old){ # The new configuration is more probable
p.old <- p.new
pi.i <- c(pi.i, z) # New parents adding z
}
else{ OKtoGO <- FALSE }
}
cat('Node:', x.i, ' Parents of this node:', pi.i)
}
```
```{r}
# data. x1, x2, x3 are binary variables and we indicate with 0 "absent"
# and with 1 "present"
x1 <- c(1, 1, 0, 1, 0, 0, 1, 0, 1, 0)
x2 <- c(0, 1, 0, 1, 0, 1, 1, 0, 1, 0)
x3 <- c(0, 1, 1, 1, 0, 1, 1, 0, 1, 0)
df <- data.frame(x1, x2, x3)
```
$$ f(i, \pi_i) = \prod_{j=1}^{q_i} \frac{(r_i - 1)!}{(N_{ij} + r_i - 1)!} \prod_{k=1}^{r_i} N_{ijk}! $$
$$ \log f(i, \pi_i) = \sum_{j=1}^{q_i} \Big[ \log (r_i - 1)! - \log (N_{ij} + r_i - 1)! + \sum_{k=1}^{r_i} \log N_{ijk}! \Big]$$
```{r}
#Precompute all needed factorials (1 to m + r -1)
n <- ncol(df)
m <- nrow(df)
r <- 2 #number of possible values
factorials <- log(c(1, 1:(m+r-1)))
for (i in 4:length(factorials)) { #first 3 are already correct
factorials[i] <- factorials[i] + factorials[i-1]
}
#factorials <- log(factorials)
```
```{r}
f <- function(index, parents, database) {
#Compute N_{ijk}
p <- length(parents)
todec <- 2 ** (0:max(0,p-1)) # (2, 4, 8)
#(1, 2, 4, 8)
qi <- 2**p #can be adapted to generic r
N <- matrix(data = 0, nrow = qi, ncol = 2)
for (i in 1:nrow(database)) {
index.k <- database[i, index] + 1 #(0, 1) -> (1, 2)
if (p > 0)
index.j <- sum(database[i, parents] * todec) + 1 # 10 -> 2
else
index.j <- 1
N[index.j, index.k] <- N[index.j, index.k] + 1
}
#Compute N_{ij}
Nj <- rowSums(N)
# print(N)
# print(Nj)
#Compute the log(factorials)
logN <- matrix(data = factorials[N+1], ncol=2) #+1 because indices start from 1
# print(logN)
factor.first <- factorials[2] #2-1+1
factor.second <- factorials[Nj+2] #+r_i -1 +1 (because of the indices) = +r_i = +2
factor.third <- rowSums(logN)
# print(factor.first)
# print(factor.second)
# print(factor.third)
result <- exp(sum(factor.first - factor.second + factor.third)) #factor.first could be brought outside to gain precision (and multiply by qi)
return(result)
}
#0 is first, 1 is second
```
```{r}
a <- f(3, c(1,2), df) # [[3, 1], [0, 1], [1, 0], [0, 4]]
```
```{r}
ordering = c(1,2,3)
ordering[1:(which(ordering == 3)-1)]
```
```{r}
K2 <- function(database, u, ordering) {
# database : dataframe m x n, containing no missing values. Entries must be binary. (could be extended to integers 1 to r_i)
# u : maximum allowed number of parents for each node
# ordering : a vector containing a permutation of the first n integers
n <- ncol(database)
structure <- matrix(data = 0, nrow = n, ncol = n)
#structure[i,j] = 1 if j is a parent of i, 0 otherwise (it is not symmetric)
for (i in 1:n) {
pi.i <- c()
p.old <- f(i, pi.i, database)
OKToProceed <- TRUE
while (OKToProceed & length(pi.i) < u) {
p.new <- 0
p.new.index <- NA
#Compute possible parents of i
i.pos <- which(ordering == i) #Position of i in the ordering
if (i.pos == 1) {
break
} else {
precedents <- ordering[1:i.pos-1]
}
for (j in precedents) {
# can it be vectorized?
fz <- f(i, c(pi.i, j), database)
if (fz > p.new) {
p.new <- fz
p.new.index <- j
}
}
if (p.new > p.old) {
p.old <- p.new
pi.i <- c(pi.i, p.new.index)
structure[i, j] = 1
} else {
OKToProceed <- FALSE
}
}
}
return(structure)
}
```
```{r}
#Print the structure
structure <- K2(df, 3, c(1,2,3))
print(structure)
#Should be:
# [0, 0, 0]
# [1, 0, 0]
# [0, 1, 0]
for (i in 1:nrow(structure)) {
cat("Node ", i, " : [")
for (j in 1:ncol(structure)) {
if (structure[i,j]) {
cat(j)
}
}
cat("]\n")
}
```
```{r}
#TODO
#---Coding---#
#Evaluate the performance (timing)
#Generalize to r cases
#Bnstruct
#---Writing---#
#Write comments for everything
```