Neural Networks and Information Extraction Home

This is the home of the Neural Networks and Information Extraction space.

To help you on your way, we've inserted some of our favourite macros on this home page. As you start creating pages, blogging and commenting you'll see the macros below fill up with all the activity in your space.

Recently Updated

Neural Networks and Information Extraction Home
Apr 05, 2020 10:54 • updated by Lizhong Zheng • view change
Training the VivoKey unsupervised AGI for intent-to-expect prediction on workflow learning detection.
Jan 24, 2019 21:36 • updated by Anonymous • view change
Detecting faults in zero based index arrays causing negative floating points in crypto transactions.
Jan 16, 2019 20:40 • updated by Anonymous • view change
Designing the vAI-Volico 'near real time' convolutional neuralnet for sentiment classification in a multi-structure environment.
Jan 16, 2019 15:22 • updated by Anonymous • view change
samplesfrommixturegaussian.png
Jun 02, 2017 21:48 • attached by Lizhong Zheng
Neural Networks and Information Extraction
May 22, 2017 09:08 • created by Anonymous
Neural Networks and Information Extraction Home
May 22, 2017 09:08 • created by Anonymous

== Why Are [[Artificial Neural Networks | Neural Networks]] So Powerful? ==

Well, that is a difficult question that has haunted us for about 30 years. The more successes we see from the applications of [[deep-learning]], the more puzzled we are. Why is it that we can plug in this “[[artificial neural networks |thing]]” to pretty much any problem, [[statistical classification | classification]] or [[statistical inference | prediction]], and with just some limited amount of tuning, we almost always can get good results. While some of us are amazed by this power, particularly the universal applicability; some found it hard to be convinced without a deeper understanding.

=== Now here is an answer! Ready? ===

In short, [[Artificial Neural Networks | Neural Networks]] extract from the data the most relevant part of the [[ feature selection | information]] that describes the statistical dependence between the features and the labels. In other words, the size of a [[Artificial Neural Networks | Neural Networks]] specifies a [[ data structure ]] that we can compute and store, and the result of training the network is the best approximation of the statistical relationship between the features and the labels that can be represented by this [[data structure]].

I know you have two questions right away: '''REALLY? WHY?'''

The “why” part is a bit involved. We have a new [[paper]] that covers this. Briefly, we need to first define a metric that quantifies how valuable a piece of partial information is for a specific inference task, and then we can show that [[Artificial Neural Networks | Neural Networks]] actually draw the most valuable part of information from the data. As a bonus, the same argument can also be used in understanding and comparing other [[Outline of machine learning | learning algorithms]], [[Principal component analysis | PCA]], [[compressed sensing]], etc. So everything ends up in the same picture. Pretty cool, huh? (Be aware, it is a loooong [[paper]].)

In this page, we try to answer the “Really?” question. One way to do that is I can write a mathematical proof, which is included in the paper. Here, let’s just do some experiments.

=== Here is the Plan: ===
We will generate some data samples <math>(\underline{x}, y )</math> pairs. <math>\underline{x}</math> is the real-valued feature vector, and <math>y </math> is the label. We will use these data samples to train a simple [[Artificial Neural Networks | neural network]] so it can be used to do classification of the feature vectors. After the training we will take out some of the weights on the edges of the network and show that these weights actually are the empirical conditional expectations <math>{\mathbb E} [\underline{X}|Y=y] </math> for different values of <math>y</math>'s.

So why are these conditional expectations worth computing? This goes back all the way to [[Alfréd Rényi]], and the notion of HGR correlation. In our [[paper]], we can show that this conditional expectation as a function of is in fact the function that is the most relevant to the [[statistical classification | classification]] problem.

Well, if the HGR thing is too abstract here, you can think there is a “correct” function that statisticians would like to compute. This is somewhat different from the conventional pictures where learning means to learn the [[sufficient statistic| statistic]] model that governs <math>(\underline{X}, Y)</math>. Since the features are often very high dimensional vectors or have other complex forms, learning this complete model can be difficult. The structure of a [[Artificial Neural Networks | neural network]] gives a constraint on the number of parameters that we can learn, which is often much smaller than the number of parameters needed to specify the full statistical model of <math>(\underline{X}, Y)</math> . Thus, we can only hope to learn a good approximate model that can be represented by these parameters. What the HGR business and our [[paper]] says is there is a theoretical way to identify what is the best approximation to the full model with only this number of parameters; and what we are demonstrating here is that at least for this specific example, a [[Artificial Neural Networks | neural network]] is computing exactly that!

Amazing! right? Imagine how the extensions of this can help you to use [[Artificial Neural Networks | Neural Networks]] in your problems!

===To follow the experiments:===
You can just read the code and comments on this page, and trust me with the results; or you can run it yourself. To do that, you will need a standard [https://www.python.org/ Python] environment, including [https://scipy.org/install.html/ Numpy], [https://matplotlib.org/ Matplotlib], etc. Also, you will need a standard [[Artificial Neural Networks | neural network]] package. For that, I use [https://keras.io/ Keras], and run it with [https://www.tensorflow.org/install/ TensorFlow]. You can follow this [https://keras.io/#installation link] to get them installed. I recommend using [https://docs.continuum.io/anaconda/install Anaconda] to install them, which takes, if you do it right, less than 10 minutes. Trust me, it’s a worthy effort. These packages are really well made and powerful. Of course, you are also welcome to just sit back, relax and enjoy the journey, as I will show you everything that you are supposed to see from these experiments.

== The Experiments ==

===To start===
You need to have the following lines to initialize.

<syntaxhighlight lang="python">
"""
Created on Mon May 8 11:58:01 2017

@author: lizhongzheng

"""

import numpy as np
import matplotlib.pyplot as plt

from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.optimizers import SGD

</syntaxhighlight>

If you receive no error message after these, then congratulation, you have installed your packages right.

=== Generate Data Samples ===
Now let’s generate some data. To start with, let’s generate the feature vectors and labels <math> (\underline{x}_i, y_i), i=1, \ldots, N </math>, from a [[mixture model | mixture Gaussian model]]

<math> p_{\underline{X}|Y}(\underline{x} | j) = {\cal N}(\underline{\mu}_j, \sigma^2I) </math>

This is almost cheating as we know right away that <math>{\mathbb E}[\underline{X}|Y=j] = \underline{\mu}_j </math> , so we only need to look for these <math>\underline{\mu}_j</math> values after the model is trained, and hope that they would magically show up as the weights in the network somewhere. Simple!

To make the story a little bit more interesting, we will pick the <math>\underline{\mu}_j</math> ’s randomly. We will pick the probability <math>p_Y</math> randomly too. Here is the code.
<syntaxhighlight lang="python">
if __name__ == "__main__":

# the dimensionality of x vectors
Dx = 3

# the cardinality of y
Cy = 8

# pick Py randomly, each entry from 1..5, then normalize
Py = np.random.choice(range(1,6), Cy)
Py=Py/sum(Py)

# pick the centers M_j randomly:
# M[:,j] is Dx dimensional, the center of P_{X|Y=j}
M= np.random.choice( range(-10, 10), [Dx, Cy])

# number of samples
N=10000

# Generate the samples

Y= np.random.choice(Cy, N, p=Py)

X= np.zeros([N, Dx])
Labels=np.zeros([N, Cy]) # neural network takes the indicators instead of Y

for i in range(N):
X[i, :]=M[:, Y[i]]
Labels[i, Y[i]]=1

X=X+ np.random.normal(0, .5, X.shape)

# empirical distribution of Y, we are not supposed to know Py anyway
PPy=sum(Labels)/N

</syntaxhighlight>

We can plot the data as something like this:

[[File:Samplesfrommixturegaussian.png|thumb|Samples from Mixture Gaussian model]]

One can pretty much eyeball the different classes. Some classes might be easier to separate, and some might be too close to separate well.

=== Use [[Artificial Neural Networks | Neural Network]] for [[statistical classification | Classification]] ===
Now let’s make a [[Artificial Neural Networks | neural network]] to nail these data. The network we would like to make has a single layer of nodes, not exactly [[deep learning]], but a good start point. It looks like this:

[[File:SinglelayerNN.png|thumb|Single Layer Neural Network with SoftMax Outputs]]

What the network does is to train some weight parameters <math>(\underline{v}_j, b_j )</math> , to form some linear transforms of the feature vectors, denoted as <math>Z_j = \underline{v}_j^T \cdot \underline{x} + b_j</math>, one for each node, or neuron. The [[Multinomial logistic regression | SoftMax]] output unit computes a distribution based on these weights,

<math>Q^{(v,b)}_{Y|\underline{X}}(j | \underline{x}) = \frac{e^{Z_l}}{\sum_i e^{Z_l}}</math>

and maximizes the likelihood with the given collection of samples

<math>\max_{v, b} \sum_{i=1}^N \log Q_{Y|\underline{X}}^{(v,b)}(y_i| \underline{x}_i) </math>

The only thing we need to specify in this process is the number of nodes, which we choose as the number of possible values of the labels, <math>Cy = |{\cal Y}| </math> in the code. The codes using [https://keras.io/ Keras] to implement this network is as simple as the follows:

<syntaxhighlight lang="python">
model=Sequential()
model.add(Dense(Cy, activation='softmax', input_dim=Dx))

sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy',
optimizer=sgd,
metrics=['accuracy'])
model.fit(X,Labels, verbose=0, epochs=200, batch_size=200)

</syntaxhighlight>

The resulting weights can be accessed by calling [https://keras.io/models/about-keras-models/#about-keras-models model.get_weights()], where we get the following results, in comparison with the centers of each class:

<syntaxhighlight lang="python">
>>> model.get_weights()[0]
array([[ 4.9477067 , 0.75933689, 2.35835409, -0.92028821, -1.89905143,
-0.17308646, -2.60500956, -3.40538096],
[ 0.11090775, 5.86771727, -2.96665311, 3.38425779, -2.5359087 ,
-4.26719618, 0.18537734, 1.92522383]], dtype=float32)

>>> M
array([[ 9, -6, 7, -4, -2, 0, -7, -9],
[-5, 9, -9, 4, -2, -7, 1, 2]])

</syntaxhighlight>

They do not look all that similar, right? The trick is that we need to regulate, to make each row vector above as a function of <math>y</math> to have zero mean and unit variance with respect to <math>p_Y</math>. To do that, we make the following regulating function:

<syntaxhighlight lang="python">
def regulate(x, p):
    '''
    regulate x vector so it has zero mean and unit variance w.r.t. p
    '''
    assert(np.isclose(sum(p),1)), 'invalid distribution used in regulate()'
    assert(x.size==p.size), 'dimension mismatch in regulate()'

    r= x - sum(x*p)
    r= r / np.sqrt(sum(p*r*r))
    return(r)
</syntaxhighlight>

Finally, we can make the plots

<syntaxhighlight lang="python">
for i in range(Dx):

        plt.figure(i)
        plt.plot(range(Cy), regulate(M[i, :], PPy), 'r-')
        plt.plot(range(Cy), regulate(models.get_weights()[0][i], PPy), 'b-')
        plt.title(i)
        plt.show()
</syntaxhighlight>

and here are the results, comparing the empirical conditional expectation and the weights from the [[artificial neural networks | neural network ]]:
[[File:Mixgaussianfeature0.png|thumb|Feature 0 of the Gaussian Example]] [[File:Mixgaussianfeature1.png|thumb|Feature 1 of the Gaussian Example]]

=== A Non-Gaussian Example ===

Well, this is not so terribly surprising for the [[mixture model |mixture Gaussian]] case. The[[Maximum a posteriori estimation | MAP]] classifier would compute the distance from a sample to the conditional mean, and put a weight on it according to the [[prior probability |prior]]. With all the scaling and shifting, it is not unthinkable that the procedure becomes the inner product to the conditional mean. In fact, I wish someone can make an animation of this, to see how the decision regions varies with the parameters, which could be a good demo teaching classical decision theory. (I didn’t say “classical” in any condescending way).

So how about we try a non-Gaussian case, say, samples with [[Dirichlet distribution]]. Why [[dirichlet distribution | Dirichlet]]? Because [https://www.python.org Python] generates it, and I cannot remember the mean of this distribution.

<syntaxhighlight lang="python">
# the dimensionality of x vectors
Dx = 2

# the cardinality of y
Cy = 8

# pick Py randomly, each entry from 1..5, then normalize
Py = np.random.choice(range(1,4), Cy)
Py=Py/sum(Py)

# pick the A parameters

A= np.random.uniform(0, 10, [Dx+1, Cy])

# number of samples
N=10000

# Generate the samples

Y= np.random.choice(Cy, N, p=Py)
T= np.zeros([N, Dx+1])

for j in range(Cy):
    T[Y==j, :] = np.random.dirichlet(A[:, j], sum(Y==j))

X=T[:, :-1]

# make the labels
Labels=np.zeros([N, Cy]) # neural network takes the indicators instead of Y
for i in range(N):
    Labels[i, Y[i]]=1

# centralize
Xbar=sum(X)/N
X=X-np.tile(Xbar, [N,1])


# empirical distribution of Y, we are not supposed to know Py anyway
PPy=sum(Labels)/N

# compute the empirical means
M=np.zeros([Cy, Dx])
for j in range(Cy):
    M[j,:] = sum(X[Y==j, :])/sum(Y==j)
</syntaxhighlight>

Here are how the generated data samples look like. I had to add in the colors, as otherwise it is really hard to see. Not a very clear clustering problem, is it? The strange triangle shape comes from the fact that [[Dirichlet distribution]] has the support over a simplex. We generate samples on a 3-D simplex and project them down to the 2-D space.

[[File:Dirichlet samples.png|thumb|Samples from Mixture Dirichlet Distribution]]

and here are the comparison between the conditional expectation and the weights

[[File:Dirichlet feature0.png|thumb| Feature 0 of the Dirichlet Example]][[File:Dirichlet feature1.png|thumb|Feature 1 of the Dirichlet Example]]

Child pages

Recently Updated

Navigate space