The neural network model is a connectionist model inspired by the concept of activating neurons in the brain, where multiple “neurons” or units are activated by an activation function, which takes as input linear combination of weights and the previous layer inputs. While not being a realistic depiction of the brain, there is a loose resemblence in that the strength of the connections or weights between units resembles the strength of synapses between neurons.
As examples are fed forward into each layer of the network, the units in each layer are “activated” through an activation function and the output is then linearly combined with the weights between the current layer and the next and fed into the subsequent activation functions until the output layer. A loss function is used to calculate the distance between the output of the network and the desired target variables, and the loss function derivative along with the local activation function derivative is then used to calculate small changes in the weights in order to propogate those changes at each subsequent layer. This is performed as an application of the chain rule for differentiation in the first order optimisation method “stochastic gradient descent” (SGD). The gradient method makes small changes that follows the derivative of the error in order to search for a global minima. One issue with gradient descent is that it can easily be caught in local minima and not reach the global optimum values.
Initialisation of the network weights is generally performed using heuristic methods, however weights are required to be independently randomly distributed.
The data that is fed into the network is numeric in nature, the challenge is in finding a representation for the input data that will be suitable for the network to learn. Continuous multivariate data is normalised, and categorical data is encoded as an appropriate representation such as a simple index mapping, or potentially an embedding representation (although the networks are also used to learn embedding representations).
The linear combination at each layer is given as \(\nu = W^TX + b\) this is then passed through the activation function \(y = \phi(\nu)\) to produce the output of the layer.
\[ \nu_j^{(l)} = \sum_{i=0}^m w_{ij}^{(l)}y_i^{(l-1)} \]
The activation of the neuron \(j\), \(y_j^{(l)}\) is given as
\[ y_j^{(l)} = \begin{cases} x_j & \text{ where } l = 0, x_j \text{ is an input node}\\ \phi_j(v_j) & \text{ where } l > 0 \end{cases} \]
A multilayer network consists of an input layer, intermediate “hidden” layers and an output layer. In the case of a simple three layer network we can refer to the units in the input layer as units \(i\) the units in the hidden layer as units \(j\) and units in the output layer as units \(k\). The weights between each layer as \(w_{ji}\) for weights between input and hidden layer, and \(w_{kj}\) for weights between the hidden and output layer.
The input layer doesn’t really have an activation function as such, in this case it can be treated as the identity function. The requirement of activation and loss functions are that they are continuous and differentiable. Normalisation of network inputs and targets can be helpful in this respect.
In the SGD learning algorithm the network is fit in two stages, a forward pass and a backward pass.
The forward pass is the same as prediction, hence data is fed into each layer subsequently it is a composition of activation functions. For example in the simple 3 layer network we have 2 activation functions and one input layer with weight matrices between input \(i\) and layer 1 units \(j\) (\(W_{ji}\)), as well as weight matrices between hidden layer 1 units \(j\) and output layer 2 units \(k\) (\(W_{kj}\)). With respective activation functions \(\sigma_j\) for each unit \(j\) in hidden layer 1 and \(\phi_k\) for each unit \(k\) in the output layer 2.
\[ f_{net} = \phi_k(W_{kj}^T\phi_j(W_{ji}^T [X \text{ } b])) = \phi_k(\nu_k) \]
The input matrix consists of an additional column \(b\) which is a bias term introduced for the linear activation of the input, this is generally a column of ones. The bias is also added to the weight matrices in subsequent layers so that it is propagated through the network. As the data is transformed on the input space, via a method of normalisation, the output of the network also needs to have the inverse transform applied in order to rescale it into the original units of the data set. This is typically performed during assessment of the network performance so that metrics depending on units can be interpreted.
During training a loss function \(\mathcal{E}\) is used to calculate the loss for the batch as well as its derivative which gives the gradient of the error signal. The loss function is continuous and differentiable. Common loss functions are \(L^2 = \frac{1}{2n}\sum_{i=1}^n (y - \hat{y})^2\) and the Cross Entropy function for comparison of discrete probability distributions generated by the network and the target probability distribution for the example.
The gradient descent method uses the derivative of the loss function with respect to the weights in order to determine the changes that need to be made to the weights within the network.
\[ \frac{\partial \mathcal{E}}{\partial w_{ji}} = \frac{\partial \mathcal{E}}{\partial e_j}\frac{\partial e_j}{\partial y_j}\frac{\partial y_j}{\partial \nu_j}\frac{\partial \nu_j}{\partial w_{ji}} \] It is the application of the chain rule with respect to the weights and the composed functions of the network.
Haykin [1] indicates that the partial derivative \(\frac{\partial \mathcal{E}}{\partial w_{ji}}\) represents a sensitivity factor which determines the direction of search in the weight space for the update to weight \(w_{ji}\). In effect it determines the direction the weight will be adjusted.
The back propogation is separated into two tasks, the first task is in determining the gradient:
\[ \delta_j = \frac{\partial \mathcal{E}}{\partial \nu_j} = \frac{\partial \mathcal{E}}{\partial e_j}\frac{\partial e_j}{\partial y_j}\frac{\partial y_j}{\partial \nu_j} \] Which is calculated depending on the layer of the network as shown below. \[ \delta_j = \begin{cases} e_j^{(L)}\phi_j'(v_j^{(l)}) & \text{ neuron } j \text{ in output layer } L\\ \phi_j'(v_j^{(l)})\sum_k \delta_k^{(l+1)}w_{kj}^{(l+1)} & \text{ neuron } j \text{ in hidden layer } l \end{cases} \]
The second part is in adjusting the weights at the connection between neuron \(j\) in layer \(l\) and neuron \(i\) in the previous layer \(l-1\) as follows.
\[ w_{ji}^{(l)} = w_{ji}^{(l)} + \eta \delta_j^{(l)} y_i^{(l-1)} \]
The hyperparameter \(\eta\) is specified prior to training (and needs to be determined empirically), it is also possible to use a dampening effect for updates made in previous weight updates by including a momentum term.
A second hyper parameter \(\alpha\) is used to provide a decay for the influence of the previous iterations weights at iteration \(n-1\). The notation here introduces the term \(n\) for the current iteration of the learning epoch. \[ w_{ji}^{(l)}(n) = w_{ji}^{(l)}(n) + \alpha[w_{ij}^{(l)}(n-1)] + \eta \delta_j^{(l)} y_i^{(l-1)}(n) \] The hyperparameter momentum \(\alpha \in [0,1]\) is described by Goodfellow et al in “Deep Learning” 2016 [2] as an analogy for the velocity vector of a particle. The parameter determines how quickly contributions of previous gradients exponentially decay.
Learning continues until completing all epochs or until a stopping condition is met.
Other optimisation methods make use of second order derivatives or approximations of second order derivatives in order to perform weight updates, these are not addressed in the implementation at this stage.
The above does not describe in detail the choice of activation functions, these depend on the type of application applied in the network. Popular choices of activation function for hidden layers include the “Restricted Linear Unit (Relu)” or the Sigmoid unit.
A good list of activation functions is given on wikipedia [3].
Aside from the linear (identity function) the following activation functions are implemented by the library.
\[ \phi(x) = \frac{1}{1 + e^{-x}} \] With its derivative \[ \phi'(x) = \phi(x)(1 - \phi(x)) = \frac{e^{-x}}{(1 + e^{-x})^2} \] we can introduce slope \(\alpha\) and temperature \(\rho\) parameters which influence the steepness and width of the function producting \(\phi(x) = \frac{1}{1 + e^{-c x}}\) and \(\phi' = -c \phi(x)(1 - \phi(x)) = \frac{ce^{-cx}}{(1+e^{-cx})^2}\) where \(c = \frac{\alpha}{\rho}\).
The hyperbolic tanh function \(\phi(x) = tanh(x) = \frac{e^x - e^{-x}}{e^{-x} + e^x}\) and its derivative \(\phi'(x) = 1 - tanh^2(x)\).
The relu function is expressed as \[ \phi(x) = \begin{cases} x & x > 0\\ 0 & \text{otherwise} \end{cases} \] and its derivative \[ \phi'(x) = \begin{cases} 1 & x > 0\\ 0 & \text{otherwise} \end{cases} \]
The softmax function is an aggregating function that is applied generally at the last layer of the network when used for assigning probabilities to class indicator matrices. This function has an interesting derivative, which is a matrix form rather than a vector.
\[ \phi(\overrightarrow{x}) = \frac{e^{x_i}}{\sum_{i=0}^m e^{x_i}} \] This yields an approximate probability, and the maximum value is taken as the index of the predicted class. The appropriate loss function for training is the cross entropy loss function. The derivative of the softmax is \[ \phi'(\overrightarrow{x}) = \phi_i(\overrightarrow{x})(\delta_{ij} - \phi_j(\overrightarrow{x})) \] Where the Kronecker Delta \[ \delta_{ij} = \begin{cases} 1 & i = j\\ 0 & i <> j \end{cases} \] The derivative can be evaluated a set of matrix operations. Note however this explanation I first found in a tutorial by Sait Celebi located at http://saitcelebi.com/tut/output/part2.html, which was an excellent explanation of the derivative of the softmax function, it is quite a good resource well worth the read, which is reflected in summary form below [4].
\[ \phi'(\overrightarrow{x}) = \left(\begin{bmatrix} \phi(x_1)\\ \phi(x_2)\\ \vdots\\ \phi(x_m) \end{bmatrix}\begin{bmatrix} 1 & 1 \cdots & 1 \end{bmatrix}\right)\otimes\left( \begin{bmatrix} 1 & 0 & \cdots & 0\\ 0 & 1 & \cdots & 0\\ \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & 1 \end{bmatrix} - \begin{bmatrix} 1\\ 1\\ \vdots\\ 1 \end{bmatrix} \begin{bmatrix} \phi(x_1) & \phi(x_2) \cdots \phi(x_m) \end{bmatrix} \right) \] \[ = \phi(\overrightarrow{x})e^T \otimes (I - e\phi(\overrightarrow{x})^T) \] where \(e\) is a vector of ones the same size as the \(x\) vector.
The kroneker delta \(\delta\) is the diagonal ones matrix in the expression above the outer product \(\otimes\) results in a matrix of \(m \times m\) in dimension. During back propogation this is converted back into a one dimensional vector due to the dot product between the error vector of the output layer and the derivative of the output layer.
In calculating the error and its gradient a differentiable loss function is required. For the purposes of demonstration two loss functions are implemented. The first being the Squared Error loss function
\[ \mathcal{E} = \frac{1}{2}\sum_{i=1}^ne_i^2 \] with its derivative \[ \mathcal{E}' = e \] where \(e_i = y_i - \hat{y}_i\), [1].
The second being the Cross Entropy measure giving a measure of entropy between two discrete probability distributions, if both distributions are the same the cross entropy should be close to 0.
\[ \mathcal{E} = -\sum_{i=1}^m p(y_i) \log q(y_i) \]
The probability function \(p(y)\) refers to the distribution of the observations and \(q(y)\) refers to the distribution of the predictions generated by the network (output by the softmax layer).
Its derivative is \(\mathcal{E}' = p(y_i) - q(y_i)\) or \(y_i - \hat{y}_i = e_j\). The derivative of the loss function is the first term used in computing the gradient described earlier.
The au.id.cxd.math library implements a simple means of training networks with stocastic gradient descent, however this is for demonstration purposes, industrial applications should make use of tools such as tensorflow or pytorch for example.
The library provides activation functions for sigmoid, tanh, Relu, Linear and Softmax, as well as loss functions for Squared Error for use with continuous data and Cross Entropy for discrete probabilities.
The following sections provide an example of two simple networks, one used for regression and the other for classification of multiple classes.
This model is a simple 3 layer network that will be used to predict the miles-per-gallon from the cars data set based on other continuous features of each model.
The first step is to define some convenience functions that will help us to load the data and partition it.
import au.id.cxd.math.data.{MatrixReader, Partition}
import au.id.cxd.math.function.transform.StandardisedNormalisation
import au.id.cxd.math.model.evaluation.{Efficiency, MAE, MSE, PeakPercentDeviation, RSquared, WillmotsIndex}
import au.id.cxd.math.model.network.activation.{Identity, LeakyRelu, Linear, Relu, Sigmoid}
import au.id.cxd.math.model.network.builder.{Builder, Sequence}
import au.id.cxd.math.model.network.initialisation.RandomWeightInitialisation
import au.id.cxd.math.model.network.layers.{DenseLayer, InputLayer}
import au.id.cxd.math.model.network.train.SGDTrainer
import au.id.cxd.math.probability.regression.{BayesLinearRegression, OrdLeastSquares}
val inputFile = s"$basePath/data/cars/autompg2.csv"
def readCars(inputFile:String): (StandardisedNormalisation, DenseMatrix[Double], DenseMatrix[Double]) = {
val reader = new MatrixReader {}
val M = reader.read(new File(inputFile))
// normalise M
val normal = new StandardisedNormalisation()
val M1 = normal.transform(M)
val continuous = Seq(2,3,4,5)
val X = M1(::, continuous)
val Y = M1(::, 0)
(normal, X.toDenseMatrix, Y.toDenseMatrix.t)
}
def translateY(data:DenseMatrix[Double], norm:StandardisedNormalisation):DenseMatrix[Double] = {
val mu = norm.meanVector
val std = norm.sigmaVector
val ycol = 0
val mu2 = DenseVector(mu(0))
val std2 = DenseVector(std(0))
val norm2 = new StandardisedNormalisation()
norm2.meanVector = mu2
norm2.sigmaVector = std2
norm2.invert(data)
}
def partition(X:DenseMatrix[Double], Y:DenseMatrix[Double]):Seq[(DenseMatrix[Double], DenseMatrix[Double], DenseMatrix[Double])] = {
val partitions = Partition(Seq(X, Y))
partitions
}
Next we load the data file.
val (norm, x, y) = readCars(inputFile)
val partitions = partition(x,y)
val trainX = partitions(0)._1
val validX = partitions(0)._2
val testX = partitions(0)._3
val trainY = partitions(1)._1
val validY = partitions(1)._2
val testY = partitions(1)._3
The partitions have been generated, the default proportions for the ‘Partition’ object are 60% training, 20% validation and 20% testing.
The next step is to define a network architecture, in this case the network has one input layer of 4 inputs, one hidden layer with 4 units and one output layer with a single unit. A helper function is written to provide the network architecture.
/**
* create a simple network
* @return
*/
def buildNetwork():Builder = {
Sequence(Seq(
InputLayer(activation=Identity(), units=4),
DenseLayer(activation=Relu(), units=4),
DenseLayer(activation=Linear(), units=1)
)).compile()
}
A training algorithm is supplied, in this case there is one to choose from, the SGD training algortihm, with a low learning rate and momentum. The low learning rate is selected because to avoid learning too quickly and resulting in overflow or underflow errors which have not been addressed in this example implementation. Similarly the momentum is provided as a low value to prevent the previous history of weight updates from causing similar underflow or overflow issues during the training algorithm. Note when using more robust tools such as tensorflow this may not be as much of a consideration to deal with.
val trainer = SGDTrainer(trainX, trainY, validX, validY, learnRate = 0.000001, momentum=0.000001)
The trainer is then used to train the network, it returns a new network containing the resulting weights and the training loss and validation loss metrics for each epoch. The network is trained for 2000 epochs.
val (loss, valloss, network2) = trainer.train(2000)(buildNetwork().asInstanceOf[Sequence])
The training and validation loss is shown below, interestingly the validation loss is lower than the training loss. Also note that the loss continues to decrease throughout the training process, although near the end the training loss begins to increase slightly.
In order to evaluate the network we can use a range of metrics, these include \(R^2 \in [0, 1]\) the percent variance explained (in the linear relationship), Willmott’s Index of Agreement \(d \in [0, 1]\), Nash-Sutcliffe Efficiency \(E \in [-\infty, 1]\), Mean Absolute Error (in the same units as the target unit), Root Mean Absolute Error (also in the same units) and Peak Percentage Deviation (also in the sames units as the target), in order to gain an insight into the performance of the model.
To calculate the metrics we first obtain an approximation from the network.
val sim = network2.predict(testX)
We can plot this prediction as well as the residuals just as we do in standard regression.
The residuals appear to be almost normally distributed with some skew towards the lower values. Visually we can say the model deviates somewhat for lower values.
The metrics for the model are generated as follows.
def printMetrics(metrics:Map[String,Double]) = {
metrics.foreach {
pair => {
println(s"${pair._1} : ${pair._2}")
}
}
}
val metrics1 = Map(
"RSquared" -> RSquared(target2.toArray.toSeq, sim2.toArray.toSeq),
"Agreement" -> WillmotsIndex(2.0, target2.toArray.toSeq, sim2.toArray.toSeq),
"Efficiency" -> Efficiency(2.0, target2.toArray.toSeq, sim2.toArray.toSeq),
"MAE" -> MAE(target2.toArray.toSeq, sim2.toArray.toSeq),
"MSE" -> MSE(target2.toArray.toSeq, sim2.toArray.toSeq),
"PDV" -> PeakPercentDeviation(target2.toArray.toSeq, sim2.toArray.toSeq)
)
println("Network model")
printMetrics(metrics1)
## Network model
## PDV : -23.107646750304045
## MAE : 5.563863118489157
## Efficiency : -0.35127610298428724
## MSE : 49.5362005603459
## RSquared : 2.1073537983243125E-4
## Agreement : 0.5072545193358489
Note that the R-squared indicates the model explains only a small percent of variation, the agreement is roughly close to the mean, the MSE about 49 miles per gallon (given the data we are predicting) and the mean absoluate error MAE is about 5.6 miles per gallon.
We can generate another model using ordinary least squares and compare using the same metrics.
// compare with OLS
val ols1 = OrdLeastSquares(trainX, trainY.toDenseVector, 1)
val (est1, error1) = ols1.train()
// prediction before update
val Y1 = ols1.predict(testX)
// rescale the simulated data
val sim4 = translateY(Y1.t, norm)
Looking at the residuals the OLS model appears to be skewed to the right with predictions in the upper range having more errors than the lower to mid range.
We can obtain metrics for the model to compare to the original neural network model.
val metrics2 = Map(
"RSquared" -> RSquared(targets.toArray.toSeq, Y1.toArray.toSeq),
"Agreement" -> WillmotsIndex(2.0, targets.toArray.toSeq, Y1.toArray.toSeq),
"Efficiency" -> Efficiency(2.0, targets.toArray.toSeq, Y1.toArray.toSeq),
"MAE" -> MAE(target2.toArray.toSeq, sim4.toArray.toSeq),
"MSE" -> MSE(target2.toArray.toSeq, sim4.toArray.toSeq),
"PDV" -> PeakPercentDeviation(target2.toArray.toSeq, sim4.toArray.toSeq)
)
println("Bayes OLS model")
printMetrics(metrics2)
## Bayes OLS model
## PDV : -39.582827691973094
## MAE : 7.602216571766249
## Efficiency : -1.1510348043281726
## MSE : 78.85441860783347
## RSquared : 0.03445182828904486
## Agreement : 0.55062510221139
Reviewing the OLS model, it has an efficiency lower than the first model, however both models have a negative efficiency which indicates that they do not predict anymore skillfully than the simply using the mean value on this data set. However the second model has a higher MAE as well as a higher MSE as well as a higher PDV which indicates the first model is slightly more skillful a prediction than the second.
The following example uses the library to build a neural network for classification. The data set used in this example is the wine data set. The idea is to use chemical measures of wine to determine which origin winery the wine was made at. There are 3 origin wineries.
In this example a helper class implements a DataSetReader in order to onehot encode discrete class labels.
import au.id.cxd.math.data.{DataSet, DataSetReader, Partition}
import au.id.cxd.math.function.transform.{ContinuousTransform, MinMaxNormalisation}
import au.id.cxd.math.model.network.initialisation.{RandomGaussianInitialisation, RandomWeightInitialisation, WeightInitialisationStrategy}
val filename = s"$basePath/data/wine_data.csv"
val wineDataReader = new DataSetReader {
/**
* the continuous columns in the wine data set
*/
override val continuousCols: List[Int] = (for (i <- 1 to 13) yield i).toList
// the discrete column is the first column.
override val discreteCols: List[Int] = List(0)
override val file: File = new File(filename)
override val continuousFn: ContinuousTransform = MinMaxNormalisation()
}
def readData(): DataSet = {
val dataset = wineDataReader.readSync()
dataset
}
We first read the data.
val data = readData()
val trainX = data.trainData(::, 0 to 12)
val trainY = data.trainData(::, 13 to 15)
val validX = data.crossValidateData(::, 0 to 12)
val validY = data.crossValidateData(::, 13 to 15)
val testX = data.testData(::, 0 to 12)
val testY = data.testData(::, 13 to 15)
A helper function defines the structure of the network, and the SGD trainer is defined.
import au.id.cxd.math.model.network.activation.{Identity, Linear, Relu, Sigmoid, Softmax}
import au.id.cxd.math.model.network.loss.{DiscreteCrossEntropy, MeanSquareErrorLoss}
def buildNetwork(initialisation: WeightInitialisationStrategy): Sequence = {
Sequence(Seq(
InputLayer(activation = Identity(), units = 13),
DenseLayer(activation = Relu(), units = 15),
DenseLayer(activation = Relu(), units = 10),
DenseLayer(activation = Relu(), units = 3),
DenseLayer(activation = Softmax(), units = 3)
), initialisation).compile()
.asInstanceOf[Sequence]
}
def trainNetwork(epochs: Int, trainer: SGDTrainer, network: Sequence): (Seq[Double], Seq[Double], Sequence) = {
val (loss, valloss, network2) = trainer.train(epochs)(network)
(loss, valloss, network2)
}
val initialisation = RandomWeightInitialisation()
val network = buildNetwork(initialisation)
val trainer = SGDTrainer(trainX, trainY, validX, validY,
learnRate = 0.00001,
momentum = 0.000001,
lossFn = DiscreteCrossEntropy())
val (loss, valloss, network2) = trainNetwork(7000, trainer, network)
The network in this case has 1 input layer, 3 hidden layers and 1 output layer with 3 output units. A softmax function is used to aggregate the output results and convert them into probabilities. The loss function is a discrete cross entropy loss function.
The network is trained for 7000 epochs and the following graph illustrates the loss.
To evaluate the network we use cross tabulation and provide some basic metrics.
import au.id.cxd.math.count.CrossTabulate
val classLabels = data.discreteMapping.head._2
val labels = network2.classify(testX, classLabels.toList)
val actualLabels = data.convertToClassLabels("origin", testY)
val xtab = CrossTabulate(actualLabels, labels)
val metrics1 = CrossTabulate.metrics(xtab)
println("Network classification performance")
println(xtab)
CrossTabulate.printMetrics(xtab)
## Network classification performance
## 13.0 0.0 2.0
## 0.0 11.0 3.0
## 0.0 4.0 2.0
## Total Records: 35.0
## Total Match: 26.0
## Total Mismatch: 9.0
## Positive Percent: 0.7428571428571429
## False percent: 0.2571428571428571
## Odds: 2.8888888888888893
The network matches a good number of records correctly.
In order to compare the performance of the network against other models we will make a comparison against canonical discriminent analysis which is a linear model that relies on between group variation.
import au.id.cxd.math.model.components.CanonicalDiscriminantAnalysis
// lets compare this with a linear model such as canonical discriminant analysis
val data1 = data.randData
val groupNames = data.discreteMapping.head._2
// we know the data has 3 sites in it 1, 2 and 3 these are mapped below
val origins = Map("1" -> "Site 1",
"2" -> "Site 2",
"3" -> "Site 3")
// we will create an ordinary partition
val (train, valid, test) = Partition(Seq(data1), 0.6, 0.2, 0.2).head
val trainX2 = train(::, 0 to 12)
val trainY2 = data.convertToClassLabels("origin", train(::, 13 to 15))
val validX2 = valid(::, 0 to 12)
val validY2 = data.convertToClassLabels("origin", valid(::, 13 to 15))
val testX2 = test(::, 0 to 12)
val testY2 = data.convertToClassLabels("origin", test(::, 13 to 15))
val (components, coeffs, intercept, percentVar, zMat, cor, groupMeans) =
CanonicalDiscriminantAnalysis(trainY2.toArray.toList.map(_.toString), trainX2)
val (testProjection, groupProjection, distances, groupAssignments) =
CanonicalDiscriminantAnalysis.classify(testX2, coeffs, intercept, groupMeans, groupNames.toList)
val predictions = groupAssignments.map(_._2)
val xtab2 = CrossTabulate(testY2, predictions)
val metrics2 = CrossTabulate.metrics(xtab2)
println("Canonical discriminant analysis performance")
println(xtab2)
CrossTabulate.printMetrics(xtab2)
## DIM (107 , 13)
## Canonical discriminant analysis performance
## 2.0 13.0 0.0
## 6.0 0.0 8.0
## 3.0 0.0 3.0
## Total Records: 35.0
## Total Match: 5.0
## Total Mismatch: 30.0
## Positive Percent: 0.14285714285714285
## False percent: 0.8571428571428571
## Odds: 0.16666666666666666
We can see that the canonical discriminant function is not very accurate. In this case the neural network classifier outperforms the canonical discriminant analysis function in terms of classification. However we may still use the canonical discriminant function as an analysis method to interpret coefficients for covariates on the training data.
As a third comparison we’ll make use of a set of linear models built from an ensemble of ordinary least squares performing logistic regression.
To do so we need to convert the target classes into three seperate groups, since each of the logistic regression classifiers will be a binary classifier, we need to generate three separate training targets, one for each class.
import au.id.cxd.math.data.filter.Which
// compare with logistic regression labelling.
// to do this we need 3 models one for each class as we are using a two class logistic regressor.
val targets1 = Which[String](trainY2, _.equalsIgnoreCase("1"))
val targets2 = Which[String](trainY2, _.equalsIgnoreCase("2"))
val targets3 = Which[String](trainY2, _.equalsIgnoreCase("3"))
def tabulateTargets(n:Int, targets:Seq[Int]) = DenseVector.tabulate(n) {
case i => if (targets.contains(i)) 1.0
else 0.0
}
val trainY2_target1 = tabulateTargets(trainY2.size, targets1)
val trainY2_target2 = tabulateTargets(trainY2.size, targets2)
val trainY2_target3 = tabulateTargets(trainY2.size, targets3)
val validY2_target1 = tabulateTargets(validY2.size, Which[String](validY2, _.equalsIgnoreCase("1")))
val validY2_target2 = tabulateTargets(validY2.size, Which[String](validY2, _.equalsIgnoreCase("2")))
val validY2_target3 = tabulateTargets(validY2.size, Which[String](validY2, _.equalsIgnoreCase("3")))
Once the data is prepared we train three separate models and combine the results again using a softmax function in order to obtain the ensemble classification.
import au.id.cxd.math.probability.regression.LogisticLeastSquares
import breeze.linalg.DenseVector
val model1 = LogisticLeastSquares(trainX2, trainY2_target1)
val model2 = LogisticLeastSquares(trainX2, trainY2_target2)
val model3 = LogisticLeastSquares(trainX2, trainY2_target3)
Seq(model1, model2, model3).foreach(_.train())
// to combine them we'll need to use the softmax for the results of each three of the probabilities
val (y1,c1) = model1.estimate(testX)
val (y2, c2) = model2.estimate(testX)
val (y3, c3) = model3.estimate(testX)
val classes3 = (for (i <- 0 until y1.length) yield i).map {
i =>
val (a,b,c) = (y1(i), y2(i), y3(i))
val total = Seq(a,b,c).map(Math.exp(_)).sum
val p = Seq(a,b,c).map(x => Math.exp(x)/total)
val maxP = p.max
val idx = Which[Double](p, x => x == maxP).head + 1
idx.toString
}
The resulting set of classes can be passed into the cross tabulation in order to evaluate the models.
// now we have classes
val xtab3 = CrossTabulate(testY2, classes3)
println("Logistic Regression performance")
println(xtab3)
CrossTabulate.printMetrics(xtab3)
## Logistic Regression performance
## 15.0 0.0 0.0
## 1.0 13.0 0.0
## 0.0 0.0 6.0
## Total Records: 35.0
## Total Match: 34.0
## Total Mismatch: 1.0
## Positive Percent: 0.9714285714285714
## False percent: 0.02857142857142857
## Odds: 34.0
The ensemble of separate logistic regressors in this case are more accurate on the dataset than the single neural network. This may be due to the small size of the dataset and the way that three separate logistic models are combined by converting the problem into an ensemble of binary classifiers. However the procedure above demonstrates how such a network can be constructed and compared against a range of other multi-class classifiers.
[1] Haykin, S (1998 ). Neural Networks: A Comprehensive Foundation, 2nd ed. Prentice Hall PTR, Upper Saddle River, NJ, USA
[2] Goodfellow, I, Bengio, Y and Courville, A (2016 ). Deep Learning. MIT Press
[3] Wikipedia contributors (2019 ). Activation function — Wikipedia, the free encyclopedia. https://en.wikipedia.org/wiki/Activation_function
[4] Sait Celebi (2019 ). Part 2: Softmax regression. http://saitcelebi.com/tut/output/part2.html