Experiments with Statistical Methods in Scala

This project is the main branch I’m currently working on, it’s a collection of methods related to statistical methods that I’m slowly distilling from part time studies at USQ.

Project Link: scala-au.id.cxd.math

The main gist of the project is to experiment with the implementation of methods for univariate and multivariate statistical analysis, as well as to capture summary notes on the topic. The front page README for the project contains links to the main areas explored so far.

It is by no means meant to replace existing tools that already excel in this area, so I would not recommend it for production work.

It was initially related to some of my earlier experimentation in F# from the project au.id.cxd.Math, which contained some earlier experiments with numerical methods for linear programming, and worked with various uses of distributions with the implementation of decision trees, and simple classifiers implementation as well as an experimentation with a simple multi-layer perceptron. Most of those were motivated by interested fostered from earlier studies. As well as a realisation that the language such as F# (or scala for that matter) could potentially express these algorithms in a manner that is reasonably terse. At the time I was using matlab alot and looking for a language toolset that could provide similar terseness (tools such as julia were not available at that time).

The aim of the project seeks to work through topics and methods in statistics and multivariate statistics with a focus on understanding the methodology as well as the computational process involved in attempting to implement the procedures. Although this is more so for the purposes of understanding the procedures than for the sake of the implementation.

Unfortunately implementation of some of the underlying numerical methods for such things as special functions, especially those for the beta and gamma family, are a necessary evil, for a small selection of the functions that are needed, the implementation largely duplicates the implementation provided by the GNU Scientific Library, and is informed by other sources such as the infamous Numerical Recipes book (the reason implementation is replicated from the GSL is to avoid the implementations from NR). Although at this stage I am not intending the numerical methods as the main focus at all, rather only a few are selected in order to achieve the main focus of experimenting with the methods that depend on these supporting functions. Although they do relate to some of the other areas of study in my course although those are less focused on numerical methods.

For the support of matrices the excellent Breeze library is used, I most certainly do not intend to have to implement the matrix decompositions I’m using. Since it is better to leverage the capability provided in Breeze in order to focus on the areas that I’m interested in exploring. While much of the probability distributions I’m playing with do overlap with the implementations provided in Breeze, these distributions are largely the area I’m interested in playing with, so implementing them at that layer give me the opportunity to learn a little bit more about them. Especially as progress start to move more towards the multivariate methods.

The issues of processing data are also unavoidable (IO, memory, performance, representation), the library is not in general designed for processing “big data” at all, and is really designed for the purpose of experimenting with relatively smaller sample sizes, hence there is no effort gone into addressing issues of scale at this stage. In most cases methods of loading and processing data are largely focused on loading data sets into matrices, and hence do not focus on iterative methods which may lend themselves to further parallel processes (where parallel processes may be required, combining results of smaller computations might be of interest, such as averaging, although not explored).

Hopefully for the selection of methods that I’m currently investigating, which are largely related to linear methods (and eventually generalised linear methods), decompositions, and distance based methods. There will eventually be a reasonable implementation and selection of notes on those methods along with references to source texts and articles. I’ve also recently found that it’s possible to run scala within R markdown notebooks via jvmr, so it will be possible to update notes leveraging both this capability in order to combine notes on the methods with notation and examples of the library usage.