Once the model has been constructed it is important to use a variety of different evaluation metrics as each individual metric has some limitations, such that performance cannot be judged by a single metric. Such metrics will be calculated on predictions made against a held-out test set.
The coefficient of determination \(R^{2}\) describes the amount of variance explained in the observations by the predicted values produced by the model [1]. It is computed as
\[R^{2} = \left\{ \frac{\sum_{i = 1}^{N}{\left( O_{i} - \overline{O} \right)\left( P_{i} - \overline{P} \right)}}{\sqrt{\sum_{i = 1}^{N}\left( O_{i} - \overline{O} \right)^{2}}\sqrt{\sum_{i = 1}^{N}\left( P_{i} - \overline{P} \right)^{2}}} \right\}^{2}\]
Note that the correlation is simply \(r = \sqrt{R^{2}}\). The \(R^{2}\) metric is a linear measure of the relationship between observed \(O\) and predicted \(P\) values, and does not capture additive and proportional differences between these values [1]. Both \(R^{2}\) and \(r\) are sensitive to the influence of outliers on the data [1].
The Nash Sutcliffe Efficiency measure \(E\) is a measure of the ratio of the mean square error to the variance in the observed data and has the range \(-1 \leq \text{E} \leq 1\) [1].
\[E = 1.0 - \frac{\sum_{i = 1}^{N}\left( O_{i} - P_{i} \right)^{2}}{\sum_{i = 1}^{N}\left( O_{i} - \overline{O} \right)^{2}}\]
For values of \(\text{E} \leq 0\), the model performs no better than predicting the mean value of observations [1]. The efficiency measure better captures the differences between variability in observations and predicted values than the other measures of correlation [1].
The measure for Nash-Sutcliffe Efficiency however has been shown to underestimate the variability between observations and simulated values by Gupta et al. [2]. In order to address this issue the Kling-Gupta Efficiency (KGE) is proposed, which has a similar range \(- 1\ \leq \text{KGE} \leq 1\), with the value approximately \(\text{KGE} \approx - 0.41\) being equal to the performance of the mean as shown by Knoben et al. [3]
\[KGE = \ 1\ –\ ED\]
\[ED = \sqrt{\left( r - 1 \right)^{2} + \left( \alpha - 1 \right)^{2} + \left( \beta - 1 \right)^{2}}\ \]
The parameters \(r\) being the correlation between observed (o) and simulated values (s), \(\alpha = \sigma_{s}/\sigma_{o}\) , the ratio of the standard deviation for simulated and observed values and \(\beta = \mu_{s}/\mu_{o}\) the ratio of the respective means [2].
Willmott’s Index of Agreement \(d\) represents the ratio between the mean square error (MSE) and the total potential error (PE), and captures differences in means and variability between observations and model predictions [1]. It has the range \(0\ \leq \text{d } \leq 1\)
\[d = 1.0 - \frac{\sum_{i = 1}^{N}\left( O_{i} - P_{i} \right)^{2}}{\sum_{i = 1}^{N}\left( \left| P_{i} - O_{i} \right| + \left| O_{i} - \overline{O} \right| \right)^{2}} = 1.0 - N\frac{\text{MSE}}{\text{PE}}\]
The above measures are dimensionless measures and allow comparison between models even under differences due to site location.
In order to quantify the error in terms of the units of the observation, the measures of error are leveraged. The mean absolute error (MAE) describes the absolute difference between observations and model predictions [1].
\[MAE = \frac{1}{N}\sum_{i = 1}^{N}\left| O_{i} - P_{i} \right|\]
The mean square error (MSE) squares the residual, and the root mean square error scales the MSE by taking the root \(RMSE = \sqrt{\text{MSE}}\).
\[MSE = \frac{1}{N}\sum_{i = 1}^{N}\left( O_{i} - P_{i} \right)^{2}\]
The latter two are also sensitive to extreme values due to the square [1].
It is also useful to describe the bias of the model (which is simply the mean error).
\[Bias = \frac{1}{N}\sum_{i = 1}^{N}\left( O_{i} - P_{i} \right)\]
The RMSE is also used to form a percentage of error measure that can be useful for model comparison, the Relative Root Mean Square Error (RRMSE) is calculated as the ratio between the RMSE and the mean of the observations.
\[RRMSE = \frac{\text{RMSE}}{\overline{O}} \times 100\]
Visualisations such as histograms of errors, time-series and distribution plots of observations versus predictions are also be employed in this thesis in order to illustrate model simulations against observations as well as to examine the bias of the model.
An advantage of the SILO data is that it contains observational data for the years since 2006 up until the current date. Aside from the held-out test set, which will be used for model evaluation, it is possible to drive the downscaling model with GCM projections from 2006 until the end of 2019, in order to assess whether the projections agree with observations. Both the test set and the subset projections are used to evaluate the degree of uncertainty for the selected models. As there are multiple scenarios (RCP4.5 and RCP8.5) it is also possible to determine an extent of difference between the two given the more recent observations provided by the SILO data.
Prediction intervals for the bias are estimated through the Quantile Regression (QR) method as demonstrated by Muthusamy, et al. [4]. Metrics for Mean Prediction Interval (MPI) are calculated for each of the upper \(PL^{u}\) and lower \(PL^{l}\) intervals derived from the QR estimation. The MPI describes the mean width of the prediction interval [4]. MPI is calculated for each quantile \(q\) as follows:
\[\text{MP}I_{q} = \frac{1}{n}\sum_{t = 1}^{n}\left( PL_{i}^{u} - PL_{i}^{l} \right)\]
It is preferable for MPI to remain low over each quantile [4]. Using the MPI it is possible to compare multiple sites in order to identify whether there are regional differences in the uncertainty of the model predictions. It is also useful to consider the extreme values when performing the analysis such as those values within the upper 25% of observations [4].