Chapter 4 Machine Learning

We now import standard packages and functions calling the dcr module using source("dcr.R") and specify the resolution of the figures.

options(repr.plot.width = 16, repr.plot.height = 9, repr.plot.res = 300)

4.1 Synopsis

Machine learning can be viewed as a “reduced form” of risk-based learning. In ‘supervised learning’ we also create a relation between responses/outputs and explanatory variables. One of the main differences to risk-based learning is that machine learning is not interested in testing statistical significance (and sometimes also not even economic significance) for the population and data are not interpreted as random samples. Instead, machine learning is interested in getting good predictions at first place.

How can this be achieved? As we will see later in more detail, splitting the available data is crucial in machine learning. Hence, we divide the data into groups, where one group is used to fit the parameters of a model using a specific target criterion, and the second group (which was not used for fitting) is used to make predictions using the fitted model and to compare the predicted outputs with the observed outputs. We can then evaluate how well these predictions perform for data that was not included in the fitting process. Some special risk-based learning and machine learning models coincide in terms of fitting the target criterion, fitting approach, and even solution, such as standard Linear Regression or Logistic Regression. However, machine learning often extends these models by more complicated non-linear structures or so-called hyperparameters that are not fitted using the data but are rather pre-specified by the researcher. A special case of machine learning is Deep Learning which means that we have a nested structure of equations in a neural network approach with many parameters and hyperparameters.

Machine learning is often divided into supervised and unsupervised learning. Supervised learning means predicting the output of one or more response variables \(y\) for a given set of predictors \(x\), i.e., we have an asymmetric relation between the variables (inputs vs. outputs) as in regression and classification. Quality of the model fit is measured by some cost or loss function \(J()\) that measures the deviation between predictions \(\hat{y}\) and observations \(y\), that is, \(J(y,\hat{y})\). The higher the deviation, the worse the prediction and the higher the costs/losses.

Unsupervised learning treats all variables \(x\) symmetric, i.e., there is no output. It tries to find specific patterns among the observations, for example according to similarities in the \(x\) variables. Therefore, there is no clear measure of success in terms of a loss function. Heuristic arguments for judgment are applied. A thorough and more technical reference for machine learning is (Hastie et al. 2017).

4.2 Terminology

Risk-based and machine learning share many common properties. The following table summarizes some definitions which are common in both areas and actually mean the same thing but may sometimes cause confusion.

Risk-based Learning Machine Learning
Estimation Fitting
Independent/Explanatory Variable, \(X\) Input, Feature
Dependent/Response Variable, \(Y\) Output
Random Error Noise
In-Sample Training Set
Out-of-Sample Test Set
Estimate a Model Learn a Model
Model Parameters Model Weights
Regression, Classification Supervised Learning
Clustering and Dimensionality Reduction Unsupervised Learning
Data Point, Observation Instance, Sample
Intercept Bias
Link Function Activation Function
Logistic Sigmoid

4.3 Cost Functions

Machine Learning is about finding a model that best fits the training data and (more importantly) the test data following some optimization. This requires an outcome variable, which we would like to predict. It is hard to perfectly predict an outcome ex-ante and the predictions will deviate from the observed outcomes. We typically compute an aggregated quantity which we would like to optimize from these deviations.

The target quantity of the optimization is generally called cost or loss. For metric outputs (e.g., LGDs), we basically use the differences between the observed and model predicted outputs. This deviance can be interpreted as a loss of precision, or costs. However, if we only use the difference, then positive and negative differences would cancel out if we sum them up. We could potentially obtain a model with a cost function value of zero, but this could have very large individual deviations. Why not then simply use absolute deviations? Optimization means computing the first mathematical derivative of the target function and setting it to zero. This is demanding because the absolute value is an operator which is mathematically hard to tackle in derivations. Hence, we mostly use the squared deviations, more precisely, the sum of squared deviations of observed and predicted values as target function. Let \(y_i\) be the observed values and \(\hat{y}_i = f(x_i, \beta)\) be the predicted values according to some estimated model for given inputs \(x_i\) and parameters \(\beta\), then the average total costs for metric variables are

\[ \begin{align*} J = \frac{1}{n}\sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \end{align*} \]

This is the Mean Squared Error (MSE) and is minimized using an optimization algorithm that we will investigate later in this book. This is also called Sum of Squares Minimization or least sum of squares.

4.4 Information and Entropy

We will now look at information theory for categorical outputs like defaults. We can interpret new data as a signal following (Shannon 1948). Let’s assume we have two future events/signals (e.g., default or non-default), measured by random variable \(D\). Our current knowledge is the probability of default \(\pi\). For a low value of \(\pi\), the chance of non-default is high and the observation of a default would be a surprise with high information content. Hence, we can define the information of a default event using its probability

\[ \begin{align*} I(d) = \log_2\left(\frac{1}{\pi}\right) = -\log_2(\pi) \end{align*} \] where the base 2 of the logarithm is used for normalization because we have two events.

The information of a non-default is:

\[ \begin{align*} I(nd) = -\log_2(1-\pi) \end{align*} \] Logarithms are used as a convention, because they give the numbers of bits needed for describing the information, e.g., one loan with default/non-default requires \(\log_2(2)=1\) bit, two loans require \(\log_2(4)=2\) bits as there are four potential states.

We compute the information in dependence of \(\pi\) using the following code.

I <- function(pi) { 
    ifelse(pi<=0 | pi>1, NA_real_, -log2(pi)) }   

We then let \(\pi\) run in a for-loop and call the function for each value of \(\pi\) and use the plot function to plot information for default and non-default in dependence of \(\pi\). The option NULL creates an empty plot and scatters are added using the lines command after every execution of the for loop using for (i in 1:100).

ggplot() + 
    xlim(0,1) + 
    geom_function(data=data.frame(z="Default"), fun=~I(.x), aes(col=z)) + 
    geom_function(data=data.frame(z="Non-default"), fun=~I(1-.x), aes(col=z)) + 
    labs(col=NULL, x=expression(pi), y="Information") + 
    facet_wrap(~z) + 
    scale_color_dcr() + 

The figure then shows the information of a default (left) and non-default (right) in dependence of \(\pi\). We see that a default becomes less informative the higher \(\pi\) is, i.e., the more we expect a default.

Rather than a logarithm with base 2 (\(\log_2\)), natural logarithms (\(\ln\)) are often used in practice for convenience for scaling. The entropy is defined as the average information content of a message/signal over default and non-default:

\[ \begin{align*} H = - \pi \ln(\pi)- (1-\pi) \ln (1-\pi) \end{align*} \]

which is defined as a function below using natural logarithms. Next, \(\pi\) is varied in a for-loop. The entropy function is then called and evaluated and the plot is generated. The figure shows the resulting plot. We see that entropy is highest for \(\pi=0.5\). This reflects the observed information which is a trade-off between default and non-default events.

Entropy <- function(pi) {
    ifelse(pi<=0 | pi >1, NA_real_, -pi * log(pi) - (1-pi)*log(1-pi)) }
ggplot() + xlim(0,1) + 
    geom_function(fun=~Entropy(.x), col=dcr_palette$main["navy"]) + 
    labs(x=expression(pi), y="Entropy")

So far, information and entropy are concepts using only the parameters of the theoretical distribution of the random event, i.e., \(\pi\). We can define the crossentropy for the theoretical distribution and the distribution of observed data:

\[ \begin{align*} H^* = - d \ln(\pi) - (1-d) \ln (1-\pi) \end{align*} \] This function is defined in the following code and evaluated for varying \(\pi\) and for a default (d=0) and a non-default (d=1). The figure shows the resulting plot for crossentropy as a function of \(\pi\) which gives exactly the information of a non-default.

CrossEntropy <- function(pi, d) { 
    case_when(pi<=0  ~ NA_real_, 
              pi>=1 ~ NA_real_, 
              d==1 ~ log(pi), 
              d==0 ~ log(1-pi)) } 
ggplot() + xlim(0,1) + 
    geom_function(data=data.frame(z="Default"), fun=~ -CrossEntropy(.x, d=1), aes(col=z)) + 
    geom_function(data=data.frame(z="Non-default"), fun=~ -CrossEntropy(.x, d=0), aes(col=z)) +
    facet_wrap(~z) + 
    labs(x=expression(pi), y="Cross entropy", col=NULL) + 

We see that crossentropy of a default is highest for low \(\pi\) and non-default is highest for high \(\pi\).

Now let’s look at information and crossentropy from another perspective. New information of a signal can be interpreted as surprise. And surprise in risk management should be avoided. The empirical question then becomes: how should we specify \(\pi\) (ex-ante) in order to keep surprise as low as possible. The answer is: set \(\pi=1\) for a defaulter and \(\pi=0\) for a non-defaulter. Then crossentropy and surprise would be zero.

Obviously this will not be possible in practice, because defaults are future events and we cannot know for sure which loans will default and which ones won’t. However, we can try to estimate this as closely as possible, i.e., try to specify \(\pi\) low for non-defaults and high for defaults. This is exactly what we will do in our optimization. We look at observed data and try to derive rules regarding how to set \(\pi\) low for non-defaults and high for defaults (ex-ante). That is, we search for values of \(\pi\) (under some restrictions) that minimize crossentropy, i.e., the surprise due to realized events. Therefore crossentropy is often called Crossentropy Loss and we try to minimize the loss.

We usually observe many loans. Hence, we transfer the concept of crossentropy loss to all observed data. Let the observed events for loan \(i\) be defined as \(d_i\) and \(1-d_i\). We then obtain using relative observed default frequencies \(\sum_i^n d_i/n\):

\[ \begin{align*} H^* = -1/n (\sum_i^n d_i \ln(\pi_i) + \sum_i^n (1-d_i) \ln(1-\pi_i)) \end{align*} \]

The expression on the right hand side was introduced before as the negative log-likelihood divided by the constant \(n\). Therefore, minimizing cross-entropy for independent binary events (binary cross-entropy) equals maximizing the log-likelihood.

Binary cross-entropy is a popular standard target function in machine learning, as it is plausible, related to risk-based learning and analytical derivatives exist which make optimizations straightforward. However, in some cases, other target functions may be applied. We study these in later chapters.

4.5 Optimization: Gradient Descent

Once we have defined our cost or loss function, we want to find the parameters that minimize it. Consider you are on the top of a mountain and allow a ball to roll downhill: the ball would most likely roll towards the route with the steepest gradient. This is similar to finding the minimum of a cost or loss function. Change the parameters (ball) to the direction that yields the largest improvement of the target function. As the function (surface) may have different gradients, the direction of the parameters (ball) might change, until they end up in a minimum. The mathematical method which is often used in machine learning for finding the minimum is called Gradient Descent and is an iterative method for numerical minimization of a function. For example, imagine we have a real-valued, differentiable function of an \(n\)-dimensional vector \(z\), \(f(z)\), where \(f:\mathbb{R}^n \rightarrow \mathbb{R}\), which we want to minimize, i.e.,

\[ \begin{align*}\min_{z} f(z) \end{align*} \]

Given a starting point \(z_0\), we generate values \(z_k\) iteratively according to

\[ \begin{align*} z_{k+1} = z_k + \eta_k \cdot d_k, \ \ \ k =0,1,2,... \end{align*} \]

where \(\eta_k>0\) is the learning rate and \(d_k \in \mathbb{R}^n\) is the descent direction. \(\eta_k\) and \(d_k\) are determined in every step such, that the sequence \(z_k\) converges to a stationary point of \(f\).

4.5.1 Simple Example

Consider a simple example with one variable \(z\) and the simple function \(y = f(z) = z^2\).

fun <- function(z) { return(z**2) } 

The value for \(z\) that minimizes this function is obviously zero with a function value of zero. Let’s now apply gradient descent to find the solution. The first derivative is \(\frac{\delta f(z)}{\delta z} = 2 \cdot z\).

deriv_fun <- function(z) { return(2*z) }

Next the learning rate \(\eta\) and the starting value are set.

lr <- 0.01
z_s <- -0.1

f <- function(z_s, y, lr) { z_s - lr*deriv_fun(z_s)  }
data.demo <- data.frame(z_s=1:1000 %>% accumulate(.f=f, lr=lr, .init=z_s)) %>% 

Then, the gradient descent method is applied within a for-loop. For the starting value of \(z_0=-0.1\), we obtain the derivative \(2\cdot z_0 = -0.2\) with function value \(f(z_0)=f(-0.1) =0.01\). The next value of \(z\) is then set according to \(z_1 = z_0 - \eta \cdot \frac{\delta f(z)}{\delta z}|_{z=-0.1} = -0.1 - 0.01 \cdot (-0.2) = -0.098\). This yields the derivative \(2\cdot z_1 = -0.196\). This process is then repeated until the number of iterations (1,000) is reached.

You can see from the final printed iteration that the value of \(z\), the function, as well as the derivative are close to the exact solution of zero.

data.demo %>% 
    ggplot(aes(x=z_s, y=f_z_s)) + 
    geom_line(col=dcr_palette$main["navy"]) + 
    scale_x_continuous(name=expression(z[s]), labels = scales::comma) + 
    scale_y_continuous(name=expression(z[s]**2)) + 
    geom_point(data=data.demo %>% slice_tail(), col=dcr_palette$main["navy"]) 

data.demo %>% slice_tail() %>% knitr::kable()
z_s f_z_s
0 0

4.5.2 Continuous Outcomes (e.g. LGDs)

Now assume we have a linear model \(f(x_i, \beta) = x_i \cdot \beta\). We then can write the cost function as

\[ \begin{align*}J(\beta) &= \frac{1}{n}\sum_{i = 1}^n (y_i - f(x_i, \beta))^2 = \frac{1}{n} (y- X\beta)^T (y- X\beta) \end{align*} \] The gradient of the cost function is

\[ \begin{align*} \frac{\partial J(\beta)}{\partial \beta} = - \frac{2}{n} X^T( y -X\beta) \end{align*} \]

The parameters are then adjusted according to

\[ \begin{align*} \beta_{k+1} = \beta_k - \eta \frac{\partial J(\beta_k)}{\partial \beta_k} \end{align*} \]

where \(\eta\) is the (constant) learning rate.

Let’s have a look at simple example with only one \(x\)-variable and only a few observations for continuous variables, say LGD is \(y\) and LTV is \(x\). First, we create \(x\) and \(y\) variables.

x <- c(0.35, 0.8, 0.55, 1.10)
y <- c(0.3, 0.55, 0.5, 0.85)

We subsequently create a scatter plot.

data.frame(LVR=x, LGD=y) %>% 
    ggplot(aes(x=LVR, y=LGD)) + 
    geom_line(col=dcr_palette$main["navy"], linetype=2) + 
    geom_point(col=dcr_palette$main["navy"]) + 
    scale_y_continuous(labels=percent) + 

We define the cost function next.

J_cost_fun <- function(beta) { 
    y_pred <- beta*x
    costs = (y-y_pred)**2
    cost = mean(costs)
    return(cost) }
J_cost_fun <- function(beta, lgd, lvr) { 
    map_dbl(beta, ~mean((.x*lvr - lgd)^2)) }

Then we let the beta parameter vary. This returns various values for the cost function in dependence of beta as shown in the figure below.

ggplot() + xlim(0, 1) + 
    geom_function(fun=J_cost_fun, args=list(lgd=y, lvr=x), col=dcr_palette$main["navy"]) + 
    labs(y="costs/loss", x=expression(beta)) 

We then define the derivative.

J_deriv <- function(beta) {y_pred = beta * x
                           tmp = mean(-2 * x * (y-y_pred))
J_deriv <- function(beta, lvr, lgd) { 
    map_dbl(beta, ~mean(-2*lvr*(lgd - lvr*.x))) }  

Next we set the learning rate and a starting value.

lr = 0.01
beta_s = 0.1

Finally we let the parameter vary according to gradient descent to reach the minimum. The results are shown in the next figure.

f <- function(beta_s, y, lr, lvr, lgd) { 
    beta_s - lr*J_deriv(beta_s, lvr, lgd)  }
data.demo <- data.frame(beta_s=1:1000 %>% accumulate(.f=f, lr=lr, lvr=x, lgd=y, .init=beta_s)) %>% 
    mutate(f_beta_s=J_cost_fun(beta_s, lvr=x, lgd=y)) 
data.demo %>% 
    ggplot(aes(x=beta_s, y=f_beta_s)) + 
    geom_line(col=dcr_palette$main["navy"]) + 
    scale_x_continuous(name=expression(beta[s]), labels = scales::comma, expand = expansion(mult = c(0, .1))) + 
    scale_y_continuous(name=expression(beta[s]**2)) + 
    geom_point(data=data.demo %>% slice_tail(), col=dcr_palette$main["navy"]) + 
    geom_label(data=data.demo %>% slice_tail(), aes(label=percent(beta_s, accuracy =0.01)), vjust=-0.25)

We compare the results of the gradient descent using lm to the one for an OLS regression. Note that we have assumed the constant to be zero and therefore do not need an intercept in the OLS regression. We specify this by coding x-1 in the right hand side.

mod <- data.frame(lgd=y, lvr=x)
results <- lm(lgd ~ lvr-1, data = mod)
beta <- summary(results)$coefficients[1,1]
cat("beta:", round((beta), digits = 4))
## beta: 0.7714

4.5.3 Binary Outcomes (e.g., PDs)

In the case of binary classification (e.g., for defaults), we have the cost function

\[ \begin{align*} J(\beta) = -\frac{1}{n} \sum_{i=1}^n (d_i \ln(\pi_i) + (1-d_i) \ln(1-\pi_i)) \\ =\frac{1}{n} \left( -d^T\ln(\pi) - (1-d)^T \ln (1 - \pi) \right) \end{align*} \]

where \(d\) and \(\pi\) are the vectors of defaults and PDs, respectively. For a Logit link function between the PDs and given features we obtain

\[ \begin{align*} \pi_i = P(d_i = 1 \, | \, x_i) = \frac{\exp(x_i\beta)}{1 + \exp(x_i\beta)} = \frac{1}{1 + \exp(-x_i \beta)} \end{align*} \]

The gradient of cost function becomes

\[ \begin{align*} \frac{\partial J(\beta)}{\partial \beta_j} = \frac{1}{n} \sum_{i=1}^n x_{ij} \left( \pi_i - d_i \right) \\ \frac{\partial J(\beta)}{\partial \beta} = \frac{1}{n} \left( X^T \left( \pi - d \right) \right) \end{align*} \]

The parameters are then adjusted according to (proof not shown here):

\[ \begin{align*} \beta_{k+1} = \beta_k - \eta \frac{\partial J(\beta_k)}{\partial \beta_k} \end{align*} \]

Let’s have a look at a simple example. First, we create \(x\) and \(d\) variables:

x <- c(400, 300, 200, 100)
d <- c(0, 0, 1, 1)

We then define the cost function.

J_cost_fun_bin <- function(beta, x, d) {
    pd_pred = exp(beta*x) /(1+ exp(beta*x))
    costs = d*log(pd_pred) + (1-d) * log(1-pd_pred)
    cost = -mean(costs)
    return(cost) }     
J_cost_fun_bin <- function(beta, x, d) {
    f <- function(beta, x, d) { 
        pd_pred = exp(beta*x) /(1+ exp(beta*x))
        -mean(d*log(pd_pred) + (1-d) * log(1-pd_pred)) } 
    beta %>% map_dbl( ~f(.x, x, d)) } 

We vary the beta parameter. This returns various values for the cost function in dependence of beta as shown in the figure.

ggplot() + 
    xlim(-0.01, 0.01) + 
    geom_function(fun= J_cost_fun_bin, args=list(x=x, d=d), col=dcr_palette$main["navy"]) + 
    labs(x=expression(beta), y=expression(costs/loss))

Next, we define the derivative.

J_deriv_bin <- function(beta, x, d) {
    pd_pred = exp(beta * x) / (1+exp(beta*x))
    tmp = mean(x * (pd_pred - d))
J_deriv_bin <- function(beta, x, d) {
    f <- function(beta, x, d) { 
        pd_pred = exp(beta * x) / (1+exp(beta*x))
        mean(x * (pd_pred - d)) } 
    beta %>% map_dbl( ~ f(.x, x, d)) } 

We set a learning rate and a starting value.

lr = 0.00001
beta_s = -0.02

We then vary the parameter according to gradient descent to reach the minimum.

f <- function(beta_s, y, lr, lvr, def) { 
    beta_s - lr*J_deriv_bin(beta_s, lvr, def)  }
data.demo <- data.frame(beta_s=1:1000 %>% accumulate(.f=f, lr=lr, lvr=x, def=d, .init=beta_s)) %>% 
    mutate(f_beta_s=J_cost_fun_bin(beta_s, x=x, d=d)) 
data.demo %>% 
    ggplot(aes(x=beta_s, y=f_beta_s)) + 
    geom_line(col=dcr_palette$main["navy"]) + 
    scale_x_continuous(name=expression(beta[s]), labels = scales::comma, expand = expansion(mult = c(0, .1))) + 
    scale_y_continuous(name=expression(beta[s]**2)) + 
    geom_point(data=data.demo %>% slice_tail(), col=dcr_palette$main["navy"]) + 
    geom_label(data=data.demo %>% slice_tail(), aes(label=percent(beta_s, accuracy =0.01)), vjust=-0.25)

We print the final solution.

log_frame <- data.frame(x, d)
results_logit <- glm(d ~ x -1, family = binomial(link = "logit"), data = log_frame)
beta <- summary(results_logit)$coefficients[1,1]
cat("Solution for beta:", round(beta, 3))
## Solution for beta: -0.003

Finally, we check the result for a Logistic Regression using the glm() function. We provide more details on Logistic Regressions in our default modeling chapter. Note that we have assumed the constant to be zero and do not need an intercept in the regression.

results_logit <- glm(d ~ x -1, family = binomial(link = "logit"), data = log_frame)
beta <- summary(results_logit)$coefficients[1,1]
cat("Solution for beta:", round(beta, 4))
## Solution for beta: -0.0029

4.6 Learning and Validation

4.6.1 Train vs Test Split

The goals of risk-based learning are parameter estimation and interpretation, hypothesis significance testing, and prediction. In contrast, the main goal in machine learning is prediction: we want to arrive at a model that predicts outcomes effectively for test data. Therefore, the models are calibrated with respect to this goal.

To check the prediction power of a fitted model, it is not sufficient to do this with the same data used for fitting the model. Validating the model with data used for fitting does not ensure a good model for prediction, which is the main reason we build models. Generally, the more complex you specify a model (e.g., using many parameters) the better the fit and the lower the costs will be. For example, if the number of data points equals the number of features, you can fit a linear model with a perfect fit and costs equal to zero.

So, the real check of prediction power of the model is to apply a fitted model to data which the fitting algorithm has not seen before. This is often done by splitting the entire data beforehand into two datasets where a training dataset is used for fitting and a mutually exclusive test dataset is used for computing predictions using the fitted model and computing performance metrics such as a cost function.

The validation check of the model should only be done once, as the model is otherwise artificially tuned with knowledge of the test data and the prediction power for other data might decrease. This is called hacking and discussed later in this chapter.

4.6.2 Bias-Variance Tradeoff

When we fit a model to data, we almost surely do not match the data generating model. Why? First, a model is a simplified picture of reality, and it will therefore be too simple. Second, the data might be very noisy, that is, there is almost no structure and mainly randomness in the data. Consider repeated random flips of a coin. While we might estimate the probabilities of 50% for each side of the coin from the observed data, we will not be able to predict the side that will face upwards at the next flip because the result is random. Thirdly, we might have too little data and/or a sampling error in the data that do not reliably calibrate the parameters of the model. Hence, we will have a fitting error for the data, and a prediction error for the data. The expected error can be decomposed into three error types:

  1. Bias (under-fitting): the model does not fit the data very well. This can be due to incorrect assumptions, e.g., in the model specification. For example, we assume linearity although the data were generated non-linearly. Under-fitting implies a deviation of the average of the estimator for different training sets from the “true” model function, meaning that the model generalizes too much.

  2. Variance (over-fitting): the model fits the training data too well. As a consequence, if we fit the model to different training data, the result is highly sensitive to the data, meaning that the model generalizes too little.

  3. Irreducible random error: this is an error that is due to the random nature or noise of the data. We can fit the model structure and parameters but not the single observations without over-fitting. For example, for a coin flip, we can fit the probabilities of the two sides. However, fitting the outcome (e.g., head or tail) results in over-fitting.

Usually, there is a trade off between bias and variance. Too simple models may have high bias and low variance. Too complex models may have low bias and high variance.

Let’s have a look at an example using a nonlinear sine curve, simplified adapted from (Buehlmann and Yu 2003). We generate \(n=20\) random data where the feature \(x\) is drawn from a Uniform distribution and the random error \(\varepsilon\) is from a normal distribution with \(\mu=0\) and \(\sigma=0.3\). Our outcome variable is then created as

\[ \begin{align*} Y = \sin (7\cdot x) + \varepsilon \end{align*} \]

We generate training data and test and could vary sigma for different noise impact. The scatter plots of \(x\) and \(y\) for train and test data are shown in a chart.

N_sim <- 20 
mu <- 0 
sigma <- 0.3 

x_train <- runif(n = N_sim, min = 0, max=1)
eps_train <- rnorm(n = N_sim, mean = mu, sd = sigma)
y_train <- sin(7 * x_train) + eps_train

x_test <- runif(n = N_sim, min = 0, max=1)
eps_test <- rnorm(n = N_sim, mean = mu, sd = sigma)
y_test <- sin(7 * x_test) + eps_train

x_axis <- seq(0, 1, 0.01)
y_true <- sin(7 * x_axis)

We now plot the train and test data together with the true relation between \(x\) and \(y\):

data.demo <- tibble(x=c(x_train, x_test), 
                    y=c(y_train, y_test), 
               sample=c(rep("Train", length(x_train)), rep("Test", length(x_test)))) 
data.demo %>%  
    ggplot(aes(x=x, y=y)) + 
        geom_point(aes(col="Random realizations")) + 
        facet_wrap(~sample) + 
        geom_function(fun= ~sin(7 *.x), mapping=aes(col="True realizations")) + 

We subsequently transform the features by polynomial splines and estimate a regression model. We vary the polynomial degree from one (i.e., a linear model) to three and to 20. We fit the models for the training and test sample using the predict() function. We then compare the fitted models with the training data and the test data and the true underlying function.

Each chart on the left shows the generated train data, the true relation and the estimated relation. The charts on the right show the test data. We also compute the mean squared error (the cost/loss function).

\[ \begin{align*} MSE = \frac{1}{n}\sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \end{align*} \]

The linear relation (polynomial degree of one, upper charts) poorly reflects the true properties of the data. The polynomial degree of three (charts in the middle) explains the observed data and the true relation well. The polynomial degree of 20 (third chart) matches the observed train data perfectly as we have 20 data points but does not reflect the true properties of the data.

In the test data (right panel) we have a clear underfit similar as in the train data in the upper chart. For a polynomial degree of three, we have a lower (better) MSE albeit higher than in the train data. For a degree of 20 we have a clear overfit for the train data and a very high MSE for the test data.

This shows for the polynomial degree of 20 that the fit in the training sample has considerably improved to an almost perfect fit. However, the fit for the test data has become much worse. This is a clear sign of overfitting.

4.6.3 Crossvalidation and Tuning

Many techniques in machine learning allow hyperparameters (e.g., for regularization, or for specifying the architecture of a neural network) which are not estimated from the data but rather set by the analyst. Hence, an analyst may be tempted to repeat model estimation using the same data for varying hyperparameters and choose the model with seemingly best performance. This might yield overfitting: the model is good for train and test data but poor for prediction on other data. Therefore, the training data are often randomly split into training and validation data several times, and each time another set of hyperparameters is used on the “train-train” data and tested on the “train- validation” data.

This splitting of the training data is called \(K\)-fold Crossvalidation. \(\hat{f}(X)\) is the fitted function.

We split the training data into \(K\) roughly equally sized parts. We then use the \(k\)-th part as validation sample and fit the model to the remaining \(K-1\) parts. We subsequently compute the prediction error when predicting the \(k\)-th part and compute the mean of the prediction errors over all folds. We repeat this for all \(K\) possible parts and compute the aggregated prediction error and scores/losses thereof.

This strategy is typical for ‘traditional’ machine learning fields, e.g., pattern/image recognition where we want to recognize what we have seen before. Typical additional challenges for credit risk are structural breaks or crises which are not in the training data. We want to make predictions for the future and to anticipate what we haven’t seen before. We do not want to predict the past or present.

Note that this is similar to the distinction between diagnosis and forecast (e.g., of a disease). The former means a person already has a disease and some diagnostic instrument detects this correctly. The latter means that a person does not have a disease and a forecasting instrument predicts whether a person will have the disease in the future or not, or the probability of getting the disease.

4.7 Practical Implementation

Let us divide our mortgage data into training and test data. We apply the split along the timeline because we are interested in predicting the future. We first take a look at the time series of the default rates. We see that, starting from the period 27, the default rates rise to a peak of about five per cent and drop afterwards.

default_rate <- data %>% group_by(time) %>% summarize(default_time = mean(default_time))
ggplot(default_rate, aes(x=time, y=default_time)) + 
    geom_line(col=dcr_palette$main["navy"]) + 
    labs(y="Default rate", x="Time")

Let’s now challenge this. How are models able to predict the data from period 27 to 40 with only the data from periods 1 to 26? To predict this, we define periods up to 26 as train data and periods starting from 27 and going to 40 as test data. You may apply other train-test splits.

data_train <- data[which(data$time < 27),]
data_test <- data[which(data$time > 26),]
data_test <- data_test[data_test$time <=40,]

We fit a model (e.g., Logistic Regression with only FICO) to the train data. First we define the features and the outputs, separately for the train and test data. We observe that the average default rates in both periods are quite different.

X_train <- data_train[["FICO_orig_time"]]
y_train <- data_train[["default_time"]]

X_test <- data_test[["FICO_orig_time"]]
y_test <- data_test[["default_time"]]

cat("\n","Default rate in training sample:", round(mean(y_train), digits = 3))
##  Default rate in training sample: 0.011
cat("\n","Default rate in test sample:", round(mean(y_test), digits = 3))
##  Default rate in test sample: 0.035

Subsequently, we fit a Logistic Regression to the training data. We use a solver that is robust with respect to the scaling of the input features, that is, they do not need to be standardized.

model_logit <- glm(y_train ~ X_train, family = binomial(link = "logit"), data = data.frame(X_train, y_train))

## Call:
## glm(formula = y_train ~ X_train, family = binomial(link = "logit"), 
##     data = data.frame(X_train, y_train))
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.4329  -0.1614  -0.1172  -0.0840   3.3820  
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.980680   0.722564   4.125 3.71e-05 ***
## X_train     -0.011913   0.001199  -9.939  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Dispersion parameter for binomial family taken to be 1)
##     Null deviance: 1571.2  on 13274  degrees of freedom
## Residual deviance: 1462.7  on 13273  degrees of freedom
## AIC: 1466.7
## Number of Fisher Scoring iterations: 8

Next, using the predict() function, we can compute PD predictions for the train data. We find that the average of the PD predictions is closely calibrated to the default rate.

predictions <- predict(model_logit, type = "response")

round(mean(predictions), digits = 3)
## [1] 0.011

We then compute predictions for the test data that were not used when fitting the model. We see that the predicted average PD is far from the observed default rate.

test_frame <- data.frame(X_train = X_test, y_train = y_test)

predictions2 <- predict(model_logit, newdata = test_frame, type = "response")

round(mean(predictions2), digits = 3)
## [1] 0.009

Although the Logistic model in the above form does not have any hyperparameters, we may do a crossvalidation at this stage. Below, we do a five-fold crossvalidation and compute the score which is in our case the crossentropy. We will look at other scores for each of the five test datasets later.

train_frame <- data.frame(X_train, y_train)

train_frame$y_train <- ifelse(train_frame$y_train == 0, "NoDefault", "Default")

control <- trainControl(method = "cv", number = 5, classProbs=T, summaryFunction = mnLogLoss)

fit <- caret::train(y_train ~ X_train, data = train_frame, method ="glm", metric = "logLoss", trControl = control)

print(round((fit$resample[,1]), digits = 3))
## [1] 0.055 0.056 0.055 0.054 0.055

4.8 P-Value and ML Hacking

Hypotheses are tested using p-values in risk-based learning of the previous chapter. This can be done if datasets are interpreted as random samples drawn from a population. P-values can be subject to p-hacking and hypotheses test outcomes can be manipulated. This may be achieved through various methods, including:

  • State hypotheses after analyzing the data, i.e., we test what we already know from the data;
  • Multiple testing: we continue drawing random samples until the test outcome is as desired;
  • Multiple variables: we test many variables at once as one or more of them will be spuriously significant just by chance;
  • Selective reporting: we report only significant results and do not report the insignificant ones.

An overview and potential ways of mitigation are given in (Kellner and Rösch 2019).

We have previously seen that machine learning (ML) approaches are often reduced in terms of statistical assumptions. P-values are mostly not considered at all. However, ML techniques are similarly susceptible to hacking/manipulation (ML Hacking). The following describes one approach of ML Hacking:

  1. We develop a model using train data;
  2. We check the prediction power using test data;
  3. We tune the model again using train data;
  4. We check again for test data;
  5. We continue until the prediction power for test data is satisfactory.

Note that this yields overfitting and good performance, which is not guaranteed for other test data.

4.9 Sandbox Problems

  • Compute the information value, entropy and crossentropy for the mortgage dataset.
  • Subsample the mortgage dataset for periods time<=20. Fit a Logistic Regression model for the subsample. Assume all loans are homogeneous and do not include features. Compare the estimated default probability with the observed default rate for the sub-sample.


Buehlmann P & Yu B (2003) ‘Boosting with the L2-loss: Regression and classification’, Journal of the American Statistical Association, 98(462):324–339.
Hastie T, Tibshirani R, & Friedman J (2017) The elements of statistical learning, 2nd ed., 12th printing, Springer.
Kellner R & Rösch D (2019) ‘A Bayesian re-interpretation of ’significant’ empirical financial research’, Finance Research Letters101402.
Shannon CE (1948) ‘A mathematical theory of communication’, Bell System Technical Journal, 27:379-423 and 623-656.