DEEP CREDIT RISK
  • WELCOME
    • CONTENTS
    • START Python
    • START R
    • FEATURED
  • DATA & CODE
  • TRAINING
    • ZOOM MASTERCLASS >
      • PD SCHEDULE
    • IN HOUSE TRAINING
  • PAPERS
  • CONTACT
  • 中文
chapter2_Python_Literacy

Synopsis¶

There is no need to worry if you are new to Python or even data science. Building great models is about model design and the actual implementation follows quickly. Throughout this book we rely on a small number of building blocks. First examples include:

  • Describing data;
  • Plotting data;
  • Generating new variables;
  • Transforming variables;
  • Subsetting data;
  • Combining data;
  • Regression models.

Further, we provide a summary of the functions written specifically for this book to help you to accelerate your learning. Where it often takes years for financial institutions to build complete credit risk models, with our book you'll be able to do so within a matter of hours.

Note that we have published the Deep Credit Risk series in different programming languages (e.g., R) with comparable results. Differences in the syntax may result in slightly different implementation structures. Results may differ occasionally. For example random sampling may result in different outcomes despite identical seed values as the random number generators may be implemented differently.

Installation¶

Anaconda and IDEs¶

Python is popular in the data science community, particularly in the application of machine learning techniques in the banking and fin-tech industries created by Guido van Rossum in 1991 and further developed by the Python Software Foundation, see \cite{VanRossum1995} and \cite{VanRossum2009}. You may be overwhelmed with the many options available if you are new to Python. We present a number of useful components and different methods of working with the software.

Note, Python is simply a general-purpose programming language, which is installed on the computer. We recommend installing Anaconda, as this includes Integrated Development Environments (IDEs) and common packages like numpy or pandas. Popular IDEs include Jupyter Notebook and Spyder and these become the faces of Python. When you open the software, you will have different windows for programming, logging processes and observing outputs. IDEs have many useful tools for code development and finding bugs.

To install Python via Anaconda on your machine:

  1. Download the Anaconda for Python 3.x from: www.anaconda.com;
  2. Select your operating system;
  3. Choose either the 32bit or 64bit option;
  4. Go through the installation prompts. Click on "Add Anaconda to my PATH environment variable. [...]";
  5. After the installation, search for Anaconda Navigator in the Start menu. Here you will find several applications. We recommend using Jupyter Notebook or Spyder;
  6. Open your Anaconda Prompt from the Start menu. Navigate to the Anaconda directory. Type: "Run conda update conda". This updates Anaconda;
  7. Open your preferred IDE (i.e., Jupyter Notebook or Spyder) and you are ready to start.

Please visit the website www.deepcreditrisk.com for further guidance including installation videos.

All examples were developed in Jupyter Notebook. Jupyter creates notebook documents in which Python can be used. This is helpful if you collaborate with others (perhaps even non-Python users) or would like to publish your findings efficiently. The notebooks can be shared as files, read without execution of code, and published online or printed. Jupyter is a very good place to get started with Python. You may use Spyder or other IDEs when you become more proficient.

Things may not go smoothly. Your computer may be set up in such a way that you do not have administrator rights, and you may need to seek IT support to limit exposure to such issues.

Packages¶

Anaconda installs major packages which are called libraries (for a list see www.anaconda.org). We use the following basic packages for our book, sorted in terms of importance:

  • pandas: Processing data structures: series (1D) and dataframes (2D); see \cite{Pandas2020};
  • numpy: Processing of n-dimensional array objects, see \cite{Harris2020};
  • scipy: Submodule for statistics, see \cite{Virtanen2020};
  • matplotlib: Plotting library, see \cite{Hunter2007};
  • math: Mathematical functions, see \cite{VanRossum2009};
  • random: Random number generator,see \cite{VanRossum2009};
  • tabulate: Printing tabular data;
  • joblib: Running functions as pipeline jobs, see \cite{Joblib2020};
  • pickle: Converting a object to a file for saving, see \cite{VanRossum2009};

We mostly require pandas (the acronym is derived from "Python Data Analysis Library") which provide for a dataframe object for data processing with integrated indexing based on labels (.loc) as well as positions (.iloc).

pandas also allows for:

  • Data sub-setting;
  • Group processing;
  • Dataset splitting, merging and joining;
  • Time-series functionality.

We require the following libraries for model building:

  • scikit-learn: Machine learning techniques, see \cite{Pedregosa2011};
  • statsmodels: Fitting statistical models. Interacts with pandas data frames to fit statistical models, see \cite{Seabold2010};
  • IPython: Interactive computing, see \cite{Perez2007};
  • pydot and graphviz: Plotting of decision trees;
  • pymc3: Probabilistic programming, see \cite{Salvatier2016};
  • lifelines: Survival analysis, see \cite{Davidson2019};
  • lightgbm: Gradient boosting, see \cite{Ke2017}.

There are two ways to install these libraries: via conda and pip. pip is the Python Packaging Authority’s recommended tool for installing packages from Python. conda is a cross-platform package and environment manager that installs and manages conda packages from the Anaconda repository as well as from the Anaconda Cloud. We recommend conda in the first instance, as we have observed there are fewer complications when using this tool. To find a package, google the name plus "conda install" (e.g., "pandas conda install") and find a link to the command, usually from Anaconda (e.g., conda install -c anaconda pandas). Copy the command, open the program Anaconda Prompt, paste, and run the command.

Some additional packages would need to be installed for the following chapters:

  • pymc3: Chapter 3;
  • pydot and graphviz: Chapters 13 and 16;
  • lightgbm: Chapter 15;
  • lifelines: Chapter 17.

Python has many more packages available. These packages are powerful, open source tools and save time as someone else has done the job for you. Package versions may change over time and may cause error messages or warning messages. Note, warning messages are not error messages and the code may run perfectly well. Later versions include a number of error (bug) fixes and using different versions may result in different outputs. Please ensure to use the same versions to obtain the same results. Different versions may result in slightly different outputs.

Also, this text is mainly designed, so that you learn the principles of Deep Credit Risk, and not the detailed syntax, as the latter may change from version to version, but the principles stay the same. Do an internet search to find a solution if an error or warning message arises with older or future versions.

However, open source programming languages like Python have some issues. First, Python packages are often poorly documented. Second, there is limited quality assurance. Generally, few problems arise when working with common packages (e.g., OLS or Logistic Regressions) as these have been around long enough to be vetted many times by other users. However, packages that implement cutting edge models may have bugs, some of which may be documented in blogs. Third, a number of packages common in disciplines have not been coded in Python and you may be unable to find a package that meets your needs. Fourth, existing packages are often extended by wrapping: calling a new package, running the existing packages, adding code and closing the new package. This can become a problem if a constituent package has been updated on your machine. The wrap, and hence your code which relies on it, may no longer work. There are efficient processes at work places to ensure that the right versions are used. This requires some additional but valuable work. To limit our exposure to such issues, we have limited our book to common packages and self-coded some applications.

To limit our exposure to such issues, we have limited our book to common packages and self-coded some applications. At the beginning of every chapter, we import the dcr module that includes a link to all packages and data as well as some functions. Should you close and re-open your IDE to continue, you would have to execute these introductory codes as well as any prior code.

Coding Guidelines¶

For the purposes of this book, we follow some basic style guidelines:

  • Limit the number of dataset copies;
  • Limit the use of non-common packages;
  • Name functions and datasets with lower case letters;
  • Use label-based indexing (.loc);
  • Separate individual steps with an empty line;
  • Insert comments using hashtags;
  • Random draws are hard-coded using seeds;
  • You may set up virtual environment to hard-code versions of packages and ensure that the same results are obtained in later executions.

We work with the following main dataframes:

  • data: the complete panel data and subsets thereof. Major samples are the subsets for outcomes (y) and independent features (x). This includes subsets for training and testing and scaled features: X_train_scaled, X_test_scaled, y_train, y_test;
  • data_default: data that conditions on default observations;
  • data_cross: data that contains only cross-section information, i.e., one observation per loan, time-varying information is reduced to the origination time;
  • data_lifetime: data that starts at the end of our observation period (i.e., period 61) and includes repetitive observations until loan maturity.

A more comprehensive documentation of coding standards should be considered in professional environments with team-based coding and staff turnover.

First Look¶

We now import standard packages and functions calling the dcr module using from dcr import * and ignore warning messages.

In [1]:
import warnings; warnings.simplefilter('ignore')
from dcr import *

dcr imports a pandas dataframe called data. We introduce the magic line %matplotlib inline to improve the display of outputs and specify the resolution of the figures.

In [2]:
import warnings; warnings.simplefilter('ignore')
%matplotlib inline
plt.rcParams['figure.dpi'] = 300
plt.rcParams['figure.figsize'] = (16, 9)
plt.rcParams.update({'font.size': 16})

Creating Objects¶

Objects are dataframes/arrays or methods. We create a new object by naming the object data2 and assigning a dataframe data to it using the = operator.

In [3]:
data2 = data

Subsampling Features¶

We subsample the dataframe by keeping the loan id, time stamp, GDP growth, FICO score and LTV ratio at observation time using data[['']]. Note that Python is case sensitive and the id variable is 'id' rather than 'ID'. We start with pandas commands and discuss numpy as an alternative last.

In [4]:
data2 = data[['id', 'time', 'gdp_time', 'FICO_orig_time', 'LTV_time']]

Printing¶

We can have a first look at the new object data2 using the print() command. You can use multiple print() commands in one go.

In [5]:
print(data2)
          id  time  gdp_time  FICO_orig_time    LTV_time
0          4    25  2.899137             587   33.911009
1          4    26  2.151365             587   34.007232
2          4    27  2.361722             587   34.335349
3          4    28  1.229172             587   34.672545
4          4    29  1.692969             587   34.951639
...      ...   ...       ...             ...         ...
62173  49972    52  1.081049             708  103.306966
62174  49972    53  0.892996             708   95.736862
62175  49972    54  1.507359             708   91.867079
62176  49972    55  2.422275             708   91.560581
62177  49972    56  1.717053             708   90.874242

[62178 rows x 5 columns]

Alternatively, you may just print the data using the data name (run: data2). Note that this way, only the last object is printed, i.e., you can only print one object at a time.

We see the dataframe (panel) with identifiers id and time. The dataframe has 62,178 rows and 5 columns. Further, we see the time-varying systematic feature gdp_time, the idiosyncratic variable FICO_orig_time and the idiosyncratic time-varying feature LTV_time.

Chaining¶

We can chain methods to dataframes using the .operator. In the following example we round all numbers in the resulting object/dataframe to two decimals. We use the round(decimals=) method to limit decimals.

In [6]:
data2 = data[['id', 'time', 'gdp_time', 'FICO_orig_time', 'LTV_time']].round(decimals=2)
print(data2)
          id  time  gdp_time  FICO_orig_time  LTV_time
0          4    25      2.90             587     33.91
1          4    26      2.15             587     34.01
2          4    27      2.36             587     34.34
3          4    28      1.23             587     34.67
4          4    29      1.69             587     34.95
...      ...   ...       ...             ...       ...
62173  49972    52      1.08             708    103.31
62174  49972    53      0.89             708     95.74
62175  49972    54      1.51             708     91.87
62176  49972    55      2.42             708     91.56
62177  49972    56      1.72             708     90.87

[62178 rows x 5 columns]

Describing¶

We may use the data.info() command to obtain an overview of a dataframe in terms of total number of observations, variable names and formats as well as its size.

In [7]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62178 entries, 0 to 62177
Data columns (total 28 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       62178 non-null  int64  
 1   time                     62178 non-null  int64  
 2   orig_time                62178 non-null  int64  
 3   first_time               62178 non-null  int64  
 4   mat_time                 62178 non-null  int64  
 5   res_time                 1160 non-null   float64
 6   balance_time             62178 non-null  float64
 7   LTV_time                 62153 non-null  float64
 8   interest_rate_time       62178 non-null  float64
 9   rate_time                62178 non-null  float64
 10  hpi_time                 62178 non-null  float64
 11  gdp_time                 62178 non-null  float64
 12  uer_time                 62178 non-null  float64
 13  REtype_CO_orig_time      62178 non-null  int64  
 14  REtype_PU_orig_time      62178 non-null  int64  
 15  REtype_SF_orig_time      62178 non-null  int64  
 16  investor_orig_time       62178 non-null  int64  
 17  balance_orig_time        62178 non-null  float64
 18  FICO_orig_time           62178 non-null  int64  
 19  LTV_orig_time            62178 non-null  float64
 20  Interest_Rate_orig_time  62178 non-null  float64
 21  state_orig_time          61828 non-null  object 
 22  hpi_orig_time            62178 non-null  float64
 23  default_time             62178 non-null  int64  
 24  payoff_time              62178 non-null  int64  
 25  status_time              62178 non-null  int64  
 26  lgd_time                 1525 non-null   float64
 27  recovery_res             1525 non-null   float64
dtypes: float64(14), int64(13), object(1)
memory usage: 13.3+ MB

The variables are generally observed for all observations. Exceptions are lgd_time, recovery_res and res_time as these are only observed for default_time=1 and after the resolution process is complete. We cover more details in our chapter on outcome engineering. The columns have three formats: integer numbers (int64), continuous numbers (float64) and character variables (object).

Another way to obtain the the number of rows and columns (dimension) of a data frame is the dataframe.shape method:

In [8]:
print(data.shape)
(62178, 28)

A further way to generate a list of variable names is the dataframe.columns.values method:

In [9]:
print(data.columns.values)
['id' 'time' 'orig_time' 'first_time' 'mat_time' 'res_time' 'balance_time'
 'LTV_time' 'interest_rate_time' 'rate_time' 'hpi_time' 'gdp_time'
 'uer_time' 'REtype_CO_orig_time' 'REtype_PU_orig_time'
 'REtype_SF_orig_time' 'investor_orig_time' 'balance_orig_time'
 'FICO_orig_time' 'LTV_orig_time' 'Interest_Rate_orig_time'
 'state_orig_time' 'hpi_orig_time' 'default_time' 'payoff_time'
 'status_time' 'lgd_time' 'recovery_res']

We can compute descriptive statistics for key variables like FICO, LTV at loan origination and the GDP growth rate. In our dataframe, FICO is a credit score with values between 429 and 819. The average FICO score is 673.

The loan-to-value (LTV) ratio at loan origination is expressed in percentage terms and is between 50.1% and 119.8%. A value in excess of 100% implies that the loan amount at origination is greater than the collateralizing house value. Banks are prudent and it is common to lend at the median LTV of 80%. The average is 78.7%.

The GDP growth rate is analyzed and expressed in percentage terms. During our observation period, the minimum is -4.15% (indicating an economic downturn) and the maximum is 5.13% (indicating an economic upturn). The average GDP growth is approximately 1.38%.

In [10]:
data2 = data[['FICO_orig_time', 'LTV_orig_time', 'gdp_time']]

print(data2.describe().round(decimals=2))
       FICO_orig_time  LTV_orig_time  gdp_time
count        62178.00       62178.00  62178.00
mean           673.36          78.70      1.38
std             72.10          10.24      1.95
min            429.00          50.10     -4.15
25%            628.00          75.00      1.10
50%            675.00          80.00      1.85
75%            730.00          80.00      2.69
max            819.00         119.80      5.13

Tabulating¶

We can also show observation counts. In the following, we analyze observation counts by origination year orig_time. For one feature we need to specify feature and columns='count'.

In [11]:
table = pd.crosstab(data.orig_time, columns='count', margins= True)

print(table)
col_0      count    All
orig_time              
-40           51     51
-35            4      4
-33            3      3
-24            1      1
-23            1      1
...          ...    ...
57             4      4
58             3      3
59            14     14
60             9      9
All        62178  62178

[73 rows x 2 columns]

For a cross-tabulation we specify two variables. Please try print(pd.crosstab(data.orig_time, data.time, margins= True)) as an example.

Resetting Indexes¶

The index sequence can be renumbered using the dataframe.reset_index(drop=True) command. The drop=True command implies that the old index is overwritten. The drop=False command implies that in addition, the old index is stored in a new variable called index.

In [12]:
data2 = data2.reset_index(drop=True)

print(data2.index)
RangeIndex(start=0, stop=62178, step=1)

Calculating Mean Values by Time¶

We compute mean values of variables using mean() and mean values by groups using the groupby() method in pandas. For example, we can plot the FICO score by time.

We compute the average FICO score by time using .mean() and .groupby('time').

We write the index (here time) in a new column by resetting the index and retaining the old index. We use the command .reset_index(drop=False). This is useful if we want to process the index as a feature in pandas.

In [13]:
data2 = data

FICO = data2.groupby('time')['FICO_orig_time'].mean().reset_index(drop=False)

Plotting¶

We then plot the mean FICO score over time. The chart shows that the mean FICO score increases over time perhaps reflecting tighter lending standards.

In [14]:
plt.plot('time', 'FICO_orig_time', data=FICO)
plt.xlabel('Time')
plt.ylabel('FICO')
plt.ylim([400, 850])
plt.show()

Generating New Variables¶

We can generate new variables using the dataframe.loc['variable_name'] command. For example, we can generate a variable with values of zero:

In [15]:
data.loc[:, 'dummy'] = 0

Often, we would like to create a new variable with values that depend on an existing variable. For example, we may be interested in categorizing the observations into mortgage loans that are above or below a threshold (LTV of 70%):

In [16]:
data.loc[data['LTV_orig_time'] > 70, 'dummy'] = 1

print(data[['LTV_orig_time']].round(decimals=2))
       LTV_orig_time
0               81.8
1               81.8
2               81.8
3               81.8
4               81.8
...              ...
62173           79.8
62174           79.8
62175           79.8
62176           79.8
62177           79.8

[62178 rows x 1 columns]

Transforming Variables¶

For transforming variables, we can generate a new variables using the dataframe.loc['variable_name'] command and compute the transformed value based on an existing variable. For example, we create a new variable FICO_orig_time2, which inflates all FICO scores by adding 10:

In [17]:
data.loc[:, 'FICO_orig_time2'] = data.loc[:, 'FICO_orig_time']+10

print(data[['FICO_orig_time', 'FICO_orig_time2']])
       FICO_orig_time  FICO_orig_time2
0                 587              597
1                 587              597
2                 587              597
3                 587              597
4                 587              597
...               ...              ...
62173             708              718
62174             708              718
62175             708              718
62176             708              718
62177             708              718

[62178 rows x 2 columns]

A common transformation that we apply throughout is capping and flooring (e.g, winsorizing). For example, we can apply a floor of 600 (i.e., assign floor value for all lower values) and a cap of 700 (i.e., assign cap value for all higher values) to FICO_orig_time.

In [18]:
data.loc[:, 'FICO_orig_time2'] = data.loc[:, 'FICO_orig_time']

data.loc[data['FICO_orig_time2'] <= 600, 'FICO_orig_time2'] = 600

data.loc[data['FICO_orig_time2'] >= 700, 'FICO_orig_time2'] = 700

print(data[['FICO_orig_time', 'FICO_orig_time2']])
       FICO_orig_time  FICO_orig_time2
0                 587              600
1                 587              600
2                 587              600
3                 587              600
4                 587              600
...               ...              ...
62173             708              700
62174             708              700
62175             708              700
62176             708              700
62177             708              700

[62178 rows x 2 columns]

In extensions, you may use data-implied values, e.g., lower percentile values for floor values and higher percentile values for cap values. We show an example in our feature engineering chapter.

Subsetting Data¶

A data subset is smaller than the complete dataset in terms of number of observations (rows) or variables (columns). Subset dataframes are important for many applications. We may want to process fewer than all variables or fewer than all observations.

In a simple case, we may be interested in a single variable and we can access this by using the dataframe.variable_name or dataframe[['variable_name']] command. We run the latter and print the shape (i.e., the number of rows/observations and columns/variables).

The .copy() statement informs pandas to make a copy of the original data. An omission may result in warning messages and errors when we continue to process the subset.

In [19]:
data2 = data[['FICO_orig_time']].copy()

print(data2.shape)
(62178, 1)

There are a number of alternatives. Most importantly, pandas offers label (dataframe.loc) and position based indexing (dataframe.iloc).

Using [[]] returns a dataframe while [] returns only a slice of a dataframe. The arguments in square brackets [row,column] indicate the rows and columns separated by a comma. : indicates all rows or all columns.

In [20]:
data2 = data.loc[:, 'FICO_orig_time'].copy()

data2 = data.iloc[:, 18].copy()

print(data2.shape)
(62178,)

Note that FICO_orig_time is the 18th variable in dataframe data (see list of variables above) and that the result of this operation is a data slice that has no column names. The resulting slice has the shape that is equal to the number of rows of data.

We can also create subsamples for individual rows (here the first row as we use index 0). The resulting slice has the shape that is equal to the number of columns of data.

In [21]:
data2 = data.loc[0, :].copy()

data2 = data.iloc[0, :].copy()

print(data2.shape)
(30,)

In more complex situations, we may want to filter for observations where key variables fulfill certain conditions. We can use the dataframe.query() command. Note that if rows are deleted the index is retained and gaps in the index sequence can be observed. We plot the index and note that the first observations up to index 53 have been deleted.

In [22]:
data2 = data.query('FICO_orig_time >= 800').copy()

print(data2.index)
Int64Index([   53,    54,    55,    56,    67,    71,  3244,  3245,  3246,
             3247,
            ...
            61844, 61873, 61874, 61875, 61876, 61877, 61878, 61879, 61880,
            61881],
           dtype='int64', length=838)

We can also do random sampling. We randomly draw observations based on a rule. For instance, we can draw a certain number (or proportion) of observations with or without replacement. The latter implies that every observation is included in the resulting dataset once if sampled. Random sampling can also be done by groups. We can draw a certain number (or proportion) of observations for each group.

To start random sampling we set a seed value which determines the number of samples using the method from the library random. 12345 is the seed value. You can choose other seed values if you like. Setting seeds allows us to repeat the random draws multiple times with the same outcome. Different seeds result in different samples.

In [23]:
data2 = data.sample(100, random_state=12345)

print(data2.shape)
(100, 30)

Lastly, we can also subset dataframes by dropping features (here LTV_orig_time) using the drop() command:

In [24]:
data2 = data.drop('LTV_orig_time', axis='columns').copy()

print(data2.shape)
(62178, 29)

Alternatively, we can also subset dataframes by dropping observations (here: first observation with an index of zero) using the dataframe.drop() command:

In [25]:
data2 = data.drop(0, axis='rows').copy()

print(data2.shape)
(62177, 30)

Combining Data¶

Pandas dataframes can be combined using different approaches:

  • .concat combines two dataframes based on the axis (combine rows or columns);
  • .append combines the rows of two dataframes;
  • .merge combines two dataframes based on matching of values of columns in datasets;
  • .join combines two dataframes based on matching of indexes.

Below are examples for these approaches. In a first step, we decompose the mortgage data into a number of subsamples and in a second step, we combine these subsamples using the various approaches.

Concatenating¶

Combining columns

We can append rows (axis=0) or columns (axis=1). This requires the same indexes. If this is not the case, the index sequence can be renumbered using the dataframe.reset_index(drop=True) method.

To showcase, we generate three datasets for variables hpi_time, uer_time and gdp_time and then concatenate these dataframes horizontally.

In [26]:
hpi_time = data.loc[:,['time', 'hpi_time']].drop_duplicates().reset_index(drop=True)
uer_time = data.loc[:,['time', 'uer_time']].drop_duplicates().reset_index(drop=True)
gdp_time = data.loc[:,['time', 'gdp_time']].drop_duplicates().reset_index(drop=True)

macro_time = pd.concat([hpi_time, uer_time, gdp_time], axis=1)

print(macro_time.shape)
(60, 6)

Combining rows

We can also append a dataframe. This has the same result as the append() method below:

In [27]:
macro_time2 = pd.concat([hpi_time, hpi_time], axis=0)

print(macro_time2.shape)
(120, 2)

Appending¶

Appending is a specific case of concat and allows for combining of rows:

In [28]:
macro_time2 = hpi_time.append(hpi_time)

print(macro_time2.shape)
(120, 2)

Match Merging¶

Common identifiers are needed to match the various data sources. Identifiers may include borrower and loan identification numbers, social security numbers, securities identification numbers, zip codes, user identification numbers, time periods and email addresses.

We decompose our mortgage dataset into constituent data files to showcase the match-merging of multiple data sources:

  • Loan origination data data_orig_time includes information available at loan origination and we keep only one observation per loan using drop_duplicates(subset='id', keep='first');
  • Loan performance data data_time contains the time-varying borrower specific information from the panel dataset and we keep all observations;
  • Macroeconomic data macro_time includes only time-specific information and we keep one observation per time period using the .drop_duplicates() method.

We print the dimensions of the original and three constituent data sets. The sum of columns of the three datasets is greater than the number of columns in the original data set as the identification variables id and time are included in each.

In [29]:
data_orig_time = data[['id', 'orig_time', 'first_time', 'mat_time', 'res_time', 'REtype_CO_orig_time', 'REtype_PU_orig_time', 'REtype_SF_orig_time',
 'investor_orig_time', 'balance_orig_time', 'FICO_orig_time', 'LTV_orig_time', 'Interest_Rate_orig_time', 'state_orig_time', 'hpi_orig_time']].drop_duplicates(subset='id', keep='first')
data_time = data[['id', 'time', 'balance_time', 'LTV_time', 'interest_rate_time', 'rate_time', 'default_time', 'payoff_time', 'status_time', 'lgd_time', 'recovery_res']]
macro_time = data[['time', 'hpi_time', 'gdp_time', 'uer_time']].drop_duplicates()

print('data:', data.shape)

print('data_orig_time:', data_orig_time.shape)

print('data_time:', data_time.shape)

print('macro_time:', macro_time.shape)
data: (62178, 30)
data_orig_time: (5000, 15)
data_time: (62178, 11)
macro_time: (60, 4)

We now have the three sub-datasets: data_orig_time, data_time and macro_time.

We can merge these to obtain the original dataset data by first merging data_orig_time and data_time by id to data2 and second by merging the resulting dataset data2 with macro_time by time to data3. We compare our input dataframe data with the output dataframe data3 in terms of observation and variable numbers.

The dataframes may not fully match as we have reduced the origination and macro data (using the first observation of the identifying variables id and time) and there might have been variation that is lost in the process of reduction. One example is zip code, which may include changes over time. This may indicate a data error, or be economically reasonable if the collateral property and hence location changes over time.

In [30]:
data2 = pd.merge(data_orig_time, data_time, on='id')
data3 = pd.merge(data2, macro_time, on='time')

print('Original dataframe data:', data.shape)
      
print('Reconstituted dataframe data3:', data3.shape)
Original dataframe data: (62178, 30)
Reconstituted dataframe data3: (62178, 28)

The merge function comes with four options:

  • Inner join: keeps all rows in $x$ and $y$ that have common characteristics. This is the default value.
  • Full join: keeps all rows from both data frames. Specify the argument how='outer'.
  • Left join: keeps all the rows of your data frame $x$ and only those from $y$ that match. Specify the argument how='left'.
  • Right join: keeps all the rows of your data frame $y$ and only those from $x$ that match. Specify the argument how='right'.

Joining¶

The joining command join combines two dataframes based on indexes. It is similar to merge but relies on indexes rather than specific columns.

In [31]:
hpi_time = data.loc[:,['time', 'hpi_time']].drop_duplicates().reset_index(drop=True)
uer_time = data.loc[:,['time', 'uer_time']].drop_duplicates().reset_index(drop=True)
macro_time3 = hpi_time.set_index('time').join(uer_time.set_index('time'), on='time')

print(macro_time3.shape)
(60, 2)

Regression Models¶

Regression models are based on substantial theory. Here, we are only providing a brief introduction to show basic characteristics of the models that we estimate and apply in the next chapters. We fit a Linear Regression using the statsmodels.formula.api from the statsmodel library which is represented by the acronym smf. We save all objects from the method smf.ols under the name data_ols, specify the model equation with LTV_time as the dependent (left-hand side, LHS) and the variables LTV_orig_time and gdp_time as features (right hand side, RHS). LHS and RHS variables are connected by the ~ sign. Further, we specify the estimation sample as our mortgage dataframe and use the fit() method:

In [32]:
data_ols = smf.ols(formula='LTV_time ~ LTV_orig_time + gdp_time', data=data).fit()

There are many packages in Python that provide regression models and they all provide a number of outputs. To get a first impression of what is available, we use the dir(data_ols) command. We do not execute this to conserve space.

We select dataframe.summary() from these objects:

In [33]:
print(data_ols.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:               LTV_time   R-squared:                       0.220
Model:                            OLS   Adj. R-squared:                  0.220
Method:                 Least Squares   F-statistic:                     8772.
Date:                Tue, 15 Jun 2021   Prob (F-statistic):               0.00
Time:                        16:18:43   Log-Likelihood:            -2.8781e+05
No. Observations:               62153   AIC:                         5.756e+05
Df Residuals:                   62150   BIC:                         5.756e+05
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept        13.8450      0.773     17.901      0.000      12.329      15.361
LTV_orig_time     0.9632      0.010     99.086      0.000       0.944       0.982
gdp_time         -4.5720      0.051    -89.721      0.000      -4.672      -4.472
==============================================================================
Omnibus:                    88921.993   Durbin-Watson:                   0.105
Prob(Omnibus):                  0.000   Jarque-Bera (JB):        123795185.295
Skew:                           7.985   Prob(JB):                         0.00
Kurtosis:                     221.054   Cond. No.                         617.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

dataframe.summary() shows the R-squared as a performance measure of the regression model. The ratio is bounded between zero (poor model quality) and one (high model quality). We will discuss performance measures in the validation chapter.

Further, the table shows the parameter estimates. We include an intercept as well as the key features LTV ratio at origination time and GDP growth rate at observation time. The sign indicates the direction of influence. LTV has a positive sign and hence a greater LTV at origination explains a greater LTV ratio at observation time. The GDP growth rate has a negative sign and hence a greater GDP growth rate explains a lower LTV at observation time.

P-values are shown in the column with the header P>|t| and have to be interpreted with care. Throughout this text, we interpret p-values which are less than a threshold value (e.g., 1%, 5% or 10%) as a sign for statistical significance of the respective feature. For example, if the p-values are less than 1% then the variable has a significant influence at the 1% level.

numpy vs pandas¶

Previous sections focused on pandas. pandas dataframes are enhanced arrays as they can be accessed by integer positions (which we use for rows) as well as labels (which we use for columns/features). pandas also offers a large range of econometric data operations, including sampling, aggregation, leading/lagging that are particularly helpful for credit risk data.

We use numpy in more technical sections of the book such as the part on machine learning for PD and LGD forecasting.

numpy is a fundamental package for scientific computing with Python. Among other things, it offers the creation and processing of multi-dimensional arrays, linear algebra and random number capabilities. An array is a collection of values. numpy arrays are accessed by their integer position with zero as a base.

We create a one-dimensional array (a vector) with two entries from a list. A list is created with square brackets and the entries are separated by commas. We use np.array(list1) and see that the created array has two entries, embraced by squared brackets. If we have one opening and one closing bracket, we have a vector with two rows. Using array1.ndim we can return the dimension, using array1.shape we see that the array has two rows. numpy arrays are more efficient than Python lists.

In [34]:
array1 = np.array([1,2])
print('Array:',array1)
print('Dimension:',array1.ndim)
print('Shape:',array1.shape)
Array: [1 2]
Dimension: 1
Shape: (2,)

We now create a another array from a list with four entries as shown below. We see that the array now has two opening and closing brackets. The two square brackets indicate a two-dimensional array (a matrix). The inner brackets contain the lines as vectors where the outer brackets 'stack' the row vectors into a matrix. The resulting matrix then has two rows and two columns. This can be extended to more than two dimensions by extending the number of square brackets.

In [35]:
array2 = np.array([[1,3],[2,4]])
print('Array:',array2)
print('Dimension:',array2.ndim)
print('Shape:',array2.shape)
Array: [[1 3]
 [2 4]]
Dimension: 2
Shape: (2, 2)

We want to access certain elements in the array. We can index and slice them in the same ways you can slice Python lists. Remember that indexing starts with zero.

In [36]:
print(array2[0,])
[1 3]
In [37]:
print(array2[:,1])
[3 4]
In [38]:
print(array2[1,0])
2

For more information about numpy we refer to the official manual and tutorials on www.numpy.org.

Converting pandas dataframes to numpy arrays¶

We can convert a pandas dataframe to numpy:

In [39]:
data2 = data[['id', 'time', 'gdp_time', 'FICO_orig_time', 'LTV_time']].round(decimals=2)
clabels = data2.columns.values

data_numpy = data2.values

print(data_numpy)
[[4.0000e+00 2.5000e+01 2.9000e+00 5.8700e+02 3.3910e+01]
 [4.0000e+00 2.6000e+01 2.1500e+00 5.8700e+02 3.4010e+01]
 [4.0000e+00 2.7000e+01 2.3600e+00 5.8700e+02 3.4340e+01]
 ...
 [4.9972e+04 5.4000e+01 1.5100e+00 7.0800e+02 9.1870e+01]
 [4.9972e+04 5.5000e+01 2.4200e+00 7.0800e+02 9.1560e+01]
 [4.9972e+04 5.6000e+01 1.7200e+00 7.0800e+02 9.0870e+01]]

We store the column labels in array clabels and the values in array data_numpy. The two square brackets indicate a two-dimensional array (a matrix). The inner brackets contain the rows as vectors where the outer brackets 'stack' the line vectors into a matrix.

Converting numpy arrays to pandas dataframes¶

We can convert a numpy dataframe to pandas. We label the columns using the columns command. We can also label rows using index command.

In [40]:
data=pd.DataFrame(data=data_numpy, columns=clabels)

print(data)
            id  time  gdp_time  FICO_orig_time  LTV_time
0          4.0  25.0      2.90           587.0     33.91
1          4.0  26.0      2.15           587.0     34.01
2          4.0  27.0      2.36           587.0     34.34
3          4.0  28.0      1.23           587.0     34.67
4          4.0  29.0      1.69           587.0     34.95
...        ...   ...       ...             ...       ...
62173  49972.0  52.0      1.08           708.0    103.31
62174  49972.0  53.0      0.89           708.0     95.74
62175  49972.0  54.0      1.51           708.0     91.87
62176  49972.0  55.0      2.42           708.0     91.56
62177  49972.0  56.0      1.72           708.0     90.87

[62178 rows x 5 columns]

Module dcr¶

We include a number of packages that have been developed by the Python community as well as functions in the module dcr that is available for download from www.deepcreditrisk.com. Functions are detailed in the next section.

We limit the number of rows that should be shown to ten to conserve space in this book.

We also import dataset dcr.csv into a pandas dataframe using the pd.read_csv command, which generates a dataframe (i.e., a panel dataset with observations in rows and variables in columns).

Finally, we suppress warning messages. You may exclude this line of code if you want to see warning messages throughout. This is equivalent to commenting the line of code by setting a hashtag upfront. We ensured few warning messages remain.

We execute this by calling the dcr module via from dcr import *. Alternatively, you may run %run dcr.

In [41]:
from dcr import *

Functions¶

Functions are helpful to execute the code multiple times applying different arguments. To be able to scale various techniques developed in this book, we provide the following functions via the module dcr that is available for download from www.deepcreditrisk.com. The functions are:

  • versions;
  • dataprep;
  • woe;
  • validation;
  • resolutionbias.

versions¶

The function versions() produces a table with the package versions that we use for this text:

In [42]:
from dcr import *
versions()
                     Package Version  Acronym
0                     Python   3.8.3      NaN
1                    IPython     NaN  IPython
2                       math     NaN     math
3   matplotlib.pyplot, pylab   3.3.2      plt
4                      numpy  1.18.5       np
5                     pandas   1.0.5       pd
6                     pickle     4.0   pickle
7                     random     NaN   random
8                      scipy   1.5.0    scipy
9                    sklearn  0.23.1  sklearn
10               statsmodels  0.11.1       sm

Package versions may change over time and may cause error messages or warning messages. Note, warning messages are not error messages and the code may run perfectly well. Later Python versions include a number of error (bug) fixes and using different versions may result in different outputs. Please ensure to use the same versions to obtain the same results. Different versions may result in slightly different outputs.

Also, this text is mainly designed, so that you learn the principles of Deep Credit Risk, and not the detailed syntax, as the latter may change from version to version, but the principles stay the same. Do an internet search to find a solution if an error or warning message arises with other future versions.

In compiling this text, we have tried to minimize the number of warning messages and very few remain. For example, we use pandas heavily as we work with panel data and a common warning message is SettingWithCopyWarning. There are two operations:

  • "set" operation: we assign values using the = sign;
  • "get" operation: we perform operations such as indexing.

It is very popular to connect (i.e., chain) operations. An example is hpi_time.set_index('time').join(uer_time.set_index('time'), on='time') with the chain elements set_index() and .join(). Python does not produce expected results in situations where multiple assignments (i.e., chained assignments) or operations are performed in different lines (i.e., hidden chains). We avoid such situations by combining sets on rows and columns using the .loc command and also specifying that we create a copy using the .copy() command when creating data subsets to avoid misinterpretations of dataframes as views of the original dataframes.

dataprep¶

The function dataprep generates economic features, principal components and clusters. Train and test datasets are provided for machine learning techniques for PD and LGD models.

The input arguments are:

  • data_in: input dataset. We generally use data = pd.read_csv('dcr.csv');
  • depvar: we are creating numpy arrays for the machine learning chapters and depvar is the target variable in these models. Default value is depvar='default_time', i.e., the default indicator. The alternative option is depvar='lgd_time';
  • splitvar: we are splitting the dataset into a training and testing sample and splitvar is the splitting feature. Default value is splitvar='time', i.e., the observation time;
  • threshold: we are splitting the dataset into a training and testing sample and threshold is the splitting threshold. Default value is threshold=26, i.e., the start of the financial crisis.

The output arguments are:

  • df: input pandas dataframe plus all additionally created variables;
  • data_train: training pandas dataframe that includes the input dataset plus all additionally created variables. For PD models the training data relates to pre-crisis periods. For LGD data, the training data includes pre-crisis and crisis periods to ensure a sufficient observation number;
  • data_test: test pandas dataframe that includes the input dataset plus all additionally created variables. For PD models the test data relates to crisis periods. For LGD data, the test data relates to post-crisis periods;
  • X_train_scaled: training numpy array that includes all features;
  • X_test_scaled: test numpy array that includes all features;
  • y_train: training numpy vector of an outcome. Default value is depvar='default_time', i.e., the default indicator;
  • y_test: test numpy vector of an outcome. Default value is depvar='default_time', i.e., the default indicator.

You may assign any name to these output objects. Additional assumptions are made. For example, the principal components analysis of the state-level default rates stores the first five components and two clusters are formed using K-means clustering. A correction for resolution bias is applied if the dependent variable is lgd_time.

We discuss the function dataprep in more detail in our feature engineering chapter.

woe¶

The function woe computes weight of evidence and information value for features. It is discussed in the feature engineering and feature selection chapters. The function is called using outputWOE, outputIV = woe(data, target, variable, bins, binning).

The input arguments are:

  • data_in: input dataframe;
  • target: outcome variable. In all our applications, we choose target='default_time', i.e., the default indicator;
  • variable: feature for which weights of evidence are computed;
  • bins: number of bins;
  • binning: whether binning or actual category is used (binning = True or binning = False).

The output arguments are:

  • outputWOE: weight of evidence per bin;
  • outputIV: information value.

You may assign any name to these output objects.

validation¶

The function validation computes a number of validation measures and visual validation plots. The function is called using validation(fit, outcome , time, continuous=False).

The input arguments are:

  • fit: fitted outcome variable;
  • outcome: outcome variable. For PD models this would be default_time and for most of our LGD models target='default_time', i.e., the default indicator;
  • time: variable which explains the stratifying feature. We generally use variable time;
  • continuous: whether dependent variable is binary or not (continuous=True or continuous=False). The default value is continuous=False.

There are no output arguments as a panel with one table and three charts is printed instead. The panel summarizes a number of validation metrics and charts. We discuss more details in our validation chapter.

resolutionbias¶

The function resolutionbias corrects observed LGD values for resolution bias. It is discussed in the outcome engineering chapter. The function is called with df3 = resolutionbias(data_in,lgd,res,t).

The input arguments are:

  • data_in: input dataset;
  • lgd: LGD variable. We use variable lgd_time in most examples;
  • res: the resolution time. We use variable res_time in most examples;
  • t: observation time. We use variable time in most examples.

The output argument is df3: pandas dataframe that includes the input dataset with the LGD values that have been corrected for resolution bias.

You may assign any name to these output objects. The function assumes an end of observations in period 60.

Sandbox Problems¶

  • Create a data subset labeled data2, which includes the variables id, time, LTV_orig_time and LTV_time.
  • Provide descriptive statistics for variables LTV_orig_time and LTV_time.
  • Filter dataset data for loans with a FICO score above 800.
  • Provide a frequency table that shows the number of defaults in the sample.
  • Provide a cross-frequency table by variables default_time and time.

\begin{comment}

References¶

(?) !! This reference was not found in biblio.bib !!

(?) !! This reference was not found in biblio.bib !!

(?) !! This reference was not found in biblio.bib !!

(?) !! This reference was not found in biblio.bib !!

(?) !! This reference was not found in biblio.bib !!

(?) !! This reference was not found in biblio.bib !!

(?) !! This reference was not found in biblio.bib !!

(?) !! This reference was not found in biblio.bib !!

(?) !! This reference was not found in biblio.bib !!

(?) !! This reference was not found in biblio.bib !!

(?) Salvatier John, Wiecki Thomas V. and Fonnesbeck Christopher, ``Probabilistic programming in Python using PyMC3'', PeerJ Computer Science, vol. 2, number , pp. e55, April 2016. online

(?) !! This reference was not found in biblio.bib !!

(?) !! This reference was not found in biblio.bib !!

\end{comment}

Click to set custom HTML
Copyright © 2023  |  Privacy Policy | Terms & Conditions
  • WELCOME
    • CONTENTS
    • START Python
    • START R
    • FEATURED
  • DATA & CODE
  • TRAINING
    • ZOOM MASTERCLASS >
      • PD SCHEDULE
    • IN HOUSE TRAINING
  • PAPERS
  • CONTACT
  • 中文