Synopsis¶
There is no need to worry if you are new to Python or even data science. Building great models is about model design and the actual implementation follows quickly. Throughout this book we rely on a small number of building blocks. First examples include:
- Describing data;
- Plotting data;
- Generating new variables;
- Transforming variables;
- Subsetting data;
- Combining data;
- Regression models.
Further, we provide a summary of the functions written specifically for this book to help you to accelerate your learning. Where it often takes years for financial institutions to build complete credit risk models, with our book you'll be able to do so within a matter of hours.
Note that we have published the Deep Credit Risk series in different programming languages (e.g., R) with comparable results. Differences in the syntax may result in slightly different implementation structures. Results may differ occasionally. For example random sampling may result in different outcomes despite identical seed values as the random number generators may be implemented differently.
Installation¶
Anaconda and IDEs¶
Python is popular in the data science community, particularly in the application of machine learning techniques in the banking and fin-tech industries created by Guido van Rossum in 1991 and further developed by the Python Software Foundation, see \cite{VanRossum1995} and \cite{VanRossum2009}. You may be overwhelmed with the many options available if you are new to Python. We present a number of useful components and different methods of working with the software.
Note, Python is simply a general-purpose programming language, which is installed on the computer. We recommend installing Anaconda, as this includes Integrated Development Environments (IDEs) and common packages like numpy
or pandas
. Popular IDEs include Jupyter Notebook and Spyder and these become the faces of Python. When you open the software, you will have different windows for programming, logging processes and observing outputs. IDEs have many useful tools for code development and finding bugs.
To install Python via Anaconda on your machine:
- Download the Anaconda for Python 3.x from: www.anaconda.com;
- Select your operating system;
- Choose either the 32bit or 64bit option;
- Go through the installation prompts. Click on "Add Anaconda to my PATH environment variable. [...]";
- After the installation, search for Anaconda Navigator in the Start menu. Here you will find several applications. We recommend using Jupyter Notebook or Spyder;
- Open your Anaconda Prompt from the Start menu. Navigate to the Anaconda directory. Type: "Run conda update conda". This updates Anaconda;
- Open your preferred IDE (i.e., Jupyter Notebook or Spyder) and you are ready to start.
Please visit the website www.deepcreditrisk.com for further guidance including installation videos.
All examples were developed in Jupyter Notebook. Jupyter creates notebook documents in which Python can be used. This is helpful if you collaborate with others (perhaps even non-Python users) or would like to publish your findings efficiently. The notebooks can be shared as files, read without execution of code, and published online or printed. Jupyter is a very good place to get started with Python. You may use Spyder or other IDEs when you become more proficient.
Things may not go smoothly. Your computer may be set up in such a way that you do not have administrator rights, and you may need to seek IT support to limit exposure to such issues.
Packages¶
Anaconda installs major packages which are called libraries (for a list see www.anaconda.org). We use the following basic packages for our book, sorted in terms of importance:
pandas
: Processing data structures: series (1D) and dataframes (2D); see \cite{Pandas2020};numpy
: Processing of n-dimensional array objects, see \cite{Harris2020};scipy
: Submodule for statistics, see \cite{Virtanen2020};matplotlib
: Plotting library, see \cite{Hunter2007};math
: Mathematical functions, see \cite{VanRossum2009};random
: Random number generator,see \cite{VanRossum2009};tabulate
: Printing tabular data;joblib
: Running functions as pipeline jobs, see \cite{Joblib2020};pickle
: Converting a object to a file for saving, see \cite{VanRossum2009};
We mostly require pandas
(the acronym is derived from "Python Data Analysis Library") which provide for a dataframe object for data processing with integrated indexing based on labels (.loc
) as well as positions (.iloc
).
pandas
also allows for:
- Data sub-setting;
- Group processing;
- Dataset splitting, merging and joining;
- Time-series functionality.
We require the following libraries for model building:
scikit-learn
: Machine learning techniques, see \cite{Pedregosa2011};statsmodels
: Fitting statistical models. Interacts withpandas
data frames to fit statistical models, see \cite{Seabold2010};IPython
: Interactive computing, see \cite{Perez2007};pydot
andgraphviz
: Plotting of decision trees;pymc3
: Probabilistic programming, see \cite{Salvatier2016};lifelines
: Survival analysis, see \cite{Davidson2019};lightgbm
: Gradient boosting, see \cite{Ke2017}.
There are two ways to install these libraries: via conda
and pip
. pip
is the Python Packaging Authority’s recommended tool for installing packages from Python. conda
is a cross-platform package and environment manager that installs and manages conda
packages from the Anaconda repository as well as from the Anaconda Cloud. We recommend conda
in the first instance, as we have observed there are fewer complications when using this tool. To find a package, google the name plus "conda install" (e.g., "pandas conda install") and find a link to the command, usually from Anaconda (e.g., conda install -c anaconda pandas
). Copy the command, open the program Anaconda Prompt, paste, and run the command.
Some additional packages would need to be installed for the following chapters:
pymc3
: Chapter 3;pydot
andgraphviz
: Chapters 13 and 16;lightgbm
: Chapter 15;lifelines
: Chapter 17.
Python has many more packages available. These packages are powerful, open source tools and save time as someone else has done the job for you. Package versions may change over time and may cause error messages or warning messages. Note, warning messages are not error messages and the code may run perfectly well. Later versions include a number of error (bug) fixes and using different versions may result in different outputs. Please ensure to use the same versions to obtain the same results. Different versions may result in slightly different outputs.
Also, this text is mainly designed, so that you learn the principles of Deep Credit Risk, and not the detailed syntax, as the latter may change from version to version, but the principles stay the same. Do an internet search to find a solution if an error or warning message arises with older or future versions.
However, open source programming languages like Python have some issues. First, Python packages are often poorly documented. Second, there is limited quality assurance. Generally, few problems arise when working with common packages (e.g., OLS or Logistic Regressions) as these have been around long enough to be vetted many times by other users. However, packages that implement cutting edge models may have bugs, some of which may be documented in blogs. Third, a number of packages common in disciplines have not been coded in Python and you may be unable to find a package that meets your needs. Fourth, existing packages are often extended by wrapping: calling a new package, running the existing packages, adding code and closing the new package. This can become a problem if a constituent package has been updated on your machine. The wrap, and hence your code which relies on it, may no longer work. There are efficient processes at work places to ensure that the right versions are used. This requires some additional but valuable work. To limit our exposure to such issues, we have limited our book to common packages and self-coded some applications.
To limit our exposure to such issues, we have limited our book to common packages and self-coded some applications. At the beginning of every chapter, we import the dcr module that includes a link to all packages and data as well as some functions. Should you close and re-open your IDE to continue, you would have to execute these introductory codes as well as any prior code.
Coding Guidelines¶
For the purposes of this book, we follow some basic style guidelines:
- Limit the number of dataset copies;
- Limit the use of non-common packages;
- Name functions and datasets with lower case letters;
- Use label-based indexing (
.loc
); - Separate individual steps with an empty line;
- Insert comments using hashtags;
- Random draws are hard-coded using seeds;
- You may set up virtual environment to hard-code versions of packages and ensure that the same results are obtained in later executions.
We work with the following main dataframes:
data
: the complete panel data and subsets thereof. Major samples are the subsets for outcomes (y
) and independent features (x
). This includes subsets for training and testing and scaled features:X_train_scaled
,X_test_scaled
,y_train
,y_test
;data_default
: data that conditions on default observations;data_cross
: data that contains only cross-section information, i.e., one observation per loan, time-varying information is reduced to the origination time;data_lifetime
: data that starts at the end of our observation period (i.e., period 61) and includes repetitive observations until loan maturity.
A more comprehensive documentation of coding standards should be considered in professional environments with team-based coding and staff turnover.
First Look¶
We now import standard packages and functions calling the dcr
module using from dcr import *
and ignore warning messages.
import warnings; warnings.simplefilter('ignore')
from dcr import *
dcr
imports a pandas
dataframe called data
. We introduce the magic line %matplotlib inline
to improve the display of outputs and specify the resolution of the figures.
import warnings; warnings.simplefilter('ignore')
%matplotlib inline
plt.rcParams['figure.dpi'] = 300
plt.rcParams['figure.figsize'] = (16, 9)
plt.rcParams.update({'font.size': 16})
Creating Objects¶
Objects are dataframes/arrays or methods. We create a new object by naming the object data2
and assigning a dataframe data
to it using the =
operator.
data2 = data
Subsampling Features¶
We subsample the dataframe by keeping the loan id, time stamp, GDP growth, FICO score and LTV ratio at observation time using data[['']]
. Note that Python is case sensitive and the id variable is 'id'
rather than 'ID'
. We start with pandas
commands and discuss numpy
as an alternative last.
data2 = data[['id', 'time', 'gdp_time', 'FICO_orig_time', 'LTV_time']]
Printing¶
We can have a first look at the new object data2
using the print()
command. You can use multiple print()
commands in one go.
print(data2)
id time gdp_time FICO_orig_time LTV_time 0 4 25 2.899137 587 33.911009 1 4 26 2.151365 587 34.007232 2 4 27 2.361722 587 34.335349 3 4 28 1.229172 587 34.672545 4 4 29 1.692969 587 34.951639 ... ... ... ... ... ... 62173 49972 52 1.081049 708 103.306966 62174 49972 53 0.892996 708 95.736862 62175 49972 54 1.507359 708 91.867079 62176 49972 55 2.422275 708 91.560581 62177 49972 56 1.717053 708 90.874242 [62178 rows x 5 columns]
Alternatively, you may just print the data using the data name (run: data2
). Note that this way, only the last object is printed, i.e., you can only print one object at a time.
We see the dataframe (panel) with identifiers id
and time
. The dataframe has 62,178 rows and 5 columns. Further, we see the time-varying systematic feature gdp_time
, the idiosyncratic variable FICO_orig_time
and the idiosyncratic time-varying feature LTV_time
.
Chaining¶
We can chain methods to dataframes using the .
operator. In the following example we round all numbers in the resulting object/dataframe to two decimals. We use the round(decimals=)
method to limit decimals.
data2 = data[['id', 'time', 'gdp_time', 'FICO_orig_time', 'LTV_time']].round(decimals=2)
print(data2)
id time gdp_time FICO_orig_time LTV_time 0 4 25 2.90 587 33.91 1 4 26 2.15 587 34.01 2 4 27 2.36 587 34.34 3 4 28 1.23 587 34.67 4 4 29 1.69 587 34.95 ... ... ... ... ... ... 62173 49972 52 1.08 708 103.31 62174 49972 53 0.89 708 95.74 62175 49972 54 1.51 708 91.87 62176 49972 55 2.42 708 91.56 62177 49972 56 1.72 708 90.87 [62178 rows x 5 columns]
Describing¶
We may use the data.info()
command to obtain an overview of a dataframe in terms of total number of observations, variable names and formats as well as its size.
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 62178 entries, 0 to 62177 Data columns (total 28 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 62178 non-null int64 1 time 62178 non-null int64 2 orig_time 62178 non-null int64 3 first_time 62178 non-null int64 4 mat_time 62178 non-null int64 5 res_time 1160 non-null float64 6 balance_time 62178 non-null float64 7 LTV_time 62153 non-null float64 8 interest_rate_time 62178 non-null float64 9 rate_time 62178 non-null float64 10 hpi_time 62178 non-null float64 11 gdp_time 62178 non-null float64 12 uer_time 62178 non-null float64 13 REtype_CO_orig_time 62178 non-null int64 14 REtype_PU_orig_time 62178 non-null int64 15 REtype_SF_orig_time 62178 non-null int64 16 investor_orig_time 62178 non-null int64 17 balance_orig_time 62178 non-null float64 18 FICO_orig_time 62178 non-null int64 19 LTV_orig_time 62178 non-null float64 20 Interest_Rate_orig_time 62178 non-null float64 21 state_orig_time 61828 non-null object 22 hpi_orig_time 62178 non-null float64 23 default_time 62178 non-null int64 24 payoff_time 62178 non-null int64 25 status_time 62178 non-null int64 26 lgd_time 1525 non-null float64 27 recovery_res 1525 non-null float64 dtypes: float64(14), int64(13), object(1) memory usage: 13.3+ MB
The variables are generally observed for all observations. Exceptions are lgd_time
, recovery_res
and res_time
as these are only observed for default_time=1
and after the resolution process is complete. We cover more details in our chapter on outcome engineering. The columns have three formats: integer numbers (int64
), continuous numbers (float64
) and character variables (object
).
Another way to obtain the the number of rows and columns (dimension) of a data frame is the dataframe.shape
method:
print(data.shape)
(62178, 28)
A further way to generate a list of variable names is the dataframe.columns.values
method:
print(data.columns.values)
['id' 'time' 'orig_time' 'first_time' 'mat_time' 'res_time' 'balance_time' 'LTV_time' 'interest_rate_time' 'rate_time' 'hpi_time' 'gdp_time' 'uer_time' 'REtype_CO_orig_time' 'REtype_PU_orig_time' 'REtype_SF_orig_time' 'investor_orig_time' 'balance_orig_time' 'FICO_orig_time' 'LTV_orig_time' 'Interest_Rate_orig_time' 'state_orig_time' 'hpi_orig_time' 'default_time' 'payoff_time' 'status_time' 'lgd_time' 'recovery_res']
We can compute descriptive statistics for key variables like FICO, LTV at loan origination and the GDP growth rate. In our dataframe, FICO is a credit score with values between 429 and 819. The average FICO score is 673.
The loan-to-value (LTV) ratio at loan origination is expressed in percentage terms and is between 50.1% and 119.8%. A value in excess of 100% implies that the loan amount at origination is greater than the collateralizing house value. Banks are prudent and it is common to lend at the median LTV of 80%. The average is 78.7%.
The GDP growth rate is analyzed and expressed in percentage terms. During our observation period, the minimum is -4.15% (indicating an economic downturn) and the maximum is 5.13% (indicating an economic upturn). The average GDP growth is approximately 1.38%.
data2 = data[['FICO_orig_time', 'LTV_orig_time', 'gdp_time']]
print(data2.describe().round(decimals=2))
FICO_orig_time LTV_orig_time gdp_time count 62178.00 62178.00 62178.00 mean 673.36 78.70 1.38 std 72.10 10.24 1.95 min 429.00 50.10 -4.15 25% 628.00 75.00 1.10 50% 675.00 80.00 1.85 75% 730.00 80.00 2.69 max 819.00 119.80 5.13
Tabulating¶
We can also show observation counts. In the following, we analyze observation counts by origination year orig_time
. For one feature we need to specify feature and columns='count'
.
table = pd.crosstab(data.orig_time, columns='count', margins= True)
print(table)
col_0 count All orig_time -40 51 51 -35 4 4 -33 3 3 -24 1 1 -23 1 1 ... ... ... 57 4 4 58 3 3 59 14 14 60 9 9 All 62178 62178 [73 rows x 2 columns]
For a cross-tabulation we specify two variables. Please try print(pd.crosstab(data.orig_time, data.time, margins= True))
as an example.
Resetting Indexes¶
The index sequence can be renumbered using the dataframe.reset_index(drop=True)
command. The drop=True
command implies that the old index is overwritten. The drop=False
command implies that in addition, the old index is stored in a new variable called index.
data2 = data2.reset_index(drop=True)
print(data2.index)
RangeIndex(start=0, stop=62178, step=1)
Calculating Mean Values by Time¶
We compute mean values of variables using mean()
and mean values by groups using the groupby()
method in pandas
. For example, we can plot the FICO score by time.
We compute the average FICO score by time using .mean()
and .groupby('time')
.
We write the index (here time
) in a new column by resetting the index and retaining the old index. We use the command .reset_index(drop=False)
. This is useful if we want to process the index as a feature in pandas
.
data2 = data
FICO = data2.groupby('time')['FICO_orig_time'].mean().reset_index(drop=False)
Plotting¶
We then plot the mean FICO score over time. The chart shows that the mean FICO score increases over time perhaps reflecting tighter lending standards.
plt.plot('time', 'FICO_orig_time', data=FICO)
plt.xlabel('Time')
plt.ylabel('FICO')
plt.ylim([400, 850])
plt.show()
Generating New Variables¶
We can generate new variables using the dataframe.loc['variable_name']
command. For example, we can generate a variable with values of zero:
data.loc[:, 'dummy'] = 0
Often, we would like to create a new variable with values that depend on an existing variable. For example, we may be interested in categorizing the observations into mortgage loans that are above or below a threshold (LTV of 70%):
data.loc[data['LTV_orig_time'] > 70, 'dummy'] = 1
print(data[['LTV_orig_time']].round(decimals=2))
LTV_orig_time 0 81.8 1 81.8 2 81.8 3 81.8 4 81.8 ... ... 62173 79.8 62174 79.8 62175 79.8 62176 79.8 62177 79.8 [62178 rows x 1 columns]
Transforming Variables¶
For transforming variables, we can generate a new variables using the dataframe.loc['variable_name']
command and compute the transformed value based on an existing variable. For example, we create a new variable FICO_orig_time2
, which inflates all FICO scores by adding 10:
data.loc[:, 'FICO_orig_time2'] = data.loc[:, 'FICO_orig_time']+10
print(data[['FICO_orig_time', 'FICO_orig_time2']])
FICO_orig_time FICO_orig_time2 0 587 597 1 587 597 2 587 597 3 587 597 4 587 597 ... ... ... 62173 708 718 62174 708 718 62175 708 718 62176 708 718 62177 708 718 [62178 rows x 2 columns]
A common transformation that we apply throughout is capping and flooring (e.g, winsorizing). For example, we can apply a floor of 600 (i.e., assign floor value for all lower values) and a cap of 700 (i.e., assign cap value for all higher values) to FICO_orig_time
.
data.loc[:, 'FICO_orig_time2'] = data.loc[:, 'FICO_orig_time']
data.loc[data['FICO_orig_time2'] <= 600, 'FICO_orig_time2'] = 600
data.loc[data['FICO_orig_time2'] >= 700, 'FICO_orig_time2'] = 700
print(data[['FICO_orig_time', 'FICO_orig_time2']])
FICO_orig_time FICO_orig_time2 0 587 600 1 587 600 2 587 600 3 587 600 4 587 600 ... ... ... 62173 708 700 62174 708 700 62175 708 700 62176 708 700 62177 708 700 [62178 rows x 2 columns]
In extensions, you may use data-implied values, e.g., lower percentile values for floor values and higher percentile values for cap values. We show an example in our feature engineering chapter.
Subsetting Data¶
A data subset is smaller than the complete dataset in terms of number of observations (rows) or variables (columns). Subset dataframes are important for many applications. We may want to process fewer than all variables or fewer than all observations.
In a simple case, we may be interested in a single variable and we can access this by using the dataframe.variable_name
or dataframe[['variable_name']]
command. We run the latter and print the shape (i.e., the number of rows/observations and columns/variables).
The .copy()
statement informs pandas
to make a copy of the original data. An omission may result in warning messages and errors when we continue to process the subset.
data2 = data[['FICO_orig_time']].copy()
print(data2.shape)
(62178, 1)
There are a number of alternatives. Most importantly, pandas
offers label (dataframe.loc
) and position based indexing (dataframe.iloc
).
Using [[]]
returns a dataframe while []
returns only a slice of a dataframe. The arguments in square brackets [row,column]
indicate the rows and columns separated by a comma. :
indicates all rows or all columns.
data2 = data.loc[:, 'FICO_orig_time'].copy()
data2 = data.iloc[:, 18].copy()
print(data2.shape)
(62178,)
Note that FICO_orig_time
is the 18th variable in dataframe data
(see list of variables above) and that the result of this operation is a data slice that has no column names. The resulting slice has the shape that is equal to the number of rows of data
.
We can also create subsamples for individual rows (here the first row as we use index 0
). The resulting slice has the shape that is equal to the number of columns of data
.
data2 = data.loc[0, :].copy()
data2 = data.iloc[0, :].copy()
print(data2.shape)
(30,)
In more complex situations, we may want to filter for observations where key variables fulfill certain conditions. We can use the dataframe.query()
command. Note that if rows are deleted the index is retained and gaps in the index sequence can be observed. We plot the index and note that the first observations up to index 53 have been deleted.
data2 = data.query('FICO_orig_time >= 800').copy()
print(data2.index)
Int64Index([ 53, 54, 55, 56, 67, 71, 3244, 3245, 3246, 3247, ... 61844, 61873, 61874, 61875, 61876, 61877, 61878, 61879, 61880, 61881], dtype='int64', length=838)
We can also do random sampling. We randomly draw observations based on a rule. For instance, we can draw a certain number (or proportion) of observations with or without replacement. The latter implies that every observation is included in the resulting dataset once if sampled. Random sampling can also be done by groups. We can draw a certain number (or proportion) of observations for each group.
To start random sampling we set a seed value which determines the number of samples using the method from the library random
. 12345
is the seed value. You can choose other seed values if you like. Setting seeds allows us to repeat the random draws multiple times with the same outcome. Different seeds result in different samples.
data2 = data.sample(100, random_state=12345)
print(data2.shape)
(100, 30)
Lastly, we can also subset dataframes by dropping features (here LTV_orig_time
) using the drop()
command:
data2 = data.drop('LTV_orig_time', axis='columns').copy()
print(data2.shape)
(62178, 29)
Alternatively, we can also subset dataframes by dropping observations (here: first observation with an index of zero) using the dataframe.drop()
command:
data2 = data.drop(0, axis='rows').copy()
print(data2.shape)
(62177, 30)
Combining Data¶
Pandas dataframes can be combined using different approaches:
.concat
combines two dataframes based on the axis (combine rows or columns);.append
combines the rows of two dataframes;.merge
combines two dataframes based on matching of values of columns in datasets;.join
combines two dataframes based on matching of indexes.
Below are examples for these approaches. In a first step, we decompose the mortgage data into a number of subsamples and in a second step, we combine these subsamples using the various approaches.
Concatenating¶
Combining columns
We can append rows (axis=0
) or columns (axis=1
). This requires the same indexes. If this is not the case, the index sequence can be renumbered using the dataframe.reset_index(drop=True)
method.
To showcase, we generate three datasets for variables hpi_time
, uer_time
and gdp_time
and then concatenate these dataframes horizontally.
hpi_time = data.loc[:,['time', 'hpi_time']].drop_duplicates().reset_index(drop=True)
uer_time = data.loc[:,['time', 'uer_time']].drop_duplicates().reset_index(drop=True)
gdp_time = data.loc[:,['time', 'gdp_time']].drop_duplicates().reset_index(drop=True)
macro_time = pd.concat([hpi_time, uer_time, gdp_time], axis=1)
print(macro_time.shape)
(60, 6)
Combining rows
We can also append a dataframe. This has the same result as the append()
method below:
macro_time2 = pd.concat([hpi_time, hpi_time], axis=0)
print(macro_time2.shape)
(120, 2)
Appending¶
Appending is a specific case of concat
and allows for combining of rows:
macro_time2 = hpi_time.append(hpi_time)
print(macro_time2.shape)
(120, 2)
Match Merging¶
Common identifiers are needed to match the various data sources. Identifiers may include borrower and loan identification numbers, social security numbers, securities identification numbers, zip codes, user identification numbers, time periods and email addresses.
We decompose our mortgage dataset into constituent data files to showcase the match-merging of multiple data sources:
- Loan origination data
data_orig_time
includes information available at loan origination and we keep only one observation per loan usingdrop_duplicates(subset='id', keep='first')
; - Loan performance data
data_time
contains the time-varying borrower specific information from the panel dataset and we keep all observations; - Macroeconomic data
macro_time
includes only time-specific information and we keep one observation per time period using the.drop_duplicates()
method.
We print the dimensions of the original and three constituent data sets. The sum of columns of the three datasets is greater than the number of columns in the original data set as the identification variables id
and time
are included in each.
data_orig_time = data[['id', 'orig_time', 'first_time', 'mat_time', 'res_time', 'REtype_CO_orig_time', 'REtype_PU_orig_time', 'REtype_SF_orig_time',
'investor_orig_time', 'balance_orig_time', 'FICO_orig_time', 'LTV_orig_time', 'Interest_Rate_orig_time', 'state_orig_time', 'hpi_orig_time']].drop_duplicates(subset='id', keep='first')
data_time = data[['id', 'time', 'balance_time', 'LTV_time', 'interest_rate_time', 'rate_time', 'default_time', 'payoff_time', 'status_time', 'lgd_time', 'recovery_res']]
macro_time = data[['time', 'hpi_time', 'gdp_time', 'uer_time']].drop_duplicates()
print('data:', data.shape)
print('data_orig_time:', data_orig_time.shape)
print('data_time:', data_time.shape)
print('macro_time:', macro_time.shape)
data: (62178, 30) data_orig_time: (5000, 15) data_time: (62178, 11) macro_time: (60, 4)
We now have the three sub-datasets: data_orig_time
, data_time
and macro_time
.
We can merge these to obtain the original dataset data
by first merging data_orig_time
and data_time
by id
to data2
and second by merging the resulting dataset data2
with macro_time
by time
to data3
. We compare our input dataframe data
with the output dataframe data3
in terms of observation and variable numbers.
The dataframes may not fully match as we have reduced the origination and macro data (using the first observation of the identifying variables id
and time
) and there might have been variation that is lost in the process of reduction. One example is zip code, which may include changes over time. This may indicate a data error, or be economically reasonable if the collateral property and hence location changes over time.
data2 = pd.merge(data_orig_time, data_time, on='id')
data3 = pd.merge(data2, macro_time, on='time')
print('Original dataframe data:', data.shape)
print('Reconstituted dataframe data3:', data3.shape)
Original dataframe data: (62178, 30) Reconstituted dataframe data3: (62178, 28)
The merge function comes with four options:
- Inner join: keeps all rows in $x$ and $y$ that have common characteristics. This is the default value.
- Full join: keeps all rows from both data frames. Specify the argument
how='outer'
. - Left join: keeps all the rows of your data frame $x$ and only those from $y$ that match. Specify the argument
how='left'
. - Right join: keeps all the rows of your data frame $y$ and only those from $x$ that match. Specify the argument
how='right'
.
Joining¶
The joining command join
combines two dataframes based on indexes. It is similar to merge
but relies on indexes rather than specific columns.
hpi_time = data.loc[:,['time', 'hpi_time']].drop_duplicates().reset_index(drop=True)
uer_time = data.loc[:,['time', 'uer_time']].drop_duplicates().reset_index(drop=True)
macro_time3 = hpi_time.set_index('time').join(uer_time.set_index('time'), on='time')
print(macro_time3.shape)
(60, 2)
Regression Models¶
Regression models are based on substantial theory. Here, we are only providing a brief introduction to show basic characteristics of the models that we estimate and apply in the next chapters. We fit a Linear Regression using the statsmodels.formula.api
from the statsmodel library which is represented by the acronym smf
. We save all objects from the method smf.ols
under the name data_ols
, specify the model equation with LTV_time
as the dependent (left-hand side, LHS) and the variables LTV_orig_time
and gdp_time
as features (right hand side, RHS). LHS and RHS variables are connected by the ~ sign. Further, we specify the estimation sample as our mortgage dataframe and use the fit()
method:
data_ols = smf.ols(formula='LTV_time ~ LTV_orig_time + gdp_time', data=data).fit()
There are many packages in Python that provide regression models and they all provide a number of outputs. To get a first impression of what is available, we use the dir(data_ols)
command. We do not execute this to conserve space.
We select dataframe.summary()
from these objects:
print(data_ols.summary())
OLS Regression Results ============================================================================== Dep. Variable: LTV_time R-squared: 0.220 Model: OLS Adj. R-squared: 0.220 Method: Least Squares F-statistic: 8772. Date: Tue, 15 Jun 2021 Prob (F-statistic): 0.00 Time: 16:18:43 Log-Likelihood: -2.8781e+05 No. Observations: 62153 AIC: 5.756e+05 Df Residuals: 62150 BIC: 5.756e+05 Df Model: 2 Covariance Type: nonrobust ================================================================================= coef std err t P>|t| [0.025 0.975] --------------------------------------------------------------------------------- Intercept 13.8450 0.773 17.901 0.000 12.329 15.361 LTV_orig_time 0.9632 0.010 99.086 0.000 0.944 0.982 gdp_time -4.5720 0.051 -89.721 0.000 -4.672 -4.472 ============================================================================== Omnibus: 88921.993 Durbin-Watson: 0.105 Prob(Omnibus): 0.000 Jarque-Bera (JB): 123795185.295 Skew: 7.985 Prob(JB): 0.00 Kurtosis: 221.054 Cond. No. 617. ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
dataframe.summary()
shows the R-squared as a performance measure of the regression model. The ratio is bounded between zero (poor model quality) and one (high model quality). We will discuss performance measures in the validation chapter.
Further, the table shows the parameter estimates. We include an intercept as well as the key features LTV ratio at origination time and GDP growth rate at observation time. The sign indicates the direction of influence. LTV has a positive sign and hence a greater LTV at origination explains a greater LTV ratio at observation time. The GDP growth rate has a negative sign and hence a greater GDP growth rate explains a lower LTV at observation time.
P-values are shown in the column with the header P>|t|
and have to be interpreted with care. Throughout this text, we interpret p-values which are less than a threshold value (e.g., 1%, 5% or 10%) as a sign for statistical significance of the respective feature. For example, if the p-values are less than 1% then the variable has a significant influence at the 1% level.
numpy
vs pandas
¶
Previous sections focused on pandas
. pandas
dataframes are enhanced arrays as they can be accessed by integer positions (which we use for rows) as well as labels (which we use for columns/features). pandas
also offers a large range of econometric data operations, including sampling, aggregation, leading/lagging that are particularly helpful for credit risk data.
We use numpy
in more technical sections of the book such as the part on machine learning for PD and LGD forecasting.
numpy
is a fundamental package for scientific computing with Python. Among other things, it offers the creation and processing of multi-dimensional arrays, linear algebra and random number capabilities. An array is a collection of values. numpy
arrays are accessed by their integer position with zero as a base.
We create a one-dimensional array (a vector) with two entries from a list. A list is created with square brackets and the entries are separated by commas. We use np.array(list1)
and see that the created array has two entries, embraced by squared brackets. If we have one opening and one closing bracket, we have a vector with two rows. Using array1.ndim
we can return the dimension, using array1.shape
we see that the array has two rows. numpy
arrays are more efficient than Python lists.
array1 = np.array([1,2])
print('Array:',array1)
print('Dimension:',array1.ndim)
print('Shape:',array1.shape)
Array: [1 2] Dimension: 1 Shape: (2,)
We now create a another array from a list with four entries as shown below. We see that the array now has two opening and closing brackets. The two square brackets indicate a two-dimensional array (a matrix). The inner brackets contain the lines as vectors where the outer brackets 'stack' the row vectors into a matrix. The resulting matrix then has two rows and two columns. This can be extended to more than two dimensions by extending the number of square brackets.
array2 = np.array([[1,3],[2,4]])
print('Array:',array2)
print('Dimension:',array2.ndim)
print('Shape:',array2.shape)
Array: [[1 3] [2 4]] Dimension: 2 Shape: (2, 2)
We want to access certain elements in the array. We can index and slice them in the same ways you can slice Python lists. Remember that indexing starts with zero.
print(array2[0,])
[1 3]
print(array2[:,1])
[3 4]
print(array2[1,0])
2
For more information about numpy
we refer to the official manual and tutorials on www.numpy.org.
Converting pandas
dataframes to numpy
arrays¶
We can convert a pandas dataframe to numpy:
data2 = data[['id', 'time', 'gdp_time', 'FICO_orig_time', 'LTV_time']].round(decimals=2)
clabels = data2.columns.values
data_numpy = data2.values
print(data_numpy)
[[4.0000e+00 2.5000e+01 2.9000e+00 5.8700e+02 3.3910e+01] [4.0000e+00 2.6000e+01 2.1500e+00 5.8700e+02 3.4010e+01] [4.0000e+00 2.7000e+01 2.3600e+00 5.8700e+02 3.4340e+01] ... [4.9972e+04 5.4000e+01 1.5100e+00 7.0800e+02 9.1870e+01] [4.9972e+04 5.5000e+01 2.4200e+00 7.0800e+02 9.1560e+01] [4.9972e+04 5.6000e+01 1.7200e+00 7.0800e+02 9.0870e+01]]
We store the column labels in array clabels
and the values in array data_numpy
. The two square brackets indicate a two-dimensional array (a matrix). The inner brackets contain the rows as vectors where the outer brackets 'stack' the line vectors into a matrix.
Converting numpy
arrays to pandas
dataframes¶
We can convert a numpy dataframe to pandas. We label the columns using the columns
command. We can also label rows using index
command.
data=pd.DataFrame(data=data_numpy, columns=clabels)
print(data)
id time gdp_time FICO_orig_time LTV_time 0 4.0 25.0 2.90 587.0 33.91 1 4.0 26.0 2.15 587.0 34.01 2 4.0 27.0 2.36 587.0 34.34 3 4.0 28.0 1.23 587.0 34.67 4 4.0 29.0 1.69 587.0 34.95 ... ... ... ... ... ... 62173 49972.0 52.0 1.08 708.0 103.31 62174 49972.0 53.0 0.89 708.0 95.74 62175 49972.0 54.0 1.51 708.0 91.87 62176 49972.0 55.0 2.42 708.0 91.56 62177 49972.0 56.0 1.72 708.0 90.87 [62178 rows x 5 columns]
Module dcr
¶
We include a number of packages that have been developed by the Python community as well as functions in the module dcr
that is available for download from www.deepcreditrisk.com. Functions are detailed in the next section.
We limit the number of rows that should be shown to ten to conserve space in this book.
We also import dataset dcr.csv
into a pandas dataframe using the pd.read_csv command
, which generates a dataframe (i.e., a panel dataset with observations in rows and variables in columns).
Finally, we suppress warning messages. You may exclude this line of code if you want to see warning messages throughout. This is equivalent to commenting the line of code by setting a hashtag upfront. We ensured few warning messages remain.
We execute this by calling the dcr module via from dcr import *
. Alternatively, you may run %run dcr
.
from dcr import *
Functions¶
Functions are helpful to execute the code multiple times applying different arguments. To be able to scale various techniques developed in this book, we provide the following functions via the module dcr
that is available for download from www.deepcreditrisk.com. The functions are:
versions
;dataprep
;woe
;validation
;resolutionbias
.
versions
¶
The function versions()
produces a table with the package versions that we use for this text:
from dcr import *
versions()
Package Version Acronym 0 Python 3.8.3 NaN 1 IPython NaN IPython 2 math NaN math 3 matplotlib.pyplot, pylab 3.3.2 plt 4 numpy 1.18.5 np 5 pandas 1.0.5 pd 6 pickle 4.0 pickle 7 random NaN random 8 scipy 1.5.0 scipy 9 sklearn 0.23.1 sklearn 10 statsmodels 0.11.1 sm
Package versions may change over time and may cause error messages or warning messages. Note, warning messages are not error messages and the code may run perfectly well. Later Python versions include a number of error (bug) fixes and using different versions may result in different outputs. Please ensure to use the same versions to obtain the same results. Different versions may result in slightly different outputs.
Also, this text is mainly designed, so that you learn the principles of Deep Credit Risk, and not the detailed syntax, as the latter may change from version to version, but the principles stay the same. Do an internet search to find a solution if an error or warning message arises with other future versions.
In compiling this text, we have tried to minimize the number of warning messages and very few remain. For example, we use pandas
heavily as we work with panel data and a common warning message is SettingWithCopyWarning
. There are two operations:
- "set" operation: we assign values using the
=
sign; - "get" operation: we perform operations such as indexing.
It is very popular to connect (i.e., chain) operations. An example is hpi_time.set_index('time').join(uer_time.set_index('time'), on='time')
with the chain elements set_index()
and .join()
. Python does not produce expected results in situations where multiple assignments (i.e., chained assignments) or operations are performed in different lines (i.e., hidden chains). We avoid such situations by combining sets on rows and columns using the .loc
command and also specifying that we create a copy using the .copy()
command when creating data subsets to avoid misinterpretations of dataframes as views of the original dataframes.
dataprep
¶
The function dataprep
generates economic features, principal components and clusters. Train and test datasets are provided for machine learning techniques for PD and LGD models.
The input arguments are:
data_in
: input dataset. We generally usedata = pd.read_csv('dcr.csv')
;depvar
: we are creatingnumpy
arrays for the machine learning chapters anddepvar
is the target variable in these models. Default value isdepvar='default_time'
, i.e., the default indicator. The alternative option isdepvar='lgd_time'
;splitvar
: we are splitting the dataset into a training and testing sample andsplitvar
is the splitting feature. Default value issplitvar='time'
, i.e., the observation time;threshold
: we are splitting the dataset into a training and testing sample andthreshold
is the splitting threshold. Default value isthreshold=26
, i.e., the start of the financial crisis.
The output arguments are:
df
: inputpandas
dataframe plus all additionally created variables;data_train
: trainingpandas
dataframe that includes the input dataset plus all additionally created variables. For PD models the training data relates to pre-crisis periods. For LGD data, the training data includes pre-crisis and crisis periods to ensure a sufficient observation number;data_test
: testpandas
dataframe that includes the input dataset plus all additionally created variables. For PD models the test data relates to crisis periods. For LGD data, the test data relates to post-crisis periods;X_train_scaled
: trainingnumpy
array that includes all features;X_test_scaled
: testnumpy
array that includes all features;y_train
: trainingnumpy
vector of an outcome. Default value isdepvar='default_time'
, i.e., the default indicator;y_test
: testnumpy
vector of an outcome. Default value isdepvar='default_time'
, i.e., the default indicator.
You may assign any name to these output objects. Additional assumptions are made. For example, the principal components analysis of the state-level default rates stores the first five components and two clusters are formed using K-means clustering. A correction for resolution bias is applied if the dependent variable is lgd_time
.
We discuss the function dataprep
in more detail in our feature engineering chapter.
woe
¶
The function woe
computes weight of evidence and information value for features. It is discussed in the feature engineering and feature selection chapters. The function is called using outputWOE, outputIV = woe(data, target, variable, bins, binning)
.
The input arguments are:
data_in
: input dataframe;target
: outcome variable. In all our applications, we choosetarget='default_time'
, i.e., the default indicator;variable
: feature for which weights of evidence are computed;bins
: number of bins;binning
: whether binning or actual category is used (binning = True
orbinning = False
).
The output arguments are:
outputWOE
: weight of evidence per bin;outputIV
: information value.
You may assign any name to these output objects.
validation
¶
The function validation
computes a number of validation measures and visual validation plots. The function is called using validation(fit, outcome , time, continuous=False)
.
The input arguments are:
fit
: fitted outcome variable;outcome
: outcome variable. For PD models this would bedefault_time
and for most of our LGD modelstarget='default_time'
, i.e., the default indicator;time
: variable which explains the stratifying feature. We generally use variabletime
;continuous
: whether dependent variable is binary or not (continuous=True
orcontinuous=False
). The default value iscontinuous=False
.
There are no output arguments as a panel with one table and three charts is printed instead. The panel summarizes a number of validation metrics and charts. We discuss more details in our validation chapter.
resolutionbias
¶
The function resolutionbias
corrects observed LGD values for resolution bias. It is discussed in the outcome engineering chapter. The function is called with df3 = resolutionbias(data_in,lgd,res,t)
.
The input arguments are:
data_in
: input dataset;lgd
: LGD variable. We use variablelgd_time
in most examples;res
: the resolution time. We use variableres_time
in most examples;t
: observation time. We use variabletime
in most examples.
The output argument is df3
: pandas
dataframe that includes the input dataset with the LGD values that have been corrected for resolution bias.
You may assign any name to these output objects. The function assumes an end of observations in period 60.
Sandbox Problems¶
- Create a data subset labeled
data2
, which includes the variablesid
,time
,LTV_orig_time
andLTV_time
. - Provide descriptive statistics for variables
LTV_orig_time
andLTV_time
. - Filter dataset
data
for loans with a FICO score above 800. - Provide a frequency table that shows the number of defaults in the sample.
- Provide a cross-frequency table by variables
default_time
andtime
.
\begin{comment}
References¶
(?) !! This reference was not found in biblio.bib !!
(?) !! This reference was not found in biblio.bib !!
(?) !! This reference was not found in biblio.bib !!
(?) !! This reference was not found in biblio.bib !!
(?) !! This reference was not found in biblio.bib !!
(?) !! This reference was not found in biblio.bib !!
(?) !! This reference was not found in biblio.bib !!
(?) !! This reference was not found in biblio.bib !!
(?) !! This reference was not found in biblio.bib !!
(?) !! This reference was not found in biblio.bib !!
(?) Salvatier John, Wiecki Thomas V. and Fonnesbeck Christopher, ``Probabilistic programming in Python using PyMC3'', PeerJ Computer Science, vol. 2, number , pp. e55, April 2016. online
(?) !! This reference was not found in biblio.bib !!
(?) !! This reference was not found in biblio.bib !!
\end{comment}