The dataset used for this project is the Prosper Loan Dataset. The dataset contains data about the details of the various loans lent to the borrowers from Prosper. Each row in the dataset represents a loan, uniquely identified by the Listing Key.
Every row in the dataset describes various attributes about the Borrower such as Employment Status, Credit Score, etc. Every row also describes other parameters such as Monthly Payments, On Time Payments, Interest Rate, etc.
The state of California, is the state with the most number of borrowers, and California, Texas and New York are the top 3 states with the most number of borrowers.
The state of North Dakota is the state with the least number of borrowers, and Wyoming, Maine and North Dakota are the 3 states with the least number of borrowers.
most_state_list = dataset['BorrowerState'].value_counts()[:10].index.tolist()
most_state_count = dataset['BorrowerState'].value_counts()[:10].values.tolist()
least_state_list = dataset['BorrowerState'].value_counts()[-10:].index.tolist()
least_state_count = dataset['BorrowerState'].value_counts()[-10:].values.tolist()
f,(ax1,ax2) = plt.subplots(ncols=2, sharey=False, sharex=False,
figsize=(12,6))
sns.barplot(x=most_state_count, y=most_state_list, ax=ax1)
ax1.set_title('States With Highest Number of Borrowers')
ax1.set_xlabel('Number of Borrowers')
ax1.set_ylabel('State Abbreviation')
sns.barplot(x=least_state_count, y=least_state_list, ax=ax2)
ax2.set_title('States With Lowest Number of Borrowers')
ax2.set_xlabel('Number of Borrowers')
ax2.set_ylabel('State Abbreviation')
plt.show()
It can be observed that, most of the loans were categorised as Debt Consolidation. Some of the most common categories of loans are:
However, many loans were listed in the categories of Not Available and Other, thus making it difficult to determine accurately which was the most common reason for taking a loan.
fig = plt.figure(figsize=(12,4))
sns.countplot(y='ListingCategory', data=dataset)
plt.title('Distribution of Loan Categories')
plt.ylabel('Loan Category')
plt.xlabel('Number of Loans')
plt.show()
It can be seen that there is almost a 50-50 distribution of borrowers who own a home, i.e almost half the borrowers own a home, whereas the other half do not own a home.
fig = plt.figure(figsize=(12,6))
sns.countplot(x='IsBorrowerHomeowner', data=dataset)
plt.title('Distribution of Homeowners')
plt.ylabel('Number of Borrowers')
plt.show()
Most of the borrowers are Employed, with some of the borrowers being listed as Not Available and Other, whereas only a few of the borrowers were Not Employed.Also, some of the borrowers are also Retired.
fig = plt.figure(figsize=(12,6))
sns.countplot(y='EmploymentStatus', data=dataset)
plt.title('Distribution of Employment Status')
plt.ylabel('Employment Status')
plt.xlabel('Number of Borrowers')
plt.show()
The highest number of loans were originated in the year of 2013, whereas the least number of loans originated in the year of 2005.
Additionally, it can also be seen that a positively increasing trend can be observed in the number of loan originations starting from the year 2009 to 2013, which peaked in 2013, and then the increase was interrupted. It can also be seen that the number of loans almost doubled every year in comparison to the previous year, for the year 2010 - 2013.
fig = plt.figure(figsize=(12,5))
sns.countplot(x='LoanOriginationYear', data=dataset, palette='OrRd')
plt.title('Year Wise Distribution - Number of Loans')
plt.xlabel('Number of Loans')
plt.ylabel('State Abbreviation')
plt.show()
The distribution cannot be clearly seen as there are many outliers in the income of the borrowers. It can be seen in the below distribution that monthly incomes above 10000 can be considered as outliers, and are thus removed from the dataset.
Most of the borrowers in the dataset have an income between 3000 and 6000.
fig = plt.figure(figsize=(12,4))
sns.boxplot(x='StatedMonthlyIncome', data=dataset)
plt.title('Distribution of Stated Monthly Income')
plt.xlabel('Stated Monthly Income')
plt.show()
This question could not be answered with complete certainity as most of the borrower's occupations has been listed as Other.
However based on the available data, it can be observed that the most common 3 occupations are:
occp_list = dataset['Occupation'].value_counts()[:10].index.tolist()
occp_count = dataset['Occupation'].value_counts()[:10].values.tolist()
fig = plt.figure(figsize=(10,4))
sns.barplot(y=occp_list, x=occp_count)
plt.title('Borrower Occupation Distribution (Top 10 Occupations)')
plt.xlabel('Number of Borrowers')
plt.ylabel('Borrower Occupation')
plt.show()
The Prosper Score ranges from 1 - 11. It can be observed that most the borrowers were assigned Prosper Scores between the range of 4 - 8. The Prosper Score of 4 was the most commonly occuring prosper score in the dataset.
fig = plt.figure(figsize=(10,6))
sns.countplot(x='ProsperScore', data=dataset, palette='OrRd')
plt.title('Prosper Score Distribution')
plt.xlabel('Prosper Score')
plt.ylabel('Number of Borrowers')
plt.show()
From the Box Plot, it can be seen that most of the borrowers have a Credit Score in the range of 650 - 750.
Outliers have also been identified, with the Credit Scores range below 600 and above 800.
f,(ax1,ax2) = plt.subplots(nrows=2, sharey=False, sharex=True,
figsize=(12,6))
sns.boxplot(x='CreditScoreRangeLower', data=dataset, color='yellow', ax=ax1)
ax1.set_title('Credit Score Range Lower Distribution')
ax1.set_xlabel('Credit Score Range Lower')
sns.boxplot(x='CreditScoreRangeUpper', data=dataset, color='red', ax=ax2)
ax2.set_title('Credit Score Range Upper Distribution')
ax2.set_xlabel('Credit Score Range Upper')
plt.show()
Most of the loans that were given, are between 2,000 - 10,000, with loans are of amount > 25,000 are extremely rarely given.
Additionally, loans of amount 10,000 and 15,000 also are very commonly issued.
fig = plt.figure(figsize=(10,6))
sns.distplot(dataset['LoanOriginalAmount'], bins=50)
plt.title('Distribution of Original Loan Amount')
plt.xlabel('Original Loan Amount')
plt.show()
Most of the loans in the dataset have Monthly Loan Payments below the amount of 500 a month.
fig = plt.figure(figsize=(10,6))
sns.distplot(dataset['MonthlyLoanPayment'], bins=20)
plt.title('Distribution of Monthly Loan Payments')
plt.xlabel('Monthly Loan Payment')
plt.show()
Majority of the loans in the dataset are taken for a period of 36 Months/3 Years, with the loans for the duration of 12 Months being very rare. Also, a smaller portion of the loans are also given for a duration of 60 Months/5 Years.
fig = plt.figure(figsize=(8,4))
sns.countplot(x='Term', data=dataset)
plt.title('Distribution of Loan Terms')
plt.xlabel('Loan Term')
plt.ylabel('Number of Loans')
plt.show()
It can be observed that there is Strong Positive Correlation between Monthly Income and Prosper Score. Higher the Monthly Income, Higher the Prosper Score.
fig = plt.figure(figsize=(12,6))
sns.pointplot(y='StatedMonthlyIncome', x='ProsperScore', data=dataset)
plt.title('Relationship between Monthly Income and Prosper Score')
plt.xlabel('Prosper Score')
plt.ylabel('Stated Monthly Income')
plt.show()
No Significant Difference can be observed in the distribution of the Prosper Score, for the borrowers who Own or Not Own a Home. However, an interesting pattern can be noticed, that the borrowers who have a Lower Prosper Score, the number of borrowers who Dont Own a Home > Who Own a Home. Whereas, for a Higher Prosper Score, the number of borrowers who Own a Home > Dont Own a Home.
fig = plt.figure(figsize=(10,6))
sns.boxenplot(y='ProsperScore', x='IsBorrowerHomeowner', data=dataset)
plt.title('Prosper Score Distribution - Based on Home Ownership')
plt.xlabel('Is Borrower Home Owner')
plt.ylabel('Prosper Score')
plt.show()
fig = plt.figure(figsize=(12,6))
sns.countplot(hue='IsBorrowerHomeowner', x='ProsperScore', data=dataset)
plt.title('Relationship Between Prosper Score & Home Ownership')
plt.ylabel('Number of Borrowers')
plt.xlabel('Prosper Score')
plt.show()
fig = plt.figure(figsize=(12,8))
sns.boxplot(x='LoanOriginalAmount', y='ListingCategory', data=dataset)
plt.title('Distribution of Loan Amount Based on Loan Category')
plt.ylabel('Loan Category')
plt.xlabel('Original Loan Amount')
plt.show()
It can be observed that the borrowers who are Employed, haves Loans of Higher Amounts when compared to borrowers with other categories of employment.
Whereas, borrowers who are Retired & Not Employed, have taken loans of Lower Amounts when compared with other Employment categories.
fig = plt.figure(figsize=(12,6))
sns.pointplot(y='LoanOriginalAmount', x='EmploymentStatus', data=dataset)
plt.title('Distribution of Loan Amount Based on Employment')
plt.xlabel('Employment Status')
plt.ylabel('Original Loan Amount')
plt.show()
A clear relationship can be observed between Employment Status and Loan Interest Rate. Borrower's who are Not Employed, have been charged the Highest Interest Rate, where borrowers who are Employed Full-Time/Part-Time are offered the minimum Interest Rate.
fig = plt.figure(figsize=(12,6))
sns.pointplot(y='BorrowerRate', x='EmploymentStatus', data=dataset)
plt.title('Distribution of Interest Rate Based on Employment')
plt.xlabel('Employment Status')
plt.ylabel('Loan Interest Rate')
plt.show()
It can be observed that most of the Borrowers who are Employed either Full Time/Part Time, have a higher range of Prosper Scores when compared to Borrowers who are Not Employed.
Suprisingly, Borrowers who are Retired, have also have a similar range of Prosper Scores to Borrowers who are employed Part-Time.
fig = plt.figure(figsize=(12,6))
sns.boxplot(x='ProsperScore', y='EmploymentStatus', data=dataset)
plt.title('Distribution of Prosper Score Based on Employment')
plt.ylabel('Employment Status')
plt.xlabel('Prosper Score')
plt.show()
A High Correlation can be observed between the Interest Rate and Borrower's Prosper Score.
Relationship: Higher Prosper Score -> Lower Interest Rate and vice-versa.
fig = plt.figure(figsize=(12,6))
sns.pointplot(y='BorrowerRate', x='ProsperScore', data=dataset)
plt.title('Distribution of Interest Rate Based on Prosper Score')
plt.ylabel('Loan Interest Rate')
plt.xlabel('Prosper Score')
plt.show()
A Positive Trend can be observed in the Loan Amounts over the years. From the year 2009 a continuing increase has been observed in the Loan Amount being given.
fig = plt.figure(figsize=(12,6))
sns.pointplot(y='LoanOriginalAmount', x='LoanOriginationYear',
data=dataset)
plt.title('Relationship Between Loan Amounts and Loan Origination Year')
plt.ylabel('Original Loan Amount')
plt.xlabel('Loan Origination Year')
plt.show()
This relation is proved to be true, it can be clearly observed that for the majority of the loans, Higher Duration Loans have a higher range of Loan Amounts.
fig = plt.figure(figsize=(12,6))
sns.boxplot(y='LoanOriginalAmount', x='Term',
data=dataset)
plt.title('Relationship Between Loan Amounts and Loan Term')
plt.ylabel('Original Loan Amount')
plt.xlabel('Loan Term')
plt.show()
It can be clearly seen that the Borrowers with Higher Prosper Score, have higher number of On Time Monthly Payments. This relation holds true for all Prosper Scores other than Prosper Score of 2, which can be observed as an exception to the pattern, wherein the Borrowers with this score, have lower number of On-Time payments when compared to borrowers with a Lower Prosper Score.
fig = plt.figure(figsize=(12,4))
sns.pointplot(y='OnTimeProsperPayments', x='ProsperScore',
data=dataset)
plt.title('Relationship Between On Time Payments and Prosper Score')
plt.ylabel('Number of On Time Payments')
plt.xlabel('Prosper Score')
plt.show()
A Strong Relationship between Credit Score and Prosper Score can be observed. It can be seen that Higher Credit Score leads to a Higher Prosper Score.
fig = plt.figure(figsize=(12,6))
sns.pointplot(y='CreditScoreRangeLower', x='ProsperScore',data=dataset)
plt.title('Relationship Between Credit Score & Prosper Score')
plt.ylabel('Prosper Score')
plt.xlabel('Credit Score Range Lower')
plt.show()
A Strong Negative Correlation can be seen between Estimated Loss and Prosper Score. The Loans given to Borrower's with Higher Prosper Score, have a Lower Estimated Loss.
fig = plt.figure(figsize=(12,6))
sns.pointplot(x='ProsperScore', y='EstimatedLoss', data=dataset)
plt.title('Relationship Between Estimated Loss and Prosper Score')
plt.xlabel('Prosper Score')
plt.ylabel('Estimated Loss')
plt.show()
A relationship can be observed between Monthly Payments and Loan Amount. Higher Loan Amount leads to Higher Monthly Payments. However, no effect of Prosper Score can be seen on either, as neither does Higher Prosper Score lead to Higher Loan Amount or Monthly Amount nor vice-versa.
fig = plt.figure(figsize=(12,6))
sns.scatterplot(x='LoanOriginalAmount', y='MonthlyLoanPayment',
data=dataset, hue='ProsperScore', palette='OrRd')
plt.title('Correlation Between Monthly Payments, Loan Amount & Prosper Score')
plt.ylabel('Monthly Loan Payment')
plt.xlabel('Original Loan Amount')
plt.show()
It can be seen with Borrower's having Fewer Current Delinquencies and Higher On-Time Payments, are likely to have higher number of loans when compared to borrower's with Higher Current Delinquencies and Fewer On-Time Payments.
fig = plt.figure(figsize=(12,6))
sns.scatterplot(x='CurrentDelinquencies', y='OnTimeProsperPayments',
hue='TotalProsperLoans', data=dataset, palette='OrRd')
plt.title('Relationship Between Delinquencies,On Time Payments & Number Of Loans')
plt.xlabel('Current Delinquencies')
plt.ylabel('Number of On Time Payments')
plt.show()
It can be seen that a Strong Positive Correlation between Interest Rate, Yield and Estimated Loss. It can be seen that with the Increase in Interest, both Yield and Estimated Loss increase in the same proportion.
fig = plt.figure(figsize=(12,6))
sns.scatterplot(x='BorrowerRate', y='LenderYield',
hue='EstimatedLoss',y_jitter=0.9, data=dataset,
palette='OrRd')
plt.title('Relation Between Interest Rate, Yield & Estimated Loss')
plt.xlabel('Borrower Interest Rate')
plt.ylabel('Lender Yield')
plt.show()
The exploration of the dataset led to the discovery of the various attributes about the Borrower's in the Prosper Loan dataset:
The exploration also enabled in the discovery of many correlations between the attributes of Loans, such as the Interest Rate, Monthly Payments, Loan Origination Year and On Time Payments and the Borrower's attributes such as Employment Status. The following are some of the findings:
The exploration of the dataset, led to the identification of some interesting dependencies between the Borrower's Attributues and Loan Attributes.