Skip to content

cplusplus12580/Loan_prediction

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 

Repository files navigation

Loan_prediction

基于提供的申请信息,自动判断用户是否有贷款的资格

train_df = pd.read_csv('train.csv')
print train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
Loan_ID              614 non-null object
Gender               601 non-null object
Married              611 non-null object
Dependents           599 non-null object
Education            614 non-null object
Self_Employed        582 non-null object
ApplicantIncome      614 non-null int64
CoapplicantIncome    614 non-null float64
LoanAmount           592 non-null float64
Loan_Amount_Term     600 non-null float64
Credit_History       564 non-null float64
Property_Area        614 non-null object
Loan_Status          614 non-null object
dtypes: float64(4), int64(1), object(8)
memory usage: 62.4+ KB
  • Loan_ID: 唯一贷款ID
  • Gender: 性别。Male/Female
  • Married: 婚否。Y/N
  • Dependents: 受抚养者数量
  • Education: Graduate(大学毕业生,研究生)/ Under Graduate(本科生)
  • Self_Employed: 个体经营者。Y/N
  • ApplicationIncome: 收入
  • CoapplicationIncome:
  • LoanAmount: 贷款额度(千)
  • Loan_Amount_Term: 贷款期限(月)
  • Credit_History: credit history meets guidelines.0和1
  • Property_area: 房产位置。Urban/Semi Urban/Rural
  • Loan_status: 是否同意贷款。Y/N

Partners in a transaction will use co-applicant status to share the responsibility of a loan as well as the benefits of ownership for the product purchased with the loan. Co-applicants legally agree to share the property and the responsibility for repayment of the loan

print train_df.isnull().sum()

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64
print train_df.head()
    Loan_ID Gender Married Dependents     Education Self_Employed  \
0  LP001002   Male      No          0      Graduate            No   
1  LP001003   Male     Yes          1      Graduate            No   
2  LP001005   Male     Yes          0      Graduate           Yes   
3  LP001006   Male     Yes          0  Not Graduate            No   
4  LP001008   Male      No          0      Graduate            No   

   ApplicantIncome  CoapplicantIncome  LoanAmount  Loan_Amount_Term  \
0             5849                0.0         NaN             360.0   
1             4583             1508.0       128.0             360.0   
2             3000                0.0        66.0             360.0   
3             2583             2358.0       120.0             360.0   
4             6000                0.0       141.0             360.0   

   Credit_History Property_Area Loan_Status  
0             1.0         Urban           Y  
1             1.0         Rural           N  
2             1.0         Urban           Y  
3             1.0         Urban           Y  
4             1.0         Urban           Y 

测试集:

test_df = pd.read_csv('test.csv')
print test_df.info()

RangeIndex: 367 entries, 0 to 366
Data columns (total 12 columns):
Loan_ID              367 non-null object
Gender               356 non-null object
Married              367 non-null object
Dependents           357 non-null object
Education            367 non-null object
Self_Employed        344 non-null object
ApplicantIncome      367 non-null int64
CoapplicantIncome    367 non-null int64
LoanAmount           362 non-null float64
Loan_Amount_Term     361 non-null float64
Credit_History       338 non-null float64
Property_Area        367 non-null object
dtypes: float64(3), int64(2), object(7)
memory usage: 34.5+ KB
print test_df.isnull().sum()

Loan_ID               0
Gender               11
Married               0
Dependents           10
Education             0
Self_Employed        23
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount            5
Loan_Amount_Term      6
Credit_History       29
Property_Area         0
dtype: int64
print test_df.head()

    Loan_ID Gender Married Dependents     Education Self_Employed  \
0  LP001015   Male     Yes          0      Graduate            No   
1  LP001022   Male     Yes          1      Graduate            No   
2  LP001031   Male     Yes          2      Graduate            No   
3  LP001035   Male     Yes          2      Graduate            No   
4  LP001051   Male      No          0  Not Graduate            No   

   ApplicantIncome  CoapplicantIncome  LoanAmount  Loan_Amount_Term  \
0             5720                  0       110.0             360.0   
1             3076               1500       126.0             360.0   
2             5000               1800       208.0             360.0   
3             2340               2546       100.0             360.0   
4             3276                  0        78.0             360.0   

   Credit_History Property_Area  
0             1.0         Urban  
1             1.0         Urban  
2             1.0         Urban  
3             NaN         Urban  
4             1.0         Urban

train 和test合并

df = pd.concat([train_df, test_df], axis=0)
train_size = train_df.shape[0]

Married

print df[df.Married.isnull()][['Gender', 'Dependents','Self_Employed', 'Education', 'ApplicantIncome']]

     Gender Dependents Self_Employed Education  ApplicantIncome
104    Male        NaN            No  Graduate             3816
228    Male        NaN            No  Graduate             4758
435  Female        NaN            No  Graduate            10047

和结婚信息最息息相关的Dependents竟然也是NaN,吐血,只能根据其它的字段信息进行推断了

facet = sns.FacetGrid(df[(df.Self_Employed == 'No') & (df.Gender=='Female') & (df.Education == 'Graduate') & (df.ApplicantIncome > 10000)], hue='Married', aspect=4)
facet.map(sns.kdeplot, 'ApplicantIncome', shade=True)
facet.set(xlim=(10000, df[(df.Self_Employed == 'No') & (df.Gender=='Female') & (df.Education == 'Graduate') & (df.ApplicantIncome > 10000)]['ApplicantIncome'].max()))
facet.add_legend()
plt.title('Female and Income > 10000')
plt.show()

从上图来看,女性申请者,收入略多于10000的非常有可能是未婚的。

sns.countplot(x='Married', hue='Gender', data=df[(df.Self_Employed == 'No') & (df.Education == 'Graduate') & (df.ApplicantIncome > 10000)])
plt.title('Income > 10000')
plt.show()

从上图中可以看出,Female申请额度大于10000时,没结婚的偏多。

综合以上两点,该女性很可能是未婚的。

facet = sns.FacetGrid(df[(df.Self_Employed == 'No') & (df.Education == 'Graduate') & (df.ApplicantIncome < 5000)], hue='Married', aspect=4)
facet.map(sns.kdeplot, 'ApplicantIncome', shade=True)
facet.set(xlim=(0, df[(df.Self_Employed == 'No') & (df.Education == 'Graduate') & (df.ApplicantIncome < 5000)]['ApplicantIncome'].max()))
facet.add_legend()
plt.show()

Male申请者,未婚和已婚的收入分布基本一致。

sns.countplot(x='Married', hue='Gender', data=df[(df.Self_Employed == 'No') & (df.Education == 'Graduate') & (df.ApplicantIncome < 5000)])
plt.title('Income < 5000')
plt.show()

但是从统计上来看,已婚的男士较多。

综上两点,该两位男性很有可能是已婚的。

所以:

_ = df.set_value((df.ApplicantIncome > 10000) & (df.Married.isnull()), 'Married', 'No')
_ = df.set_value((df.ApplicantIncome < 5000) & (df.Married.isnull()), 'Married', 'Yes')

Gender

查看性别对贷款的影响

sns.countplot(x='Gender', hue='Loan_Status', data = train_df)
plt.show()

产看Gender缺失值的其它特征:

print df[df.Gender.isnull()][['Self_Employed', 'ApplicantIncome']]

    Self_Employed  ApplicantIncome
23             No             3365
126            No            23803
171            No            51763
188           Yes              674
314            No             2473
334           Yes             9833
460           Yes             2083
467            No            16692
477            No             2873
507            No             3583
576            No             3087
588            No             4750
592           Yes             9357
22             No             3909
51             No             3500
106            No             1596
138            No             3333
209            No             2038
231           Yes             2860
245            No             3186
279            No            29167
296            No             6478
303           Yes              570
318            No             4768
sns.boxplot(x='Gender', y='ApplicantIncome', hue='Self_Employed', data=df)
plt.axhline(y='5000', color='red')
plt.axhline(y='10000', color='green')
plt.show()

对Gender进行预测。

选择相关的不含缺失值的字段用于预测。

df_for_gender = df[['Gender', 'Married', 'Property_Area', 'Education', 'ApplicantIncome']]

对ApplicantIncome字段进行缩放处理

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_for_gender['Norm_Income'] = pd.Series(scaler.fit_transform(df_for_gender['ApplicantIncome'].reshape(-1,1)).reshape(-1), index=df_for_gender.index)
df_for_gender.drop('ApplicantIncome', axis=1, inplace=True)

对类别型变量进行dummies处理

df_for_gender = pd.get_dummies(df_for_gender, columns=['Married', 'Property_Area', 'Education'])

经过上面的步骤,用于预测Gender的数据集如下:

print df_for_gender.info()

Data columns (total 9 columns):
Gender                     957 non-null object
Norm_Income                981 non-null float64
Married_No                 981 non-null float64
Married_Yes                981 non-null float64
Property_Area_Rural        981 non-null float64
Property_Area_Semiurban    981 non-null float64
Property_Area_Urban        981 non-null float64
Education_Graduate         981 non-null float64
Education_Not Graduate     981 non-null float64
dtypes: float64(8), object(1)
memory usage: 76.6+ KB

训练逻辑回归模型

from sklearn.linear_model import LogisticRegression
lg = LogisticRegression()
parameters={'C': [0.5]}
clf_lgl = get_model(lg, parameters, X_train, y_train, scoring)
print clf_lgl

LogisticRegression(C=0.5, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

模型测试正确率还算可以。

print accuracy_score(y_train, clf_lgl.predict(X_train))

0.811659192825

CV检验:

plot_learning_curve(clf_lgl, 'Logistic Regression', X, y, cv=4)
plt.show()

模型正常。

对缺失值进行预测,并替换缺失值

y_predict = clf_lgl.predict(X_predict)
df.set_value(df.Gender.isnull(), 'Gender', y_predict)

删除后面不用的变量

del df_for_gender, X, y, X_train, X_test, y_train, y_test, X_predict, y_predict, lg, clf_lgl

Dependents

sns.countplot(x='Dependents', hue ='Loan_Status', data=df)
plt.title('count by Dependents and Loan_Status')
plt.show()

对Dependents进行类似的预测,但是正确率很低。

df_for_dependents = df[['ApplicantIncome', 'Education', 'Gender', 'Married', 'Property_Area', 'Dependents']]

预测的准确率有点低

Credit_history

LogisticRegression(C=0.5, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
0.835182250396

Self_Employed

from sklearn.svm import SVC
svc = SVC(random_state=42, kernel='poly', probability=True)
parameters = {'C': [35], 'gamma': [0.0055], 'coef0': [0.1],
              'degree':[2]}
clf_svc = get_model(svc, parameters, X_train, y_train, scoring)
print (clf_svc)
print (accuracy_score(y_test, clf_svc.predict(X_test)))
plot_learning_curve(clf_svc, 'SVC for Self_Employed Precidtion', X, y, cv=4);
plt.show()


SVC(C=35, cache_size=200, class_weight=None, coef0=0.1,
  decision_function_shape=None, degree=2, gamma=0.0055, kernel='poly',
  max_iter=-1, probability=True, random_state=42, shrinking=True,
  tol=0.001, verbose=False)
0.852517985612

LoanAmount

df['TotalIncome'] = df['ApplicantIncome'] + df['CoapplicantIncome']
sns.FacetGrid(df, hue='Loan_Status', size=5).map(plt.scatter, 'TotalIncome','LoanAmount').add_legend()
plt.title('LoanAmount and TotalIncome')
plt.show()

分布毫无规律可言。

加上其它的字段信息进行线性回归。

先采用随机值处理

Loan_Amount_Term

df['Total_Income'] = df['ApplicantIncome'] + df['CoapplicantIncome']

print df[df.Loan_Amount_Term.isnull()][['LoanAmount', 'Total_Income']]

     LoanAmount  Total_Income
19        115.0        6100.0
36        100.0        3158.0
44         96.0        4695.0
45         88.0        3410.0
73         95.0        4755.0
112       152.0        7686.0
165       182.0        6873.0
197       120.0        4272.0
223       175.0        8588.0
232       120.0        5787.0
335        70.0        9993.0
367       124.0        5124.0
421        80.0        2720.0
423       110.0        8917.0
45        185.0        8160.0
48        187.0       10130.0
117        80.0        4416.0
129        70.0        5409.0
184       150.0       10916.0
214       118.0        7473.0
print df['Loan_Amount_Term'].value_counts()

360.0    823
180.0     66
480.0     23
300.0     20
240.0      8
84.0       7
120.0      4
60.0       3
36.0       3
12.0       2
350.0      1
6.0        1

银行贷款时间一般都是一些固定的。

sns.countplot(x='Loan_Amount_Term', hue='Loan_Status', data=df)
plt.show()

LogisticRegression(C=0.5, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
0.830508474576

到这里数据集基本完成了缺失值的替换。

print df.info()

Int64Index: 981 entries, 0 to 366
Data columns (total 14 columns):
ApplicantIncome      981 non-null int64
CoapplicantIncome    981 non-null float64
Credit_History       981 non-null float64
Dependents           981 non-null object
Education            981 non-null object
Gender               981 non-null object
LoanAmount           981 non-null float64
Loan_Amount_Term     981 non-null object
Loan_ID              981 non-null object
Loan_Status          614 non-null object
Married              981 non-null object
Property_Area        981 non-null object
Self_Employed        981 non-null object
Total_Income         981 non-null float64
dtypes: float64(4), int64(1), object(9)
memory usage: 115.0+ KB

逻辑回归模型:

LogisticRegression(C=0.5, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
0.789189189189

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=12, min_samples_split=5,
            min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=1,
            oob_score=True, random_state=42, verbose=0, warm_start=False)

Ensemble

from sklearn.ensemble import VotingClassifier
clf_vc = VotingClassifier(estimators=[('lgl', clf_lgl), ('rfc1', clf_rfc1)], 
                          voting='hard', weights=[1,1])
clf_vc = clf_vc.fit(X_train, y_train)

print (accuracy_score(y_test, clf_vc.predict(X_test)))
print (clf_vc)

0.789189189189
VotingClassifier(estimators=[('lgl', LogisticRegression(C=0.5, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)), ('rfc1', Rand...stimators=500, n_jobs=1,
            oob_score=True, random_state=42, verbose=0, warm_start=False))],
         voting='hard', weights=[2, 1])

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published