- 一、库
- 二、数据
- 2.1 导入数据
- 2.2 数据初认识
- 2.3 数据示例
- 2.4 缺失值情况
- 三、清洗数据
- 3.1 Embarked字段
- 3.2 Age字段
- 3.3 Age和Sex字段
- 3.4 Cabin字段
- 3.5 Fare字段
- 3.6 Parch和SibSp字段
- 3.7 Pclass字段
- 3.8 类型化处理
- 四、机器学习
- 4.1 CV数据集
- 4.2 CV检验
- 4.3 测试集预测
- 五、尚需优化的问题
import pandas as pd
from pandas import Series, DataFrame
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid", color_codes=True)
import warnings
warnings.filterwarnings("ignore")
- 2.1 导入数据
train_df = pd.read_csv('train.csv')
- 2.2 数据初认识
train_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
得到的训练集信息如下:
- (1)共有891(0-890)条数据
- (2)12字段, PassengerId、Survived、Pclass、Name、Sex、Age、SibSp、Parch、Ticket、Fare、Cabin、Embarked
- (3)Age、Cabin和Embarked字段中含有空值
- (4)数据类型有float、int和object
字段说明:
- Survived: 是否幸存。0:否;1:是
- Pclass: 船舱等级。1:高级;2:中级;3:低级
- Name: 乘客姓名
- Sex: 乘客性别
- Age: 乘客年龄
- SibSp: 随行的兄妹、配偶的数量
- Parch: 随行的父母、子女的数量
- Ticket: 船票的号码
- Fare: 船票价格
- Cabin: 类似于火车的座位
- Embarked: 登船港口。C=Cherbourg; Q=Queenstown; S=Southampton
- 2.3 数据示例
print train_df.head()
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
Name Sex Age SibSp \
0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0
Parch Ticket Fare Cabin Embarked
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
- 2.4 缺失值情况
print train_df.isnull().sum()
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
Embarked字段,在df中含有2个空值
由于缺失值的数量非常小,可以查看一下这两个乘客的其他属性:
print train_df[train_df.Embarked.isnull()][['Pclass', 'Fare', 'Cabin', 'Sex']]
Pclass Fare Cabin Sex
61 1 80.0 B28 female
829 1 80.0 B28 female
sns.boxplot(x='Pclass', y='Fare', hue='Embarked', data=train_df)
plt.axhline(y='80', color='red')
plt.show()
从上图中可以看出,Pclass为1、船票费用80的乘客都集中在从C(Cherbourg)港口登船。所以将缺失值替换为“C”
train_df['Embarked'] = train_df['Embarked'].fillna('C')
Embarked更详细的图表信息
fig, (axis1, axis2, axis3) = plt.subplots(1, 3, figsize = (15, 5))
sns.countplot(x='Embarked', data=train_df, ax=axis1)
sns.countplot(x='Survived', hue='Embarked', data=train_df, order=[1,0], ax=axis2)
embark_perc = train_df[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean()
sns.barplot(x='Embarked', y='Survived', data=embark_perc, order=['S', 'C', 'Q'], ax=axis3)
sns.plt.show()
- 缺失的值用随机值代替
average_age_df = train_df['Age'].mean()
std_age_df = train_df['Age'].std()
count_nan_age_df = train_df['Age'].isnull().sum()
rand_age = np.random.randint(average_age_df - std_age_df, average_age_df + std_age_df, size = count_nan_age_df)
train_df['Age'][np.isnan(train_df['Age'])] = rand_age
- 不同年龄的幸存率分布情况
fig, axis1 = plt.subplots(1,1, figsize=(18,6))
average_age = train_df[['Age', 'Survived']].groupby(['Age'], as_index=False).mean()
sns.barplot(x='Age', y='Survived', data=average_age)
plt.show()
为了检验女士和儿童是否有优势,添加一个由Age和Sex生成的字段Person,值为child、female或male。
def get_person(passenger):
age, sex = passenger
return 'child' if age < 16 else sex
train_df['Person'] = train_df[['Age', 'Sex']].apply(get_person, axis=1)
不同人群的数量分布及幸存率
fig, (axis1, axis2) = plt.subplots(1, 2, figsize=(10, 5))
sns.countplot(x='Person', data=train_df, ax=axis1)
person_perc = train_df[['Person', 'Survived']].groupby(['Person'], as_index=False).mean()
sns.barplot(x='Person', y='Survived', data=person_perc, ax=axis2, order=['male', 'female', 'child'])
plt.show()
儿童和女士的优势好明显,尤其是女士,幸存率接近80%。
看看单身狗的幸存情况
def get_single_male(passenger):
person, hasfamily = passenger
return 1 if person == 'male' and hasfamily == 0 else 0
df['Single_male'] = df[['Person', 'Has_family']].apply(get_single_male, axis=1)
train_df = train_df = df.iloc[: train_size, :]
fig, (axis1, axis2) = plt.subplots(1, 2, figsize=(10, 5))
sns.countplot(x='Single_male', data=train_df, ax=axis1)
person_perc = train_df[['Single_male', 'Survived']].groupby(['Single_male'], as_index=False).mean()
sns.barplot(x='Single_male', y='Survived', data=person_perc, ax=axis2)
plt.show()
训练集中含有大量的缺失值。 先将缺失值填充为U0,然后从Cabin字段中提取Cabin的类别,存放在新的Cabin_type字段中
df['Cabin'] = df['Cabin'].fillna('U0')
Cabin_type = df[~df['Cabin'].isnull()]['Cabin'].map( lambda x: re.compile('([A-Z]+)').search(x).group())
df['Cabin_type'] = Cabin_type
del Cabin_type
print df['Cabin_type'].value_counts()
U 687
C 59
B 47
D 33
E 32
A 15
F 13
G 4
T 1
票价分布直方图
train_df = df.iloc[: train_size, :]
train_df['Fare'].plot(kind='hist', figsize=(15,3), bins=100) # xlim=(0,50)
plt.show()
幸存和非幸存乘客的年龄分布情况
facet = sns.FacetGrid(train_df, hue='Survived', aspect=4)
facet.map(sns.kdeplot, 'Age', shade=True)
facet.set(xlim=(0, train_df['Age'].max()))
facet.add_legend()
plt.show()
分析存活和非存活乘客的票价情况,均值和方差
fare_not_survived = train_df['Fare'][train_df['Survived'] == 0]
fare_survived = train_df['Fare'][train_df['Survived'] == 1]
# 均值和误差
average_fare = DataFrame([fare_not_survived.mean(), fare_survived.mean()])
std_fare = DataFrame([fare_not_survived.std(), fare_survived.std()])
average_fare.index.names = std_fare.index.names = ['Survived']
average_fare.plot(yerr=std_fare, kind='bar', legend=False)
plt.show()
票价正则化:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
train_df['Norm_fare'] = pd.Series(scaler.fit_transform(train_df['Fare'].reshape(-1,1)).reshape(-1), index=train_df.index)
这两个字段都和家属有关,所以考虑把它们转换成一个字段。转换过程如下:
- 第一步:Group_num字段,也就是Parch和SibSp的和,再加上自己
df['Group_num'] = df['Parch'] + df['SibSp'] + 1
- 查看Group_num的幸存率
fig, (axis1, axis2) = plt.subplots(1, 2)
sns.countplot(x='Group_num', data=train_df, ax = axis1)
family_perc = train_df[['Group_num', 'Survived']].groupby(['Group_num'], as_index=False).mean()
sns.barplot(x='Group_num', y='Survived', data=family_perc, ax = axis2)
plt.show()
由于数量在2~4之间幸存率明显高于其他的,因此将Group_num分成三类。
- Group_size字段,1对应于S,2~4对应于M,5及以上对应于L
df['Group_size'] = pd.Series('M', index=df.index)
df = df.set_value(df['Group_num']>4, 'Group_size', 'L')
df = df.set_value(df['Group_num']==1, 'Group_size', 'S')
groupsize_perc = train_df[['Group_size', 'Survived']].groupby(['Group_size'], as_index=False).mean()
sns.barplot(x='Group_size', y='Survived', data=groupsize_perc)
plt.show()
查看Pclass的种类
print df['Pclass'].value_counts()
3 1163
1 524
2 458
Name: Pclass, dtype: int64
船舱的等级有三类,分为1、2、3等,档次依次降低。
sns.factorplot('Pclass', 'Survived', order=[1, 2, 3], data=df, size=5)
plt.show()
Pclass为3时,幸存率只有25%左右,非常低
train_df.into()
RangeIndex: 891 entries, 0 to 890
Data columns (total 17 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 891 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 891 non-null object
Embarked 891 non-null object
Cabin_type 891 non-null object
Group_num 891 non-null int64
Group_size 891 non-null object
Norm_fare 891 non-null float64
Person 891 non-null object
dtypes: float64(3), int64(6), object(8)
memory usage: 118.4+ KB
df.drop(labels=['PassengerId','Fare','Cabin', 'Name', 'Sex', 'Parch', 'SibSp', 'Ticket', 'Group_num'], axis=1, inplace=True)
df.info()
Data columns (total 8 columns):
Survived 891 non-null int64
Pclass 891 non-null int64
Age 891 non-null float64
Embarked 891 non-null object
Cabin_type 891 non-null object
Group_size 891 non-null object
Norm_fare 891 non-null float64
Person 891 non-null object
dtypes: float64(2), int64(3), object(4)
memory usage: 62.7+ KB
train_df = pd.get_dummies(train_df, columns=['Embarked', 'Pclass', 'Cabin_type', 'Group_size', 'Person'])
train_df.info()
RangeIndex: 891 entries, 0 to 890
Data columns (total 24 columns):
Survived 891 non-null int64
Age 891 non-null float64
Norm_fare 891 non-null float64
Embarked_C 891 non-null float64
Embarked_Q 891 non-null float64
Embarked_S 891 non-null float64
Pclass_1 891 non-null float64
Pclass_2 891 non-null float64
Pclass_3 891 non-null float64
Cabin_type_A 891 non-null float64
Cabin_type_B 891 non-null float64
Cabin_type_C 891 non-null float64
Cabin_type_D 891 non-null float64
Cabin_type_E 891 non-null float64
Cabin_type_F 891 non-null float64
Cabin_type_G 891 non-null float64
Cabin_type_T 891 non-null float64
Cabin_type_U 891 non-null float64
Group_size_L 891 non-null float64
Group_size_M 891 non-null float64
Group_size_S 891 non-null float64
Person_child 891 non-null float64
Person_female 891 non-null float64
Person_male 891 non-null float64
dtypes: float64(23), int64(2)
memory usage: 174.1 KB
print train_df.info()
memory usage: 174.1 KB
Survived Age Norm_fare Embarked_C Embarked_Q Embarked_S \
0 0 22.0 -0.502445 0.0 0.0 1.0
1 1 38.0 0.786845 1.0 0.0 0.0
2 1 26.0 -0.488854 0.0 0.0 1.0
3 1 35.0 0.420730 0.0 0.0 1.0
4 0 35.0 -0.486337 0.0 0.0 1.0
Pclass_1 Pclass_2 Pclass_3 ... Cabin_type_F Cabin_type_G \
0 0.0 0.0 1.0 ... 0.0 0.0
1 1.0 0.0 0.0 ... 0.0 0.0
2 0.0 0.0 1.0 ... 0.0 0.0
3 1.0 0.0 0.0 ... 0.0 0.0
4 0.0 0.0 1.0 ... 0.0 0.0
Cabin_type_T Cabin_type_U Group_size_L Group_size_M Group_size_S \
0 0.0 1.0 0.0 1.0 0.0
1 0.0 0.0 0.0 1.0 0.0
2 0.0 1.0 0.0 0.0 1.0
3 0.0 0.0 0.0 1.0 0.0
4 0.0 1.0 0.0 0.0 1.0
Person_child Person_female Person_male
0 0.0 0.0 1.0
1 0.0 1.0 0.0
2 0.0 1.0 0.0
3 0.0 1.0 0.0
4 0.0 0.0 1.0
为了进行Cross-Validation检验,将train_df数据按照7:3的比例分成训练集和检验集
from sklearn.cross_validation import train_test_split
x_train = train_df.drop('Survived', axis=1)
y_train = train_df['Survived']
x_train_train, x_train_val, y_train_train, y_train_val = train_test_split(x_train, y_train, test_size=0.3, random_state=1)
from sklearn.linear_model import LogisticRegression
lg = LogisticRegression()
从图中可以看出,该模型没有产生过拟合和欠拟合的问题,基本正确。
lg.fit(x_train_train, y_train_train)
from sklearn.metrics import accuracy_score
print accuracy_score(y_train_val, lg.predict(x_train_val))
在测试集上的预测准确率为0.791044776119。
- name字段中还有很多信息可以利用
- 年龄预测
- 多个模型的测试