+{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"name":"1.linear_regreesion_v1.ipynb","version":"0.3.2","provenance":[],"collapsed_sections":[],"toc_visible":true},"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"}},"cells":[{"metadata":{"id":"ax1r8W0rU8d1","colab_type":"text"},"cell_type":"markdown","source":["# 线性回归 - Linear Regreesion\n","注意:python版本为3.6\n"]},{"metadata":{"id":"kgTnKLvOU8d2","colab_type":"code","colab":{}},"cell_type":"code","source":["import pandas as pd\n","import seaborn as sns\n","sns.set(context=\"notebook\", style=\"whitegrid\", palette=\"dark\")\n","import matplotlib.pyplot as plt\n","import numpy as np"],"execution_count":0,"outputs":[]},{"metadata":{"id":"Nje4hmN0U8d6","colab_type":"code","colab":{}},"cell_type":"code","source":["df = pd.read_csv('ex1data1.txt', names=['population', 'profit']) # 读取数据并赋予列名"],"execution_count":0,"outputs":[]},{"metadata":{"id":"a9p1GK2iU8d8","colab_type":"code","colab":{}},"cell_type":"code","source":["df.head() # 显示数据前五行"],"execution_count":0,"outputs":[]},{"metadata":{"id":"Ly0Fs3unU8eA","colab_type":"code","colab":{}},"cell_type":"code","source":["df.info() # 打印df的class信息"],"execution_count":0,"outputs":[]},{"metadata":{"id":"WsQ2uPs6muT7","colab_type":"code","colab":{}},"cell_type":"code","source":["df.describe() # 打印df的统计信息"],"execution_count":0,"outputs":[]},{"metadata":{"id":"dtj0pJAOU8eE","colab_type":"text"},"cell_type":"markdown","source":["***\n","# 看下原始数据"]},{"metadata":{"id":"ON7EiaK7U8eE","colab_type":"code","colab":{}},"cell_type":"code","source":["sns.lmplot('population', 'profit', df, size=6, fit_reg=False)\n","plt.show()"],"execution_count":0,"outputs":[]},{"metadata":{"id":"wRWRxgtAU8eH","colab_type":"code","colab":{}},"cell_type":"code","source":["def get_X(df): # 读取特征\n","# \"\"\"\n","# use concat to add intersect feature to avoid side effect\n","# not efficient for big dataset though\n","# \"\"\"\n"," ones = pd.DataFrame({'ones': np.ones(len(df))})#ones是m行1列的dataframe\n"," data = pd.concat([ones, df], axis=1) # 合并数据,根据列合并\n"," return data.iloc[:, :-1].as_matrix() # 这个操作返回 ndarray,不是矩阵\n","\n","\n","def get_y(df):#读取标签\n","# '''assume the last column is the target'''\n"," return np.array(df.iloc[:, -1])#df.iloc[:, -1]是指df的最后一列\n","\n","\n","def normalize_feature(df):\n","# \"\"\"Applies function along input axis(default 0) of DataFrame.\"\"\"\n"," return df.apply(lambda column: (column - column.mean()) / column.std())#特征缩放"],"execution_count":0,"outputs":[]},{"metadata":{"id":"GPFqxv_zU8eJ","colab_type":"text"},"cell_type":"markdown","source":["多变量的假设 h 表示为:${{h}_{\\theta }}\\left( x \\right)={{\\theta }_{0}}+{{\\theta }_{1}}{{x}_{1}}+{{\\theta }_{2}}{{x}_{2}}+...+{{\\theta }_{n}}{{x}_{n}}$。\n","\n","这个公式中有n+1个参数和n个变量,为了使得公式能够简化一些,引入${{x}_{0}}=1$,则公式转化为: ${{h}_{\\theta }}\\left( x \\right)={{\\theta }_{0}x_0}+{{\\theta }_{1}}{{x}_{1}}+{{\\theta }_{2}}{{x}_{2}}+...+{{\\theta }_{n}}{{x}_{n}}$。\n","\n","此时模型中的参数是一个n+1维的向量,任何一个训练实例也都是n+1维的向量,特征矩阵X的维度是 m*(n+1)。 因此公式可以简化为:${{h}_{\\theta }}\\left( x \\right)={{\\theta }^{T}}X$,其中上标T代表矩阵转置。\n"]},{"metadata":{"id":"on8_khfsU8eQ","colab_type":"text"},"cell_type":"markdown","source":["# 计算代价函数\n","$$J\\left( \\theta \\right)=\\frac{1}{2m}\\sum\\limits_{i=1}^{m}{{{\\left( {{h}_{\\theta }}\\left( {{x}^{(i)}} \\right)-{{y}^{(i)}} \\right)}^{2}}}$$\n","\n","其中:\n","\n","$${{h}_{\\theta }}\\left( x \\right)={{\\theta }^{T}}X={{\\theta }_{0}}{{x}_{0}}+{{\\theta }_{1}}{{x}_{1}}+{{\\theta }_{2}}{{x}_{2}}+...+{{\\theta }_{n}}{{x}_{n}}$$"]},{"metadata":{"id":"yomFBPelU8eQ","colab_type":"code","colab":{}},"cell_type":"code","source":["# 查看数据维度\n","data = df\n","X = get_X(data)\n","print(X.shape, type(X))\n","\n","y = get_y(data)\n","print(y.shape, type(y))\n"],"execution_count":0,"outputs":[]},{"metadata":{"id":"AeTfJ6UyU8eT","colab_type":"code","colab":{}},"cell_type":"code","source":["theta = np.zeros(X.shape[1]) # X.shape[1]=2, 代表特征数n\n","print(theta)"],"execution_count":0,"outputs":[]},{"metadata":{"id":"ntnGgUoKU8eV","colab_type":"code","colab":{}},"cell_type":"code","source":["def lr_cost(theta, X, y):\n"," \"\"\" 计算代价函数\n"," X: R(m*n), m 样本数, n 特征数\n"," y: R(m)\n"," theta : R(n), 线性回归的参数\n"," \"\"\"\n"," m = X.shape[0]#m为样本数\n","\n"," inner = X @ theta - y # R(m*1),X @ theta等价于X.dot(theta)\n","\n"," # 1*m @ m*1 = 1*1 in matrix multiplication\n"," # but you know numpy didn't do transpose in 1d array, so here is just a\n"," # vector inner product to itselves\n"," square_sum = inner.T @ inner\n"," cost = square_sum / (2 * m)\n","\n"," return cost"],"execution_count":0,"outputs":[]},{"metadata":{"id":"BPFR1bKOU8eY","colab_type":"code","colab":{}},"cell_type":"code","source":["lr_cost(theta, X, y) # 返回cost的值"],"execution_count":0,"outputs":[]},{"metadata":{"id":"KZ9oOCHrU8eb","colab_type":"text"},"cell_type":"markdown","source":["# 批量梯度下降 - Batch Gradient Decent\n","$$\\begin{aligned}{{\\theta }_{j}} &:={{\\theta }_{j}}-\\alpha \\frac{\\partial }{\\partial {{\\theta }_{j}}}J\\left( \\theta \\right) \\\\ &:= {{\\theta }_{j}}-\\alpha \\frac{1}{m} \\sum^{m}_{i=1}\\left( h_\\theta \\left(x^{(i)}\\right) -y^{(i)} \\right)x^{(i)}_j \\end{aligned}$$\n","注意:对于所有的$j$,需要同时更新$\\theta_j$。"]},{"metadata":{"id":"4_JZqMY3U8ec","colab_type":"code","colab":{}},"cell_type":"code","source":["def gradient(theta, X, y):\n"," \"\"\"\n"," 计算梯度,也就是 J(θ)的偏导数\n"," \"\"\"\n"," m = X.shape[0]\n","\n"," inner = X.T @ (X @ theta - y) # (m,n).T @ (m, 1) -> (n, 1),X @ theta等价于X.dot(theta)\n","\n"," return inner / m"],"execution_count":0,"outputs":[]},{"metadata":{"id":"IRexXn6EU8ee","colab_type":"code","colab":{}},"cell_type":"code","source":["def batch_gradient_decent(theta, X, y, epoch, alpha=0.01):\n"," \"\"\"\n"," 批量梯度下降函数。拟合线性回归,返回参数和代价\n"," epoch: 批处理的轮数\n"," \"\"\"\n"," cost_data = [lr_cost(theta, X, y)]\n"," _theta = theta.copy() # 拷贝一份,不和原来的theta混淆\n","\n"," for _ in range(epoch):\n"," _theta = _theta - alpha * gradient(_theta, X, y)\n"," cost_data.append(lr_cost(_theta, X, y))\n","\n"," return _theta, cost_data"],"execution_count":0,"outputs":[]},{"metadata":{"id":"Lx_dnNrxU8ei","colab_type":"code","colab":{}},"cell_type":"code","source":["epoch = 500\n","final_theta, cost_data = batch_gradient_decent(theta, X, y, epoch)"],"execution_count":0,"outputs":[]},{"metadata":{"id":"kCe-db1AU8en","colab_type":"code","colab":{}},"cell_type":"code","source":["final_theta\n","#最终的theta"],"execution_count":0,"outputs":[]},{"metadata":{"id":"yigjBbL4U8eq","colab_type":"code","colab":{}},"cell_type":"code","source":["cost_data\n","# 看下代价数据"],"execution_count":0,"outputs":[]},{"metadata":{"id":"ZrMX8THHU8et","colab_type":"code","colab":{}},"cell_type":"code","source":["# 计算最终的代价\n","lr_cost(final_theta, X, y)"],"execution_count":0,"outputs":[]},{"metadata":{"id":"B9VOLMPqk2Os","colab_type":"text"},"cell_type":"markdown","source":["scikit-learn model的预测表现"]},{"metadata":{"scrolled":true,"colab_type":"code","id":"5gM2jYk2T0Hv","colab":{}},"cell_type":"code","source":["from sklearn import linear_model\n","model = linear_model.LinearRegression()\n","model.fit(X, y)\n","\n","x = X[:, 1]\n","f = model.predict(X).flatten()\n","\n","plt.scatter(X[:,1], y, label='Traning Data')\n","plt.plot(x, f, 'r', label='Prediction')\n","plt.legend(loc=2)\n","plt.show()"],"execution_count":0,"outputs":[]},{"metadata":{"id":"rsQw66Y9U8ew","colab_type":"text"},"cell_type":"markdown","source":["# 代价数据可视化"]},{"metadata":{"id":"YZyEbK4RU8ew","colab_type":"code","colab":{}},"cell_type":"code","source":["ax = sns.tsplot(cost_data, time=np.arange(epoch+1))\n","ax.set_xlabel('epoch')\n","ax.set_ylabel('cost')\n","plt.show()\n","#可以看到从第二轮代价数据变换很大,接下来平稳了"],"execution_count":0,"outputs":[]},{"metadata":{"id":"5XhEi6SqU8ez","colab_type":"code","colab":{}},"cell_type":"code","source":["b = final_theta[0] # intercept,Y轴上的截距\n","m = final_theta[1] # slope,斜率\n","\n","plt.scatter(data.population, data.profit, label=\"Training data\")\n","plt.plot(data.population, data.population*m + b, 'r', label=\"Prediction\")\n","plt.legend(loc=2)\n","plt.show()"],"execution_count":0,"outputs":[]},{"metadata":{"id":"67DO0QyBU8e2","colab_type":"text"},"cell_type":"markdown","source":["# 3- 选修章节"]},{"metadata":{"id":"Tb-CO9raU8e2","colab_type":"code","colab":{}},"cell_type":"code","source":["raw_data = pd.read_csv('ex1data2.txt', names=['square', 'bedrooms', 'price'])\n","raw_data.head()"],"execution_count":0,"outputs":[]},{"metadata":{"collapsed":true,"id":"7WBXHMXzU8e6","colab_type":"text"},"cell_type":"markdown","source":["# 标准化数据\n","最简单的方法是令:\n","\n"," \n","\n","其中 是平均值,sn 是标准差。\n"]},{"metadata":{"id":"sFLN1gSfU8e7","colab_type":"code","colab":{}},"cell_type":"code","source":["def normalize_feature(df):\n","# \"\"\"Applies function along input axis(default 0) of DataFrame.\"\"\"\n"," return df.apply(lambda column: (column - column.mean()) / column.std())"],"execution_count":0,"outputs":[]},{"metadata":{"id":"SL2V8gRmU8e-","colab_type":"code","colab":{}},"cell_type":"code","source":["data = normalize_feature(raw_data)\n","data.head()"],"execution_count":0,"outputs":[]},{"metadata":{"id":"pjP2SsV3U8fE","colab_type":"text"},"cell_type":"markdown","source":["# 2. 多变量批量梯度下降 - Multi-var batch gradient decent"]},{"metadata":{"id":"XHTA9KcXU8fE","colab_type":"code","colab":{}},"cell_type":"code","source":["X = get_X(data)\n","print(X.shape, type(X))\n","\n","y = get_y(data)\n","print(y.shape, type(y)) #看下数据的维度和类型"],"execution_count":0,"outputs":[]},{"metadata":{"id":"PUHGSR5jU8fI","colab_type":"code","colab":{}},"cell_type":"code","source":["alpha = 0.01 #学习率\n","theta = np.zeros(X.shape[1]) #X.shape[1]:特征数n\n","epoch = 500 #迭代次数"],"execution_count":0,"outputs":[]},{"metadata":{"id":"NbC5-6NfU8fL","colab_type":"code","colab":{}},"cell_type":"code","source":["final_theta, cost_data = batch_gradient_decent(theta, X, y, epoch, alpha=alpha)"],"execution_count":0,"outputs":[]},{"metadata":{"id":"DTvLgpImU8fN","colab_type":"code","colab":{}},"cell_type":"code","source":["sns.tsplot(time=np.arange(len(cost_data)), data = cost_data)\n","plt.xlabel('epoch', fontsize=18)\n","plt.ylabel('cost', fontsize=18)\n","plt.show()"],"execution_count":0,"outputs":[]},{"metadata":{"id":"XIAeC3aWU8fQ","colab_type":"code","colab":{}},"cell_type":"code","source":["final_theta"],"execution_count":0,"outputs":[]},{"metadata":{"id":"xhv7H1OSTZJs","colab_type":"text"},"cell_type":"markdown","source":["Scikit-learn 的预测"]},{"metadata":{"scrolled":true,"id":"Dx3gtarBk2Os","colab_type":"code","colab":{}},"cell_type":"code","source":["from mpl_toolkits.mplot3d import Axes3D\n","from sklearn import linear_model\n","\n","model = linear_model.LinearRegression()\n","model.fit(X, y)\n","\n","f = model.predict(X).flatten()\n","\n","fig = plt.figure()\n","ax = fig.add_subplot(111, projection='3d')\n","ax.plot(X[:,1], X[:,2], f, 'r', label='Prediction')\n","ax.scatter(X[:,1], X[:,2], y, label='Traning Data')\n","ax.legend(loc=2)\n","ax.set_xlabel('square')\n","ax.set_ylabel('bedrooms')\n","ax.set_zlabel('price')\n","ax.set_title('square & bedrooms vs. price')\n","#ax.view_init(30, 10)\n","plt.show()"],"execution_count":0,"outputs":[]},{"metadata":{"id":"IRrnttEHU8fS","colab_type":"text"},"cell_type":"markdown","source":["# 3. 学习率 - Learning rate"]},{"metadata":{"id":"w9U-3YXMU8fT","colab_type":"code","colab":{}},"cell_type":"code","source":["base = np.logspace(-1, -5, num=4)\n","candidate = np.sort(np.concatenate((base, base*3)))\n","print(candidate)"],"execution_count":0,"outputs":[]},{"metadata":{"id":"ePELIHY-U8fV","colab_type":"code","colab":{}},"cell_type":"code","source":["epoch=50\n","\n","fig, ax = plt.subplots(figsize=(8, 8))\n","\n","for alpha in candidate:\n"," _, cost_data = batch_gradient_decent(theta, X, y, epoch, alpha=alpha)\n"," ax.plot(np.arange(epoch+1), cost_data, label=alpha)\n","\n","ax.set_xlabel('epoch', fontsize=12)\n","ax.set_ylabel('cost', fontsize=12)\n","ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)\n","ax.set_title('learning rate', fontsize=12)\n","plt.show()"],"execution_count":0,"outputs":[]},{"metadata":{"id":"SwpPCazZg3Ik","colab_type":"text"},"cell_type":"markdown","source":["可以看到最合适的learning rate是0.3"]},{"metadata":{"id":"6-HlM4onU8fX","colab_type":"text"},"cell_type":"markdown","source":["# 4. 正规方程 - Normal equation\n","正规方程是通过求解下面的方程来找出使得代价函数最小的参数的:$\\frac{\\partial }{\\partial {{\\theta }_{j}}}J\\left( {{\\theta }_{j}} \\right)=0$ 。\n"," 假设我们的训练集特征矩阵为 X(包含了${{x}_{0}}=1$)并且我们的训练集结果为向量 y,则利用正规方程解出向量 $\\theta ={{\\left( {{X}^{T}}X \\right)}^{-1}}{{X}^{T}}y$ 。\n","上标T代表矩阵转置,上标-1 代表矩阵的逆。设矩阵$A={{X}^{T}}X$,则:${{\\left( {{X}^{T}}X \\right)}^{-1}}={{A}^{-1}}$\n","\n","梯度下降与正规方程的比较:\n","\n","梯度下降:需要选择学习率α,需要多次迭代,当特征数量n大时也能较好适用,适用于各种类型的模型\t\n","\n","正规方程:不需要选择学习率α,一次计算得出,需要计算${{\\left( {{X}^{T}}X \\right)}^{-1}}$,如果特征数量n较大则运算代价大,因为矩阵逆的计算时间复杂度为O(n3),通常来说当n小于10000 时还是可以接受的,只适用于线性模型,不适合逻辑回归模型等其他模型\n","\n"]},{"metadata":{"id":"BGjQLe7jU8fY","colab_type":"code","colab":{}},"cell_type":"code","source":["# 正规方程\n","def normalEqn(X, y):\n"," theta = np.linalg.inv(X.T@X)@X.T@y#X.T@X等价于X.T.dot(X)\n"," return theta"],"execution_count":0,"outputs":[]},{"metadata":{"id":"3JB8AH_iU8fa","colab_type":"code","colab":{}},"cell_type":"code","source":["final_theta2=normalEqn(X, y)#感觉和批量梯度下降的theta的值有点差距\n","final_theta2"],"execution_count":0,"outputs":[]},{"metadata":{"id":"YfBLlOZnY6Qi","colab_type":"code","colab":{}},"cell_type":"code","source":["f = final_theta2[0] + final_theta2[1] * X[:,1] + final_theta2[2] * X[:,2]\n","\n","fig = plt.figure()\n","ax = fig.add_subplot(111, projection='3d')\n","ax.plot(X[:,1], X[:,2], f, 'r', label='Prediction')\n","ax.scatter(X[:,1], X[:,2], y, label='Traning Data')\n","ax.legend(loc=2)\n","ax.set_xlabel('square')\n","ax.set_ylabel('bedrooms')\n","ax.set_zlabel('price')\n","ax.set_title('square & bedrooms vs. price')\n","#ax.view_init(30, 10)\n","plt.show()"],"execution_count":0,"outputs":[]},{"metadata":{"id":"uB1oLn6lZOzo","colab_type":"code","colab":{}},"cell_type":"code","source":[""],"execution_count":0,"outputs":[]}]}
0 commit comments