Example of Multiple Linear Regression in Python - statsmodels and sklearn.

quickheaven · quickheaven · commit f1161a93dafe · 2023-07-09T17:17:47.000-04:00
diff --git a/Example_of_Multiple_Linear_Regression_in_Python.ipynb b/Example_of_Multiple_Linear_Regression_in_Python.ipynb
@@ -0,0 +1,193 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "651019ae-1bd8-41b3-be75-474a841f1f8c",
+   "metadata": {},
+   "source": [
+    "# Linear Regression in Python using Statsmodels\n",
+    "https://datatofish.com/statsmodels-linear-regression/"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "de3b5be7-792e-4a94-8cd0-a17df627c419",
+   "metadata": {},
+   "source": [
+    "## About Linear Regression\n",
+    "Linear regression is used as a predictive model that assumes a linear relationship between the dependent variable (which is the variable we are trying to predict/estimate) and the independent variable/s (input variable/s used in the prediction)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "efd67629-b49f-47e2-9305-a7abb6f10366",
+   "metadata": {},
+   "source": [
+    "## An Example (with the full dataset)\n",
+    "For illustration purposes, let’s suppose that you have a fictitious economy with the following parameters, where the index_price is the dependent variable, and the 2 independent/input variables are:\n",
+    "\n",
+    "* interest_rate\n",
+    "* unemployment_rate\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "3dacd2dc-aa3e-40f3-a66d-252b7512bef7",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "    year  month  interest_rate  unemployment_rate  index_price\n",
+      "0   2017     12           2.75                5.3         1464\n",
+      "1   2017     11           2.50                5.3         1394\n",
+      "2   2017     10           2.50                5.3         1357\n",
+      "3   2017      9           2.50                5.3         1293\n",
+      "4   2017      8           2.50                5.4         1256\n",
+      "5   2017      7           2.50                5.6         1254\n",
+      "6   2017      6           2.50                5.5         1234\n",
+      "7   2017      5           2.25                5.5         1195\n",
+      "8   2017      4           2.25                5.5         1159\n",
+      "9   2017      3           2.25                5.6         1167\n",
+      "10  2017      2           2.00                5.7         1130\n",
+      "11  2017      1           2.00                5.9         1075\n",
+      "12  2016     12           2.00                6.0         1047\n",
+      "13  2016     11           1.75                5.9          965\n",
+      "14  2016     10           1.75                5.8          943\n",
+      "15  2016      9           1.75                6.1          958\n",
+      "16  2016      8           1.75                6.2          971\n",
+      "17  2016      7           1.75                6.1          949\n",
+      "18  2016      6           1.75                6.1          884\n",
+      "19  2016      5           1.75                6.1          866\n",
+      "20  2016      4           1.75                5.9          876\n",
+      "21  2016      3           1.75                6.2          822\n",
+      "22  2016      2           1.75                6.2          704\n",
+      "23  2016      1           1.75                6.1          719\n"
+     ]
+    }
+   ],
+   "source": [
+    "import pandas as pd\n",
+    "\n",
+    "data = {'year': [2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016],\n",
+    "        'month': [12,11,10,9,8,7,6,5,4,3,2,1,12,11,10,9,8,7,6,5,4,3,2,1],\n",
+    "        'interest_rate': [2.75,2.5,2.5,2.5,2.5,2.5,2.5,2.25,2.25,2.25,2,2,2,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75],\n",
+    "        'unemployment_rate': [5.3,5.3,5.3,5.3,5.4,5.6,5.5,5.5,5.5,5.6,5.7,5.9,6,5.9,5.8,6.1,6.2,6.1,6.1,6.1,5.9,6.2,6.2,6.1],\n",
+    "        'index_price': [1464,1394,1357,1293,1256,1254,1234,1195,1159,1167,1130,1075,1047,965,943,958,971,949,884,866,876,822,704,719]        \n",
+    "        }\n",
+    "\n",
+    "df = pd.DataFrame(data)\n",
+    "print(df)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6b7443b5-1c4a-49ac-89e4-4d3bd29f584a",
+   "metadata": {},
+   "source": [
+    "## The Python Code using Statsmodels"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "434e1d1b-0487-4f15-932a-ce0d82f779e5",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "                            OLS Regression Results                            \n",
+      "==============================================================================\n",
+      "Dep. Variable:            index_price   R-squared:                       0.898\n",
+      "Model:                            OLS   Adj. R-squared:                  0.888\n",
+      "Method:                 Least Squares   F-statistic:                     92.07\n",
+      "Date:                Sun, 09 Jul 2023   Prob (F-statistic):           4.04e-11\n",
+      "Time:                        17:01:29   Log-Likelihood:                -134.61\n",
+      "No. Observations:                  24   AIC:                             275.2\n",
+      "Df Residuals:                      21   BIC:                             278.8\n",
+      "Df Model:                           2                                         \n",
+      "Covariance Type:            nonrobust                                         \n",
+      "=====================================================================================\n",
+      "                        coef    std err          t      P>|t|      [0.025      0.975]\n",
+      "-------------------------------------------------------------------------------------\n",
+      "const              1798.4040    899.248      2.000      0.059     -71.685    3668.493\n",
+      "interest_rate       345.5401    111.367      3.103      0.005     113.940     577.140\n",
+      "unemployment_rate  -250.1466    117.950     -2.121      0.046    -495.437      -4.856\n",
+      "==============================================================================\n",
+      "Omnibus:                        2.691   Durbin-Watson:                   0.530\n",
+      "Prob(Omnibus):                  0.260   Jarque-Bera (JB):                1.551\n",
+      "Skew:                          -0.612   Prob(JB):                        0.461\n",
+      "Kurtosis:                       3.226   Cond. No.                         394.\n",
+      "==============================================================================\n",
+      "\n",
+      "Notes:\n",
+      "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n"
+     ]
+    }
+   ],
+   "source": [
+    "import statsmodels.api as sm\n",
+    "\n",
+    "x = df[['interest_rate','unemployment_rate']]\n",
+    "y = df['index_price']\n",
+    "\n",
+    "x = sm.add_constant(x)\n",
+    "\n",
+    "model = sm.OLS(y, x).fit()\n",
+    "predictions = model.predict(x) \n",
+    "\n",
+    "print_model = model.summary()\n",
+    "print(print_model)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "722a577e-9e89-4808-841e-0c9535b8fd99",
+   "metadata": {},
+   "source": [
+    "## Interpreting the Regression Results\n",
+    "\n",
+    "* **Adjusted. R-squared** reflects the fit of the model. R-squared values range from 0 to 1, where a higher value generally indicates a better fit, assuming certain conditions are met.\n",
+    "* **const** coefficient is your Y-intercept. It means that if both the interest_rate and unemployment_rate coefficients are zero, then the expected output (i.e., the Y) would be equal to the const coefficient.\n",
+    "* **interest_rate coefficient** represents the change in the output Y due to a change of one unit in the interest rate (everything else held constant)\n",
+    "unemployment_rate coefficient represents the change in the output Y due to a change of one unit in the unemployment rate (everything else held constant)\n",
+    "* **std err** reflects the level of accuracy of the coefficients. The lower it is, the higher is the level of accuracy\n",
+    "* **P >|t|** is your *p-value*. A p-value of less than 0.05 is considered to be statistically significant\n",
+    "Confidence Interval represents the range in which our coefficients are likely to fall (with a likelihood of 95%)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0b683e7d-d9ca-4ab2-87f9-c8c0d992ada3",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.8"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}