Skip to content

Commit f1161a9

Browse files
committed
Example of Multiple Linear Regression in Python - statsmodels and sklearn.
1 parent 50d63f5 commit f1161a9

File tree

1 file changed

+193
-0
lines changed

1 file changed

+193
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,193 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "651019ae-1bd8-41b3-be75-474a841f1f8c",
6+
"metadata": {},
7+
"source": [
8+
"# Linear Regression in Python using Statsmodels\n",
9+
"https://datatofish.com/statsmodels-linear-regression/"
10+
]
11+
},
12+
{
13+
"cell_type": "markdown",
14+
"id": "de3b5be7-792e-4a94-8cd0-a17df627c419",
15+
"metadata": {},
16+
"source": [
17+
"## About Linear Regression\n",
18+
"Linear regression is used as a predictive model that assumes a linear relationship between the dependent variable (which is the variable we are trying to predict/estimate) and the independent variable/s (input variable/s used in the prediction)."
19+
]
20+
},
21+
{
22+
"cell_type": "markdown",
23+
"id": "efd67629-b49f-47e2-9305-a7abb6f10366",
24+
"metadata": {},
25+
"source": [
26+
"## An Example (with the full dataset)\n",
27+
"For illustration purposes, let’s suppose that you have a fictitious economy with the following parameters, where the index_price is the dependent variable, and the 2 independent/input variables are:\n",
28+
"\n",
29+
"* interest_rate\n",
30+
"* unemployment_rate\n"
31+
]
32+
},
33+
{
34+
"cell_type": "code",
35+
"execution_count": 1,
36+
"id": "3dacd2dc-aa3e-40f3-a66d-252b7512bef7",
37+
"metadata": {},
38+
"outputs": [
39+
{
40+
"name": "stdout",
41+
"output_type": "stream",
42+
"text": [
43+
" year month interest_rate unemployment_rate index_price\n",
44+
"0 2017 12 2.75 5.3 1464\n",
45+
"1 2017 11 2.50 5.3 1394\n",
46+
"2 2017 10 2.50 5.3 1357\n",
47+
"3 2017 9 2.50 5.3 1293\n",
48+
"4 2017 8 2.50 5.4 1256\n",
49+
"5 2017 7 2.50 5.6 1254\n",
50+
"6 2017 6 2.50 5.5 1234\n",
51+
"7 2017 5 2.25 5.5 1195\n",
52+
"8 2017 4 2.25 5.5 1159\n",
53+
"9 2017 3 2.25 5.6 1167\n",
54+
"10 2017 2 2.00 5.7 1130\n",
55+
"11 2017 1 2.00 5.9 1075\n",
56+
"12 2016 12 2.00 6.0 1047\n",
57+
"13 2016 11 1.75 5.9 965\n",
58+
"14 2016 10 1.75 5.8 943\n",
59+
"15 2016 9 1.75 6.1 958\n",
60+
"16 2016 8 1.75 6.2 971\n",
61+
"17 2016 7 1.75 6.1 949\n",
62+
"18 2016 6 1.75 6.1 884\n",
63+
"19 2016 5 1.75 6.1 866\n",
64+
"20 2016 4 1.75 5.9 876\n",
65+
"21 2016 3 1.75 6.2 822\n",
66+
"22 2016 2 1.75 6.2 704\n",
67+
"23 2016 1 1.75 6.1 719\n"
68+
]
69+
}
70+
],
71+
"source": [
72+
"import pandas as pd\n",
73+
"\n",
74+
"data = {'year': [2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016],\n",
75+
" 'month': [12,11,10,9,8,7,6,5,4,3,2,1,12,11,10,9,8,7,6,5,4,3,2,1],\n",
76+
" 'interest_rate': [2.75,2.5,2.5,2.5,2.5,2.5,2.5,2.25,2.25,2.25,2,2,2,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75],\n",
77+
" 'unemployment_rate': [5.3,5.3,5.3,5.3,5.4,5.6,5.5,5.5,5.5,5.6,5.7,5.9,6,5.9,5.8,6.1,6.2,6.1,6.1,6.1,5.9,6.2,6.2,6.1],\n",
78+
" 'index_price': [1464,1394,1357,1293,1256,1254,1234,1195,1159,1167,1130,1075,1047,965,943,958,971,949,884,866,876,822,704,719] \n",
79+
" }\n",
80+
"\n",
81+
"df = pd.DataFrame(data)\n",
82+
"print(df)"
83+
]
84+
},
85+
{
86+
"cell_type": "markdown",
87+
"id": "6b7443b5-1c4a-49ac-89e4-4d3bd29f584a",
88+
"metadata": {},
89+
"source": [
90+
"## The Python Code using Statsmodels"
91+
]
92+
},
93+
{
94+
"cell_type": "code",
95+
"execution_count": 3,
96+
"id": "434e1d1b-0487-4f15-932a-ce0d82f779e5",
97+
"metadata": {},
98+
"outputs": [
99+
{
100+
"name": "stdout",
101+
"output_type": "stream",
102+
"text": [
103+
" OLS Regression Results \n",
104+
"==============================================================================\n",
105+
"Dep. Variable: index_price R-squared: 0.898\n",
106+
"Model: OLS Adj. R-squared: 0.888\n",
107+
"Method: Least Squares F-statistic: 92.07\n",
108+
"Date: Sun, 09 Jul 2023 Prob (F-statistic): 4.04e-11\n",
109+
"Time: 17:01:29 Log-Likelihood: -134.61\n",
110+
"No. Observations: 24 AIC: 275.2\n",
111+
"Df Residuals: 21 BIC: 278.8\n",
112+
"Df Model: 2 \n",
113+
"Covariance Type: nonrobust \n",
114+
"=====================================================================================\n",
115+
" coef std err t P>|t| [0.025 0.975]\n",
116+
"-------------------------------------------------------------------------------------\n",
117+
"const 1798.4040 899.248 2.000 0.059 -71.685 3668.493\n",
118+
"interest_rate 345.5401 111.367 3.103 0.005 113.940 577.140\n",
119+
"unemployment_rate -250.1466 117.950 -2.121 0.046 -495.437 -4.856\n",
120+
"==============================================================================\n",
121+
"Omnibus: 2.691 Durbin-Watson: 0.530\n",
122+
"Prob(Omnibus): 0.260 Jarque-Bera (JB): 1.551\n",
123+
"Skew: -0.612 Prob(JB): 0.461\n",
124+
"Kurtosis: 3.226 Cond. No. 394.\n",
125+
"==============================================================================\n",
126+
"\n",
127+
"Notes:\n",
128+
"[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n"
129+
]
130+
}
131+
],
132+
"source": [
133+
"import statsmodels.api as sm\n",
134+
"\n",
135+
"x = df[['interest_rate','unemployment_rate']]\n",
136+
"y = df['index_price']\n",
137+
"\n",
138+
"x = sm.add_constant(x)\n",
139+
"\n",
140+
"model = sm.OLS(y, x).fit()\n",
141+
"predictions = model.predict(x) \n",
142+
"\n",
143+
"print_model = model.summary()\n",
144+
"print(print_model)"
145+
]
146+
},
147+
{
148+
"cell_type": "markdown",
149+
"id": "722a577e-9e89-4808-841e-0c9535b8fd99",
150+
"metadata": {},
151+
"source": [
152+
"## Interpreting the Regression Results\n",
153+
"\n",
154+
"* **Adjusted. R-squared** reflects the fit of the model. R-squared values range from 0 to 1, where a higher value generally indicates a better fit, assuming certain conditions are met.\n",
155+
"* **const** coefficient is your Y-intercept. It means that if both the interest_rate and unemployment_rate coefficients are zero, then the expected output (i.e., the Y) would be equal to the const coefficient.\n",
156+
"* **interest_rate coefficient** represents the change in the output Y due to a change of one unit in the interest rate (everything else held constant)\n",
157+
"unemployment_rate coefficient represents the change in the output Y due to a change of one unit in the unemployment rate (everything else held constant)\n",
158+
"* **std err** reflects the level of accuracy of the coefficients. The lower it is, the higher is the level of accuracy\n",
159+
"* **P >|t|** is your *p-value*. A p-value of less than 0.05 is considered to be statistically significant\n",
160+
"Confidence Interval represents the range in which our coefficients are likely to fall (with a likelihood of 95%)"
161+
]
162+
},
163+
{
164+
"cell_type": "code",
165+
"execution_count": null,
166+
"id": "0b683e7d-d9ca-4ab2-87f9-c8c0d992ada3",
167+
"metadata": {},
168+
"outputs": [],
169+
"source": []
170+
}
171+
],
172+
"metadata": {
173+
"kernelspec": {
174+
"display_name": "Python 3",
175+
"language": "python",
176+
"name": "python3"
177+
},
178+
"language_info": {
179+
"codemirror_mode": {
180+
"name": "ipython",
181+
"version": 3
182+
},
183+
"file_extension": ".py",
184+
"mimetype": "text/x-python",
185+
"name": "python",
186+
"nbconvert_exporter": "python",
187+
"pygments_lexer": "ipython3",
188+
"version": "3.8.8"
189+
}
190+
},
191+
"nbformat": 4,
192+
"nbformat_minor": 5
193+
}

0 commit comments

Comments
 (0)