Prediction (out of sample)

In [1]:
%matplotlib inline

from __future__ import print_function
import numpy as np
import statsmodels.api as sm

Artificial data

In [2]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

In [3]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.987
Model:                            OLS   Adj. R-squared:                  0.986
Method:                 Least Squares   F-statistic:                     1129.
Date:                Mon, 29 Oct 2018   Prob (F-statistic):           4.54e-43
Time:                        04:30:54   Log-Likelihood:                 4.2549
No. Observations:                  50   AIC:                           -0.5098
Df Residuals:                      46   BIC:                             7.138
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.9157      0.079     62.247      0.000       4.757       5.075
x1             0.5041      0.012     41.393      0.000       0.480       0.529
x2             0.4994      0.048     10.430      0.000       0.403       0.596
x3            -0.0191      0.001    -17.880      0.000      -0.021      -0.017
==============================================================================
Omnibus:                        0.861   Durbin-Watson:                   2.340
Prob(Omnibus):                  0.650   Jarque-Bera (JB):                0.353
Skew:                           0.181   Prob(JB):                        0.838
Kurtosis:                       3.198   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

In [4]:
ypred = olsres.predict(X)
print(ypred)
[ 4.43771039  4.91654539  5.35644408  5.73019117  6.02039324  6.22233638
  6.34476075  6.40842453  6.44269344  6.48071603  6.55397746  6.6871266
  6.89392626  7.17499202  7.51769109  7.89821783  8.28550495  8.64632781
  8.95076308  9.1771051   9.31543206  9.36923608  9.35484952  9.29876158
  9.23326624  9.19115639  9.20033611  9.27923608  9.43378547  9.6564378
  9.92741082 10.21793665 10.49498827 10.72670579 10.88763089 10.96288962
 10.95063607 10.86235571 10.72097832 10.55711072 10.40400814 10.29211048
 10.24404113 10.27089    10.37039148 10.52729668 10.7158788  10.90415991
 11.0591649  11.15233915]

Create a new sample of explanatory variables Xnew, predict and plot

In [5]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[11.15472074 11.02566663 10.78476027 10.47662987 10.16002185  9.89341767
  9.72071553  9.66048256  9.70140895  9.80507695]

Plot comparison

In [6]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

In [7]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we don't want any expansion magic from using **2

In [8]:
res.params
Out[8]:
Intercept           4.915701
x1                  0.504133
np.sin(x1)          0.499371
I((x1 - 5) ** 2)   -0.019120
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

In [9]:
res.predict(exog=dict(x1=x1n))
Out[9]:
0    11.154721
1    11.025667
2    10.784760
3    10.476630
4    10.160022
5     9.893418
6     9.720716
7     9.660483
8     9.701409
9     9.805077
dtype: float64