[시계열] ARIMA 모델 모수추정 일반

확률과정 모형 추정

1. 결정론적 추세(Deterministic Trend) 제거

- 다항식 추세

2. 확률적 추세(Stochastic Trend) 제거

- ARIMA
- ADF 사용
    - H0: 단위근 존재 -> Integration 존재 -> 차분필요

3. 정규성 확인

- 정규성을 띠면 백색잡음 -> ARMA 모형 적용가능
- 일반 선형확률과정은 가우시안 정규분포 따름(백색잡음의 선형조합)
- 정규성 개선을 위한 방법
    - Box-Cox
    - 로그변환

4. ARMA 모형의 차수 결정

- MA: ACF가 특정차수이상에서 사라짐(cut-off)
- AR: PACF가 특정차수이상에서 사라짐
- ARMA: ACF/PACF가 특정차수이상에서 사라지지 않음
    - 모수추정: AIC/BIC, MML/LM/MLE 등 사용
    - bootstrapping으로 모수의 표준오차 추정

5. 모형 진단

- 잔차 정규성 검정
- 잔차에 ACF/PACF/Ljung-Box(H0: k=0이외 상관계수0) Q검정으로 모형 차수 재확인

Tip

1) 로그변환 및 Box-Cox 변환

왜 사용? to get constant variance $\rightarrow$ and then detrend $\rightarrow$ stationary process

  • 비정상확률 과정 중 시간t에 의존하여 기댓값 및 분산이 변하는 경우
  • eg. 기댓값은 t에 의존, 표준편차는 $\mu$에 의존
  • cf. log는 차분 후에도 trend가 잡히지 않을 경우에도 사용
    • 즉, 시점이 변함에 따라 증가하다가, 일정 시점 이후부터 급격히 증가하는 자료의 경우, 차분을 여러번 해도 trend가 안 잡힘 $/rightarrow$ log변환 후 차분

2) 모형구분

  • AR vs MA: 모두 백색잡음 $\epsilon$의 선형 조합. ACF & PACF로 구분 가능
  • ARMA(Stationary) vs ARIMA(Non-Stationary): 특성방정식이 다름. ADF로 구분 가능
    • ARMA(p, q)
    • ARIMA(p, 1, q):

참고

http://stat.snu.ac.kr/time/download/%EC%8B%9C%EA%B3%84%EC%97%B4%EB%B6%84%EC%84%9D6%EC%9E%A5%EA%B0%95%EC%9D%98.pdf

예제) MA & AR의 ACF & PACF 비교

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# ma
theta = 0.6
ar = [1]
ma = [1, theta]
a = sm.tsa.ArmaProcess(ar, ma)
s = a.generate_sample(20, burnin=240)

plt.figure(figsize=(5,7))
plt.subplot(411)
plt.stem(a.acf(20))
plt.subplot(412)
plt.stem(a.pacf(20))

ax=plt.subplot(413)
sm.tsa.graphics.plot_acf(s, ax=ax)
ax=plt.subplot(414)
sm.tsa.graphics.plot_pacf(s, method='ywm', ax=ax)

plt.tight_layout()
plt.show()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# AR
phi = 0.6
ar = [1, phi]
ma = [1, 0]
a = sm.tsa.ArmaProcess(ar, ma)
s = a.generate_sample(20, burnin=240)

plt.figure(figsize=(5,7))
plt.subplot(411)
plt.stem(a.acf(20))
plt.subplot(412)
plt.stem(a.pacf(20))

ax=plt.subplot(413)
sm.tsa.graphics.plot_acf(s, ax=ax)
ax=plt.subplot(414)
sm.tsa.graphics.plot_pacf(s, method='ywm', ax=ax)

plt.tight_layout()
plt.show()

예제) Shampoo Sales Dataset

    1. Deterministic Trend & Seasonality
    1. Stochastic Trend & Seasonality: Multiplicative SARIMA

Original data

1
2
3
4
5
6
7
8
9
10
import datetime
from datetime import *
import statsmodels.api as sm

df = pd.read_csv('shampoo.csv')
df.dropna(inplace=True)
X = df.values[:, 1]
plt.plot(X)
plt.show()
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 36 entries, 0 to 35
Data columns (total 2 columns):
Month                                        36 non-null object
Sales of shampoo over a three year period    36 non-null float64
dtypes: float64(1), object(1)
memory usage: 864.0+ bytes
1
2
3
4
5
6
7
def parser(x):
return datetime.strptime('190' + x['Month'], '%Y-%m')

df['Month'] = df.apply(parser, axis=1)
df = df.rename_axis('order').reset_index()
df.columns = ['order', 'date', 'val']
df['order'] = df['order'] + 1

EDA

1
2
3
# MA 아님
sm.tsa.graphics.plot_acf(df['val'])
plt.show()

1
2
3
# AR(2) 가능성
sm.tsa.graphics.plot_pacf(df['val'], method='ywm')
plt.show()

1
sm.tsa.adfuller(df['val'])
(3.060142083641183,
 1.0,
 10,
 25,
 {'1%': -3.7238633119999998, '5%': -2.98648896, '10%': -2.6328004},
 278.9972644263031)

1. Deterministic Trend & Seasonality

1) 결정론적 추세 확인: linear

1
2
3
4
f = "val ~ order"
mod = sm.OLS.from_formula(f, df)
res = mod.fit()
print(res.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    val   R-squared:                       0.730
Model:                            OLS   Adj. R-squared:                  0.722
Method:                 Least Squares   F-statistic:                     91.97
Date:                Thu, 14 Feb 2019   Prob (F-statistic):           3.37e-11
Time:                        22:07:16   Log-Likelihood:                -207.13
No. Observations:                  36   AIC:                             418.3
Df Residuals:                      34   BIC:                             421.4
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     89.1371     26.723      3.336      0.002      34.829     143.445
order         12.0791      1.260      9.590      0.000       9.519      14.639
==============================================================================
Omnibus:                        3.174   Durbin-Watson:                   1.937
Prob(Omnibus):                  0.205   Jarque-Bera (JB):                2.703
Skew:                           0.666   Prob(JB):                        0.259
Kurtosis:                       2.827   Cond. No.                         43.4
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
1
2
3
4
5
t = df.order; y = df.val
trend1 = res.params[0] + res.params[1]*t
plt.plot(t, y, '-')
plt.plot(t, trend1, '-')
plt.show()

2) 결정론적 추세 확인: quadratic

  • r sqaure 더 높음
1
2
3
4
f = "val ~ order + I(order**2)"
mod = sm.OLS.from_formula(f, df)
res = mod.fit()
print(res.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    val   R-squared:                       0.832
Model:                            OLS   Adj. R-squared:                  0.821
Method:                 Least Squares   F-statistic:                     81.51
Date:                Thu, 14 Feb 2019   Prob (F-statistic):           1.71e-13
Time:                        22:07:24   Log-Likelihood:                -198.63
No. Observations:                  36   AIC:                             403.3
Df Residuals:                      33   BIC:                             408.0
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept       202.8789     33.300      6.092      0.000     135.129     270.629
order            -5.8801      4.150     -1.417      0.166     -14.324       2.563
I(order ** 2)     0.4854      0.109      4.461      0.000       0.264       0.707
==============================================================================
Omnibus:                        2.850   Durbin-Watson:                   3.055
Prob(Omnibus):                  0.241   Jarque-Bera (JB):                2.222
Skew:                           0.608   Prob(JB):                        0.329
Kurtosis:                       2.971   Cond. No.                     1.92e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.92e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
1
2
3
4
trend2 = res.params[0] + res.params[1] * t + res.params[2] * (t**2)
plt.plot(t, y, '-')
plt.plot(t, trend2, '-')
plt.show()

3) remove determinstic trend only

1
2
3
df['diff'] = df['val'] - trend1
df['diff2'] = df['val'] - trend2
df[:5]
order date val diff diff2
0 1 1901-01-01 266.0 164.783784 68.515884
1 2 1901-02-01 145.9 32.604710 -47.160121
2 3 1901-03-01 183.1 57.725637 -6.506894
3 4 1901-04-01 119.3 -18.153436 -67.824437
4 5 1901-05-01 180.3 30.767490 -5.312748

4.1) plotting: linear vs quadratic

1
2
3
4
5
6
plt.plot(df.order, df['val'], label='original')
plt.plot(df.order, df['diff'], label='trend: linear')
plt.plot(df.order, df['diff2'], label='trend: quadratic')
plt.grid(False)
plt.legend()
plt.show()

4.2) plotting resid: linear vs quadratic

1
2
3
4
plt.plot(df.order - df['diff'], label='trend: linear')
plt.plot(df.order - df['diff2'], label='trend: quadratic')
plt.legend()
plt.show()

5) remove Trend & Seasonality

1
2
3
4
5
6
df['cat'] = df['order'] % 15
df['cat'].astype('category')
f = 'val ~ C(cat) + order + I(order**2) - 1'
mod = sm.OLS.from_formula(f, df)
res = mod.fit()
print(res.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    val   R-squared:                       0.958
Model:                            OLS   Adj. R-squared:                  0.924
Method:                 Least Squares   F-statistic:                     27.41
Date:                Thu, 14 Feb 2019   Prob (F-statistic):           8.58e-10
Time:                        22:07:43   Log-Likelihood:                -173.44
No. Observations:                  36   AIC:                             380.9
Df Residuals:                      19   BIC:                             407.8
Df Model:                          16                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
C(cat)[0]       174.7258     39.418      4.433      0.000      92.224     257.228
C(cat)[1]       262.7115     30.125      8.721      0.000     199.658     325.765
C(cat)[2]       128.9006     30.535      4.221      0.000      64.990     192.811
C(cat)[3]       253.1511     30.877      8.199      0.000     188.526     317.776
C(cat)[4]       130.1963     31.149      4.180      0.001      65.000     195.392
C(cat)[5]       198.3030     31.355      6.324      0.000     132.676     263.930
C(cat)[6]       197.2711     31.498      6.263      0.000     131.346     263.196
C(cat)[7]       277.3137     37.662      7.363      0.000     198.485     356.142
C(cat)[8]       186.0607     38.172      4.874      0.000     106.167     265.955
C(cat)[9]       199.0857     38.599      5.158      0.000     118.298     279.874
C(cat)[10]      151.8388     38.942      3.899      0.001      70.333     233.345
C(cat)[11]      297.0200     39.200      7.577      0.000     214.974     379.066
C(cat)[12]      146.5293     39.374      3.722      0.001      64.119     228.939
C(cat)[13]      198.5167     39.465      5.030      0.000     115.915     281.118
C(cat)[14]      142.2322     39.478      3.603      0.002      59.603     224.861
order            -5.5256      3.059     -1.806      0.087     -11.928       0.877
I(order ** 2)     0.4860      0.081      6.027      0.000       0.317       0.655
==============================================================================
Omnibus:                        0.957   Durbin-Watson:                   2.711
Prob(Omnibus):                  0.620   Jarque-Bera (JB):                0.508
Skew:                           0.290   Prob(JB):                        0.776
Kurtosis:                       3.053   Cond. No.                     8.31e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.31e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

6) plotting

1
2
3
4
5
6
7
8
9
10
11
plt.figure(figsize=(10, 10))

plt.subplot(211)
plt.plot(df['val'], label='orginal')
plt.plot(res.fittedvalues, label='seasonality+deterministic trend', lw=3, alpha=0.7)
plt.legend()

# residual
plt.subplot(212)
plt.plot(res.resid, label='residuals')
plt.legend()
<matplotlib.legend.Legend at 0x7f4c44182240>

7.1) 잔차검정: 정규성 pass

1
2
sp.stats.probplot(res.resid, plot=plt)
plt.show()

1
sp.stats.shapiro(res.resid)
(0.986703634262085, 0.9349536895751953)

7.2) 잔차검정: 이분산성

1
plt.scatter(res.fittedvalues, res.resid)
<matplotlib.collections.PathCollection at 0x7f4c428741d0>

7.3) 잔차검정: 자기상관계수 ACF

  • lag=0 이외에 상관관계 존재 -> 잔차 독립조건 위배
1
2
3
4
5
6
7
8
9
plt.figure(figsize=(5, 7))

ax1 = plt.subplot(211)
sm.tsa.graphics.plot_acf(res.resid, ax=ax1, lags=15)
ax2 = plt.subplot(212)
sm.tsa.graphics.plot_pacf(res.resid, ax=ax2, lags=15)

plt.tight_layout()
plt.show()

7.3) Ljung-Box 검정

  • 잔차가 독립적이라면, 귀무가설(ACF값이 lag=k까지 모두 0) 성립해야 함
  • 아래는 기각
1
2
3
val, pval = sm.stats.diagnostic.acorr_ljungbox(res.resid)
plt.stem(pval)
plt.show()

결론

  • 결정론적 모델 부적합

자가질문

  • pacf lag=0에서 1이 안되는 이유: 자료가 작아서 lag이 너무 크면 계산에 오류 발생? No, method=’ywm’(biased method)
  • How to know if a model is deterministic or stochastic model?
    • detrend 후 잔차가 백색잡음이면 stochastic model
  • determinisitic & stochastic trend & seasonality 동시에 존재 가능?
  • 왜 detrend하고 나서 regression으로 seasonality 찾으면 안되는지?
< !-- add by yurixu 替换Google的jquery并且添加判断逻辑 -->