[시계열] ARIMA 모델 모수추정 일반

2019-2-14 Thu 21:56

Math

확률과정 모형 추정

1. 결정론적 추세(Deterministic Trend) 제거

- 다항식 추세

2. 확률적 추세(Stochastic Trend) 제거

- ARIMA
- ADF 사용
    - H0: 단위근 존재 -> Integration 존재 -> 차분필요

3. 정규성 확인

- 정규성을 띠면 백색잡음 -> ARMA 모형 적용가능
- 일반 선형확률과정은 가우시안 정규분포 따름(백색잡음의 선형조합)
- 정규성 개선을 위한 방법
    - Box-Cox
    - 로그변환

4. ARMA 모형의 차수 결정

- MA: ACF가 특정차수이상에서 사라짐(cut-off)
- AR: PACF가 특정차수이상에서 사라짐
- ARMA: ACF/PACF가 특정차수이상에서 사라지지 않음
    - 모수추정: AIC/BIC, MML/LM/MLE 등 사용
    - bootstrapping으로 모수의 표준오차 추정

5. 모형 진단

- 잔차 정규성 검정
- 잔차에 ACF/PACF/Ljung-Box(H0: k=0이외 상관계수0) Q검정으로 모형 차수 재확인

Tip

1) 로그변환 및 Box-Cox 변환

왜 사용? to get constant variance $\rightarrow$ and then detrend $\rightarrow$ stationary process

비정상확률 과정 중 시간t에 의존하여 기댓값 및 분산이 변하는 경우
eg. 기댓값은 t에 의존, 표준편차는 $\mu$에 의존 $E[Y_t] = \mu_t = f(t) \\ \sqrt{Var[Y_t]} = \mu_t \sigma \\ E[\log{Y_t}] = \log{\mu_t} \\ Var[\log{Y_t}] \approx \sigma^2 \rightarrow \text{constant}$
cf. log는 차분 후에도 trend가 잡히지 않을 경우에도 사용
- 즉, 시점이 변함에 따라 증가하다가, 일정 시점 이후부터 급격히 증가하는 자료의 경우, 차분을 여러번 해도 trend가 안 잡힘 $/rightarrow$ log변환 후 차분

2) 모형구분

AR vs MA: 모두 백색잡음 $\epsilon$의 선형 조합. ACF & PACF로 구분 가능
ARMA(Stationary) vs ARIMA(Non-Stationary): 특성방정식이 다름. ADF로 구분 가능
- ARMA(p, q) $Y_t = \phi_1Y_{t-1}+ \cdots + \phi_p Y_{t-p} + \cdots$
- ARIMA(p, 1, q): $Y_t = (1+\phi_1)Y_t + (\phi_2 - \phi_1)Y_{t-1} + \cdots + (\phi_p - \phi_{p-1})Y_{t-p} - \phi_p Y_{t-p-1} + \cdots$

참고

예제) MA & AR의 ACF & PACF 비교

# ma
theta = 0.6
ar = [1]
ma = [1, theta]
a = sm.tsa.ArmaProcess(ar, ma)
s = a.generate_sample(20, burnin=240)

plt.figure(figsize=(5,7))
plt.subplot(411)
plt.stem(a.acf(20))
plt.subplot(412)
plt.stem(a.pacf(20))

ax=plt.subplot(413)
sm.tsa.graphics.plot_acf(s, ax=ax)
ax=plt.subplot(414)
sm.tsa.graphics.plot_pacf(s, method='ywm', ax=ax)

plt.tight_layout()
plt.show()

# AR
phi = 0.6
ar = [1, phi]
ma = [1, 0]
a = sm.tsa.ArmaProcess(ar, ma)
s = a.generate_sample(20, burnin=240)

plt.figure(figsize=(5,7))
plt.subplot(411)
plt.stem(a.acf(20))
plt.subplot(412)
plt.stem(a.pacf(20))

ax=plt.subplot(413)
sm.tsa.graphics.plot_acf(s, ax=ax)
ax=plt.subplot(414)
sm.tsa.graphics.plot_pacf(s, method='ywm', ax=ax)

plt.tight_layout()
plt.show()

예제) Shampoo Sales Dataset

1. Deterministic Trend & Seasonality
1. Stochastic Trend & Seasonality: Multiplicative SARIMA

Original data

import datetime
from datetime import *
import statsmodels.api as sm

df = pd.read_csv('shampoo.csv')
df.dropna(inplace=True)
X = df.values[:, 1]
plt.plot(X)
plt.show()
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 36 entries, 0 to 35
Data columns (total 2 columns):
Month                                        36 non-null object
Sales of shampoo over a three year period    36 non-null float64
dtypes: float64(1), object(1)
memory usage: 864.0+ bytes

def parser(x):
    return datetime.strptime('190' + x['Month'], '%Y-%m')

df['Month'] = df.apply(parser, axis=1)
df = df.rename_axis('order').reset_index()
df.columns = ['order', 'date', 'val']
df['order'] = df['order'] + 1

EDA

1
2
3

# MA 아님
sm.tsa.graphics.plot_acf(df['val'])
plt.show()

1
2
3

# AR(2) 가능성
sm.tsa.graphics.plot_pacf(df['val'], method='ywm')
plt.show()

1	sm.tsa.adfuller(df['val'])

(3.060142083641183,
 1.0,
 10,
 25,
 {'1%': -3.7238633119999998, '5%': -2.98648896, '10%': -2.6328004},
 278.9972644263031)

1. Deterministic Trend & Seasonality

1) 결정론적 추세 확인: linear

f = "val ~ order"
mod = sm.OLS.from_formula(f, df)
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    val   R-squared:                       0.730
Model:                            OLS   Adj. R-squared:                  0.722
Method:                 Least Squares   F-statistic:                     91.97
Date:                Thu, 14 Feb 2019   Prob (F-statistic):           3.37e-11
Time:                        22:07:16   Log-Likelihood:                -207.13
No. Observations:                  36   AIC:                             418.3
Df Residuals:                      34   BIC:                             421.4
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     89.1371     26.723      3.336      0.002      34.829     143.445
order         12.0791      1.260      9.590      0.000       9.519      14.639
==============================================================================
Omnibus:                        3.174   Durbin-Watson:                   1.937
Prob(Omnibus):                  0.205   Jarque-Bera (JB):                2.703
Skew:                           0.666   Prob(JB):                        0.259
Kurtosis:                       2.827   Cond. No.                         43.4
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

t = df.order; y = df.val
trend1 = res.params[0] + res.params[1]*t
plt.plot(t, y, '-')
plt.plot(t, trend1, '-')
plt.show()

2) 결정론적 추세 확인: quadratic

r sqaure 더 높음

f = "val ~ order + I(order**2)"
mod = sm.OLS.from_formula(f, df)
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    val   R-squared:                       0.832
Model:                            OLS   Adj. R-squared:                  0.821
Method:                 Least Squares   F-statistic:                     81.51
Date:                Thu, 14 Feb 2019   Prob (F-statistic):           1.71e-13
Time:                        22:07:24   Log-Likelihood:                -198.63
No. Observations:                  36   AIC:                             403.3
Df Residuals:                      33   BIC:                             408.0
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept       202.8789     33.300      6.092      0.000     135.129     270.629
order            -5.8801      4.150     -1.417      0.166     -14.324       2.563
I(order ** 2)     0.4854      0.109      4.461      0.000       0.264       0.707
==============================================================================
Omnibus:                        2.850   Durbin-Watson:                   3.055
Prob(Omnibus):                  0.241   Jarque-Bera (JB):                2.222
Skew:                           0.608   Prob(JB):                        0.329
Kurtosis:                       2.971   Cond. No.                     1.92e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.92e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

trend2 = res.params[0] + res.params[1] * t + res.params[2] * (t**2)
plt.plot(t, y, '-')
plt.plot(t, trend2, '-')
plt.show()

3) remove determinstic trend only

1
2
3

df['diff'] = df['val'] - trend1
df['diff2'] = df['val'] - trend2
df[:5]

	order	date	val	diff	diff2
0	1	1901-01-01	266.0	164.783784	68.515884
1	2	1901-02-01	145.9	32.604710	-47.160121
2	3	1901-03-01	183.1	57.725637	-6.506894
3	4	1901-04-01	119.3	-18.153436	-67.824437
4	5	1901-05-01	180.3	30.767490	-5.312748

4.1) plotting: linear vs quadratic

plt.plot(df.order, df['val'], label='original')
plt.plot(df.order, df['diff'], label='trend: linear')
plt.plot(df.order, df['diff2'], label='trend: quadratic')
plt.grid(False)
plt.legend()
plt.show()

4.2) plotting resid: linear vs quadratic

plt.plot(df.order - df['diff'], label='trend: linear')
plt.plot(df.order - df['diff2'], label='trend: quadratic')
plt.legend()
plt.show()

5) remove Trend & Seasonality

df['cat'] = df['order'] % 15
df['cat'].astype('category')
f = 'val ~ C(cat) + order + I(order**2) - 1'
mod = sm.OLS.from_formula(f, df)
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    val   R-squared:                       0.958
Model:                            OLS   Adj. R-squared:                  0.924
Method:                 Least Squares   F-statistic:                     27.41
Date:                Thu, 14 Feb 2019   Prob (F-statistic):           8.58e-10
Time:                        22:07:43   Log-Likelihood:                -173.44
No. Observations:                  36   AIC:                             380.9
Df Residuals:                      19   BIC:                             407.8
Df Model:                          16                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
C(cat)[0]       174.7258     39.418      4.433      0.000      92.224     257.228
C(cat)[1]       262.7115     30.125      8.721      0.000     199.658     325.765
C(cat)[2]       128.9006     30.535      4.221      0.000      64.990     192.811
C(cat)[3]       253.1511     30.877      8.199      0.000     188.526     317.776
C(cat)[4]       130.1963     31.149      4.180      0.001      65.000     195.392
C(cat)[5]       198.3030     31.355      6.324      0.000     132.676     263.930
C(cat)[6]       197.2711     31.498      6.263      0.000     131.346     263.196
C(cat)[7]       277.3137     37.662      7.363      0.000     198.485     356.142
C(cat)[8]       186.0607     38.172      4.874      0.000     106.167     265.955
C(cat)[9]       199.0857     38.599      5.158      0.000     118.298     279.874
C(cat)[10]      151.8388     38.942      3.899      0.001      70.333     233.345
C(cat)[11]      297.0200     39.200      7.577      0.000     214.974     379.066
C(cat)[12]      146.5293     39.374      3.722      0.001      64.119     228.939
C(cat)[13]      198.5167     39.465      5.030      0.000     115.915     281.118
C(cat)[14]      142.2322     39.478      3.603      0.002      59.603     224.861
order            -5.5256      3.059     -1.806      0.087     -11.928       0.877
I(order ** 2)     0.4860      0.081      6.027      0.000       0.317       0.655
==============================================================================
Omnibus:                        0.957   Durbin-Watson:                   2.711
Prob(Omnibus):                  0.620   Jarque-Bera (JB):                0.508
Skew:                           0.290   Prob(JB):                        0.776
Kurtosis:                       3.053   Cond. No.                     8.31e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.31e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

6) plotting

plt.figure(figsize=(10, 10))

plt.subplot(211)
plt.plot(df['val'], label='orginal')
plt.plot(res.fittedvalues, label='seasonality+deterministic trend', lw=3, alpha=0.7)
plt.legend()

# residual
plt.subplot(212)
plt.plot(res.resid, label='residuals')
plt.legend()

<matplotlib.legend.Legend at 0x7f4c44182240>

7.1) 잔차검정: 정규성 pass

1 2	sp.stats.probplot(res.resid, plot=plt) plt.show()

1	sp.stats.shapiro(res.resid)

(0.986703634262085, 0.9349536895751953)

7.2) 잔차검정: 이분산성

1	plt.scatter(res.fittedvalues, res.resid)

<matplotlib.collections.PathCollection at 0x7f4c428741d0>

7.3) 잔차검정: 자기상관계수 ACF

lag=0 이외에 상관관계 존재 -> 잔차 독립조건 위배

plt.figure(figsize=(5, 7))

ax1 = plt.subplot(211)
sm.tsa.graphics.plot_acf(res.resid, ax=ax1, lags=15)
ax2 = plt.subplot(212)
sm.tsa.graphics.plot_pacf(res.resid, ax=ax2, lags=15)

plt.tight_layout()
plt.show()

7.3) Ljung-Box 검정

잔차가 독립적이라면, 귀무가설(ACF값이 lag=k까지 모두 0) 성립해야 함
아래는 기각

1
2
3

val, pval = sm.stats.diagnostic.acorr_ljungbox(res.resid)
plt.stem(pval)
plt.show()

결론

결정론적 모델 부적합

reference

자가질문

pacf lag=0에서 1이 안되는 이유: 자료가 작아서 lag이 너무 크면 계산에 오류 발생? No, method=’ywm’(biased method)
How to know if a model is deterministic or stochastic model?
- detrend 후 잔차가 백색잡음이면 stochastic model
determinisitic & stochastic trend & seasonality 동시에 존재 가능?
왜 detrend하고 나서 regression으로 seasonality 찾으면 안되는지?

Henry's blog

Step by step

[시계열] ARIMA 모델 모수추정 일반

확률과정 모형 추정

1. 결정론적 추세(Deterministic Trend) 제거

2. 확률적 추세(Stochastic Trend) 제거

3. 정규성 확인

4. ARMA 모형의 차수 결정

5. 모형 진단

Tip

1) 로그변환 및 Box-Cox 변환

왜 사용? to get constant variance $\rightarrow$ and then detrend $\rightarrow$ stationary process

2) 모형구분

참고

예제) MA & AR의 ACF & PACF 비교

예제) Shampoo Sales Dataset

Original data

EDA

1. Deterministic Trend & Seasonality

1) 결정론적 추세 확인: linear

2) 결정론적 추세 확인: quadratic

3) remove determinstic trend only

4.1) plotting: linear vs quadratic

4.2) plotting resid: linear vs quadratic

5) remove Trend & Seasonality

6) plotting

7.1) 잔차검정: 정규성 pass

7.2) 잔차검정: 이분산성

7.3) 잔차검정: 자기상관계수 ACF

7.3) Ljung-Box 검정

결론

자가질문