[Stat_Test_1] ANOVA_집단간 비교

2019-3-13 Wed 13:12

Math

통계 검정: 집단간 비교

Two groups

z-test: 모집단 분산 미리 알고/ 표본크기 30이상 (대개 분산모르므로 사용안함)
t-test(등분산/이분산): 독립된 두 개의 표본집단간의 평균차이 검정. eg. 어떤 변인에 따른 두 집단 차이

Two or More groups

1. F-test: 두 개 이상의 집단 비교. 집단 내 & 집단 간 분산을 분석

eg. A, B, C 각 집단내 & 집단간 연령 분산 -> A 집단 연령의 분산 > B, C 집단 연령의 분산 = A는 이질적 연령대로 구성 되어 있음

2.1 ANOVA(Analysis of Variance, 분산분석): 범주형 x, 연속형 y, F-test 사용 (link)

$F-static = \dfrac{MST}{MSE} \approx \dfrac{\text{Var b/w Groups}}{\text{Var within Groups}}$

결론: 가변수 활용한 회귀분석 결과와 같음.(변수의 category값이 너무 많으면 그때 ANOVA 하자)
- 회귀분석과 ANOVA 관계: ANOVA를 통해 회귀식의 F 검정 가능(결정계수 R^2에 대해 F 검정. 즉, 회귀식의 적합성 검정)
- 결정계수 R^2 -> F 검정 통계량 계산: test statistics F = f(R^2) -> F 검정 (검정 통계량이란?)
  - k: num of group; n: num of sample
  - MSR_(explained var.) = SSR/df; MSE_(unexplained var.) = SSE/df
  - Regression : ANOVA
- = SSR(sum of squares of reg.) : SSB(b/w groups sum of squares)
- = SSE(sum of squares error) : SSW(within gorups sum of squares)
회귀분석이 제공하는 정보가 더 많음. eg. ANOVA: y에 영향 유무 vs 회귀: 어떻게 영향
H0: 모집단의 평균이 모두 같다
목적: 평균값의 차이가 의미 있는가. 아니면 분산이 커서 평균 차이가 발생하는지 확인
조건: 정규성, 분산의 동질성, 관찰의 독립성
종류:
- 일원 분산분석 model: y ~ x1
- 이원 분산분석 model: y ~ x1, x2, x1:x2(교호 작용/주 효과)

2.2 ANOVA Question (link)

Q: F-통계량과 R^2가 어떻게 다른가 (How is F Statistic different from R Squared?)
A: R^2는 종속/독립 변수가 얼마나 강하게 연관이 있는지 보여주지만, 유의도를 제공하지는 않음. F-통계량은 그 연관성의 유의도를 보여줌. (R squared provides a measure of strength of relationship between our predictors and our response variable and it does not comment on whether the relationship is statistically significant. F Statistic gives us a power to judge whether that relationship is statistically significant in other words it comments on whether or R² is significant or not.)
Q: 회귀분석에서 F-통계량을 어떻게 활용해야하는가? (What should I do with F statistic in Regression model?)
A: R^2의 신뢰도 측정에 부가적으로 사용 (If my F-statistic is significant that gives me extra confidence on the R² value that I have got and Vice Versa))

예제 (link)

reference

< !-- add by yurixu 替换Google的jquery并且添加判断逻辑 -->