Machine Learning/Statistics

상관계수

고슴군 2022. 1. 7. 20:44

상관계수의 가정

- 두 변수의 관계가 선형성을 만족시켜야 한다.

- 데이터가 등분산성을 충족시켜야 한다

- outlier가 없어야 한다.

- 데이터가 절단되어 있지 말아야 한다.

- 데이터가 정규분포여야 한다. (이론적으로는 그렇지만 실제로는 아닌 경우에도 사용함)

[참조]

https://m.blog.naver.com/PostView.naver?isHttpsRedirect=true&blogId=artquery&logNo=44943778

https://m.blog.naver.com/PostView.naver?isHttpsRedirect=true&blogId=chj1335033&logNo=221258402192

- Correlation in Time series

In [6]:

import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

In [7]:

df = pd.read_csv('synchrony_sample.csv')

In [13]:

df

Out[13]:

	S1_Joy	S2_Joy
0	1.662181	0.611217
1	1.584762	0.697876
2	1.413029	1.198360
3	1.995480	0.950441
4	1.981835	0.669841
...	...	...
5395	2.767519	-1.980672
5396	2.785396	-2.472706
5397	2.757275	-2.746988
5398	2.701315	-1.865076
5399	2.263026	-1.570337

5400 rows × 2 columns

In [12]:

df.isna().sum()

Out[12]:

S1_Joy    125
S2_Joy     47
dtype: int64

Pearson correlation - simple is best¶

Overall pearson¶

global synchrony

In [11]:

overall_pearson_r = df.corr().iloc[0,1]
print(f"Pandas computed Pearson r: {overall_pearson_r}")
# out: Pandas computed Pearson r: 0.2058774513561943

r, p = stats.pearsonr(df.dropna()['S1_Joy'], df.dropna()['S2_Joy'])
print(f"Scipy computed Pearson r: {r} and p-value: {p}")
# out: Scipy computed Pearson r: 0.20587745135619354 and p-value: 3.7902989479463397e-51

# Compute rolling window synchrony
f,ax=plt.subplots(figsize=(14,7))
df.rolling(window=30,center=True).median().plot(ax=ax)
ax.set(xlabel='Time',ylabel='Pearson r')
ax.set(title=f"Overall Pearson r = {np.round(overall_pearson_r,2)}");

Pandas computed Pearson r: 0.2058774513561943
Scipy computed Pearson r: 0.20587745135619362 and p-value: 3.790298947947356e-51

Moving window correlation¶

local synchrony at that moment

In [62]:

# Set window size to compute moving window synchrony.
r_window_size = 120
# Interpolate missing data.
df_interpolated = df.interpolate()
# Compute rolling window synchrony
rolling_r = df_interpolated['S1_Joy'].rolling(window=r_window_size, center=True).corr(df_interpolated['S2_Joy'])
f,ax=plt.subplots(2,1,figsize=(14,6),sharex=True)
df.rolling(window=30,center=True).median().plot(ax=ax[0])
ax[0].set(xlabel='Frame',ylabel='Smiling Evidence')
rolling_r.plot(ax=ax[1])
ax[1].set(xlabel='Frame',ylabel='Pearson r')
plt.suptitle("Smiling data and rolling window correlation")

Out[62]:

Text(0.5, 0.98, 'Smiling data and rolling window correlation')

120개씩 값 rolling 하면서 local correlation 계산. window 사이즈가 너무 작으면 correlation 계산을 위한 sample이 작아서 신뢰성 떨어진다
Overall, the Pearson correlation is a good place to start as it provides a very simple way to compute both global and local synchrony. However, this still does not provide insights into signal dynamics such as which signal occurs first which can be measured via cross correlations.

Time Lagged Cross Correlation (TLCC)¶

Time lagged cross correlation (TLCC) can identify directionality between two signals such as a leader-follower relationship in which the leader initiates a response which is repeated by the follower. There are couple ways to do investigate such relationship including Granger causality

In [17]:

def crosscorr(datax, datay, lag=0, wrap=False):
    """ Lag-N cross correlation. 
    Shifted data filled with NaNs 
    
    Parameters
    ----------
    lag : int, default 0
    datax, datay : pandas.Series objects of equal length
    wrap : NaN 채우는 것. shift 하면서 사라진 값으로 다시 채우기. 값이 순환되게 된다. wrap=False 이면 NaN은 drop하고 correlation 구한다.
    Returns
    ----------
    crosscorr : float
    """
    if wrap:
        shiftedy = datay.shift(lag)
        shiftedy.iloc[:lag] = datay.iloc[-lag:].values
        return datax.corr(shiftedy)
    else: 
        return datax.corr(datay.shift(lag))

In [16]:

d1 = df['S1_Joy']
d2 = df['S2_Joy']
seconds = 5
fps = 30
rs = [crosscorr(d1,d2, lag) for lag in range(-int(seconds*fps),int(seconds*fps+1))]
offset = np.floor(len(rs)/2)-np.argmax(rs) # 최대 correlation 값 가지는 offset 계산

f,ax=plt.subplots(figsize=(14,3))
ax.plot(rs)
ax.axvline(np.ceil(len(rs)/2),color='k',linestyle='--',label='Center')
ax.axvline(np.argmax(rs),color='r',linestyle='--',label='Peak synchrony')
ax.set(title=f'Offset = {offset} frames\nS1 leads <> S2 leads',ylim=[.1,.31],xlim=[0,301], xlabel='Offset',ylabel='Pearson r')
ax.set_xticks([0, 50, 100, 151, 201, 251, 301])
ax.set_xticklabels([-150, -100, -50, 0, 50, 100, 150]);
plt.legend()

# Offset이 왼쪽에 있으면, S1이 리드하과 S2가 따라오는 것
# shift(-150)이 d2에 대해서 적용되고, d2의 미래와 d1의 현재간에 correlation 계산 하는 것. 즉, offset이 음수이면 d1이 선행한다는 뜻
# 이것도 결국 global level로 correlation 측정하는 것. 시차 두면서.

Out[16]:

<matplotlib.legend.Legend at 0x1de008702e8>

Windowed time lagged cross correlation¶

In [58]:

seconds = 5
fps = 30
no_splits = 20
samples_per_split = df.shape[0]/no_splits
rss=[]

for t in range(0, no_splits):
    d1 = df['S1_Joy'].loc[(t)*samples_per_split:(t+1)*samples_per_split]
    d2 = df['S2_Joy'].loc[(t)*samples_per_split:(t+1)*samples_per_split]
    rs = [crosscorr(d1,d2, lag) for lag in range(-int(seconds*fps),int(seconds*fps+1))]
    rss.append(rs)
rss = pd.DataFrame(rss)
f,ax = plt.subplots(figsize=(10,5))
sns.heatmap(rss,cmap='RdBu_r',ax=ax)
ax.set(title=f'Windowed Time Lagged Cross Correlation',xlim=[0,301], xlabel='Offset',ylabel='Window epochs')
ax.set_xticks([0, 50, 100, 151, 201, 251, 301])
ax.set_xticklabels([-150, -100, -50, 0, 50, 100, 150]);

# Rolling window time lagged cross correlation
seconds = 5
fps = 30
window_size = 300 #samples
t_start = 0
t_end = t_start + window_size
step_size = 30
rss=[]
while t_end < 5400:
    d1 = df['S1_Joy'].iloc[t_start:t_end]
    d2 = df['S2_Joy'].iloc[t_start:t_end]
    rs = [crosscorr(d1,d2, lag, wrap=False) for lag in range(-int(seconds*fps),int(seconds*fps+1))]
    rss.append(rs)
    t_start = t_start + step_size
    t_end = t_end + step_size
rss = pd.DataFrame(rss)

f,ax = plt.subplots(figsize=(10,10))
sns.heatmap(rss,cmap='RdBu_r',ax=ax)
ax.set(title=f'Rolling Windowed Time Lagged Cross Correlation',xlim=[0,301], xlabel='Offset',ylabel='Epochs')
ax.set_xticks([0, 50, 100, 151, 201, 251, 301])
ax.set_xticklabels([-150, -100, -50, 0, 50, 100, 150]);

위의 window correlation과 다르게, rolling하는 것이 아니라 20개의 조각으로 나눈다.
각 조각에 대해서, TLCC 계산하는 것. 그러니까, 각 chunk별로 선/후행 관계를 따져본 것이다.
너무 나눈 것 같다 해석이 힘들다... 일관성 있게, chuck에 상관없이 특정 offset에서 상관성 높게 나오는 것이 가장 이상적일 것.

한계점 : However, these signals have been computed with the assumption that events are happening simultaneously and also in similar lengths which is covered in the next section.

Dynamic Time Warping — synchrony of signals varying in lengths¶

In [93]:

from dtw import dtw

d1 = df['S1_Joy'].interpolate().values
d2 = df['S2_Joy'].interpolate().values
d, cost_matrix, acc_cost_matrix, path = dtw(d1.reshape(-1,1),d2.reshape(-1,1), dist_method='euclidean')

plt.imshow(acc_cost_matrix.T, origin='lower', cmap='gray', interpolation='nearest')
plt.plot(path[0], path[1], 'w')
plt.xlabel('Subject1')
plt.ylabel('Subject2')
plt.title(f'DTW Minimum Path with minimum distance: {np.round(d,2)}')
plt.show()

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-93-a2785c98aedb> in <module>
      3 d1 = df['S1_Joy'].interpolate().values
      4 d2 = df['S2_Joy'].interpolate().values
----> 5 d, cost_matrix, acc_cost_matrix, path = dtw(d1.reshape(-1,1),d2.reshape(-1,1), dist_method='euclidean')
      6 
      7 plt.imshow(acc_cost_matrix.T, origin='lower', cmap='gray', interpolation='nearest')

TypeError: cannot unpack non-iterable DTW object

[참조]

https://towardsdatascience.com/four-ways-to-quantify-synchrony-between-time-series-data-b99136c4a9c9

https://www.statology.org/rolling-correlation-pandas/

Granger Causality

인과관계가 아닌, 데이터 간의 선행관계가 있는지와 있다면 lag 값을 파악하기 위한 것. time series간의 선후관계 확인 위해 사용 가능.

1.2.2 Granger Causality 이해

"어떤 현상의 예측에 있어 다른 정보를 이용하는 것이 얼마나 유의미한지 나타내는 척도"

추론불가한 문제: "닭이 먼저인가 달걀이 먼저인가?" (인과관계)

추론가능한 문제: "닭과 달걀의 생성순서 별 서로의 영향력은 어떤가?" (Granger 인과관계)

필요성:

회귀분석에서 독립변수와 종속변수는 경제이론에 의해 이미 결정된 것으로 간주하고 인과관계 확인

원인과 결과가 불분명한 경우 함수관계에 대한 명확한 결정은 현실적으로 어려움

예시:

닭과 달걀의 생산량의 인과관계

단기금리와 중기금리와의 인과관계

강수량과 동물원방문객수의 인과관계

급여인상액과 소비금액의 인과관계

어떤 회사의 광고비지출액와 매출액의 인과관계

강수량과 인터넷사용량의 인과관계

어떤 광고캠페인의 수치형 설정조건과 클릭수와 인과관계

모형의 전재: 과거의 사건은 현재의 사건을 유발할 수 있지만 미래의 사건은 현재의 사건을 유발할 수 없다
- 정상성: 정상성 데이터를 가정하므로 독립변수( $X$ )와 종속변수( $Y$ )는 모두 정상성 상태여야 함 (비정상 데이터기반 결과는 오해석 여지가 많음)
- 입력시차: 입력변수로 시차(Lagged) 적용된 변수를 반영해야 하며, 예상되는 시차가 $N$ 이라면, $1$ 부터 $N$ 까지의 시차 모두를 입력변수로 사용해야 함
- 최종시차: 예상시차 𝑁N에 매우 민감하므로 적합한 길이를 선택해야 함
  - 통상 연 환산빈도의 2~3배까지: 연별 자료시 2, 분기별 자료시 8, 월별 자료시 24)
  - $F$ 검정통계량의 유의한 변화에 의해 결정
- 검정방향: 독립변수와 종속변수의 양방향 관련성 비교가 필요하기에 총2회의 검정을 수행해야 함
  - $X ⟹ Y$ 1회: $X$ 가 $Y$ 에 인과영향인지 테스트 ( $β_{j} = 0$ 여부 확인, $ϵ_{X Y}$ 분산 감소정도 확인)
    $\begin{aligned} Just use Y & Y_{t} & = μ_{t} + \sum_{i = 1}^{\infty} α_{i} Y_{t - i} + ϵ_{Y} \\ Use X and Y & Y_{t} & = μ_{t} + \sum_{i = 1}^{\infty} α_{i} Y_{t - i} + \sum_{j = 1}^{\infty} β_{j} X_{t - j} + ϵ_{X Y} \end{aligned}$
  - $Y ⟹ X$ 1회: $Y$ 가 $X$ 에 인과영향인지 테스트 ( $β_{j} = 0$ 여부 확인, $ϵ_{Y X}$ 분산 감소정도 확인)
    $\begin{aligned} Just use X & X_{t} & = μ_{t} + \sum_{i = 1}^{\infty} α_{i} X_{t - i} + ϵ_{X} \\ Use X and Y & X_{t} & = μ_{t} + \sum_{i = 1}^{\infty} α_{i} X_{t - i} + \sum_{j = 1}^{\infty} β_{j} Y_{t - j} + ϵ_{Y X} \end{aligned}$
- 자동화가능성: 여러가지 데이터에 일반화해서 자동화하기 어려움
- 주의사항:
  - 무조건적 인과관계를 단정할 수 없음
  - 시간 선후가 유의미한 맥락을 갖는 시계열에 적용할 수 있고, 시간선후 기간기준에 따라 Granger 인과관계가 있게 될 수 있음
  - 상관관계를 두고 인과관계가 없다는 증명으로 간단히 볼 수는 있지만, 보이지 않는 요소들도 고려되어야 확실함

Granger 인과관계 테스트
- 가설확인
  - 대중주장(귀무가설, Null Hypothesis, $H_{0}$ ): 한 변수가 다른 변수를 예측하는데 도움이 되지 않는다
  - 나의주장(대립가설, Alternative Hypothesis, $H_{1}$ ): 한 변수가 다른 변수를 예측하는데 도움이 된다
- 의사결정(1회 검정)
  - p-value >= 내기준(ex. 0.05): 내가 수집한(분석한) 데이터는 대중주장과 유사하니 대중주장 참 & 나의주장 거짓수집한(분석한) 데이터는 한 변수가 다른 변수를 예측하는데 도움되지 않는다
  - p-value < 내기준(ex. 0.05): 내가 수집한(분석한) 데이터는 대중주장을 벗어나니 대중주장 거짓 & 나의주장 참수집한(분석한) 데이터는 한 변수가 다른 변수를 예측하는데 도움된다
- 의사결정(2회 검정 비교)
  - " $Y$ lags로만 $Y$ 의 데이터를 선형회귀한 것의 예측력(p-value)" > " $X$ lags + $Y$ lags로 $Y$ 의 데이터를 선형회귀한 것의 예측력(p-value)"수집한(분석한) 데이터는 $X$ 변수가 $Y$ 변수를 예측하는데 도움되지 않는다
  - " $X$ lags + $Y$ lags로 $Y$ 의 데이터를 선형회귀한 것의 예측력(p-value)" > " $Y$ lags로만 $Y$ 의 데이터를 선형회귀한 것의 예측력(p-value)"수집한(분석한) 데이터는 $X$ 변수가 $Y$ 변수를 예측하는데 도움된다
- 결과조합
  - " $X$ 가 $Y$ 에 인과영향을 준다" + " $Y$ 는 $X$ 에 인과영향을 주지 않는다"
    : $X$ 가 $Y$ 에 선행한다고 볼 수 있기에, $X$ 가 $Y$ 의 인과요인이 될 가능성이 높음
  - " $Y$ 가 $X$ 에 인과영향을 준다" + " $X$ 는 $Y$ 에 인과영향을 주지 않는다"
    : $Y$ 가 $X$ 에 선행한다고 볼 수 있기에, $Y$ 가 $X$ 의 인과요인이 될 가능성이 높음
  - " $X$ 가 $Y$ 에 인과영향을 준다" + " $Y$ 도 $X$ 에 인과영향을 준다"
    : 쌍방으로 Granger Causality가 성립하는 경우로 이 경우 제3의 외부변수(Exogenous Variable)가 영향을 공통으로 주었을 가능성이 높음
    : 제3의 외부변수(Exogenous Variable)를 알아내던가 포기하던가 해야하며, $V A R$ 모형을 사용해야 할 수 있음(Granger Causality도 $V A R$ 모형 중 하나)
  - " $X$ 가 $Y$ 에 인과영향을 주지 않는다" + " $Y$ 도 $X$ 에 인과영향을 주지 않는다"
    : 두 변수가 서로 인과영향을 주지 않는다고 볼 수도 있지만 단언은 어려움
    : $A R I M A$ 모형으로 추가 확인이 가능할 수 있음
    : 입력되는 최종 시차에 따라 달라질 수 있으므로 시차에 따른 해석을 달리 할 수도 있음(사람의 경험과 판단이 개입되어야 함)

저작자표시 (새창열림)

'Machine Learning > Statistics' 카테고리의 다른 글

MCMC (Markov Chain Monte Carlo) (0)	2022.02.08
Feature Scailing (2)	2021.01.03
Partial AutoCorrelation Function (PACF) (0)	2020.12.29
Autocorrelation 이란? (0)	2020.12.11
Feature Transformation (0)	2020.08.20

현재글상관계수

Dive into Data Science

상관계수

Pearson correlation - simple is best¶

Overall pearson¶

Moving window correlation¶

Time Lagged Cross Correlation (TLCC)¶

Windowed time lagged cross correlation¶

Dynamic Time Warping — synchrony of signals varying in lengths¶

1.2.2 Granger Causality 이해

'Machine Learning > Statistics' 카테고리의 다른 글

'Machine Learning/Statistics'의 다른글

티스토리툴바

상관계수

Pearson correlation - simple is best¶

Overall pearson¶

Moving window correlation¶

Time Lagged Cross Correlation (TLCC)¶

Windowed time lagged cross correlation¶

Dynamic Time Warping — synchrony of signals varying in lengths¶

1.2.2 Granger Causality 이해

'Machine Learning > Statistics' 카테고리의 다른 글

'Machine Learning/Statistics'의 다른글

관련글

티스토리툴바