[Generative] Improved Precision and Recall Metric for Assessing Generative Models (NIPS'19)

[Generative] Improved Precision and Recall Metric for Assessing Generative Models (NIPS'19)

2024. 10. 26. 02:54ㆍDevelopers 공간 [SOTA]

728x90

Paper : https://proceedings.neurips.cc/paper_files/paper/2019/file/0234c510bc6d908b28c70ff313743079-Paper.pdf
Code : https://github.com/kynkaat/improved-precision-and-recall-metric
Authors
- Nvidia, NIPS’19
Main Idea
- 생성모델에서 생성한 sample들의 coverage와 quality를 측정
- manifold의 explicit & non-parametric한 표현을 활용해 visualize가 가능한 새로운 metric을 제안
- Tasks : 2D Image Generation
- Results : FFHQ, ImageNet

<구성>
0. Before Start...
a. 배경지식
b. F-Beta Score in Generation
c. TPR, FPR for Mode Collapse
1. Problem
a. Precision & Recall
2. Approach
a. Algorithm
b. Application
c. Advanced
3. Implementation
a. Dataloader
b. Post Processor
c. Improved Precision and Recall
d. Realism Score

글효과 분류1 : 논문 내 참조 및 인용

글효과 분류2 : 폴더/파일

글효과 분류3 : 용어설명

글효과 분류4 : 글 내 참조

글효과 분류5 : 글 내 참조2

글효과 분류6 : 글 내 참조3

0. Before Start...

시작하기에 앞서 배경지식으로 기초적인 분류성능 평가지표를 다시 살펴보고, 생성모델을 Precision과 Recall로 분석한 두가지 논문을 살펴보고자 합니다.

a. 배경지식

먼저, 분류성능 평가지표에 대한 정보를 살펴보겠습니다.

TP(True Positive) : 정답이라 예상해서, 맞음.
FP(False Positive, False Alarm) : 정답이라 예상했지만, 틀림.
FN (False Negative, Missed Detection) : 오답이라 예상했지만, 틀림.
TN(True Negative) : 오답이라 예상해서, 맞음.

이를 활용해 다양한 Metric을 얻어낼 수 있는데 아래와 같습니다.

1. Accuracy : 전체중에 True인것, 즉 (정답==정답), (오답==오답) 의 확률
$$\frac{TP+TN}{TP+FP+FN+TN}$$
2. Precision(정밀도) :정답이라고 한 것 중에 실제 정답인 것의 확률
$$\frac{TP}{TP+FP}$$
3. TPR(True Positive Rate)= Recall(재현율) : 실제 정답인 것중에 정답이라고 해서 맞춘 비율.
=1-Type2 Error=Sensitivity(민감도)
$$\frac{TP}{TP+FN}$$
** FPR(False Positive Rate) : 실제로는 오답인 것 중에, 괜히 정답이라 해서 틀린 비율 
= 1- specificity(특이도, TNR) : 1- 실제로 오답인 것 중에 오답이라고 맞춘 비율
$$\frac{FP}{FP+TN}$$
4. F1score : Precision과 Recall의 기하평균
$$2\times \frac{Precision\times Recall}{Precision+Recall}$$

이런 Metric의 tradeoff를 살펴보기 위해, 아래와 같은 visual 표현법이 있습니다.

아래 그림중 왼쪽 : TPR와 FPR 간의 tradeoff를 통해 모델을 분석하기 위한 RoC(Receiver Operating Characteristic) Curve
아래 그림중 오른쪽 : Precision과 Recall 간의 tradeoff를 통해 모델을 분석하기 위한 PR(Precision-Recall) Curve

[angeloyeo.github.io/2020/08/05/ROC.html ]

[ROC curve와 PR curve의 차이 : https://ichi.pro/ko/roc-mich-precision-recall-gogseon-213574150750732]

또한 추가적으로 아래와 같은 표현법을 사용하기도 합니다.

AP50 : Average Precision at 50이라는 뜻으로, Detection문제에서 IoU가 50%이상일 때를 정답으로 Precision의 평균을 측정한 결과입니다.
Precision@K : “top-K 결과 중에 정답이 얼마나 포함될지”인 precision
Recall@K : “정답 중에 top-K 결과가 얼마나 포함될지”인 recall
R-precision : “top-R 결과 중에 정답 R개 중 얼마나 포함될지”인 precision

[https://en.wikipedia.org/wiki/Sensitivity_and_specificity]

b. F-Beta Score in Generation

** Assessing generative models via precision and recall (NIPS’18)

** https://github.com/msmsajjadi/precision-recall-distributions

기존 precision, recall에 대한 정의는 divergence를 위 설명한 것과 같이 두개의 차원으로 나누어 설명하곤 했었는데, 이런 직관적인 특징을 활용해 기존 FID와 Inception Score보다 나은, 생성 모델을 평가할 방법을 제안한 논문입니다.

뒤 챕터 1에서 설명할 논문에서 활용하는 Metric으로 $F_8$과 $F_{1/8}$이 있습니다.

이들은 F-Beta score이라는 기존에 활용하던 개념의 Metric으로 precision과 recall의 기하평균인 F1 score에 weight를 둔 metric인데, 이 논문에서는 어떻게 등장했는지를 살펴보면 좋을 것 같습니다.

기존에는 생성모델의 정량적인 평가 방법이 충분하지 않았는데, 가장 큰 이유는 distribution이 implicit하기 때문이라고 합니다.

즉, 샘플을 미리 정의된 distribution에서 학습함에도 불구하고, 효율적으로 likelihood를 평가할 방법이 없고, 실제로 likelihood는 굉장히 tractable함에도 불구하고 부적절하거나 high-dimensional문제가 발생합니다.

** likelihood $p(y|\theta,x)$ : 모델은 $\theta$이고, input은 x인데 이때 y는 무엇일지에 대한 확률

** 본 논문에서는 나와 있지 않지만 likelihood를 효율적으로 구할 수 없다는 것은, output에 대한 표본만 있고 대조군이 없기 때문인 것 같습니다.

이에 대한 대안으로 학습된 모델을 평가하기 위해 IS(Inception Score)와 FID(Frechet Inception Distance)가 등장했습니다.

** IS와 FID에 대해 궁금하시면 아래 더보기를 참조하세요.

------------------------------------------------------------------------------------

<IS와 FID>

IS와 FID에 대해 간단히 살펴보겠습니다.

1. IS(Inception Score) : 모델이 얼마나 데이터셋의 class distribution을 잘 파악하는지에 대한 지표입니다.

ImageNet으로 학습한 Inception모델을 활용했을 때, 생성한 이미지 x에 대한 클래스 라벨이 y이면, IS는 conditional한 label y의 distribution $p(y|x)$가 low entropy를 가져 Sharpness를 가지고, label y자체의 distribution $p(y)$가 high entropy를 가져 다양한 class를 생성할 수 있습니다.

** Sharpness(S) : Class를 확신을 가지고 만들어낼수록 명확하게 sharp하게 표현합니다.

** Diversity(D) : Class에 대해 다양하게 만들어낼 수 있습니다.

$$IS(G)=exp(\mathbb{E}_{x\sim G}[{\color{red}d_{KL}(\underbrace{p(y|x)}_{\text{Sharpness}}, \underbrace{p(y)}_{\text{Diversity}}})])$$

2. FID(Frechet Inception Distance) : 모델이 생성한 이미지들의 feature를 다른 Test Set의 feature과 비교하는 방법으로, 생성된 이미지의 Quality와 Diversity를 측정하는 방법입니다.

** $\mu$는 mean, $\Sigma$는 covariance를 의미합니다.

** Test Set 이미지 집합 $T$와 G로 생성한 이미지 집합 $I_G$일 때,

$$FID(T,I_G)=\underbrace{\|\mu_T-\mu_{I_G}\|^2_2}_{\text{Quality}}+\underbrace{Tr(\Sigma_T+\Sigma_{I_G}-2(\Sigma_T\Sigma_{I_G})^{\frac{1}{2}})}_{\text{Diversity}}$$

특히 FID는 이미지가 corrupt되더라도 visual fidelity와의 연계성이 깊어, fidelity를 구하기 위해 unlabeled data에도 적용이 가능합니다.

------------------------------------------------------------------------------------

기존에 많이 이용되었던 FID는 sample quality를 표현하며 mode dropping 특징과 연관이 굉장히 깊어 자주 활용되지만, 하지만 이런 metric들도 역시나 weakness 존재합니다.

즉, "실패한 케이스"에 대한 구분이 잘 되지 않습니다.

** Mode Dropping과 관련해 궁금하시면 아래 더보기를 참조하세요

------------------------------------------------------------------------------------

<Mode Dropping과 관련해서>

아래의 문제들은 Generator 중 GAN에서 주로 발생하는 문제입니다.

mode collapse : 실제 distribution의 몇개의 mode들이 평균으로 합쳐지는 현상으로, 쉽게 말해 intra-class collapse입니다.
mode dropping(mode drop) : 실제 distribution의 표현하기 어려운 mode들이 아예 무시되는 현상으로, 쉽게 말해 class drop입니다.
mode inventing(mode invention) : 실제 distribution의 없는 mode들이 생기는 현상으로, 쉽게 말해 class addition입니다.

Adversarial Network는 Generator(생성자)와 Evaluator(판별자)를 번갈아 학습하기 때문에 둘의 능력이 균형을 이루면서 학습되어야합니다.

하지만 Evaluator(판별자)가 디테일한 부분을 배우지 못하면, 모델이 최적까지 도달하지 못하고 위 문제와 같이 몇가지의 종류의 이미지(mode)를 생성하지 못하거나, 부족하게 생성하는 경우가 발생합니다.

이런 경우 Generator가 다양한 input을 같은 output으로 mapping 하여, 다양한 이미지(mode)를 만들어내지 못하며, 낮은 recall과 높은 precision을 가지기 때문에 위와 같은 문제가 발생합니다.

그럼 GAN에서는 구체적으로 왜 위와 같은 문제가 발생했을까요? 아래와 같은 두가지의 원인이 제안되었습니다.

보통 GAN에서는 일반적으로 아래와 같은 objective로 학습을 하는데, maximum objective를 가지는 Discriminator와 minimum objective를 가지는 Generator가 서로 큰 차이를 가지고 있어 발생한다고 하기도 합니다.

$$G^*=\underset{\theta_G}{min}\underset{\theta_D}{max}V(G,D)$$

Jensen-Shannon divergence의 사용으로 인해 발생한다고 알려져 있기도 합니다. 하지만 KL divergence를 사용하더라도 같은 결과가 나타나기 때문에 이 주장은 타당성이 약해보입니다.
** divergence : Vector Field에서의 벡터가 퍼져 나오는지, 모여서 없어지는지 등의 발산 정도를 나타내는 스칼라 값

참고로 이를 해결하기 위해 아래와 같은 방법이 시도되기도 했습니다.

모델의 capacity를 증가시키면 mode dropping을 감소시킬 수 있습니다.
GAN의 generator 개수를 증가시키면 mode collapse를 감소시킬 수 있습니다.

------------------------------------------------------------------------------------

"실패한 케이스"에 대한 구분이 잘 되지 않는다는 것을 자세히 보겠습니다. 앞서의 metric들은 1차원의 score를 내기 때문에 failure case와의 구분이 안되는데 아래 그림을 보면 아래와 같은 특징이 있습니다.

공통점 : 비슷한 FID를 갖습니다.
왼쪽 그림 : realistic하지만, 특정 종류만 생성합니다.
오른쪽 그림 : low quality이지만, 모든 다양한 종류들을 생성합니다.

위 그림에서 가운데는 본 논문에서 제안하는 이어 설명할 metric을 활용해 나타낸 것인데, 이를 살펴보면 아래와 같이 구분이 가능합니다.

왼쪽 그림 : 높은 precision과 낮은 recall
오른쪽 그림 : 낮은 precision과 높은 recall

즉, distribution간의 divergence를 precision과 recall 두가지로 분리해 살펴보는 것입니다.

그럼 본 논문에서 제안하는 PRD(Precision and Recall for Distributions)에 대한 정의를 순서대로 살펴보겠습니다. 먼저 아래와 같은 분포(distribution)이 있다고 가정하겠습니다.

distribution Q : 비교 대상인 분포
distribution P : reference 분포

그럼 precision과 recall은 아래와 같을 것입니다.

precision : “정답이라 한 것중에 실제 정답인 것”이므로, 생성한 Q 중에 실제 P인 것을 의미합니다.
recall : “실제 정답 중에 정답이라 한 것”이므로, 실제 P 중에 실제 생성한 Q를 의미합니다.

좀 더 자세히 이해하기 위해 예를 들어 P와 Q에 대한 toy example 케이스를 아래 그림과 같이 보였습니다.

(a) P는 bi-modal인데, Q는 한개만 잡은 경우 : precision 최고, recall 낮음
(b) Q는 bi-modal인데, P는 한개만 정답인 경우 : precision 낮음, recall 최고
(c) P=Q인 경우 : precision 최고, recall 최고
(d) P!=Q(disjoint)인 경우 : precision 0, recall 0

이 때 본 논문은 아래와 같이 $S$ 영역을 정의해서, 아래 $P, Q$와 같이 weight를 통해 두 분포를 합쳐 표현합니다.

$S=supp(P)\cap supp(Q)$: P와 Q의 intersecion
** support(지지집합) : 정의역(Domain)을 유효한 범위로 축소하는 역할을 하며, 값을 가지지 않는 불필요한 영역을 취급하지 않겠다는 것입니다.
$supp(P)=\{\omega\in \Omega | P(\omega)>0\}$
$\bar{S}$ : S의 여집합(complement)
$P=\bar{\beta}P_S+(1-\bar{\beta})P_{\bar{S}}$ : TP와 FN의 weighted 조합
- $\bar{\beta} \in (0,1]$ : recall에 대한 weight
- $1-\bar{\beta}$ : recall의 손실에 대한 weight
$Q=\bar{\alpha}Q_S+(1-\bar{\alpha})Q_{\bar{S}}$ :TP와 FP의 weighted 조합
- $\bar{\alpha} \in (0,1]$ : precision에 대한 weight
- $1-\bar{\alpha}$ : precision의 손실에 대한 weight

여기서 FN, TP, FP는 아래와 같이 정의될 것입니다.

True Positive : 데이터와 동일하게 Model이 생성한 영역
False Negative : 데이터에 있지만 Model이 생성하지 못한 영역
False Positive : Model이 생성했지만 데이터에는 없는 영역

하지만, 위 기준은 데이터와 동일하게 생성했는지에 대한 분포입니다.

근데 우리가 가정하는 Generative Model의 기능은 데이터가 가진 것을 그대로 잘 구현했는지가 아니라, threshold에 따라 우리가 생각하는 "클래스" 혹은 "조건"에 맞게 해당 부분들을 잘 생성했는지를 보고 정답이라고 판단합니다.

[두 확률 분포가 다를 때 : https://only-wanna.tistory.com/entry/Classification-Metrics분류-모델-지표-알아보기-TPR-FPR과-ROC-Curve-사이-관계-및-AUC]

게다가 위를 통해 precision과 recall을 구하면 $P_S=Q_S$일 때만 분석이 맞는데, $P_S\neq Q_S$인 경우는 아래와 같은 의문이 생깁니다.

$P_S$와 $Q_S$가 다른 것이 precision과 recall의 손실에 영향을 줄까?
아래 원인 중에 어떤 것이 정확한 원인 일까?
- non-covered : $Q_S$가 부적절하게 $P_S$를 커버하고 있는 것 때문일까
- unrealistic : $Q_S$가 불필요한 노이즈를 생성하기 때문일까?

본 논문은 이 애매함을 해결하기 위해 기존 precision, recall 두개의 숫자로 제공하는것 대신 precision, recall간의 trade-off로 제공함으로써 이 애매함을 해결해보려고 합니다.

따라서 이 trade-off를 parametrization하기 위해 $\mu$라는 S에 포함된 어떤 distribution을 정의합니다.

** 결국 “데이터와 같이 예측한 부분들” S지역에 대해 또! 세분화해서 “잘 예측한 부분들” $\mu$에 대해 다시 얘기하자는 것입니다.

왜냐면 threshold에 따라 원하는 결과가 달라질 수도 있기 때문입니다.

$\mu$: $P_S, Q_S$의 “true” common component
$P_\mu$: $P_S$의 일부지만 $Q_S$에서 놓친 부분, 즉 $P_S=\mu+P_\mu$
$Q_\mu$: $Q_S$의 일부지만 $P_S$에서 놓친 부분, 즉 $Q_S=\mu+Q_\mu$

이렇게 되면 P와 Q는 아래와 같이 다시 정의할 수 있습니다.

$P={\beta'}{\color{red}\mu}+(1-{\beta'})P_{{\color{red}\mu}}$
$Q={\alpha'}{\color{red}\mu}+(1-{\alpha'})Q_{{\color{red}\mu}}$

앞서 설명한 모든 내용을 formal한 정의로 다시 정리해 나타내면 아래와 같이 다시 표현할 수 있습니다.

$v_P$ : $P$의 일부지만 $Q$에서 놓친 부분, 즉 $P_S=\mu+v_P=\mu+P_\mu+P_{\bar{S}}$
$v_Q$ : $Q$의 일부지만 $P$에서 놓친 부분, 즉 $Q_S=\mu+v_P=\mu+Q_\mu+Q_{\bar{S}}$
$\beta $ : recall
$\alpha$ : precision

그럼 P와 Q는 또 아래와 같이 정의되겠죠. 그림도 한번 보시겠습니다.

$P={\beta}{\color{red}\mu}+(1-{\beta}){\color{red}v_P}$
$Q={\alpha}{\color{red}\mu}+(1-{\alpha}){\color{red}v_Q}$

자 이제, 본 논문에서 정의한 PRD(Precision and Recall for Distributions) 즉 $PRD(Q,P)$는, P distribution에 대한 Q의 precision과 recall쌍을 의미합니다.

-------------------------------------------------------

<PRD의 특징>

위와 같이 정의한 PRD는 아래와 같은 수학적 속성들을 가진다고 설명합니다.

$$\begin{matrix}
(i)&(1,1)\in PRD(Q,P)&\Leftrightarrow Q=P &(equality)\\
(ii)&PRD(Q,P)=\{(0,0)\}&\Leftrightarrow supp(Q)\cap supp(P)=\varnothing & (disjoint\ supports))\\
(iii)&Q(supp(P))=\bar{\alpha}=max_{(\alpha,\beta)\in PRD(Q,P)}\alpha&& (max\ precision)\\
(iv)&P(supp(Q))=\bar{\beta}=max_{(\alpha,\beta)\in PRD(Q,P)}\beta&& (max\ recall)\\
(v)&(\alpha',\beta')\in PRD(Q,P)&& (monotonicity)\\
&if\ \alpha'\in(0,\alpha], \beta'\in(0,\beta], (\alpha,\beta)\in PRD(Q,P)&&\\
(vi)&(\alpha,\beta)\in PRD(Q,P)&\Leftrightarrow (\beta,\alpha)\in PRD(P,Q)& (duality)
\end{matrix}$$

위 보였던 toy example에 대해 위 속성들이 적용되었을 때의 특징은 아래와 같고, 이때의 PRD()는 아래 그림과 같이 됩니다.

(c) : (i)+(v) : PRD(Q,P)가 unit square를 가지면 Q=P가 됩니다.
(d) : (ii) : P와 Q간의 겹치는 부분이 없으면 PRD(Q,P)는 오직 origin(0,0)만 포함하게 됩니다.
(a), (b) : (iii), (iv) : P와 Q의 간의 decomposition을 분석할 수 있습니다.
(vi) : precision과 recall처럼 PRD(Q,P)는 PRD(P,Q)와 같습니다.
[toy example에 대한 PRD curve]

-------------------------------------------------------

그럼 이제 $PRD(Q,P)$를 $\alpha,\beta$에 따른 tradeoff로 그려야하는데, 어떻게 진행했을까요?

직접 모든 $\alpha,\beta$에 대해 $\mu, v_P, v_Q$ distribution을 찾아낼 수도 있겠지만, 본 논문에서는 아래와 같이 $\lambda\in (0,\infty)$에 대한 함수로 $\alpha,\beta$를 정의해 PRD를 구해냅니다.

$$\begin{aligned}
\beta({\color{red}\lambda})&=\sum_{\omega\in \Omega}min(P(\omega),\frac{Q(\omega)}{\color{red}\lambda})\\
\alpha({\color{red}\lambda})&=\sum_{\omega\in \Omega}min({\color{red}\lambda}P(\omega),{Q(\omega)})
\end{aligned}$$

근데 이렇게 $\lambda$를 활용해 정의하면, $PRD(Q,P)$는 아래와 같은 속성을 가진다고 합니다.

즉, 구해낸 $\alpha,\beta$에 $\theta$로 표현된 값만 곱하면 PRD(Q,P)를 쉽게 구해낼 수 있다는 것이죠.

$$PRD(Q,P)=(\alpha,\beta)=\left\{({\color{red}\theta}\alpha(\lambda),{\color{red}\theta}\beta(\lambda))|\lambda \in (0,\infty), {\color{red}\theta \in [0,1]} \right\}$$

또한 아래 그림과 같이 $\alpha=\lambda\beta$형태로 기울기를 $\lambda$로 나타낼 수도 있습니다.

이에 추가적으로 아래 식과 같이 equiangular grid 값의 $\lambda$를 활용하는 경우 더 간단하게 PRD curve를 만들어낼 수 있습니다.

즉, $\lambda$를 아래와 같이 equiangular grid 값을 사용하면 $\alpha$와 $\beta$가 그 자체만으로 $PRD(Q,P)$ 즉 precision, recall 쌍이 될 수 있다는 것이죠.

$$\begin{aligned}
\Lambda&=\{tan(\frac{i}{m+1}\frac{\pi}{2})|i=1,2,\dots m\}\\
\widehat{PRD}(Q,P)&=\{(\alpha(\lambda),\beta(\lambda))|\lambda\in \Lambda\}
\end{aligned}$$

그럼 본격적으로 실제 구현시에는 어떻게 진행했을 까요? 먼저 용어를 살펴보겠습니다.

$\hat{P}\sim P$: 연속적인 P분포에서 얻은 샘플들
$\hat{Q}\sim Q$: 연속적인 Q분포에서 얻은 샘플들

실제 Deep Generative Model에 적용할 때는 아래와 같은 순서로 진행됩니다.

먼저, 샘플 이미지 $\hat{P}, \hat{Q}$를 pre-trained Classifier를 통해 통계학적으로 regularity를 가진 feature space로 옮겨줍니다.
** 이미지가 Inception 네트워크 를 통한 뒤, Pool3 layer의 activation을 활용합니다.
그리고 feature space에서 mini-batch K-means clustering을 활용해 $\hat{P}$와 $\hat{Q}$을 각각 K개의 centroid로 cluster합니다.
** K=20을 활용했습니다.
** 즉, 전체분포를 k개의 cluster로 표현한 것이고, 이는 사실 두 분포의 상대적인 probability density를 나타내는 과정이라 뒤에서는 문제라고 지적됩니다.
이렇게 구해진 두개 분포의 K개의 cluster bins를 $P$와 $Q$라고 생각하고, $\lambda$에 따른 여러가지 $\alpha$와 $\beta$를 찾아 $PRD(Q,P)$를 구해냅니다.
생성된 샘플들이 true distribution의 "많은" sample cluster와 합쳐지지 못하면, recall이 낮을 것이고
생성된 샘플들의 클러스터에 real 샘플이 부족하면, precision이 낮을 것입니다.

-------------------------------------------------------

<코드로 살펴보는 과정>

먼저, input으로 활용할 두개의 distribution을 정해주었습니다. 여기서 10은 데이터의 개수 50은 feature dimension입니다.

import numpy as np

a=[]
mu, sigma = 3.5, 0.3
for emb_dim in range(10):
    a.append(np.random.normal(mu, sigma, 50))
b=[]
mu, sigma = 3.5, 0.3
for emb_dim in range(10):
    b.append(np.random.normal(mu, sigma, 50))

해당 코드를 활용해 아래와 같이 실행해줍니다.

** https://github.com/msmsajjadi/precision-recall-distributions

import prd_score

prd_data_1 = prd_score.compute_prd_from_embedding(a, b)
prd_score.plot([prd_data_1], ['model_1'], out_path="fig.png")

그럼 prd_score.compute_prd_from_embedding(a,b)를 조금 더 자세히 살펴보겠습니다.

eval_data, ref_data : [데이터수, feature dimension]
eval_dist, ref_dist : [20,] MiniBatchKMeans을 활용해 도합 1.0이 되는 probability density를 가진 20개의 bin으로 정리해냅니다.
precision, recall : [1001,] 1001개의 angle에 대해 각각 precision과 recall을 구해냅니다.

def compute_prd_from_embedding(eval_data, ref_data, num_clusters=20,
                               num_angles=1001, num_runs=10,
                               enforce_balance=True):
  eval_data = np.array(eval_data, dtype=np.float64)
  ref_data = np.array(ref_data, dtype=np.float64)
  precisions = []
  recalls = []
  for _ in range(num_runs):
    eval_dist, ref_dist = _cluster_into_bins(eval_data, ref_data, num_clusters)
    precision, recall = compute_prd(eval_dist, ref_dist, num_angles)
    precisions.append(precision)
    recalls.append(recall)
  precision = np.mean(precisions, axis=0)
  recall = np.mean(recalls, axis=0)
  return precision, recall

위에서 prd를 구하는 과정을 조금더 살펴보겠습니다.

slopes_2d : [1001,1], 1001개에 대한 angle 즉 $\lambda$를 의미합니다.
ref_dist_2d, eval_dist_2d : [1,20], 위에서 구한 probability density의 20개 bins입니다.
- ref_dist_2d*slopes_2d : [1001,20], 1001개에 대해 모두 probability density를 곱해냅니다.
- np.minimum(ref_dist_2d*slopes_2d, eval_dist_2d) : [1001,20], 1001개에 대해 eval_dist_2d와 비교해 각 element들 중 작은 값을 선정해냅니다.
- np.minimum(ref_dist_2d*slopes_2d, eval_dist_2d).sum(axis=1) : [1001,] 작은 값들을 모두 더해 precision을 구합니다.
- recall은 $\alpha/\lambda$로 구해낼 수 있습니다.
precision, recall : [1001,] 1001개의 angle에 대해 각각 precision과 recall을 구해냅니다.

def compute_prd(eval_dist, ref_dist, num_angles=1001, epsilon=1e-10):
  # Compute slopes for linearly spaced angles between [0, pi/2]
  angles = np.linspace(epsilon, np.pi/2 - epsilon, num=num_angles)
  slopes = np.tan(angles)

  # Broadcast slopes so that second dimension will be states of the distribution
  slopes_2d = np.expand_dims(slopes, 1)

  # Broadcast distributions so that first dimension represents the angles
  ref_dist_2d = np.expand_dims(ref_dist, 0)
  eval_dist_2d = np.expand_dims(eval_dist, 0)

  # Compute precision and recall for all angles in one step via broadcasting
  precision = np.minimum(ref_dist_2d*slopes_2d, eval_dist_2d).sum(axis=1)
  recall = precision / slopes

  # handle numerical instabilities leaing to precision/recall just above 1
  max_val = max(np.max(precision), np.max(recall))
  if max_val > 1.001:
    raise ValueError('Detected value > 1.001, this should not happen.')
  precision = np.clip(precision, 0, 1)
  recall = np.clip(recall, 0, 1)

  return precision, recall

-------------------------------------------------------

-------------------------------------------------------

<K-means Clustering과정 정리>

[https://velog.io/@jhlee508/머신러닝-K-평균K-Means-알고리즘]

1. 군집의 개수 K 설정하기 : 총 몇개로 군집할지를 설정합니다.
2. 초기 중심점 설정하기 : 초기 Centroid를 어떻게 설정하는지에 따라 성능이 크게 달라지며, 총 K개의 초기 Centroid를 설정해둡니다.
3. 데이터를 군집에 할당하기 (배정) : 거리 상 가장 가까운 Centroid를 중심으로 모든 데이터를 cluster로 할당합니다.
4. 중심점 재설정 (갱신) : Centroid를 각 cluster의 중간 위치로 재설정합니다.
5. 3과 같습니다.
6. 더이상 Centroid이 이동을 할 곳이 없으면 멈춥니다.

-------------------------------------------------------

PRD curve를 활용한 예시를 보겠습니다. 아래 그림은 MNIST로 학습한 두개의 GAN이 같은 FID를 가지지만 다른 PRD Curve를 가지는 상황입니다.

결과적으로 recall ~0.6을 원하는 경우 left 모델이 높은 precision을 가지면서도 몇개의 digit을 생성하지만, recall 0.6~을 원하는 경우 right 모델이 높은 precision으로 모든 digit을 생성합니다.

실제로 precision을 측정했을 때 left모델은 96.7%이고 right모델은 88.6%로, left모델이 조금 더 sample quality가 높았다고 합니다.

또한 위 그림의 histogram으로 보면 left 모델은 모든 class를 생성하지는 못하지만(recall손실) right 모델은 모든 class를 생성할 수 있습니다.

이는 IS score와도 연관이 깊은데, 아시다시피 기존 IS score는 아래와 같은 전제를 가지고 있습니다.

conditional label $p(y|x)$는 낮은 entropy를 가져야합니다. (=precision)
marginal $p(y)=\int p(y|x=G9z))dz$는 높은 entropy를 가져합니다. (=recall)

하지만 본 논문은 unlabeled dataset에 적용한 것과 달리 IS는 labeled dataset이 필요합니다.

마지막으로, 앞서 F score를 활용할 것이라고 이야기했는데, 이를 살펴보겠습니다.

본 논문은 추가적으로 PRD curve를 요약해 표현하기 위해 precision과 recall의 조화평균인 $F_1$ score의 최대값을 활용한다고 하는데, 더 나아가 상대적인 importance를 주어 개선한 $F_\beta$를 활용합니다.

$$\begin{aligned}
F_\beta&=(1+{\color{red}\beta}^2)\frac{p\cdot r}{({\color{red}\beta}^2p)+r}\\
&when\ \beta>1\sim important\ recall(r) \\
&when\ \beta<1\sim important\ precision(p) \\
\end{aligned}$$

결과적으로 본 논문은 PRD Curve를 $F_\beta$와 $F_{1/\beta}$의 값으로 아래와 같이 나타냅니다. 아래 그림은 7개의 GAN과 VAE모델에 대한 최대 $F_8$과 최대 $F_{1/8}$을 보인 것입니다.

(A)모델 : high precision, low recall
(B)모델 : high precision, high recall
(C)모델 : low precision, low recall
(D)모델 : low precision, high recall

c. TPR, FPR for Mode Collapse

** PacGAN: The power of two samples in generative adversarial networks (NIPS'18)

기존의 GAN 논문에서는 Mode collapse에 대한 형식적으로 정의한 사례가 없어 PacGAN에서 이를 제안합니다.

먼저 용어는 아래와 같습니다.

$P$ : 타겟 real distribution
$X\sim P$ : P에서 얻은 샘플
$Q$ : 생성된 distribution
$Z$ : Generator에 들어갈 code vector space에서 얻은 샘플
$(P,Q)$ : 2차원 표현의 region이라 부르며, ROC(Receiver Operating Characteristic) curve 표현에서 얻어냅니다.

일반적으로 Mode collapse는 두가지 정의가 존재합니다.

1) 생성 모델이 기존 타겟 Distribution에 있었던 몇 개의 mode를 잃어버리는 것
2) code vector Z 상의 두개의 포인트가 sample space X 상의 같은 포인트로 mapping되는 것.

본 논문에서는 1) 정의를 집중해 생성된 sample들의 quality에만 집중해서 살펴봅니다.

즉, 두 개의 생성모델이 같은 marginal distribution을 가지고 같은 sample들을 생성한다면, code vector와 무관하게 다르게 간주하지 않았습니다.

-------------------------------------------------------------

<GAN에서 Discriminator의 역할>

기존 GAN에서 discriminator의 역할은 Jensen-Shannon divergence나 variation distance와 같은 desired loss를 최소화하는 것의 대리역할을 합니다.

따라서 discriminator의 CE loss를 살펴보면 아래와 같습니다.
** Jensen-Shannon divergence : $d_{JS}(P,Q)=\frac{1}{2}d_{KL}(P\|\frac{P+Q}{2})+\frac{1}{2}d_{KL}(Q\|\frac{P+Q}{2})$
$$\underset{G}{min}\ \underbrace{\underset{G}{max}\ \mathbb{E}_{X\sim P}[log(D(X))]+\mathbb{E}_{G(Z)\sim Q}[log(1-D(G(Z))))]}_{d_{KL}(P\|\frac{P+Q}{2})+d_{KL}(Q\|\frac{P+Q}{2})+log(1/4)}$$

직관적으로 이해하기 위해 아래 그림을 보면, GAN에서 Discriminator 분포 $D$(파란점)을 통해, Generator이 생성한 분포 $p_g$(초록선)이 데이터의 분포 $p_x$(검은점)을 구분하도록 하면서 학습하게 됩니다.

또한 아래그림의 검은 선은, z가 샘플된 domain이 uniform하게 샘플되었다고 했을 때, $x=G(z)$를 통해 non-uniform한 $p_g$로 mapping되는 과정을 보인 그림입니다.

-------------------------------------------------------------

본격적으로 본 논문에서는 P와 Q에 대해 어떤 set S가 있을 때, 아래의 두 조건을 만족하면 $(\epsilon, \delta)$-mode collapse가 발생했다고 정의합니다.

조건1. $0\leq \epsilon \leq \delta \leq 1$
조건2. target distribution P중에 $P(S)\geq\delta$인 부분이, generated distribution $Q(S)\leq\epsilon$이다.

즉, distribution pair들이 같은 variation distance를 가지더라도 다른 mode collapse pattern을 보일 수 있는데, 이들을 분리해 살펴보기 위해 위 $\epsilon$과 $\delta$를 활용한 것 입니다.

이 때 높은 $\delta$와 낮은 $\epsilon$을 가지면 mode collapse가 심하다는 뜻입니다.

예를 들어 보겠습니다. 아래 그림에서 P는 uniform target distribution이고, Q는 아래 그림의 $Q_1$과 $Q_2$와 같이 존재한다고 할 때, $P\sim Q_1$과 $P\sim Q_2$는 0.2로 같은 TV(Total Variation) distance를 가집니다.

** TV(Total Variation) distance : $d_{TV}\triangleq sup_{S\subseteq X}\{P(S)-Q(S)\}$$

** 예를 들어 P와 $Q_1$의 영역의 차이는 1*0.2=0.2입니다.

하지만 위에서 정한 mode collapse에 대한 정의로 하면 Q1이 더 mode collapse가 심합니다.

$(P,Q1)$ : $\epsilon=0,\delta=0.2$-mode collapse,
P가 0.2와 크거나 같은 분포를 가지는 곳에서 Q가 0 분포보다 작거나 같은 곳이 존재한다.
$(P,Q2)$ : $\epsilon=0.12,\delta=0.2$-mode collapse,
P가 0.2보다 크거나 같은 분포를 가지는 곳에서 Q가 0.6*0.2보다 작거나 같은 곳이 존재한다.

이제, 더 정확하게 분석하기 위해 본 논문에서는 2차원의 mode collapse region이라는 것을 아래와 같이 정의합니다.

** conv(convex hull) : 주어진 영역을 포함하는 가장 작은 블록 집합

$$\mathcal{R}(P,Q)\triangleq conv(\{(\epsilon,\delta)|\delta>\epsilon\text{ and }(P,Q)\ has\ (\epsilon, \delta)-mode\ collapse)\})$$

아래 그림은 위 toy example에 대해 mode collapse region을 나타낸 것인데, $Q1, Q2$ 둘은 같은 region을 가지지만 모양이 다릅니다.

앞서 설명한 바와 같이 작은 $\epsilon$과 큰 $\delta$일 때 mode collapse가 많다는 것이므로, (0,0)근처에서 날카로운 기울기를 가질수록 그렇습니다.

그래서 아래 왼쪽 region은 (0,0)에서 날카롭게 증가하니 mode-collapse가 심하고, 오른쪽 region은 (0,0)에서 gentle하게 증가하니 덜 심하다는 뜻입니다.

[toy example을 mode collapse region으로 표현]

자 근데, 논문에서는 이렇게 보인 mode collapse region이 ROC curve와 1:1 대응 이라고 이야기합니다.

아래 그림은 위 보았던 $Q1, Q2$에 대해 binary hypothesis testing 실험을 통해 얻은 TPR-FPR로 만든 hypothesis testing region입니다.

이는 실제 위에서 본 mode collapse region과 같은 것을 볼 수 있습니다

[toy example을 hypothesis testing region으로 표현]

1. Problem

본 논문은 생성모델에서 생성한 sample들의 coverage와 quality를 측정하기 위해, manifold의 explicit & non-parametric한 표현을 활용한 새로운 metric을 제안합니다.

기존 생성모델들의 목적은 학습 데이터의 manifold를 학습하고 기존 학습데이터와 많이 다르지는 않지만 새로운 샘플을 생성하는 것이 목표였습니다.

이런 샘플링 과정에서의 복잡한 manifold를 modeling함으로써 평가지표로 활용했을 땐, 보통 두개의 다른 목적이 있습니다.

모델에서 얻은 각각의 샘플도 예시로서 활용 가능해야하고, high quality여야 한다.
모델에서 얻은 각각의 샘플의 variation은 학습 데이터와 비슷해야 한다.

이 때, 기존의 FID, IS, KID와 같은 지표들은 이 두개의 목적을 tradeoff 없이 하나의 값으로 나타내기 때문에 모델의 성능을 진단하는 것이 어렵습니다.

이렇게 되면, FID나 다른 density metric들은 variation을 포기하고 truncated된 subset 도메인에 대해서 높은 quality 샘플만 생성하게하는 결과를 가져 옵니다.

따라서 이렇게 내재된 manifold를 커버하는 것이 부족한 문제를 지적하고 variation을 측정하기 위한 metric들이 출현했지만,

** Improved techniques for training GANs (NIPS'16)

** Unrolled generative adversarial networks (arxiv'16)

** PacGAN: The power of two samples in generative adversarial networks (arxiv'17)

주관적이거나
domain specific
non-reliable : 믿음직스럽지는 않은

경향이 있습니다.

a. Precision & Recall

이후에 Sample의 quality를 의미하는 precision과 Sample Distribution의 Coverage를 의미하는 recall 개념을 활용해 생성된 샘플을 평가하는 방법이 제안되었지만, 이 또한 약점이 있었습니다.

** Assessing generative models via precision and recall (NIPS’18)

그 약점을 조금 살펴보겠습니다. 원래 precision & recall은

precision : 생성된 이미지 중 realistic한 것
recall : 학습데이터 manifold 중 생성해 cover할 수 있는 부분

을 의미합니다.

이 때, 해당 논문에서는 두 distribution의 상대적인 probability density를 활용해 precision, recall을 재정의합니다.

근데, 이렇게 "상대적으로" 측정하는 것은 두 distribution의 차이가 발생했을 때

caseA. 생성 모델이 real distribution을 적절하게 cover하지 못해서 그런건지
caseB. 생성 모델이 너무 unrealistic하게 생성한 것들 때문인지

애매함을 여전히 야기했습니다.

---------------------------------------------------------------

<무슨 애매함이 발생한다는 걸까?>

예를 들면, 생성모델이 특정 class를 생성할 때

case A. class 중 특정 그림만 만들어낸다고 하면 real distribution을 충분히 포괄하지 못한 것이 원인
case B. class를 이상하게 왜곡해서 그려내면 unrealistic하게 생성한 것이 원인

인데, 상대적인 확률 밀도(Probability Densities) 차이로 나타냈을 때는 둘 중 어떤 것이 원인인지를 알수가 없습니다.

아래와 같은 그림을 살펴보면 아래와 같은 서로 다른 문제가 있지만, 상대적인 확률 밀도 차이는 같을 것입니다.

검은선 : real distribution
파란선(case A) : real distribution의 peak 중 하나(왼쪽)를 아예 커버하지 못해 충분히 포괄하지 못합니다.
(Under-coverage, False Negative)
빨간선(case B) : real distribution를 잘 커버했지만, 너무 넓게 펴져 실제와 다르게 생성이 가능합니다.
(Unrealistic Samples, False Positives)

---------------------------------------------------------------

위 같은 이유 때문에 해당 논문은 extrema가 기존 precision/recall 정의와 일치하도록 "$\alpha, \beta$를 활용한 precision/recall의 연속체 값"을 사용해 모델링 함으로써 이런 애매함을 해결했었습니다.

** extrema : 학습한 데이터 분포와 실제 데이터 분포를 비교할 때, 두 분포를 잘 설명하기 위한 두 분포 간의 경계 지점 (분포의 극한 부분)

하지만 이를 실제 practical하게 적용했을 때는 여전히 상대성 때문에 extrema를 정확히 측정할 수 없어, 상황을 정확하게 해석하지는 않았습니다.

특히 예를 들어, 많은 수의 truncation이나 mode collapse 때문에 많은 샘플들이 뭉쳐있는 경우가 그렇습니다.

** truncation에 대해 궁금하시면 아래 더보기를 참조하세요

---------------------------------------------

<truncation이란>

styleGAN에서 사용한 trick으로, 학습이 완료된 네트워크의 mapping vector $w\in latent\ space\ W$를 평균($\bar{w}$)쪽으로 이동시켜 "학습 데이터 분포 상 적은 수(density)를 갖고 있는 객체"에 대해 생성을 못하는 현상을 방지하기 위한 방법입니다.

** 아래 $\psi$가 0이면 평균의 얼굴이 나오며, $\psi$가 1이면 truncation을 적용하지 않는 결과가 나옵니다.

$$\begin{aligned}
\bar{w}&=\mathbb{E}_{z\sim P(z)}[f(z)]\\
w'&=\bar{w}+\psi (w-\bar{w})
\end{aligned}$$

이렇게 하면 variation은 조금 잃을 수 있지만 quality가 향상합니다.

하지만 앞서 나온 내용의 truncation 용어는 평균에 가까운 샘플이 몰리는 현상을 의미합니다.

---------------------------------------------

따라서 본 논문은 이를 개선하고, precision과 recall을 활용해 quality와 variety간의 tradeoff를 explicit하게 visualize합니다.

이 논문 외 아래 논문에서는 이를 개선하기위해 post-hoc classifier를 학습하는 방법도 제안되기도 했습니다.

** Revisiting precision and recall definition for generative model evaluation (arxiv’19)

** post-hoc(사후검증) : 3개 이상의 집단간 통계상 차이가 있는지 2개씩 t-test등을 통해 확인하는데, 이 결과는 서로 차이가 있다는 것만 알지 3개 집단 간의 차이가 어떤지는 알기 어렵습니다. 이럴 때 활용하는 것이 사후 검증입니다.

2. Approach

본 논문에서 앞서 설명한 논문의 precision & recall은 quality와 manifold coverage를 분리했다는 효과로 이미 충분하며,

PacGAN에서 나온 vertical & horizontal extremal cases와 precision & recall의 개념이 일치하다는 것을 보아 어느 정도는 증명이 되었다고 본다고 합니다.

** PacGAN: The power of two samples in generative adversarial networks (NIPS'18)

a. Algorithm

이제, precision과 recall를 측정하기 위해, 실제 데이터와 생성된 데이터 mainfold를 어떻게 explicit한 non-parametric 표현으로 구성했는지를 보이겠습니다.

먼저 용어를 설명하겠습니다.

$X_r\sim P_r$ : real distribution에서 뽑은 샘플
$X_g\sim P_g$ : generated distribution에서 뽑은 샘플
$\phi_r, \Phi_r$ : real distribution의 샘플들을 pre-trained classifer를 통과시킨 feature vector, 및 집합
$\phi_g, \Phi_g$ : generated distribution의 샘플들을 pre-trained classifer를 통과시킨 feature vector, 및 집합
** feature vector들은 양측에서 같은 개수만큼 뽑아냅니다. $|\Phi_r|=|\Phi_g|$
** 위 같은 개수만큼 뽑는 것은, 뒤의 코드에서 보시면 아시겠지만 중요하진 않습니다.

위에서 얻은 샘플들의 feature vector들은, 각각의 set내에서 feature vector간의 pairwise Euclidean distance들을 측정합니다.

그다음 kth nearest neighbor과의 distance를 radius로 해서 hypersphere를 형성합니다.

--------------------------------------------------------

<hypersphere를 결정하는 파라미터>

a. the number of samples

본 논문의 metric FID와 같이 샘플의 개수에 영향을 적게 받지만, 그래도 FID처럼 50k의 샘플을 통해 측정했습니다.

b. neighbor K

neighbor K는 아래와 같은 tradeoff가 존재합니다.

높은 K : 전체 maifold를 cover하며, precision & recall을 더 consistent하게 잘 측정한다.
낮은 K : 작은 volume에 대해 overestimate하여, 높은 K보다는 consistent하지는 않습니다.

본 논문에서는 실험에서 k에 따라 Precision, Recall이 saturate되는 곳이 k=3이어서, k=3으로 정했다고 합니다.

아래 (b)에서 파란색은 precision 오렌지색은 recall을 의미합니다.

c. pretrained-Classifer

본 논문에서는 feature vector를 구하기 위해 아래와 같은 두가지를 고려했는데, 방법1을 택했습니다

방법1. pretrained VGG-16 classifer에 넣은 뒤, 두번째 FC layer에서 나온 activation vector를 사용
이미지 상의 의미론적인 부분을 의미합니다.
** Large scale GAN training for high fidelity natural image synthesis (ICLR'19)
방법2. pretrained VGG-16 classifier의 여러개 conv layer에서 나온 activation을 활용
이미지가 corrupt되었다고 판단하는 사람들의 결정과 연관성이 깊습니다.
** The unreasonable effectiveness of deep features as a perceptual metric (CVPR'18)

이유는 정확한 spatial arrangement에 대해 강조가 덜 되었기 때문에 본 논문에서의 metric 목적과 연관성이 더 높았다고 합니다.

--------------------------------------------------------

--------------------------------------------------------

<K-th nearest neighbor>

k번째로 가까운 neighbor를 뽑는 과정은 KNN(K-Nearest Neighbor Algorithm)에서 자주 들었던 개념입니다.

KNN 알고리즘은 Supervised Learning에 활용되는 classification 알고리즘으로, 아래 그림을 예로 어떤 class가 분류된 원형태의 샘플들이 있는데, 빨간 별이 추가되면

k=3 인경우 : 초록색 원이 다수이므로, 빨간 별은 Class B로 분류될 것입니다.
k=6인 경우 : 파란색 원이 다수이므로, 빨간 별은 Class A로 분류될 것입니다.

[https://www.jcchouinard.com/k-nearest-neighbors/]

이런 방식으로 이웃의 수(K)를 주고, 거리(보통 uclidean distance)를 통해 미리 정의된 Class로 분류합니다.

--------------------------------------------------------

이 hypersphere는 true manifold를 estimate한 부피라고 할 수 있으며, 따라서 아래 binary 함수와 같이 주어진 샘플 $\phi$가 이 부피안에 포함되어 있는지를 통해 결정합니다.

** $NN_k(\phi',\Phi)$ : vector $\phi'$의 kth nearest neighbor을 의미합니다.

$$f({\color{blue}\phi},{\color{red}\Phi})=\left\{\begin{matrix}
1,&if\ \|{\color{blue}\phi}-{\color{red}\phi'}\|_2 \leq \|{\color{red}\phi'}-NN_k({\color{red}\phi'}, {\color{red}\Phi})\|_2&\text{for at least one }{\color{red}\phi'}\in {\color{red}\Phi}\\
0,&otherwise
\end{matrix}\right.$$

이 때, 위 함수를 활용하면 아래와 같은 precision과 recall 등이 정의됩니다.

$f(\phi, \Phi_r)$ : 이미지가 얼마나 realistic한지를 나타내는 지표
$f(\phi, \Phi_g)$ : 생성된 모델인지를 나타내는 지표
$precision(\Phi_r, \Phi_g)=\frac{1}{\|\Phi_g\|}\sum_{\phi_g\in \Phi_g}f(\phi_g, \Phi_r)$ : 생성된 이미지 중에 실제 이미지에 포함될 비율
$recall(\Phi_r, \Phi_g)=\frac{1}{\|\Phi_r\|}\sum_{\phi_r\in \Phi_r}f(\phi_r, \Phi_g)$ : 실제 이미지 중 생성된 이미지에 포함될 비율

다시 말해 Precision을 예로,

real distribution의 포인트의 kth nearest neighbor과의 거리보다,
real distribution의 포인트의 생성된 포인트와의 거리가 가까운 경우
"생성된 포인트 $\phi_g$가 real distribution $\Phi_r$의 approximate mainfold에 포함된다"고 합니다.
이를 모든 생성된 포인트에 대해 진행해 평균합니다.

위 과정을 정리한 과정은 아래와 같습니다.

아래는 10개의 class를 가진 2D Gaussian mixture model을 활용해 Mode Droping과 Mode Invention 테스트를 진행했습니다.

아래 그림의 (a)를 보면 실제 데이터는 mode(1~5)이고, 생성된 데이터는 전체 mode(1~10)입니다.

또한 (b)는 본 논문의 방법 (c)는 이전 논문(F-score)의 방법의 결과입니다.

** F-score $F_{1/8}, F_{8}$에서는 cluster 개수를 20개로 설정해 precision & recall을 구했습니다.

본 논문에서 제안한 방법은 precision, recall 모두 정확하게 표현이 되었으며, 이전 논문(F-score)에서는 비슷하게 동작했지만 k-means clustering 때문에 조금 달랐다고 합니다.

b. Application

그럼 이제 본 논문에서 소개한 precision & recall이 quality와 variation과 연관이 깊다는 것을 보이고, 다른 FID혹은 이전 논문(F-score)과 비교해서는 어떤지 소개하겠습니다.

아래에서는 StyleGAN과 BigGAN 두 모델에서 설명할건데, StyleGAN은 FFHQ데이터로 학습되었으며, BigGAN은 ImageNet으로 학습되었습니다.

1. StyleGAN

아래 그림은 StyleGAN의 truncation 양과 학습시간에 따른 4가지 셋업에 대한 결과와, 본 논문의 precision recall 분석 결과입니다.

Setup A : 강하게 truncateded되었습니다. high quality이지만 서로 비슷해 precision이 높고 recall이 낮습니다.
Setup B : A보다 적게 truncated되었습니다. 이미지 quality가 떨어져 precision이 낮아졌으며, variation이 늘어 recall이 높아졌습니다.
Setup C : FID를 위한 configuration을 활용했습니다. 얼굴이 왜곡되어 precision이 낮아졌으며, 색감이나 악세서리와 같은 variation이 높아 recall이 늘어났습니다.
Setup D : 모든 이미지가 quality가 좋지 않으며 precision 낮으며, varation과 recall이 보존되었습니다.

[StyleGAN의 setup에 따른 precision & recall 결과]

하지만 이전 방법(F-score)은 Setup B,C,D가 완벽하며 A를 오히려 낮은 precision으로 예측했습니다.

또한 B와 D가 FID 비슷한 값으로 측정되는 것을 보아 FID가 이미지 quality보다는 variation에 비중을 더 두고 있다는 것을 알 수 있습니다.

(A 또한 FID가 안좋으니 variation이 낮으면 안좋게 준다는 것이 확실히 보입니다.)

결과적으로 본 논문의 방법은 다른 metric과 다르게 explicit하게 tradeoff를 보여줄 수 있습니다.

다음으로, 아래 그림은 추가적으로 점진적인 truncation을 주었을 때 precision과 recall의 변화를 살펴본 것입니다.

본 논문의 방법은 앞서와 같이 잘 동작하는 걸 볼 수 있습니다.

** 일반적으로 truncation이 진행될수록(0에 가까워지면) quality가 높아지고(precision이 높아지고) variation이 낮아지는(recall이 낮아지는) 현상이 있습니다.

근데, 이전 방법(F-score)에서 truncation이 진행되면(0에 가까워지면) precision이 높아져야 하는데 반대로 낮아지는 문제가 발생합니다.

** 실제로는 생성된 이미지 중 real image에 가까운 샘플을 많이 포함하니 precision이 높아야 합니다.

이는, truncation이 진행될 때 생성된 이미지들은 real distribution이 없는 embedding space의 cluster로 packing되기 때문에, real 이미지들과 관계를 맺지 못해 낮은 precision을 유발하게 됩니다.

** 실제 본 논문의 Precision 방법에서는, 둘 간의 cluster를 통한 K개의 probability density를 비교하지 않고, 실제 real distribution과 생성 포인트들의 포함 관계 를 통해 구현하므로 "상대적인 값"이 아니라 "절대적인 값"으로 비교를 해 문제가 안생깁니다.

이런 이전 방법(F-score)에서의 precision underestimate 단점을 해결하려면, 적은 cluster를 사용하면 되지만, 문제는 반대로 recall을 underestimate하게 됩니다.

2. BigGAN

BigGAN을 학습한 ImageNet은 각각 ~1300개의 이미지를 가진 1000개의 class로 이루어져있는데, class별로 easy/difficult를 구분해놓았습니다.

difficult class의 특징은 아래 중 하나입니다.

global structure가 명확하거나,
사람의 얼굴이 aligned되지 않았거나,
데이터셋 내부에서 잘 표현이 안되었거나,

easy class의 특징은 아래 중 하나 입니다.

조직적으로 구성되었거나,
global structure가 없거나
데이터셋 내부에서 흔하거나,
단, 강아지나 고양이는 다양한 종의 클래스가 존재해 정보가 많습니다.

이 때, 아래는 다양한 truncation값에 따른 precision & recall결과입니다. 왼쪽 그림에서 각 class의 여러 점들은 왼쪽에서 오른쪽으로 갈수록 truncation $\psi$가 0.3에서 1.0으로 늘어나며, 오른쪽 (a)는 easy (b)는 difficult입니다.

결과적으로 easy class에 대해서는 precision이 높고, difficult class에 대해서는 precision이 낮습니다.

또한, recall은 easy class에서 강아지와 고양이는 특이하게도 높게나왔고 Lemon과 Broccoli처럼 variation을 잃은 것에 대해 낮게 나왔으며, difficult class에 대해서는 다 높았습니다.

마지막으로 FID는 Lemon과 Broccoli 같이 variation이 낮아도 좋게 나왔는데, feature space에서 Wasserstein-2 distance를 구하는 FID는 intrinsic variation이 조금 낮더라도 좋게 나온다는 것을 알 수 있습니다.

결과적으로 FID는 정성적 차이에 대한 중요한 기준을 숨기고 있으며, precision & recall에 대해 tradeoff가 내부적으로 존재한다는 것을 알 수 있습니다.

c. Advanced

그럼 이 precision, recall을 활용해 FFHQ(1024x1024)에서 학습한 StyleGAN의 design 결정 및 개선하는 방법을 살펴보겠습니다.

기존에 보통 FID를 활용해 학습하는 경우, 그 값이 $\pm 14%$정도 iteration별로 바뀌는데, 이들을 순차적으로 확인해 최고의 모델을 선정합니다.

하지만 본 논문의 방법을 활용하기엔 precision과 recall의 tradeoff를 둘다 고려해야 하는 multi-objective 최적화 문제가 있으므로 아래 그림 (a)(b)와 같이 pareto frontier들을 통해 이 문제를 조율합니다.

** Pareto frontier : 여러 목표를 동시에 최적화하는 문제에 사용되는 방법론으로, 상충하는 목표 중 최적의 균형을 찾아야하는 상황에서 사용됩니다. optimal 선택을 포함한 minimal subset들을 구한 뒤, tradeoff를 통해 결정하는 방법입니다.

[pareto frontier를 통한 multi-objective 최적화]

다음으로 본 논문에서 소개한 precision은 생성이미지들의 전체적인 quality를 보여주지만, 하나의 생성이미지 샘플에 대해서는 "포함되는지 안되는지"로만 판단될 것이기 때문에 score로 ranking을 매기거나 하는 것이 불가능합니다.

그래서 본 논문에서는 아래와 같은 "하나의 생성 샘플이 실제 이미지들과 얼마나 가까운지"를 나타내는 연속적인 estimate인 realism score를 소개합니다.

먼저, 앞서 소개했던 $f(\phi,\Phi)$를 조금 변형해 아래와 같은 식으로 나타내보겠습니다. 이 값이 1보다 크면 “hypersphere에 포함된다”고 정의되었었습니다.

$$\frac{\|{\phi_r}-NN_k({\phi_r},{\Phi_r})\|_2}{\|{\phi_g}-{\phi_r}\|_2}$$

이 때, 이 값들 중 최대값을 realism score $R({\phi_g},\Phi_r)>1$라고 정의합니다.

$$
R({\color{blue}\phi_g},{\color{red}\Phi_r})=\underset{{\color{red}\phi_r}}{max}\left\{\frac{\|{\color{red}\phi_r}-NN_k({\color{red}\phi_r},{\color{red}\Phi_r})\|_2}{\|{\color{blue}\phi_g}-{\color{red}\phi_r}\|_2}\right\}
$$

근데, 이 때 문제가 있습니다.

학습할 때 sample들이 sparse했던 경우 k-NN의 hypersphere 또한 굉장히 커질 텐데, 이렇게 별로 많이 표현되지 않은 hypersphere에는 생성된 샘플들이 많이 위치하지도 않을 것이고, 포함되더라도 sparse하거나 작은 volume에만 속할 확률이 높습니다.

이런 경우 하나의 샘플이 hypersphere의 가장자리(fringe)에 포함된다면 그 score를 정확하다고 볼 수 없으니, 본 논문은 큰 radius를 가진 hypershpere는 절반을 버립니다.

즉, 위에서 max를 구하는 것이 아니고, median을 활용한 hypersphere에 대해서만 포함되는지 확인하는 것입니다.

이런 pruning을 통해 consistent한 score를 제공할 수 있게 됩니다.

아래 그림은 realism score로 측정했을 때 높은 값과 낮은 값을 분리해 나타낸 것인데, 확실히 realism score가 낮으면 distorted되는 경향이 강합니다.

이런 realism score는 GAN의 latent space $\mathcal{W}$에서의 linear interpolation interpolation했을 때의 quality를 측정할 때도 사용될 수 있습니다.

예를 들어, 아래 그림은 랜덤한 latent vector에 대한 4개의 interpolation 예시를 보입니다.

실제 real manifold 밖에 있는 것으로 보이는 endpoint는 정성적으로 봐도 quality가 좋지 않은 것도 확인할 수 있습니다.

path A : 양쪽 다 real manifold에 포함되어 있습니다.
path B, C : 한쪽은 real manifold에 포함되어 있고, 다른쪽은 밖에 있습니다.
path D : 양쪽 다 real manifold 밖에 있습니다.

[다양한 interpolation 기법에 적용한 realism score]

또한, 이런 interpolation 예시를 통해서 latent space $\mathcal{W}$ 내의 realistic한 이미지를 제공하는 subset이 어떤 모양의 region을 가지고 있는지를 찾아보는 실험을 진행했는데,

먼저, 본 논문에서는 truncation 없이 1M개의 $R\geq1$인 latent vector들을 샘플하고, 이중에 양쪽 endpoint가 모두 real manifold에 포함하는 500k의 interpolation path를 찾아냈습니다.

이런 path의 중간에 25%이상의 중간이미지가 $R<0.9$이면 real manifold에서 벗어난 것이라고 했을 때, 오직 2.4%만이 중간에 unrealistic 부분으로 path가 넘어갔다고 합니다.

이 말인 즉슨, latent space $\mathcal{W}$내의 realistic한 이미지를 제공하는 subset이 굉장히 높은 convex(볼록, 바깥쪽으로 둥근) 형태를 가진다는 것을 알 수 있다고 합니다.

3. Implementation

이번엔 실제로 어떻게 구현하면 좋을지를 살펴보겠습니다. 하기 코드를 참조해 작성했습니다.

** https://github.com/youngjung/improved-precision-and-recall-metric-pytorch

설명은 필요한 부분만 자세히 하겠습니다.

a. Dataloader

공통적으로 필요한 라이브러리를 import합니다.

import numpy as np
import os
from glob import glob
from torch.utils.data import Dataset, DataLoader

먼저 이미지 데이터를 처리하기 위한 데이터셋은 아래와 같이 구성할 예정입니다.

from PIL import Image
from torchvision import transforms

class ImageFolder(Dataset):
    def __init__(self, root, transform=None, image_size=224):
        self.fnames = glob(os.path.join(root, '**', '*.jpg'), recursive=True) + \
            glob(os.path.join(root, '**', '*.png'), recursive=True) + \
            glob(os.path.join(root, '**', '*.jpeg'), recursive=True)

        transform = []
        transform.append(transforms.Resize([image_size, image_size]))
        transform.append(transforms.ToTensor())
        transform.append(transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                              std=[0.229, 0.224, 0.225]))
        transform = transforms.Compose(transform)
        self.transform = transform

    def __getitem__(self, index):
        image_path = self.fnames[index]
        image = Image.open(image_path).convert('RGB')
        if self.transform is not None:
            image = self.transform(image)
        # Output : 1 x C x H x W
        return image

    def __len__(self):
        return len(self.fnames)

다음으로 오디오 데이터를 처리하기 위한 데이터셋은 아래와 같이 구성한 예정입니다.

post-processing으로 아무 처리도 안해주었는데, 이는 오디오 같은 경우는 샘플마다 모두 길이가 다르기 때문에 dataset에서 같은 크기로 조정후 넘길 방법이 없어서 입니다.

class AudioFolder(Dataset):
    def __init__(self, root,transform=None):
        self.fnames = glob(os.path.join(root, '**', '*.mp3'), recursive=True) + \
            glob(os.path.join(root, '**', '*.wav'), recursive=True) 
        if len(self.fnames) == 0:
            raise ValueError('No files with this extension in this path!')    
        self.transform = transform

    def __getitem__(self, index):
        audio_path = self.fnames[index]
        return audio_path


    def __len__(self):
        return len(self.fnames)

이제 위 데이터셋을 통해 data loader를 만들어주는 마지막 코드를 구성하겠습니다.

def get_custom_loader(dir_or_fnames, batch_size=50, num_workers=4, num_samples=-1, mode="image"):
    if isinstance(dir_or_fnames, str):
        if mode=="image":
            dataset = ImageFolder(dir_or_fnames)
        elif mode=="audio":
            dataset = AudioFolder(dir_or_fnames)
        elif mode=="audio2":
            dataset = AudioFolder(dir_or_fnames)
        else:
            raise TypeError
    else:
        raise TypeError

    if num_samples > 0:
        dataset.fnames = dataset.fnames[:num_samples]

    data_loader = DataLoader(dataset=dataset,
                             batch_size=batch_size,
                             shuffle=False,
                             num_workers=num_workers,
                             pin_memory=True)
    return data_loader

b. Post Processor

먼저 필요한 라이브러리 들을 import해줍니다. 위에서 구현했던 get_custom_loader를 가져올 예정입니다.

import os
import numpy as np
import torch
from tqdm import tqdm, trange

from .tk_data import get_custom_loader

먼저, 이미지에서 feature를 얻어낼 수 있는 클래스를 만들어주겠습니다.

** 아래에서 extract를 하고 나면 [데이터 개수, 4096]이 될 것입니다.

from functools import partial
import torch.nn.functional as F

class ImageExtractor():
    def __init__(self, batch_size=50, num_samples=10000, model=None):
        self.batch_size = batch_size
        self.num_samples = num_samples
        if model is None:
            print('loading vgg16 for improved precision and recall...', end='', flush=True)
            import torchvision.models as models
            self.vgg16 = models.vgg16(pretrained=True)
            print('done')
        else:
            self.vgg16 = model
        self.vgg16 = self.vgg16.cuda().eval().requires_grad_(False)
        self.mode="image"
    def extract_features_single(self, images):
        desc = 'extracting features of %d images' % images.size(0)
        num_batches = int(np.ceil(images.size(0) / self.batch_size))
        _, _, height, width = images.shape
        if height != 224 or width != 224:
            print('IPR: resizing %s to (224, 224)' % str((height, width)))
            resize = partial(F.interpolate, size=(224, 224))
        else:
            def resize(x): return x

        features = []
        for bi in trange(num_batches, desc=desc):
            start = bi * self.batch_size
            end = start + self.batch_size
            batch = images[start:end]
            batch = resize(batch)
            before_fc = self.vgg16.features(batch.cuda())
            before_fc = before_fc.view(-1, 7 * 7 * 512)
            feature = self.vgg16.classifier[:4](before_fc)
            features.append(feature.cpu().data.numpy())

        return np.concatenate(features, axis=0)

    def extract_features_from_files(self, path_or_fnames):
        dataloader = get_custom_loader(path_or_fnames, batch_size=self.batch_size, num_samples=self.num_samples, mode="image")
        num_found_images = len(dataloader.dataset)
        desc = 'extracting features of %d images' % num_found_images
        if num_found_images < self.num_samples:
            print('WARNING: num_found_images(%d) < num_samples(%d)' % (num_found_images, self.num_samples))

        features = []
        for batch in tqdm(dataloader, desc=desc):
            before_fc = self.vgg16.features(batch.cuda())
            before_fc = before_fc.view(-1, 7 * 7 * 512)
            feature = self.vgg16.classifier[:4](before_fc)
            features.append(feature.cpu().data.numpy())

        return np.concatenate(features, axis=0)

다음으로, 오디오에서 feature를 얻어낼 수 있는 클래스를 만들어주겠습니다.

** 아래에서 extract를 하고 나면 [데이터 개수, 1024]이 될 것입니다.

import librosa
import pyloudnorm as pyln

class AudioExtractor2():
    def __init__(self, batch_size=50, num_samples=10000, model=None, content_type='music', samplingrate=44100):
        self.batch_size = batch_size
        self.num_samples = num_samples
        self.samplingrate=samplingrate
        if model is None:
            print('loading CLAP for improved precision and recall...', end='', flush=True)

            url = 'https://huggingface.co/lukewys/laion_clap/resolve/main/music_audioset_epoch_15_esc_90.14.pt'
            clap_path = 'load/clap_score/music_audioset_epoch_15_esc_90.14.pt'
			import laion_clap
            from clap_module.factory import load_state_dict
            model = laion_clap.CLAP_Module(enable_fusion=False, amodel='HTSAT-base',  device='cuda')

            # download clap_model if not already downloaded
            if not os.path.exists(clap_path):
                print('Downloading ', clap_model, '...')
                os.makedirs(os.path.dirname(clap_path), exist_ok=True)
                response = requests.get(url, stream=True)
                total_size = int(response.headers.get('content-length', 0))

                with open(clap_path, 'wb') as file:
                    with tqdm(total=total_size, unit='B', unit_scale=True) as progress_bar:
                        for data in response.iter_content(chunk_size=8192):
                            file.write(data)
                            progress_bar.update(len(data))

            pkg = load_state_dict(clap_path)
            pkg.pop('text_branch.embeddings.position_ids', None)
            model.model.load_state_dict(pkg)
            self.clap=model
            print('done')
        else:
            self.clap = model
        self.clap = self.clap.cuda().eval().requires_grad_(False)
        self.mode="audio2"

    def extract_features_single(self, audios):
        desc = 'extracting features of %d audios' % len(audios)
        num_batches = int(np.ceil(len(audios) / self.batch_size))
        features = []
        for bi in trange(num_batches, desc=desc):
            start = bi * self.batch_size
            end = start + self.batch_size
            batch = audios[start:end]
            audios=[]
            for audio_path in batch:
                audio, _ = librosa.load(audio_path, sr=self.samplingrate, mono=True) # sample rate should be 48000
                audio = pyln.normalize.peak(audio, -1.0)
                audio = audio.reshape(1, -1) # unsqueeze (1,T)
                audio = torch.from_numpy(self.int16_to_float32(self.float32_to_int16(audio))).float()
                audio_embeddings = self.clap.get_audio_embedding_from_data(x = audio, use_tensor=True)
                audios.append(audio_embeddings.cpu().numpy())
            embeddings = np.concatenate(audios, axis=0)
            features.append(embeddings)
        return np.concatenate(features, axis=0)


    def extract_features_from_files(self, path_or_fnames):
        dataloader = get_custom_loader(path_or_fnames, batch_size=self.batch_size, num_samples=self.num_samples, mode="audio")
        num_found_audios = len(dataloader.dataset)
        desc = 'extracting features of %d audios' % num_found_audios
        if num_found_audios < self.num_samples:
            print('WARNING: num_found_audios(%d) < num_samples(%d)' % (num_found_audios, self.num_samples))
        features = []
        for batch in tqdm(dataloader, desc=desc):
            audios=[]
            for audio_path in batch:
                audio, _ = librosa.load(audio_path, sr=self.samplingrate, mono=True) # sample rate should be 48000
                audio = pyln.normalize.peak(audio, -1.0)
                audio = audio.reshape(1, -1) # unsqueeze (1,T)
                audio = torch.from_numpy(self.int16_to_float32(self.float32_to_int16(audio))).float()
                audio_embeddings = self.clap.get_audio_embedding_from_data(x = audio, use_tensor=True)
                audios.append(audio_embeddings.cpu().numpy())
            embeddings = np.concatenate(audios, axis=0)
            features.append(embeddings)
        return np.concatenate(features, axis=0)
        
    def int16_to_float32(self, x):
        return (x / 32767.0).astype(np.float32)

    def float32_to_int16(self, x):
        x = np.clip(x, a_min=-1., a_max=1.)
        return (x * 32767.).astype(np.int16)

------------------------------------------------------------------

<OpenL3를 활용한 Extractor>

위에서 CLAP을 사용한 이유는 다른 extractor를 사용하면, [오디오의 길이, feature dim]과 같은 차원이 나오기 때문에 크기를 맞춰줄 방법이 없어 이후에 Realism score를 알아낼 수 없어서 입니다.

하지만 제가 구현했던 openl3를 활용한 extractor도 아래와 같이 보이겠습니다.

** 아래에서 extract를 하고 나면 [모든 오디오 차원의 합, 1024]이 될 것입니다.

import soxr
import openl3

class AudioExtractor():
    def __init__(self, batch_size=50, num_samples=10000, model=None, content_type='music', openl3_hop_size=0.5, samplingrate=44100, channels=2):
        self.batch_size = batch_size
        self.num_samples = num_samples
        self.samplingrate=samplingrate
        self.channels = channels
        self.openl3_hop_size=openl3_hop_size
        if model is None:
            print('loading openl3 for improved precision and recall...', end='', flush=True)
            self.openl3 = openl3.models.load_audio_embedding_model(input_repr="mel256", content_type=content_type, embedding_size=512)
            print('done')
        else:
            self.openl3 = model
        self.mode="audio"
    def extract_features_single(self, audios):

        desc = 'extracting features of %d audios' % len(audios)
        num_batches = int(np.ceil(len(audios) / self.batch_size))
        features = []
        for bi in trange(num_batches, desc=desc):
            start = bi * self.batch_size
            end = start + self.batch_size
            batch = audios[start:end]
            first=True
            batch_audio_l = []
            batch_audio_r = []
            batch_sr = []
            for audio_path in batch:
                audio, sr = librosa.load(audio_path, sr=None, mono=False)
                audio = audio.T
                audio = pyln.normalize.peak(audio, -1.0)            
                if audio.shape[0] < sr: 
                    print('Audio shorter than 1 sec, openl3 will zero-pad it:', file, audio.shape, sr)

                # resample to the desired evaluation bandwidth
                audio = soxr.resample(audio, sr, self.samplingrate) # mono/stereo <- mono/stereo, input sr, output sr

                # mono embeddings are stored in batch_audio_l (R channel not used)
                if self.channels == 1:
                    batch_audio_l.append(audio)

                elif self.channels == 2:
                    if audio.ndim == 1:
                        # if mono, "fake" stereo by copying mono channel to L and R
                        batch_audio_l.append(audio)
                        batch_audio_r.append(audio)
                    elif audio.ndim == 2:
                        # if it's stereo separate channels for openl3
                        batch_audio_l.append(audio[:,0])
                        batch_audio_r.append(audio[:,1])

                batch_sr.append(self.samplingrate)

            # extracting mono embeddings (dim=512) or the L channel for stereo embeddings
            emb, _ = openl3.get_audio_embedding(batch_audio_l, batch_sr, model=self.openl3, verbose=False, hop_size=self.openl3_hop_size, batch_size=self.batch_size)
            # format mono embedding
            if self.channels == 1:
                emb = np.concatenate(emb,axis=0)
            # extracting stereo embeddings (dim=1024), since we concatenate L (dim=512) and R (dim=512) embeddings
            elif self.channels == 2:
                # extract the missing R channel
                emb_r, _ = openl3.get_audio_embedding(batch_audio_r, batch_sr, model=self.openl3, verbose=False, hop_size=self.openl3_hop_size, batch_size=self.batch_size)
                emb = [np.concatenate([l, r], axis=1) for l, r in zip(emb, emb_r)]
                emb = np.concatenate(emb, axis=0)
            # concatenate embeddings
            if first:
                embeddings = emb
                first = False
            else:
                embeddings = np.concatenate([embeddings, emb], axis=0)
            features.append(embeddings)
        return np.concatenate(features, axis=0)


        return np.concatenate(features, axis=0)

    def extract_features_from_files(self, path_or_fnames):
        dataloader = get_custom_loader(path_or_fnames, batch_size=self.batch_size, num_samples=self.num_samples, mode="audio")
        num_found_audios = len(dataloader.dataset)
        desc = 'extracting features of %d audios' % num_found_audios
        if num_found_audios < self.num_samples:
            print('WARNING: num_found_audios(%d) < num_samples(%d)' % (num_found_audios, self.num_samples))

        features = []
        for batch in tqdm(dataloader, desc=desc):
            first=True
            batch_audio_l = []
            batch_audio_r = []
            batch_sr = []
            for audio_path in batch:
                audio, sr = librosa.load(audio_path, sr=None, mono=False)
                audio = audio.T
                audio = pyln.normalize.peak(audio, -1.0)            
                if audio.shape[0] < sr: 
                    print('Audio shorter than 1 sec, openl3 will zero-pad it:', file, audio.shape, sr)

                # resample to the desired evaluation bandwidth
                audio = soxr.resample(audio, sr, self.samplingrate) # mono/stereo <- mono/stereo, input sr, output sr

                # mono embeddings are stored in batch_audio_l (R channel not used)
                if self.channels == 1:
                    batch_audio_l.append(audio)

                elif self.channels == 2:
                    if audio.ndim == 1:
                        # if mono, "fake" stereo by copying mono channel to L and R
                        batch_audio_l.append(audio)
                        batch_audio_r.append(audio)
                    elif audio.ndim == 2:
                        # if it's stereo separate channels for openl3
                        batch_audio_l.append(audio[:,0])
                        batch_audio_r.append(audio[:,1])

                batch_sr.append(self.samplingrate)

            # extracting mono embeddings (dim=512) or the L channel for stereo embeddings
            emb, _ = openl3.get_audio_embedding(batch_audio_l, batch_sr, model=self.openl3, verbose=False, hop_size=self.openl3_hop_size, batch_size=self.batch_size)
            # format mono embedding
            if self.channels == 1:
                emb = np.concatenate(emb,axis=0)
            # extracting stereo embeddings (dim=1024), since we concatenate L (dim=512) and R (dim=512) embeddings
            elif self.channels == 2:
                # extract the missing R channel
                emb_r, _ = openl3.get_audio_embedding(batch_audio_r, batch_sr, model=self.openl3, verbose=False, hop_size=self.openl3_hop_size, batch_size=self.batch_size)
                emb = [np.concatenate([l, r], axis=1) for l, r in zip(emb, emb_r)]
                emb = np.concatenate(emb, axis=0)
            # concatenate embeddings
            if first:
                embeddings = emb
                first = False
            else:
                embeddings = np.concatenate([embeddings, emb], axis=0)
            features.append(embeddings)
        return np.concatenate(features, axis=0)

------------------------------------------------------------------

c. Improved Precision and Recall

먼저 필요한 것들을 import해 줍니다.위에서 구현했던 다양한 Extractor를 가져올 예정입니다.

from collections import namedtuple
import numpy as np
from argparse import ArgumentParser, ArgumentDefaultsHelpFormatter
from tqdm import tqdm, trange
import torch

Manifold = namedtuple('Manifold', ['features', 'radii'])
PrecisionAndRecall = namedtuple('PrecisionAndRecall', ['precision', 'recall'])

IPR객체는 아래와 같이 선언됩니다. 어떤 modal의 데이터를 처리할지에 따라 다른 Extractor를 가져와 사용했습니다.

class IPR():
    def __init__(self, batch_size=50, k=3, num_samples=10000, path_real=None, model=None, mode="image"):
        self.manifold_ref = None
        self.k = k
        self.mode=mode
        if self.mode=="image":
            from tools.tk_model import ImageExtractor
            self.extractor=ImageExtractor(batch_size=batch_size, num_samples=num_samples, model=model)
        elif self.mode=="audio":
            from tools.tk_model import AudioExtractor
            self.extractor=AudioExtractor(batch_size=batch_size, num_samples=num_samples, model=model)
        elif self.mode=="audio2":
            from tools.tk_model import AudioExtractor2
            self.extractor=AudioExtractor2(batch_size=batch_size, num_samples=num_samples, model=model)
        else:
            raise Exception("Wrong mode : {}".format(mode))      

        if path_real==None:
            raise Exception("Give Reference Data path")
        self.manifold_ref = self.compute_manifold(path_real)
    def __call__(self, subject):
        return self.precision_and_recall_with(subject)

    def precision_and_recall_with(self, subject):
        assert self.manifold_ref is not None, "call IPR.Step1_compute_manifold_ref() first"

        manifold_subject = self.compute_manifold(subject)
        precision = compute_metric(self.manifold_ref, manifold_subject.features, 'computing precision...')
        recall = compute_metric(manifold_subject, self.manifold_ref.features, 'computing recall...')
        return PrecisionAndRecall(precision, recall)

    def compute_manifold(self, input):
        # features
        if isinstance(input, str):
            if input.endswith('.npz'):  # input is precalculated file
                print('loading', input)
                f = np.load(input)
                feats = f['feature']
                radii = f['radii']
                f.close()
                return Manifold(feats, radii)
            else:  # input is dir
                feats = self.extractor.extract_features_from_files(input)
        elif isinstance(input, torch.Tensor):
            feats = self.extractor.extract_features_single(input)
        elif isinstance(input, np.ndarray):
            input = torch.Tensor(input)
            feats = self.extractor.extract_features_single(input)
        elif isinstance(input, list):
            if isinstance(input[0], torch.Tensor):
                input = torch.cat(input, dim=0)
                feats = self.extractor.extract_features_single(input)
            elif isinstance(input[0], np.ndarray):
                input = np.concatenate(input, axis=0)
                input = torch.Tensor(input)
                feats = self.extractor.extract_features_single(input)
            elif isinstance(input[0], str):  # input is list of fnames
                feats = self.extractor.extract_features_from_files(input)
            else:
                raise TypeError
        else:
            print(type(input))
            raise TypeError

        # radii
        distances = compute_pairwise_distances(feats)
        radii = distances2radii(distances, k=self.k)
        return Manifold(feats, radii)

    def save_ref(self, fname):
        print('saving manifold to ', fname, '...')
        np.savez_compressed(fname,
                            feature=self.manifold_ref.features,
                            radii=self.manifold_ref.radii)

그럼 위에서 사용된 핵심함수들을 살펴보겠습니다.

먼저, compute_pairwise_distances는 아래 두가지를 측정할 수 있습니다.

X와 Y 간의 대해 서로의 거리
X와 본인간의 서로의 거리

def compute_pairwise_distances(X, Y=None):
    '''
    args:
        X: np.array of shape N x dim
        Y: np.array of shape N x dim
    returns:
        N x N symmetric np.array
    '''
    num_X = X.shape[0]
    if Y is None:
        num_Y = num_X
    else:
        num_Y = Y.shape[0]
    X = X.astype(np.float64)  # to prevent underflow
    X_norm_square = np.sum(X**2, axis=1, keepdims=True)
    if Y is None:
        Y_norm_square = X_norm_square
    else:
        Y_norm_square = np.sum(Y**2, axis=1, keepdims=True)
    X_square = np.repeat(X_norm_square, num_Y, axis=1)
    Y_square = np.repeat(Y_norm_square.T, num_X, axis=0)
    if Y is None:
        Y = X
    XY = np.dot(X, Y.T)
    diff_square = X_square - 2*XY + Y_square

    # check negative distance
    min_diff_square = diff_square.min()
    if min_diff_square < 0:
        idx = diff_square < 0
        diff_square[idx] = 0
        print('WARNING: %d negative diff_squares found and set to zero, min_diff_square=' % idx.sum(),
              min_diff_square)

    distances = np.sqrt(diff_square)
    return distances

다음으로 distances2radii는 kth nearest neighbor까지의 거리를 측정해냅니다. 이 때 보통 위에서 "본인간의 거리" 값을 넣어주어야 합니다.

def distances2radii(distances, k=3):
    num_features = distances.shape[0]
    radii = np.zeros(num_features)
    for i in range(num_features):
        radii[i] = get_kth_value(distances[i], k=k)
    return radii


def get_kth_value(np_array, k):
    kprime = k+1  # kth NN should be (k+1)th because closest one is itself
    idx = np.argpartition(np_array, kprime)
    k_smallests = np_array[idx[:kprime]]
    kth_value = k_smallests.max()
    return kth_value

마지막으로 compute_metric 함수를 통해 reference와 generate sample 각각의 feature간의 거리를 구하고, 그 중 몇개가 포함이 되는지를 확인해냅니다. 이 함수는 아래와 같이 불립니다.

precision = compute_metric(self.manifold_ref, manifold_subject.features, 'computing precision...')
recall = compute_metric(manifold_subject, self.manifold_ref.features, 'computing recall...')

위 코드를 보시면 아시겠지만 target에 대해 reference를 기준으로 계산(precision)하고, reference에 대해 target을 기준으로 계산(recall)합니다. 따라서 데이터를 준비할 때 target과 reference의 개수가 같을 필요는 없지만, 둘 중 하나가 너무 적은 데이터를 사용한다면 manifold가 잘 구현이 안될 것입니다.

def compute_metric(manifold_ref, feats_subject, desc=''):
    num_subjects = feats_subject.shape[0]
    count = 0
    dist = compute_pairwise_distances(manifold_ref.features, feats_subject)
    for i in trange(num_subjects, desc=desc):
        count += (dist[:, i] < manifold_ref.radii).any()
    return count / num_subjects

이제 실행해보겠습니다.

path_real=""
path_fake=""
mode="image"

ipr_reference = IPR(args.batch_size, args.k, args.num_samples, path_real, mode=mode)

precision, recall = ipr_reference(path_fake)

print('precision:', precision)
print('recall:', recall)

----------------------------------------------------------------

<Input Arguments>

참고한 코드와 비슷합니다.

parser = ArgumentParser(formatter_class=ArgumentDefaultsHelpFormatter)
parser.add_argument('--batch_size', type=int, default=50, help='Batch size to use')
parser.add_argument('--k', type=int, default=3, help='Batch size to use')
parser.add_argument('--num_samples', type=int, default=50, help='number of samples to use')
parser.add_argument('--only_save', action='store_true', help='fname for precalculating manifold')
args = parser.parse_args()

----------------------------------------------------------------

d. Realism Score

먼저 필요한 라이브러리를 import했습니다.

import os
import numpy as np

realism은 아래와 같은 함수로 구현되었습니다.

** 논문에서 언급된 pruning은 구현되어있지 않은 것 같습니다.

def realism(manifold_ref, feat_subject):
    feats_real = manifold_ref.features
    radii_real = manifold_ref.radii
    diff = feats_real - feat_subject
    dists = np.linalg.norm(diff, axis=1)
    eps = 1e-6
    ratios = radii_real / (dists + eps)
    max_realism = float(ratios.max())
    return max_realism

위 핵심함수를 실행할 wrapper는 아래와 같습니다.

단순히 score를 측정하기를 원하는 input이 들어왔을 때, 해당 input의 형태에 따라 처리해주는 부분들입니다.

def realism_wrapper(ipr_reference, target, num_samples=1, batch_size=50):
    manifold_ref = ipr_reference.manifold_ref
    extractor = ipr_reference.extractor
    mode = ipr_reference.mode

    if os.path.isdir(target):
    	from .tk_data import get_custom_loader
        dataloader = get_custom_loader(target, batch_size=batch_size, num_samples=1, mode=mode)
        print('found %d images in ' % len(dataloader.dataset) + target)
        # exact_target: torch.Tensor of 1 x C x H x W
        if len(dataloader.dataset)==0 :
            raise Exception("Empty dataset for {}".format(target))
        elif len(dataloader.dataset)==1 :
            exact_target = next(iter(dataloader))
    elif os.path.isfile(target):
        if target.endswith(".jpg") or target.endswith(".jpeg") or target.endswith(".png"):
            if extractor.mode != "image" or mode != "image":
                raise Exception("This is not proper path for target : {}".format(target))
            from PIL import Image
            image = Image.open(target).convert('RGB')

            # TODO Take from dataloader
            from torchvision import transforms
            transform = []
            transform.append(transforms.Resize([224,224]))
            transform.append(transforms.ToTensor())
            transform.append(transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                                  std=[0.229, 0.224, 0.225]))
            transform = transforms.Compose(transform)
            exact_target = transform(image).unsqueeze(0)
        elif target.endswith(".mp3") or target.endswith(".wav"):
            if not ((extractor.mode == "audio" and mode == "audio") or (extractor.mode == "audio2" and mode == "audio2")):
                raise Exception("This is not proper path for target : {}".format(target))
            exact_target = [target]
        elif target.endswith(".npy"):
            pass
        else:
            raise Exception("This is not proper path for target : {}".format(target))
    else:
        raise Exception("This is not proper path for target : {}".format(target))

    feat = extractor.extract_features_single(exact_target)
    return realism(manifold_ref, feat)

실행해보겠습니다. 이때 위에서 구현했던 IPR의 object instance가 필요합니다.

# Example usage: realism of a real image
realism_score = realism_wrapper(ipr_reference, path_real)
print('realism of sample:', realism_score)

mode collapse : https://dl-ai.blogspot.com/2017/08/gan-problems.html

inception score, FID : https://wikidocs.net/149481

k-means clustering : https://velog.io/@jhlee508/머신러닝-K-평균K-Means-알고리즘

TPR : https://only-wanna.tistory.com/entry/Classification-Metrics%EB%B6%84%EB%A5%98-%EB%AA%A8%EB%8D%B8-%EC%A7%80%ED%91%9C-%EC%95%8C%EC%95%84%EB%B3%B4%EA%B8%B0-TPR-FPR%EA%B3%BC-ROC-Curve-%EC%82%AC%EC%9D%B4-%EA%B4%80%EA%B3%84-%EB%B0%8F-AUC

ROC curve : https://diseny.tistory.com/entry/ROC-%EA%B3%A1%EC%84%A0-%EC%95%84%EC%A3%BC-%EC%89%BD%EA%B2%8C-%EC%9D%B4%ED%95%B4%ED%95%98%EA%B8%B0 :

PR curve : https://nanunzoey.tistory.com/entry/ROC-%EA%B3%A1%EC%84%A0-vs-P-R-%EA%B3%A1%EC%84%A0#google_vignette

728x90

저작자표시 비영리 변경금지

'Developers 공간 [SOTA]' 카테고리의 다른 글

[Generative] Simplifying, Stabilizing & Scaling Continuous Time Continuous-time Consistency Models (ICLR’25) (0)	2024.12.04
[Generative] Class-Balancing Diffusion Models (CVPR’24) (0)	2024.10.14
[Generative] A Recipe for watermarking Diffusion Models (ICLR’24) (0)	2024.10.07
[Generative] Shap·E: Generating Conditional 3D Implicit Functions (arxiv'23) (0)	2024.09.21
[Generative] InstanceDiffusion: Instance-level Control for Image Generation (CVPR'24) (0)	2024.05.25

태그

최근글

0. Before Start...

1. Problem

2. Approach

3. Implementation

'Developers 공간 [SOTA]' 카테고리의 다른 글

관련글

티스토리툴바