Siamese Network

Say the two identical networks took images x(i)x^{(i)} and x(j)x^{(j)} as inputs and computed encodings f(x(i))f(x^{(i)}) and f(x(j))f(x^{(j)}) in a certain hidden layer. To compute the difference between the images, we compute d(x(i),x(j)x^{(i)}, x^{(j)}) = ||f(x(i))f(x(j))f(x^{(i)})-f(x^{(j)})||2^2 and consider the images to be of the same person if this value is below a certain threshold.

Triplet Loss

To learn the parameters for a siamese network so as to generate a good encoding of an input image, we apply gradient descent on the triplet loss function.

For every image (called anchor image A), we consider a positive image P and a negative image N. P will be similar to A and N will not be similar to A (i.e. A and P are pictures of the same person while N is a picture of a different person).

Our aim is to satisfy the following equation:

d(A,P)+αd(A, P) + \alpha <= d(A,N)d(A, N)

i.e. d(A,P)d(A,N)+αd(A, P) - d(A, N) + \alpha <= 0

where α\alpha is called the margin.

The triplet loss function is given by:

L(A,P,N)=max(d(A,P)d(A,N)+α,0)L(A, P, N) = max(d(A, P) - d(A, N) + \alpha, 0)

and the cost function will be:

J=i=1mL(A(i),P(i),N(i))J = \sum_{i=1}^m L(A^{(i)}, P^{(i)}, N^{(i)})

Note that while training, we must not choose A, P, N randomly. Instead, we must choose A, P, N such that d(A, P) is very close to d(A, N). This will allow the gradient descent to choose parameters that maximize the margin between similar-looking A and N images.

Also, since A and P are images of the same person, we will need multiple images of the same person while training (as opposed to one-shot learning).

Last updated