# Siamese Network

The Siamese Network was introduced in a paper titled "DeepFace" in 2014. It consists of a pair of identical deep neural networks, each of which takes an image as an input and computes an encoding in one of its layers.![](https://3668624528-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-M5-0RGo4dZhdwCIHrC1%2F-M5-0TEuN77zAmTpQjIr%2F-M5-0Vl_cgdxNTSqQPWJ%2FSiamese%20Neural%20Networks.JPG?generation=1586990827721212\&alt=media)

Say the two identical networks took images $$x^{(i)}$$ and $$x^{(j)}$$ as inputs and computed encodings $$f(x^{(i)})$$ and $$f(x^{(j)})$$ in a certain hidden layer. To compute the difference between the images, we compute d($$x^{(i)}, x^{(j)}$$) = ||$$f(x^{(i)})-f(x^{(j)})$$||$$^2$$ and consider the images to be of the same person if this value is below a certain threshold.

## Triplet Loss

To learn the parameters for a siamese network so as to generate a good encoding of an input image, we apply gradient descent on the triplet loss function.

For every image (called anchor image A), we consider a positive image P and a negative image N. P will be similar to A and N will not be similar to A (i.e. A and P are pictures of the same person while N is a picture of a different person).

![](https://3668624528-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-M5-0RGo4dZhdwCIHrC1%2F-M5-0TEuN77zAmTpQjIr%2F-M5-0VlbCxdlhqdza-7B%2FTriplet%20Loss.JPG?generation=1586990838940021\&alt=media)

Our aim is to satisfy the following equation:

$$d(A, P) + \alpha$$ <= $$d(A, N)$$

i.e. $$d(A, P) - d(A, N) + \alpha$$ <= 0

where $$\alpha$$ is called the margin.

The triplet loss function is given by:

$$L(A, P, N) = max(d(A, P) - d(A, N) + \alpha, 0)$$

and the cost function will be:

$$J = \sum\_{i=1}^m L(A^{(i)}, P^{(i)}, N^{(i)})$$

Note that while training, we must not choose A, P, N randomly. Instead, we must choose A, P, N such that d(A, P) is very close to d(A, N). This will allow the gradient descent to choose parameters that maximize the margin between similar-looking A and N images.

Also, since A and P are images of the same person, we will need multiple images of the same person while training (as opposed to one-shot learning).
