I’ve heard about t-SNE recently in a few places, but I have no clue how it works. I’m going to give it a shot here and see if I can understand it.
t-SNE is a dimensionality reduction technique that is often used to visualize data from many dimensions in two or three dimensions. Unlike PCA, t-SNE is a nonlinear technique for dimensionality reduction. The method uses similarity distributions in the full input space and in the reduced dimension space. It minimizes the difference between these two distributions to create the best mapping into the lower dimensional subspace.
The mapping in low dimension space has parameters that are estimated using stochastic gradient descent. This can take a long time to run, and will be much slower than PCA.
The similarity of point \(x_j\) to \(x_i\) in the full dimensions is given by \[p_{j|i} = \frac{\exp(-||x_i-x_j||^2/(2\sigma^2))}{\sum_{k\neq i} \exp(-||x_i-x_k||^2/(2\sigma^2))} \]
The low dimensional points corresponding to \(x_i\) and \(x_j\) are \(y_i\) and \(y_j\). The similarity distance between these is given by \[q_{j|i} = \frac{\exp(-||y_i-y_j||^2)}{\sum_{k\neq i} \exp(-||y_i-y_k||^2)} \] The \(\sigma\)’s have been removed because we can control the scaling in that space. If the mapping were perfect, we would have \(p_{j|i} = q_{j|i}\). Thus we will try to create the best mapping by minimizing the distance between these two. As is common with probability distributions, our distance measure is the Kullback-Leibler measure, although it is not a valid distance metric.
Pros and Cons
Pros: - Better results than PCA
Cons: - Much slower than PCA