A good article on details of word2vec networks

word2vec explain

Which illustrates that the word2vec network loss function is in fact the inner product of context words vector and target word vector which can be also understood as a special case of the cross-entropy measurement between two probabilistic

distributions.

About SVD

Because U and V are unitary, we know that the columns U1, ..., Um of U yield an orthonormal basis of Km and the columns V1, ..., Vn of V yield an orthonormal basis of Kn (with respect to the standard scalar products on these spaces).

The linear transformation

{\begin{cases}T:K^{n}\to K^{m}\x\mapsto \mathbf {M} x\end{cases))

has a particularly simple description with respect to these orthonormal bases: we have

T(\mathbf {V} _{i})=\sigma _{i}\mathbf {U} _{i},\qquad i=1,\cdots ,\min(m,n),

where σi is the i-th diagonal entry of Σ, and T(Vi) = 0 for i > min(m,n).

The geometric content of the SVD theorem can thus be summarized as follows: for every linear map T : Kn → Km one can find orthonormal bases of Kn and Km such that T maps the i-th basis vector of Kn to a non-negative multiple of the i-th basis vector of Km, and sends the left-over basis vectors to zero. With respect to these bases, the map T is therefore represented by a diagonal matrix with non-negative real diagonal entries.

To get a more visual flavour of singular values and SVD factorization — at least when working on real vector spaces — consider the sphere S of radius one in Rn. The linear map T maps this sphere onto an ellipsoid in Rm. Non-zero singular values are simply the lengths of the semi-axes of this ellipsoid. Especially when n = m, and all the singular values are distinct and non-zero, the SVD of the linear map T can be easily analysed as a succession of three consecutive moves: consider the ellipsoid T(S) and specifically its axes; then consider the directions in Rn sent by T onto these axes. These directions happen to be mutually orthogonal. Apply first an isometry V∗ sending these directions to the coordinate axes of Rn. On a second move, apply an endomorphism D diagonalized along the coordinate axes and stretching or shrinking in each direction, using the semi-axes lengths of T(S) as stretching coefficients. The composition D ∘ V∗ then sends the unit-sphere onto an ellipsoid isometric to T(S). To define the third and last move U, apply an isometry to this ellipsoid so as to carry it over T(S). As can be easily checked, the composition U ∘ D ∘ V∗ coincides with T.