# Word_2_vector.. (aka word embeddings)

## Word 2 vector:

• word 2 vector is a way to take a big set of text and convert into a matrix with a word at
each row.
• It is a shallow neural-network(2 layers)
• Two options/training methods (

### CBOW(Continuous-bag-of-words assumption)

• — a text is represented as the bag(multiset) of its words
• — disregards grammar
• — disregards word order but keeps multiplicity
• — Also used in computer vision

### skip-gram() — it is a generalization

of n-grams(which is basically a markov chain model, with (n-1)-order)
* — It is a n-1 order markov model
* — Used in Protein sequencing, DNA Sequencing, Computational
linguistics(character and word)
* — Models sequences, using the statistical properties of n-grams
* — predicts $x_i$ based on $x_(i-(n-1)), ....,x_(i-1)$.
* — in language modeling independence assumptions are made so that each
word depends only on n-1 previous words.(or characters in case of
character level modeling)
* — The probability of a word conditional on previous n-1 words follows a
Categorical Distribution
* — In practice, the probability distributions are smoothed by assigning non-zero probabilities to unseen words or n-grams.

• — Finding the right ‘n’ for a model is based on the Bias Vs Variance tradeoff we’re wiling to make

## Smoothing Techniques:

• — Problems of balance weight between infrequent n-grams.
• — Unseen n-grams by default get 0.0 without smoothing.
• — Use pseudocounts for unseen n-grams.(generally motivated by
bayesian reasoning on the sub n-grams, for n < original n)

• — Skip grams also allow the possibility of skipping. So a 1-skip bi(2-)gram would create bigrams while skipping the second word in a three sequence.

• — Could be useful for languages with less strict subject-verb-object order than English.

• Depends on Distributional Hypothesis
• Vector representations of words called “word embeddings”
• Basic motivation is that compared to audio, visual domains, the word/text domain treats
them as discrete symbols, encoding them as sparse dataset. Vector based representation
works around these issues.
• Also called as vector Space models
• Two ways of training: a, CBOW(Continuous-Bag-Of-Words) model predicts target words, given
a group of words, b, skip-gram is ulta. aka predicts group of words from a given word.

• Trained using the Maximum Likelihood model

• Ideally, Maximizes probability of next word given the previous ‘h’ words in terms of a softmax function
• However, calculating the softmax values requires computing and normalizing each probability using score for all the other words in context at every step.
• Therefore a logistic regression aka binary classification objective functionis used.
• The way this is achieved is called negative sampling