Skip-gram - Graph Analytics & Algorithms

Change Password

Submit

Change Email

Submit

Change Nickname

Current Nickname:

Submit

Profile

Account ID:

Full Name:
Phone:
Company:
Company Email:

Change Password

Apply

You have no license application record.

Apply

Certificate	Issued at	Valid until	Serial No.	File

Serial No.	Valid until	File

Not having one? Apply now! >>>

Product	Created On	ID	Amount (USD)	Invoice

Product	Created On	ID	Amount (USD)	Invoice

No Invoice

Create Ultipa Account

I agree to the Privacy Policy .

Please agree to continue.

Already have an Ultipa account? Sign in now!

Forgot Password

Reset Password

Back to sign in

Skip-gram

The Skip-gram (SG) model is a widely used method for generating word embeddings in natural language processing (NLP). Its underlying principles have also been used in graph embedding algorithms such as Node2Vec and Struc2Vec to produce node embeddings.

Background

The Skip-gram model originated from the Word2Vec algorithm, which was introduced by T. Mikolov et al. introduced Word2Vec at Google in 2013. Word2Vec maps words into a vector space such that semantically similar words are represented by vectors that lie close to each other.

T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient Estimation of Word Representations in Vector Space (2013)
X. Rong, word2vec Parameter Learning Explained (2016)

In the realm of graph embedding, the introduction of DeepWalk in 2014 marked a pivotal moment, applying the Skip-gram model to generate vector representations of nodes in a graph. The key idea is to treat nodes as "words", and the sequences of nodes generated by random walks as "sentences" forming a "corpus".

B. Perozzi, R. Al-Rfou, S. Skiena, DeepWalk: Online Learning of Social Representations (2014)

Subsequent graph embedding methods such as Node2Vec and Struc2Vec have enhanced the DeepWalk approach, while still relying on the Skip-gram framework.

To better understand the Skip-gram model, we will illustrate it in its original context within natural language processing.

Model Overview

The core idea behind the Skip-gram model is to predict surrounding context words given a target word. As illustrated in the diagram below, the input word, denoted as $w (t)$ , is fed into the model. The model then predicts a set of nearby context words related to $w (t)$ : $w (t - 2)$ , $w (t - 1)$ , $w (t + 1)$ , and $w (t + 2)$ . Here, the symbols $+$ / $-$ signify the words that appear before and after the target word in the sequence. The number of context words can be adjusted as needed.

However, it's important to recognize that the ultimate goal of the Skip-gram model is not the prediction task itself. Rather, its primary objective is to derive the weight matrix found within the mapping relationship (indicated as PROJECTION in the diagram), which serves as the learned vector representations of words.

Corpus

A corpus is a collection of sentences or texts that a model utilizes to learn the semantic relationships between words.

For example, consider a vocabulary containing 10 distinct words extracted from a corpus: graph, is, a, good, way, to, visualize, data, very, at.

These words can be used to construct sentences such as:

Graph is a good way to visualize data.

Sliding Window Sampling

The Skip-gram model uses a sliding window sampling technique to generate training samples. This method involves a "window" that moves sequentially over each word in a sentence. For each target word, the model pairs it with context words that appear within a predefined range, denoted as window_size, both before and after the target.

Below is an illustration of the sampling process when window_size $= 1$ .

It's important to note that when window_size $> 1$ , all context words within the specified window are treated equally, regardless of their distance from the target word.

One-hot Encoding

Since words are not directly interpretable by machine learning models, they must be converted into machine-understandable representations.

One common method for encoding words is one-hot encoding. In this approach, each word is represented by a unique binary vector where only one element is "hot" (set to $1$ ) and all others are "cold" (set to $0$ ). The position of the $1$ in the vector corresponds to the index of the word in the vocabulary.

Below is an example of how one-hot encoding is applied to our vocabulary:

Word	One-hot Encoded Vector
graph	$1000000000$
is	$0100000000$
a	$0010000000$
good	$0001000000$
way	$0000100000$
to	$0000010000$
visualize	$0000001000$
data	$0000000100$
very	$0000000010$

Skip-gram Architecture

The architecture of the Skip-gram model is illustrated above, where:

$x_{V \times 1}$ is the one-hot encoded input vector of the target word, and $V$ is the number of words in the vocabulary.
$W_{V \times N}$ is the weight matrix mapping from the input layer to the hidden layer, where $N$ is the embedding dimension.
$h_{N \times 1}$ is the hidden layer vector.
${W'}_{N \times V}$ is the weight matrix from the hidden layer to the output layer. Note that $W'$ is not the transpose of $W$ —they are distinct matrices.
$u_{V \times 1}$ is the vector before applying the activation function Softmax.
$y_{c}$ ( $c = 1, 2, ..., C$ ) are the final output vectors (also referred to as panels). $C$ panels correspond to the $C$ context words predicted from the target word.

Softmax: The Softmax function serves as an activation function that transforms a vector of real numbers into a probability distribution vector. In the resulting vector, all elements sum to $1$ . The formula for Softmax is as follows:

Forward Propagation

In our example, let $V = 10$ , and set $N = 2$ . We begin by randomly initializing the weight matrices $W$ and $W'$ as shown below. For demonstration, we will use the sample pairs (is, graph) and (is, a).

Input Layer → Hidden Layer

Get the hidden layer vector $h$ by:

Given that $x$ is a one-hot encoded vector with only $x_{k} = 1$ , $h$ corresponds to the $k$ -th row of the matrix $W$ . This operation is essentially a simple lookup process:

where $v_{w I}$ is the input vector of the target word.

In fact, each row of the matrix $W$ , denoted as $v_{w}$ , is viewed as the final embedding of each word in the vocabulary.

Hidden Layer → Output Layer

Get the vector $u$ by:

The $j$ -th component of the vector $u$ is computed as the dot product between the vector $h$ and the transpose of the $j$ -th column vector of the matrix $W'$ :

where $v_{w_{j}}^{'}$ is the output vector of the $j$ -th word in the vocabulary.

In the Skip-gram model, each word in the vocabulary has two distinct representations: the input vector $v_{w}$ and the output vector $v_{w}^{'}$ . The input vector represents the word when it is used as the target, while the output vector represents the word when it acts as a context word.

During computation, $u_{j}$ is essentially the dot product of the input vector of the target word $v_{w I}$ and the output vector of the $j$ -th word $v_{w_{j}}^{'}$ . The Skip-gram model is built on the principle that a higher similarity between two vectors yields a larger dot product between them.

It is also important to note that only the input vectors are ultimately used as the word embeddings. Separating input and output vectors simplifies the computation process, improving both efficiency and accuracy during training and inference.

Get each output panel $y_{c}$ by:

where $y_{c, j}$ is the $j$ -th component of $y_{c}$ , representing the probability of the $j$ -th word within the vocabulary being predicted while considering the given target word. Apparently, the sum of all probabilities is $1$ .

The $C$ words with the highest predicted probabilities are selected as the context words. In our example, where $C = 2$ , the predicted context words are good and visualize.

Backpropagation

SGD is used to backpropagate the errors and adjust the weights in both $W$ and $W'$ .

Loss Function

Our goal is to maximize the probabilities of the $C$ context words, which is equivalent to maximizing the product of these probabilities:

where $j_{c}^{*}$ is the index of the expected $c$ -th output context word.

Because minimizing a function is often more straightforward and practical than maximizing it, we transform the above objective accordingly:

Therefore, the loss function $E$ for the Skip-gram model is defined as:

Take the partial derivative of $E$ with respect to $u_{j}$ :

To simplify the notaion going forward, we define the following:

where $t_{c}$ is the one-hot encoding vector of the $c$ -th expected output context word. In our example, $t_{1}$ and $t_{2}$ are the one-hot encoded vectors of the words graph and a, respectively. Therefore, the corresponding errors $e_{1}$ and $e_{2}$ are calculated as follows:

Therefore, $\frac{\partial E}{\partial u_{j}}$ can be written as:

In our example, it is calculated as:

Output Layer → Hidden Layer

Adjustments are applied to all the weights in matrix $W'$ , meaning that the output vectors of all words are updated.

Calculate the partial derivative of $E$ with respect to $w_{i j}^{'}$ :

Adjust $w_{i j}^{'}$ according to the learning rate $η$ :

Set $η = 0.4$ . For instance, $w_{14}^{'} = 0.86$ and $w_{24}^{'} = 0.67$ are updated to:

w_{14}^{'} = w_{14}^{'} - η \cdot (e_{1, 4} + e_{2, 4}) \cdot h_{1} = 0.86 - 0.4 \times 0.314 \times 0.65 = 0.78

w_{24}^{'} = w_{24}^{'} - η \cdot (e_{1, 4} + e_{2, 4}) \cdot h_{2} = 0.67 - 0.4 \times 0.314 \times 0.87 = 0.56

Hidden Layer → Input Layer

Only the weights in matrix $W$ corresponding to the input vector of the target word are updated.

The vector $h$ is obtained by only looking up the $k$ -th row of matrix $W$ (given that $x_{k} = 1$ ):

Calculate the partial derivative of $E$ with respect to $w_{k i}$ :

Adjust $w_{k i}$ according to the learning rate $η$ :

In our example, $k = 2$ , hence, $w_{21} = 0.65$ and $w_{22} = 0.87$ are updated:

\frac{\partial E}{\partial w_{21}} = (e_{1, 1} + e_{2, 1}) \cdot w_{11}^{'} + (e_{1, 2} + e_{2, 2}) \cdot w_{12}^{'} + ... + (e_{1, 10} + e_{2, 10}) \cdot w_{1,10}^{'} = 0.283

w_{21} = w_{21} - η \cdot \frac{\partial E}{\partial w_{21}} = 0.65 - 0.4 \times 0.283 = 0.54

\frac{\partial E}{\partial w_{22}} = (e_{1, 1} + e_{2, 1}) \cdot w_{21}^{'} + (e_{1, 2} + e_{2, 2}) \cdot w_{22}^{'} + ... + (e_{1, 10} + e_{2, 10}) \cdot w_{2,10}^{'} = 0.081

w_{22} = w_{22} - η \cdot \frac{\partial E}{\partial w_{22}} = 0.87 - 0.4 \times 0.081 = 0.84

Optimization

We have explored the fundamentals of the Skip-gram model. However, incorporating optimizations is essential to keep the model's computational complexity practical for real-world applications. Click here to continue reading.

ID
Product
Status
Cores
Maximum Shard Services
Maximum Total Cores for Shard Service
Maximum HDC Services
Maximum Total Cores for HDC Service
Applied Validity Period(days)
Effective Date
Expired Date
Mac Address
Reason for Application
Review Comment