Change Password

Please enter the password.
Please enter the password. Between 8-64 characters. Not identical to your email address. Contain at least 3 of: uppercase, lowercase, numbers, and special characters.
Please enter the password.
Submit

Change Nickname

Current Nickname:
Submit

Apply New License

License Detail

Please complete this required field.

  • Ultipa Blaze (v4)
  • Ultipa Powerhouse (v5)

Standalone

learn more about the four main severs in the architecture of Ultipa Powerhouse (v5) , click

here

Please complete this required field.

Please complete this required field.

Please complete this required field.

Please complete this required field.

Leave it blank if an HDC service is not required.

Please complete this required field.

Leave it blank if an HDC service is not required.

Please complete this required field.

Please complete this required field.

Mac addresses of all servers, separated by line break or comma.

Please complete this required field.

Please complete this required field.

Cancel
Apply
ID
Product
Status
Cores
Maximum Shard Services
Maximum Total Cores for Shard Service
Maximum HDC Services
Maximum Total Cores for HDC Service
Applied Validity Period(days)
Effective Date
Expired Date
Mac Address
Reason for Application
Review Comment
Close
Profile
  • Full Name:
  • Phone:
  • Company:
  • Company Email:
Change Password
Apply

You have no license application record.

Apply
Certificate Issued at Valid until Serial No. File
Serial No. Valid until File

Not having one? Apply now! >>>

Product Created On ID Amount (USD) Invoice
Product Created On ID Amount (USD) Invoice

No Invoice

v5.2
Search
    English
    v5.2

      Skip-gram Optimization

      Overview

      The basic Skip-gram model is nearly impractical for real-world use due to its high computational demands.

      The sizes of the matrices W and W depend on the vocabulary size (e.g., V=10000) and the embedding dimension (e.g., N=300). As a result, each matrix can contain millions of weights (e.g., VN=3 million), making the Skip-gram neural network substantially large. Training such a model effectively requires a massive number of samples to tune all the weights.

      Additionaly, during each backpropagation step, updates are applied to all output vectors (vw) in matrix W, even though most of these vectors are unrelated to the current target or context words. Given the large size of W, this makes gradient descent highly inefficient and computationally slow.

      Another significant computational cost comes from the Softmax function, which involves all words in the vocabulary to compute the normalization denominator.

      T. Mikoliv and colleagues introduced optimization techniques for the Skip-gram model, including subsampling and negative sampling. These approaches help accelerate training and improve the quality of the resulting embedding vectors.

      Subsampling

      Common words in the corpus, such as "the", "and", and "is", raise certain concerns:

      • They have limited semantic value. For example, the model benefits more from the co-occurrence of "France" and "Paris" than from the frequent pairing of "France" and "the".
      • These words show up in more training samples than needed, making it inefficient to train their vectors.

      Subsampling is used to address this issue by randomly discarding words during training. Frequent words are more likely to be discarded, while rare words are kept more often.

      First, calculate the probability of keeping a word by:

      where f(wi) is the frequency of the i-th word, α is a factor that influences the distribution and is default to 0.001.

      Then, a random number between 0 and 1 is generated. If P(wi) is smaller than this number, the word is discarded.

      For instance, when α=0.001, then for f(wi)0.0026, P(wi)1, so words with frequency 0.0026 or less will 100% be kept. For a high word frequency like f(wi)=0.03, P(wi)=0.22.

      In the case when α=0.002, then words with frequency 0.0052 or less will 100% be kept. For the same high word frequency f(wi)=0.03, P(wi)=0.32.

      Thus, a higher value of α increases the probability that frequent nodes are down-sampled.

      For example, if the word 'a' is discarded from the sentence 'Graph is a good way to visualize data,' then no training samples will include 'a' as either the target or a context word.

      Negative Sampling

      In the negative sampling approach, when a positive context word is sampled for a target word, a total of k words are simultaneously chosen as negative samples.

      For instance, let's consider the simple corpus when discussing the basic Skip-gram model. This corpus comprises a vocabulary of 10 words: graph, is, a, good, way, to, visualize, data, very, at. When the positive sample (target, content): (is, a) is generated using a sliding window, we select k=3 negative words graph, data and at to accompany it:

      Target Word Context Word Expected Output
      is Positive Sample a 1
      Negative Samples graph 0
      data 0
      at 0

      With negative sampling, the training objective of the model shifts from predicting context words for the target word to a binary classification task. In this setup, the output for the positive word is expected as 1, while the outputs for the negative words are expected as 0; other words that do not fall into either category are disregarded.

      Consequently, during the backpropagation process, the model only updates the output vectors vw associated with the positive and negative words to improve the model's classification performance.

      Consider the scenario where V=10000 and N=300. When applying negative sampling with the parameter k=9, only 300×10=3000 individual weights in W will require updates, which is 0.1% of the 3 million weights to be updated without negative sampling!

      Our experiments indicate that values of k in the range 5~20 are useful for small training datasets, while for large datasets the k can be as small as 2~5. (Mikolov et al.)

      To select negative samples, a probability distribution Pn is required. The fundamental principle is to prioritize frequent words in the corpus. However, using raw frequency can result in an overrepresentation of very common words, while underrepresenting less frequent ones. To address this, an empirical distribution is often used that involves raising the word frequency to the power of 34:

      where f(wi) is the frequency of the i-th word, the subscript n of P indicates the concept of noise, the distribution Pn is also called the noise distribution.

      In extreme cases, consider a corpus containing only two words, with frequencies of 0.9 and 0.1 respectively. Applying the above formula results in adjusted probabilities of 0.84 and 0.16. This adjustment helps reduce the bias caused by large frequency disparities.

      However, when working with large corpora, negative sampling can still be computationally intensive. To address this, a resolution is introduced to rescale the noise distribution. A higher resolution value enables the resampled distribution to better approximate the original noise distribution, striking a balance between efficiency and accuracy.

      Optimized Model Training

      Forward Propagation

      We will demonstrate with target word is, positive word a, and negative words graph, data and at:

      With negative sampling, the Skip-gram model uses the following variation of the Softmax function, which is actually the Sigmoid function (σ) of uj. This function maps all components of u within the range of 0 and 1:

      Backpropagation

      As explained, the output for the positive word, denoted as y0, is expected to be 1; while the k outputs corresponding to the negative words, denoted as yi, are expected to be 0. Therefore, the objective of the model's training is to maximize both y0 and 1-yi, which can be equivalently interpreted as maximizing their product:

      The loss funtion E is then obtained by transforming the above as a minimization problem:

      Take the partial derivative of E with respect to u0 and ui:

      Eu0 and Eui hold a similar meaning to Euj in the original Skip-gram model, which can be understood as subtracting the expected vector from the output vector:

      The process of updating weights in matrices W and W is straightforward. You may refer to the original form of Skip-gram. However, only weights w11, w21, w13, w23, w18, w28, w1,10 and w2,10 in W and weights w21 and w21 in W are updated.

      Please complete the following information to download this book
      *
      公司名称不能为空
      *
      公司邮箱必须填写
      *
      你的名字必须填写
      *
      你的电话必须填写
      Privacy Policy
      Please agree to continue.

      Copyright © 2019-2025 Ultipa Inc. – All Rights Reserved   |  Security   |  Legal Notices   |  Web Use Notices