RoFormer: Enhanced Transformer With Rotary Position Embedding

Updated on November 20, 2024 8 minutes read

Transformer-based models are famous for their ability to parse and interpret complex text. They rely on understanding the order and context of words - tasks at which traditional positional encoding methods have shown their limits. Addressing this gap, the ROFORMER model, powered by the Rotary Position Embedding (RoPE), redefines our approach to positional encoding.

Traditional Positional Encoding

Transformers treat text as a series of tokens, and allow parallel processing of sequences for greater efficiency. However, this strength brought its challenge: the model’s inherent agnosticism to token order. Positional encoding was the answer, providing each token a unique signature denoting its sequence position.

Absolute Position Embeddings

Initially, models like BERT used absolute position embeddings, assigning a fixed vector to each position in a sequence. This method, though straightforward, inherently lacks the ability to adapt to sequence length variations, or to emphasize the relative distances between tokens, critical for understanding many linguistic constructs.

Relative Position Embeddings

To capture the dynamic nature of language, relative position embeddings were introduced, focusing on the distance between tokens rather than their absolute positions. Despite their conceptual advantage, these embeddings introduced computational complexity, and failed to seamlessly integrate into the self-attention mechanism of Transformers, limiting their efficacy.

ROFORMER and Rotary Position Embedding

Recognizing the limitations of existing positional encoding strategies, ROFORMER introduces Rotary Position Embedding (RoPE), an approach that combines the benefits of absolute and relative position information without their respective drawbacks.

Rotary Position Embedding

RoPE encodes positional information using rotation matrices, enabling the model to understand not just where a token is, but how it relates to every other token in a sequence.

Credit: ArXiv

It operates through a geometric lens, treating token positions as points in a multi-dimensional space that are rotated to mark their sequential relationships. This rotation allows the model to preserve and exploit both absolute and relative positional cues within its self-attention mechanism.

Implementing RoPE

Implementing RoPE involves encoding each token’s position into a rotation matrix, and applying this matrix within the self-attention mechanism of the Transformer. This process allows for a flexible, dynamic interpretation of positional information, accommodating varying sequence lengths and capturing the essence of token interrelationships without significant computational overhead.

First, you’ll need a function to generate the rotary embeddings, and then you’ll integrate these embeddings into your model. The example below assumes you’re familiar with creating custom layers in Keras.

Step 1: Define the Rotary Embedding Function

This function generates the rotary embeddings given the maximum sequence length, and the dimensionality of the embeddings.

from tensorflow.keras.layers import Layer
import numpy as np

def get_rotary_embedding(dim, max_seq_len):
    inv_freq = 1.0 / (10000 ** (tf.range(0, dim, 2, dtype=tf.float32) / dim))
    t = tf.range(max_seq_len, dtype=tf.float32)
    freqs = tf.einsum('i,j->ij', t, inv_freq)
    emb = tf.concat((tf.cos(freqs), tf.sin(freqs)), axis=-1)
    return emb

inv_freq = 1.0 / (10000 ** (tf.range(0, dim, 2, dtype=tf.float32) / dim))

This line computes the inverse of exponentially scaled frequencies based on the position indices. These frequencies are used in generating sinusoidal patterns for rotary embeddings, which helps in encoding the relative positional information in sequences. This mechanism is particularly useful in tasks where understanding the order and relative positioning of elements is crucial, such as in natural language processing or time series analysis.

In details:

tf.range(0, dim, 2, dtype=tf.float32) creates a range of values starting from 0 up to dim (exclusive), stepping by 2. The dtype=tf.float32 argument specifies that the elements of this tensor are 32-bit floating-point numbers. If dim is 8, for example, this would produce [0, 2, 4, 6].
The tensor produced by tf.range is then divided by the dimensionality (dim) of the embeddings. This operation scales these indices down to a range between 0 and 1 (exclusive if dim is even, slightly inclusive if dim is odd, because the range step skips every other value). Continuing the example with dim = 8, dividing by 8 yields [0.0, 0.25, 0.5, 0.75].
The 10000 ** (...) operation raises 10,000 to the power of each element in the previously scaled tensor. The base of 10,000 is somewhat arbitrary, but is chosen to ensure that the frequencies vary over a wide range, which helps the model to differentiate between different positions more effectively. For [0.0, 0.25, 0.5, 0.75], it would apply the power operation to each, resulting in values much larger for higher elements.
Finally, the inverse frequency is obtained by taking the reciprocal (1/x) of the values from the previous step. The inverse frequencies are smaller for higher indices, meaning elements further in the sequence will have smaller frequencies, affecting how their positions are encoded into the model. This allows the embeddings to scale in a manner where relative positions can be inferred through the model’s attention mechanisms.

The line:

freqs = tf.einsum('i,j->ij', t, inv_freq)

uses TensorFlow’s tf.einsum function, a tool that allows for concise and efficient expression of tensor operations using the Einstein summation notation.

This operation effectively calculates the outer product of the t and inv_freq vectors, resulting in a matrix where each element (i, j) is the product of the i-th element of t and the j-th element of inv_freq. This matrix (freqs) represents the frequencies that are used to generate the sinusoidal patterns for the rotary embeddings.

Step 2: Custom Keras Layer for Rotary Embeddings

Now, let’s create a custom Keras layer that applies rotary embeddings to the input tensor. This layer assumes that the input tensor is of shape (batch_size, sequence_length, embedding_dim).

class RotaryEmbeddingLayer(Layer):
    def __init__(self, dim, max_seq_len, **kwargs):
        super().__init__(**kwargs)
        self.dim = dim
        self.max_seq_len = max_seq_len
        self.rotary_embeddings = get_rotary_embedding(dim, max_seq_len)

    def call(self, inputs):
        seq_len = tf.shape(inputs)[1]
        embeddings = self.rotary_embeddings[:seq_len]
        
        cos_emb = embeddings[:, None, :self.dim // 2]
        sin_emb = embeddings[:, None, self.dim // 2:]
        
        # Decompose inputs into sine and cosine components
        inputs_cos = inputs[..., :self.dim // 2]
        inputs_sin = inputs[..., self.dim // 2:]
        
        # Apply rotary embeddings
        rotated_cos = inputs_cos * cos_emb - inputs_sin * sin_emb
        rotated_sin = inputs_sin * cos_emb + inputs_cos * sin_emb
        
        return tf.concat([rotated_cos, rotated_sin], axis=-1)

    def get_config(self):
        config = super().get_config()
        config.update({
            "dim": self.dim,
            "max_seq_len": self.max_seq_len
        })
        return config

The line embeddings = self.rotary_embeddings[:seq_len] selects the appropriate subset of pre-computed rotary embeddings based on the current input sequence length. Since the length of sequences can vary from one batch to another, this slicing operation ensures that only the embeddings corresponding to the actual sequence length are used.

The variable embeddings now holds a tensor of shape (seq_len, embedding_dim), where seq_len is the length of the sequences in the current batch, and embedding_dim is the dimensionality of the embeddings. This tensor contains the rotary positional embeddings for each position in the sequence up to seq_len.

emb = tf.concat((tf.cos(freqs), tf.sin(freqs)), axis=-1) combines sine and cosine transformations of positional frequencies into a single tensor:

-tf.cos(freqs) and tf.sin(freqs) apply the cosine and sine transformations, respectively, to the freqs tensor. The freqs tensor contains frequency values for each position in the input sequence and each dimension of the embedding space, calculated based on the sequence positions and the inverse frequencies of the embedding dimensions. The sine and cosine functions are applied element-wise, resulting in two tensors of the same shape as freqs. These transformations help in encoding the position in a way that captures the cyclical nature of positional relationships, facilitating the model’s ability to understand relative positions.

-tf.concat((tf.cos(freqs), tf.sin(freqs)), axis=-1) concatenates the cosine and sine transformed tensors along the last axis (denoted by axis=-1). Concatenating these tensors side by side effectively doubles the dimensionality of the freqs tensor, with the first half representing cosine-transformed values and the second half representing sine-transformed values for each position. The concatenation ensures that each positional encoding contains both sine and cosine information, which allows the preservation of information about both the amplitude and phase of the positional signals.

The concatenated tensor emb now holds the complete rotary embeddings for the input positions. The shape of emb will be the same as freqs in its first two dimensions (corresponding to sequence positions and embedding dimensions), but its last dimension will be twice as large, accounting for both sine and cosine values. These embeddings are used to modulate the input embeddings by adding positional information in a rotationally equivariant manner.

-cos_emb = embeddings[:, None, :self.dim // 2]:

The first colon : means “select all elements in this dimension,” which, in this case, refers to all positions in the sequence.
None is used to add an additional dimension, making the tensor 3-dimensional. This is often done to ensure compatibility with certain operations that expect inputs of a specific number of dimensions. For instance, when performing element-wise multiplication with another tensor that is 3-dimensional, the shapes must align according to broadcasting rules.
:self.dim // 2, selects the first half of the dimensions in the last axis. Since the embedding_dimension is doubled to include both sine and cosine values, dividing by 2 effectively selects just the cosine components of the embeddings.

Step 3: Integration with a Keras Model

After defining the RotaryEmbeddingLayer, you can integrate it into your Keras model. This layer should be applied to your embeddings before feeding them into attention layers or any subsequent model layers.

Here’s a simplified example of how to integrate the rotary embeddings into a model:

from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding, Dense

max_seq_len = 512
embedding_dim = 64

inp = Input(shape=(max_seq_len,))
x = Embedding(input_dim=10000, output_dim=embedding_dim)(inp)
x = RotaryEmbeddingLayer(dim=embedding_dim, max_seq_len=max_seq_len)(x)
# Add your model's layers here, e.g., Transformer blocks
x = Dense(1, activation='sigmoid')(x)

model = Model(inputs=inp, outputs=x)
model.summary()

Part-Time & Full-Time Coding Courses at Code Labs Academy – No Experience Needed