Embedding Models: From Architecture to Implementation
This is the notes that I took following this lecture.
It talks about training and using embedding models for RAG applications.
I hope this will be helpful!
Contextualised Token Embeddings
Introduction
We talk about contextualised token embeddings like BERT here.
These models are different from word embedding models, like word2vec
or GloVe
, which generate a vector for each word.
For the word embedding models, the context is lost. Contextualised embedding models made improevments.
BERT
Bert is a model with transformer–like architecture.
- A full transformer have an
encoder
and onedecoder
. - GPTs are decoder only.
- Bert is encoder only.
Bert base
- 12 layers
- 768 hidden units
- 12 attention heads
- 110 Million parameters
Bert large
- 24 layers
- 1024 hidden units
- 16 attention heads
- 340 Millon parameters
Training Bert
The pretraining of bert contains the following steps.
- Masked language modelling (MLM), where the model is trained to predict a randomly masked missing word.
- Next sentence prediction (NSP), where the model is trained to predict the next (short) sentence.
After this, we often finetune the model for specialised downstream tasks:
- text classification
- named entity recognisation
- question answering
Here is an example
CLS token in Bert
Q: Is it true that the CLS token is inserted to the beginning of the text by all tokenizers?
A: The statement that the CLS (Classification) token is inserted at the beginning of the text by all tokenizers is not entirely accurate. Here’s a more precise explanation:
BERT and Similar Models: The CLS token is a special token used in transformer models like BERT (Bidirectional Encoder Representations from Transformers). In these models, the CLS token is indeed inserted at the beginning of the input text during tokenization. This token serves as a summary representation of the entire input sequence, often used in classification tasks.
Other Models: Not all tokenizers or models use the CLS token. The insertion of the CLS token is specific to certain models like BERT, RoBERTa, and some other models derived from the transformer architecture that use this token for specific purposes.
BERT: When BERT tokenizes a text, it inserts the CLS token at the beginning and, often, a SEP (Separator) token at the end (especially when handling pairs of sentences).
GPT: Models like GPT (Generative Pre-trained Transformer) do not use a CLS token. Instead, they might use different special tokens or none at all, depending on the specific task or model configuration.
In summary, the CLS token is not universally inserted at the beginning of the text by all tokenizers. It is specific to models like BERT and its variants, where it plays a particular role in processing input sequences. Other models might use different conventions or special tokens.
GloVe: word embedding
We start showing how a word embedding model works. Here is the example with GloVe
model.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
import torch
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity
We use the gensim
library to download and use the word–embedding models.
import gensim.downloader as api
= api.load('glove-wiki-gigaword-100') word_vectors
The word_vectors
is a model and we can use it to process individual words as if it is a Python dict
.
print(word_vectors['king'].shape)
print(word_vectors['king'][:5])
(100,)
[-0.32307 -0.87616 0.21977 0.25268 0.22976]
Here are the visualization for some words with PCA.
= [
words "king", "princess", "monarch", "throne", "crown",
"mountain", "ocean", "tv", "rainbow", "cloud", "queen"
]= np.array([word_vectors[word] for word in words])
vectors = PCA(n_components=2)
pca = pca.fit_transform(vectors)
vectors_pca = plt.subplots(1, 1, figsize=(3, 3))
fig, axes 0], vectors_pca[:, 1])
axes.scatter(vectors_pca[:, for i, word in enumerate(words):
0]+.02, vectors_pca[i, 1]+.02))
axes.annotate(word, (vectors_pca[i, 'PCA of Word Embeddings')
axes.set_title(
plt.tight_layout() plt.show()
Here are the algebraic relationships for the words.
= word_vectors.most_similar(
result =['king', 'woman'],
positive=['man'], topn=1
negative
)
print(
f"""
The word closest to 'king' - 'man' + 'woman' is: '{result[0][0]}'
with a similarity score of {result[0][1]}"""
)
The word closest to 'king' - 'man' + 'woman' is: 'queen'
with a similarity score of 0.7698540091514587
GloVe vs BERT: words in context
Download the models
To make the note run, we need to download these models to a directory named models
.
- bert-base-uncased: https://huggingface.co/google-bert/bert-base-uncased
- dpr-ctx_encoder-multiset-base: https://huggingface.co/facebook/dpr-ctx_encoder-multiset-base
- dpr-question_encoder-multiset-base: https://huggingface.co/facebook/dpr-question_encoder-multiset-base
The way to actually download these models are
- Install Git-LFS
- Use
git clone URL
to download the files.
The hierarchy of the model directory is,
.
├── bert-base-uncased
└── facebook
├── dpr-ctx_encoder-multiset-base
└── dpr-question_encoder-multiset-base
Code
Here we load the downloaded Bert model and we print out its architecture. On a high level, its made of
= BertTokenizer.from_pretrained('./models/bert-base-uncased')
tokenizer = BertModel.from_pretrained('./models/bert-base-uncased')
model print(model)
BertModel(
(embeddings): BertEmbeddings(
(word_embeddings): Embedding(30522, 768, padding_idx=0)
(position_embeddings): Embedding(512, 768)
(token_type_embeddings): Embedding(2, 768)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(encoder): BertEncoder(
(layer): ModuleList(
(0-11): 12 x BertLayer(
(attention): BertAttention(
(self): BertSdpaSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
(intermediate_act_fn): GELUActivation()
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
)
(pooler): BertPooler(
(dense): Linear(in_features=768, out_features=768, bias=True)
(activation): Tanh()
)
)
def get_bert_embeddings(sentence, word):
= tokenizer(sentence, return_tensors='pt')
inputs = model(**inputs)
outputs = outputs.last_hidden_state
last_hidden_states = tokenizer.tokenize(sentence)
word_tokens = word_tokens.index(word)
word_index = last_hidden_states[0, word_index + 1, :] # +1 to account for [CLS] token
word_embedding return word_embedding
Here we show that the same word (bat) have different encodings in different sentences.
= "The bat flew out of the cave at night."
sentence1 = "He swung the bat and hit a home run."
sentence2
= "bat"
word
= get_bert_embeddings(sentence1, word).detach().numpy()
bert_embedding1 = get_bert_embeddings(sentence2, word).detach().numpy()
bert_embedding2 = word_vectors[word]
word_embedding
print("BERT Embedding for 'bat' in sentence 1:", bert_embedding1[:5])
print("BERT Embedding for 'bat' in sentence 2:", bert_embedding2[:5])
print("GloVe Embedding for 'bat':", word_embedding[:5])
= cosine_similarity([bert_embedding1], [bert_embedding2])[0][0]
bert_similarity = cosine_similarity([word_embedding], [word_embedding])[0][0]
word_embedding_similarity
print()
print(f"Cosine Similarity between BERT embeddings in different contexts: {bert_similarity}")
print(f"Cosine Similarity between GloVe embeddings: {word_embedding_similarity}")
BERT Embedding for 'bat' in sentence 1: [ 0.41316125 -0.12908228 -0.4486571 -0.40492678 -0.15305704]
BERT Embedding for 'bat' in sentence 2: [ 0.64066917 -0.31121528 -0.4408981 -0.16551168 -0.20056099]
GloVe Embedding for 'bat': [-0.47601 0.81705 0.11151 -0.22687 -0.80672]
Cosine Similarity between BERT embeddings in different contexts: 0.45995745062828064
Cosine Similarity between GloVe embeddings: 1.0
Cross Encoder
Bert is a general LM and here we demonstrated anoter model that has been finetuned on Question–Answer pairs.
More details can be found here.
This model can even be used as a QA bot.
from sentence_transformers import CrossEncoder
= CrossEncoder(
model 'cross-encoder/ms-marco-MiniLM-L-6-v2', max_length=512,
=torch.nn.Sigmoid()
default_activation_function )
= "Where is the capital of France?"
question
= [
answers "Paris is the capital of France.",
"Berlin is the capital of Germany.",
"Madrid is the capital of Spain."
]
= model.predict([
scores 0]),
(question, answers[1]),
(question, answers[2])
(question, answers[
])print(scores)
= torch.argmax(torch.tensor(scores)).item()
most_relevant_idx print(f"The most relevant passage is: {answers[most_relevant_idx]}")
[0.99965715 0.05289626 0.04520707]
The most relevant passage is: Paris is the capital of France.
Token v.s. Sentence Embedding
Two levels of Embeddings
The tokens in BERT model are embedded in two stages:
- token level embedding.
- context level embedding.
Here is an illustration.
Creating Embeddings for Sentences.
The idea: to create an embedding for sentences, rather than words / tokens?
We want to capture semantic similarities between sentences.
Naive approaches that fail:
- (Mean pooling) To take the output of the last layer of the transformer model, and average over different tokens.
- Just use the embedding of the
CLS
special token.
Approach #1: Mean Pooling
Let’s see the code
import torch
import numpy as np
import seaborn as sns
import pandas as pd
import scipy
from transformers import BertModel, BertTokenizer
from datasets import load_dataset
Now we try the mean pooling method.
def get_sentence_embedding(sentence):
"""
"""
= tokenizer(
encoded_input =True, truncation=True, return_tensors='pt'
sentence, padding
)= encoded_input['attention_mask']
attention_mask
with torch.no_grad():
= model(**encoded_input)
output
= output.last_hidden_state
token_embeddings = attention_mask.unsqueeze(-1).expand(
input_mask_expanded
token_embeddings.size()float()
).
# mean pooling operation, considering the BERT input_mask and padding
= torch.sum(
sentence_embedding * input_mask_expanded, 1
token_embeddings / torch.clamp(
) sum(1), min=1e-9
input_mask_expanded.
)return sentence_embedding.flatten().tolist()
def cosine_similarity_matrix(features):
= np.linalg.norm(features, axis=1, keepdims=True)
norms = features / norms
normalized_features = np.inner(normalized_features, normalized_features)
similarity_matrix = np.round(similarity_matrix, 4)
rounded_similarity_matrix return rounded_similarity_matrix
def plot_similarity(labels, features, rotation):
= cosine_similarity_matrix(features)
sim =0.8)
sns.set_theme(font_scale= sns.heatmap(sim, xticklabels=labels, yticklabels=labels, vmin=0, vmax=1, cmap="YlOrRd")
g =rotation)
g.set_xticklabels(labels, rotation"Semantic Textual Similarity")
g.set_title(return g
= "./models/bert-base-uncased"
model_name = BertTokenizer.from_pretrained(model_name)
tokenizer = BertModel.from_pretrained(model_name)
model
= [
messages # Smartphones
"I like my phone",
"My phone is not good.",
"Your cellphone looks great.",
# Weather
"Will it snow tomorrow?",
"Recently a lot of hurricanes have hit the US",
"Global warming is real",
# Food and health
"An apple a day, keeps the doctors away",
"Eating strawberries is healthy",
"Is paleo better than keto?",
# Asking about age
"How old are you?",
"what is your age?",
]
= []
embeddings for t in messages:
= get_sentence_embedding(t)
emb
embeddings.append(emb)
90) plot_similarity(messages, embeddings,
The result is not good. Now we benchmark the method with a standard dataset.
Here the scores are generated by human benchmarks.
= load_dataset("mteb/stsbenchmark-sts")
sts_dataset = pd.DataFrame({'sent1': sts_dataset['test']['sentence1'],
sts 'sent2': sts_dataset['test']['sentence2'],
'score': [x/5 for x in sts_dataset['test']['score']]})
10) sts.head(
sent1 | sent2 | score | |
---|---|---|---|
0 | A girl is styling her hair. | A girl is brushing her hair. | 0.5000 |
1 | A group of men play soccer on the beach. | A group of boys are playing soccer on the beach. | 0.7200 |
2 | One woman is measuring another woman's ankle. | A woman measures another woman's ankle. | 1.0000 |
3 | A man is cutting up a cucumber. | A man is slicing a cucumber. | 0.8400 |
4 | A man is playing a harp. | A man is playing a keyboard. | 0.3000 |
5 | A woman is cutting onions. | A woman is cutting tofu. | 0.3600 |
6 | A man is riding an electric bicycle. | A man is riding a bicycle. | 0.7000 |
7 | A man is playing the drums. | A man is playing the guitar. | 0.4400 |
8 | A man is playing guitar. | A lady is playing the guitar. | 0.4400 |
9 | A man is playing a guitar. | A man is playing a trumpet. | 0.3428 |
def sim_two_sentences(s1, s2):
= get_sentence_embedding(s1)
emb1 = get_sentence_embedding(s2)
emb2 = cosine_similarity_matrix(np.vstack([emb1, emb2]))
sim return sim[0,1]
= 50
n_examples
= sts.head(n_examples)
sts 'avg_bert_score'] = np.vectorize(sim_two_sentences) \
sts['sent1'], sts['sent2']) (sts[
10) sts.head(
sent1 | sent2 | score | avg_bert_score | |
---|---|---|---|---|
0 | A girl is styling her hair. | A girl is brushing her hair. | 0.5000 | 0.9767 |
1 | A group of men play soccer on the beach. | A group of boys are playing soccer on the beach. | 0.7200 | 0.9615 |
2 | One woman is measuring another woman's ankle. | A woman measures another woman's ankle. | 1.0000 | 0.9191 |
3 | A man is cutting up a cucumber. | A man is slicing a cucumber. | 0.8400 | 0.9763 |
4 | A man is playing a harp. | A man is playing a keyboard. | 0.3000 | 0.9472 |
5 | A woman is cutting onions. | A woman is cutting tofu. | 0.3600 | 0.8787 |
6 | A man is riding an electric bicycle. | A man is riding a bicycle. | 0.7000 | 0.9793 |
7 | A man is playing the drums. | A man is playing the guitar. | 0.4400 | 0.9730 |
8 | A man is playing guitar. | A lady is playing the guitar. | 0.4400 | 0.9519 |
9 | A man is playing a guitar. | A man is playing a trumpet. | 0.3428 | 0.9588 |
The result of the mean pooling method gives very high similarity scores for all samples. This is not good.
We can use the Pearson Correlation to quantify the effect:
= scipy.stats.pearsonr(sts['score'], sts['avg_bert_score'])
pc print(f'Pearson correlation coefficient = {pc[0]}\np-value = {pc[1]}')
Pearson correlation coefficient = 0.32100172202695454
p-value = 0.02303013509739929
Approach #2: Sentence Transformer
from sentence_transformers import SentenceTransformer
= SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
model = []
embeddings for t in messages:
= list(model.encode(t))
emb
embeddings.append(emb)
90) plot_similarity(messages, embeddings,
def sim_two_sentences(s1, s2):
= list(model.encode(s1))
emb1 = list(model.encode(s2))
emb2 = cosine_similarity_matrix(np.vstack([emb1, emb2]))
sim return sim[0,1]
'mini_LM_score'] = np.vectorize(sim_two_sentences)(sts['sent1'], sts['sent2'])
sts[10) sts.head(
sent1 | sent2 | score | avg_bert_score | mini_LM_score | |
---|---|---|---|---|---|
0 | A girl is styling her hair. | A girl is brushing her hair. | 0.5000 | 0.9767 | 0.8052 |
1 | A group of men play soccer on the beach. | A group of boys are playing soccer on the beach. | 0.7200 | 0.9615 | 0.7886 |
2 | One woman is measuring another woman's ankle. | A woman measures another woman's ankle. | 1.0000 | 0.9191 | 0.9465 |
3 | A man is cutting up a cucumber. | A man is slicing a cucumber. | 0.8400 | 0.9763 | 0.8820 |
4 | A man is playing a harp. | A man is playing a keyboard. | 0.3000 | 0.9472 | 0.3556 |
5 | A woman is cutting onions. | A woman is cutting tofu. | 0.3600 | 0.8787 | 0.5017 |
6 | A man is riding an electric bicycle. | A man is riding a bicycle. | 0.7000 | 0.9793 | 0.8074 |
7 | A man is playing the drums. | A man is playing the guitar. | 0.4400 | 0.9730 | 0.4757 |
8 | A man is playing guitar. | A lady is playing the guitar. | 0.4400 | 0.9519 | 0.6182 |
9 | A man is playing a guitar. | A man is playing a trumpet. | 0.3428 | 0.9588 | 0.5096 |
Now we calculate the Pearson correlation score again, it is much higher!
= scipy.stats.pearsonr(sts['score'], sts['mini_LM_score'])
pc print(f'Pearson correlation coefficient = {pc[0]}\np-value = {pc[1]}')
Pearson correlation coefficient = 0.9303740673726044
p-value = 1.4823857251915091e-22
History of Sentence Transformers
Different Tasks for Sentence Transformers
- Similarity Calculation
- Question Answering
For the QA task, we have two encoders for both Q and A, and they are trained with contrasive loss function.
Training a Dual Encoder
Here we will
- Build a dual encoder, and
- Train the encoder with Q+A pairs
Dual Encoder
We train this model with contrasive loss:
\mathcal{L} = \sum_{ij} \left[ y_{ij} \cdot \left( 1 - \underbrace{\textrm{sim}(u_i, v_j)}_\textrm{larger = better} \right)^2 +\ (1 - y_{ij}) \cdot \underbrace{ \max\left(0,\; \textrm{sim}(u_i, v_j) - m \right)^2 }_\textrm{should be smaller than m} \right]
where
- y_{ij} = 1 when the question i and answer j matches.
- u_i is the question embedding vector.
- v_j is the answer embedding vector.
- m is the minimum acceptable distance.
import torch
import numpy as np
import pandas as pd
from tqdm import tqdm
import matplotlib.pyplot as plt
from transformers import AutoTokenizer
= torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
DEVICE DEVICE
device(type='cuda', index=0)
The Cross-Entropy Trick
Here we demonstrate the usage of cross entropy
\frac{1}{N} \sum_{i=1}^{N} \log \left( \frac{\exp{S_{ii}}}{\sum_{j=1}^N \exp(S_{ij})} \right)
for (a special kind of) contrasive loss calculation. This loss function works well.
Notice that,
- Pytorch takes the logits so the softmax is performed during the loss calculation.
- We are effectively treating the similarity of Question i vs Answer j, as the logit (probability) of sample i being in class j.
- I think the cross entropy is a surgate to the contrasive loss. But why don’t we optimise the contrasive loss directly?
def contrastive_loss(data):
= torch.arange(data.size(0)) # (N, )
target = torch.nn.CrossEntropyLoss()(data, target)
loss return loss
= pd.DataFrame(
df
[4.3, 1.2, 0.05, 1.07],
[0.18, 3.2, 0.09, 0.05],
[0.85, 0.27, 2.2, 1.03],
[0.23, 0.57, 0.12, 5.1]
[
]
)= torch.tensor(df.values, dtype=torch.float32)
data = contrastive_loss(data)
loss print(loss)
tensor(0.1966)
Here we show that the more “diagnal” the result is, the smaller the loss value.
= data.size(0)
N = ~torch.eye(N, N, dtype=bool)
non_diag_mask
= plt.subplots(1, 3, figsize=(6, 2))
fig, axes for inx in range(3):
= torch.tensor(df.values, dtype=torch.float32)
data range(N), range(N)] += inx * 2
data[/= 2
data[non_diag_mask] f"Loss = {contrastive_loss(data):.3f}")
axes[inx].set_title(
axes[inx].imshow(data)'off')
plt.axis(
plt.tight_layout() plt.show()
The Encoder
class Encoder(torch.nn.Module):
def __init__(self, vocab_size, embed_dim_in, embed_dim_out):
super().__init__()
self.embedding_layer = torch.nn.Embedding(vocab_size, embed_dim_in)
self.encoder = torch.nn.TransformerEncoder(
torch.nn.TransformerEncoderLayer(=embed_dim_in,
d_model=8,
nhead=True,
batch_first
),=3,
num_layers=torch.nn.LayerNorm(normalized_shape=[embed_dim_in,]),
norm=False,
enable_nested_tensor
)self.projection = torch.nn.Linear(embed_dim_in, embed_dim_out)
def forward(self, tokeniser_out):
= self.embedding_layer( # (BS, L, DIM_IN)
x 'input_ids']
tokeniser_out[
)= self.encoder( # (BS, L, DIM_IN)
x
x, =tokeniser_out[ #
src_key_padding_mask'attention_mask'
].logical_not()
)= x[:,0,:] # (BS, DIM_IN)
cls_embed return self.projection(cls_embed)
= AutoTokenizer.from_pretrained('./models/bert-base-uncased', )
tokenizer = Encoder(tokenizer.vocab_size, 512, 128)
model = tokenizer(
x
["hello world this is really nice sentece I want to know the sequencec length",
"how are you",
"hello",
"okay"
=True, truncation=True, return_tensors='pt'
], padding
)= model(x)
y y.shape
torch.Size([4, 128])
The Dataset
class MyDataset(torch.utils.data.Dataset):
def __init__(self, datapath, n_rows, offset=0):
self.data = pd.read_csv(
="\t",
datapath, sep=0,
header=n_rows + offset
nrows
).iloc[offset:]
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
return (
self.data.iloc[idx]['questions'],
self.data.iloc[idx]['answers']
)
= MyDataset('./shared_data/nq_sample.tsv', n_rows=550)
dataset_train = MyDataset('./shared_data/nq_sample.tsv', n_rows=100, offset=550)
dataset_valid 5) dataset_valid.data.head(
questions | answers | |
---|---|---|
550 | when did parents just don't understand come out | The song was released as a single in spring 1988. |
551 | was hawaii part of the us during pearl harbor | The attack on Pearl Harbor was a surprise mili... |
552 | who is the guy in the sugarland stuck like glue | McPartlin also played Devon ``Captain Awesome'... |
553 | who wrote the song all is well with my soul | ``It Is Well With My Soul'' is a hymn penned b... |
554 | what time does pod save america come out | The podcast debuted in January 2017 and airs t... |
Training Loop
def train(
dataset_train,
dataset_valid,=10,
num_epochs=1e-5,
lr=32,
batch_size=512,
embed_size=128,
output_embed_size=64,
max_seq_len
):= len(dataset_train) // batch_size + 1
n_iters_train = len(dataset_valid) // batch_size + 1
n_iters_valid = AutoTokenizer.from_pretrained('bert-base-uncased')
tokenizer = Encoder(tokenizer.vocab_size, embed_size, output_embed_size)
question_encoder = Encoder(tokenizer.vocab_size, embed_size, output_embed_size)
answer_encoder
question_encoder.to(DEVICE)
answer_encoder.to(DEVICE)
= torch.utils.data.DataLoader(
dataloader_train =batch_size, shuffle=True
dataset_train, batch_size
)= torch.utils.data.DataLoader(
dataloader_valid =batch_size, shuffle=True
dataset_valid, batch_size
)= list(
parameters
question_encoder.parameters()+ list(
)
answer_encoder.parameters()
)= torch.optim.Adam(parameters, lr=lr)
optimizer = torch.nn.CrossEntropyLoss()
loss_fn = {'train': [], 'valid': []}
loss_history for epoch in tqdm(range(num_epochs)):
= []
running_loss_train # train the NN
for idx, data_batch in enumerate(dataloader_train):
= data_batch
question, answer = tokenizer(
question_tok =True, truncation=True,
question, padding='pt', max_length=max_seq_len
return_tensors
).to(DEVICE)= tokenizer(
answer_tok =True, truncation=True,
answer, padding='pt', max_length=max_seq_len
return_tensors
).to(DEVICE)
= question_encoder(question_tok) # BS x 128
question_embed = answer_encoder(answer_tok) # BS x 128
answer_embed = question_embed @ answer_embed.T # BS x BS
similarity_scores
= torch.arange(
target 0], dtype=torch.long
question_embed.shape['cuda')
).to(= loss_fn(similarity_scores, target)
loss += [loss.item()]
running_loss_train if idx == n_iters_train - 1:
'train'].append(np.mean(running_loss_train))
loss_history[
# reset optimizer
optimizer.zero_grad()
loss.backward()
optimizer.step()
# get the validation loss
= []
running_loss_valid for idx, data_batch in enumerate(dataloader_valid):
= data_batch
question, answer = tokenizer(
question_tok =True, truncation=True,
question, padding='pt', max_length=max_seq_len
return_tensors
).to(DEVICE)= tokenizer(
answer_tok =True, truncation=True,
answer, padding='pt', max_length=max_seq_len
return_tensors
).to(DEVICE)
with torch.no_grad():
= question_encoder.eval()(question_tok) # BS x 128
question_embed = answer_encoder.eval()(answer_tok)
answer_embed = question_embed @ answer_embed.T # BS x BS
similarity_scores = torch.arange(
target 0], dtype=torch.long
question_embed.shape['cuda')
).to(= loss_fn(similarity_scores, target)
loss += [loss.item()]
running_loss_valid if idx == n_iters_valid - 1:
'valid'].append(np.mean(running_loss_valid))
loss_history[
return question_encoder, answer_encoder, loss_history
= train(
qe, ae, history
dataset_train, dataset_valid,=50, batch_size=128, lr=2e-5
num_epochs )
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:23<00:00, 2.17it/s]
If we check the validation loss, clearly the network is overfitting.
This is to be expected as the dataset only contains a few hundres of examples.
=(6, 2.5))
plt.figure(figsizefor name, item in history.items():
=name, marker='+')
plt.plot(item, label
plt.legend()#plt.yscale('log')
"Epoch")
plt.xlabel("Loss")
plt.ylabel(
plt.tight_layout() plt.show()
Notice taht we have to use the eval()
to disable random dropout.
= 'What is the tallest mountain in the world?'
question = AutoTokenizer.from_pretrained('bert-base-uncased')
tokenizer = tokenizer(
question_tok =True, truncation=True, return_tensors='pt', max_length=64
question, padding
).to(DEVICE)
with torch.no_grad():
= qe.eval()(question_tok)[0]
question_emb
print(question_emb[:5])
tensor([-0.5661, -0.5561, 0.2012, 0.0749, -0.7478], device='cuda:0')
Here we see the embedding of the paired QA are close
= [
answers "What is the tallest mountain in the world?",
"The tallest mountain in the world is Mount Everest.",
"Who is donald duck?"
]= []
answer_emb
for answer in answers:
= tokenizer(
tok =True, truncation=True, return_tensors='pt', max_length=64
answer, padding
).to(DEVICE)with torch.no_grad():
= ae.eval()(tok)[0]
emb
answer_emb.append(emb)
for answer in answer_emb:
print(answer @ question_emb)
tensor(11.2628, device='cuda:0')
tensor(9.2472, device='cuda:0')
tensor(-3.7265, device='cuda:0')
Inference
Dual Encoders for RAG
The Answer Encoder is used in the ingest process.
The Question Encoder is used in the query process.
Similarity Search
Think of similarity as the inverse of a “distance”. We want to find “close neighbours” for a given query.
The naive calculation can be slow, with a O(N) complexity. However, there are faster algorithms:
- HNSW
- Annoy
- FAISS
- NMSLib
import numpy as np
from sentence_transformers import SentenceTransformer
from transformers import (
AutoTokenizer, DPRContextEncoder, DPRQuestionEncoder
)import logging
logging.getLogger("transformers.modeling_utils"
).setLevel(logging.ERROR)
def cosine_similarity_matrix(features):
= np.linalg.norm(features, axis=1, keepdims=True)
norms = features / norms
normalized_features = np.inner(normalized_features, normalized_features)
similarity_matrix = np.round(similarity_matrix, 4)
rounded_similarity_matrix return rounded_similarity_matrix
Using just a embedding model
If we use the same embedding model (encoder) for both the text chunk and user query, there is one possible issue.
The exact same questions are preferred by the similarity search, compared to the real answer!
= [
answers "What is the tallest mountain in the world?",
"The tallest mountain in the world is Mount Everest.",
"Mount Shasta",
"I like my hike in the mountains",
"I am going to a yoga class"
]
= 'What is the tallest mountain in the world?'
question
= SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
model = list(model.encode(question))
question_embedding
= []
sim for answer in answers:
= list(model.encode(answer))
answer_embedding 0,1])
sim.append(cosine_similarity_matrix(np.stack([question_embedding, answer_embedding]))[
print(sim)
= np.argmax(sim)
best_inx print(f"Question = {question}")
print(f"Best answer = {answers[best_inx]}")
[1.0, 0.7976, 0.4001, 0.3559, 0.0972]
Question = What is the tallest mountain in the world?
Best answer = What is the tallest mountain in the world?
Using a Dual Encoder Model
We use the pre–trained dual encoder model from facebook to demonstrate the same task.
This is the paper for the model.
= AutoTokenizer.from_pretrained(
answer_tokenizer "./models/facebook/dpr-ctx_encoder-multiset-base"
)= DPRContextEncoder.from_pretrained(
answer_encoder "./models/facebook/dpr-ctx_encoder-multiset-base"
)= AutoTokenizer.from_pretrained(
question_tokenizer "./models/facebook/dpr-question_encoder-multiset-base"
)= DPRQuestionEncoder.from_pretrained(
question_encoder "./models/facebook/dpr-question_encoder-multiset-base"
)
# Compute the question embeddings
= question_tokenizer(
question_tokens ="pt"
question, return_tensors"input_ids"]
)[
= question_encoder(
question_embedding
question_tokens
).pooler_output.flatten().tolist()
= []
sim for answer in answers:
= answer_tokenizer(
answer_tokens ="pt"
answer, return_tensors"input_ids"]
)[= answer_encoder(
answer_embedding
answer_tokens
).pooler_output.flatten().tolist()
sim.append(
cosine_similarity_matrix(
np.stack([question_embedding, answer_embedding])0, 1]
)[
)
print(sim)
= np.argmax(sim)
best_inx print(f"Question = {question}")
print(f"Best answer = {answers[best_inx]}")
[0.6253, 0.7472, 0.5506, 0.3687, 0.25]
Question = What is the tallest mountain in the world?
Best answer = The tallest mountain in the world is Mount Everest.
Extra Topics
Two Stage Retrieval
Sentence transformers can be inaccurate. We can use a two–stage retrieval to improve.
This is called Retrieval and Rerank.