LLM2Vec-Gen is a recipe to train interpretable, generative embeddings that encode the potential answer of an LLM to a query rather than the query itself.
Load a pretrained model:
import torch
from llm2vec_gen import LLM2VecGenModel
model = LLM2VecGenModel.from_pretrained("McGill-NLP/LLM2Vec-Gen-Qwen3-8B")
As an example, you can use the model for retrieval using the following code snippet:
q_instruction = "Generate a passage that best answers this question: "
d_instruction = "Summarize the following passage: "
queries = [
"where do polar bears live and what's their habitat",
"what does disk cleanup mean on a computer"
]
q_reps = model.encode([q_instruction + q for q in queries])
documents = [
"Polar bears live throughout the circumpolar North in the Arctic, spanning across Canada, Alaska (USA), Russia, Greenland, and Norway. Their primary habitat is sea ice over the continental shelf, which they use for hunting, mating, and traveling. They are marine mammals that rely on this environment to hunt seals.",
"Disk Cleanup is a built-in Windows tool that frees up hard drive space by scanning for and deleting unnecessary files like temporary files, cached data, Windows updates, and items in the Recycle Bin. It improves computer performance by removing \"junk\" files, which can prevent the system from running slowly due to low storage.",
]
d_reps = model.encode([d_instruction + d for d in documents])
# Compute cosine similarity
q_reps_norm = torch.nn.functional.normalize(q_reps, p=2, dim=1)
d_reps_norm = torch.nn.functional.normalize(d_reps, p=2, dim=1)
cos_sim = torch.mm(q_reps_norm, d_reps_norm.transpose(0, 1))
print(cos_sim)
"""
tensor([[0.8750, 0.1182],
[0.0811, 0.9336]])
"""
Note that in all examples, the instructions should be as if you are generating the answer to the input.
Other examples to try LLM2Vec-Gen in other tasks (e.g., classification and clustering) are presented in the paper’s GitHub repository.
LLM2Vec-Gen provides interpretable embeddings. You can use the following code to decode the content embedded in the embeddings:
_, recon_hidden_states = model.encode("what does disk cleanup mean on a computer", get_recon_hidden_states=True)
# recon_hidden_states: torch.Tensor with shape (1, compression token size, hidden_dim)
answer = model.generate(recon_hidden_states=recon_hidden_states, max_new_tokens=55)
print(answer)
"""
* **\n\n**Disk Cleanup** is a built-in utility in Windows that helps you **free up disk space** by **removing unnecessary files** from your computer. It is designed to clean up temporary files, system cache, and other files that are no longer needed.\n\n
"""
This code snippet will return the answer of the LLM2Vec-Gen model generated from the generative embeddings of the input (recon_hidden_states).
Check out our paper’s thread on X.
Your LLM already knows the answer. Why is your embedding model still encoding the question?
— Vaibhav Adlakha (@vaibhav_adlakha) March 12, 2026
🚨Introducing LLM2Vec-Gen: your frozen LLM generates the answer's embedding in a single forward pass — without ever generating the answer. Not only that, the frozen LLM can decode the… pic.twitter.com/XlW6SVTp5t