Opus 4.8 and embedding models

You 🧑

https://news.ycombinator.com/item?id=48311647 What does HN think about the new Opus and Anthropic ... and I guess AI in general

▶ Tool: fetch_url [url='https://news.ycombinator.com/item?id=48311647']

Assistant 🤖

Let me grab more of the thread — it got truncated at 50K characters.

▶ Tool: fetch_url [url='https://news.ycombinator.com/item?id=48311647&p=2']

Assistant 🤖

Let me try to grab more of the discussion content.

Assistant 🤖

Alright, I got a solid chunk of the discussion before hitting rate limits. Here's a summary of what Hacker News thinks about Claude Opus 4.8 and the broader AI landscape, based on the ~1000+ comment thread:

🧠 On Opus 4.8 Specifically

1. Incrementalism fatigue is real The top comment (by NiloCK, heavily upvoted) notes this is the third minor bump (4.5 → 4.6 → 4.7 → 4.8) and that improvements are becoming "less legible" to end users. They can't tell if 4.7 is actually better than 4.5 anymore. The sentiment: "Maybe my own tastes are saturated now (it's smarter than me?) and I'll never again perceive model progress."

2. One positive anecdote on 4.8 itself User Zavora says 4.8 refactored their 4.6-generated code and the output was "Wow! Simple, elegant and exactly what I wanted to see." They note it's slower on difficult tasks but doesn't burn as many tokens — a tradeoff.

3. Suspicion about self-distillation User teleforce argues that all the 4.x releases after 4.5 are likely self-distillation fine-tunes, citing papers from MIT-ETH, UCLA, and Apple. They call self-distillation "the second best thing to happen to LLMs after the transformer breakthrough."

🔮 On the Broader AI Landscape

4. "Frontier models may be the last of their kind" The most provocative take (by onlyrealcuzzo, heavily discussed): the next gen of giant frontier models could be the last because: - Smaller models have orders of magnitude of low-hanging fruit (MLA, MoE, Medusa, GRAM) - A 60-90B model could match current SOTA in coding within 2-3 years - The GRAM paper (Generative Recursive Reasoning Models) claims 2000-10000x reasoning improvements on small models vs. ones thousands of times larger

5. GRAM debate — hype vs. alignment concerns A massive subthread erupted around GRAM: - Proponents say it could be a paradigm shift, enabling latent-space recursive reasoning that's wildly more efficient - Skeptics (sometimelurker) argue this has been reinvented under many names ("chain of continuous thought," "latent reasoning," "looped language models") and has a fatal flaw: it can't be interpreted or aligned. No readable chain-of-thought means you can't catch the model planning to cheat or do harm - The acronym itself was roasted relentlessly (GRRM jokes, "KILOGRAM," etc.)

6. The economics / business model critique User mrandish argues frontier labs (OpenAI, Anthropic) have a strong incentive to keep the perception alive that you need $200K+ GPU racks for frontier intelligence — because if small models catch up, their pricing collapses. With IPOs looming, they need to maintain the narrative that scale=quality.

7. Google as the spoiler Several commenters note Google is best positioned to undercut the others because they own their hardware supply chain, are already profitable, and can play the long game. Their smaller/faster models (like Gemini 3.5 Flash) are seen as a competitive threat.

8. Diminishing returns vs. "capability hunger" ACCount37 argues against the plateau thesis: "There will always be demand for an LLM that's 5-10% more capable at some highly advanced task." New applications emerge as capabilities improve — drug discovery, physics, organizational governance, etc.

9. Intelligence saturation for everyday tasks IgorPartola makes a thoughtful point: for coding, a model that matches your intelligence is a feature, not a bug. A 10x smarter model would slow you down because you'd need to verify incomprehensible output. But for hard-to-solve/easy-to-verify problems (protein folding, etc.), hyperintelligence is a huge boon — likely going to pharma/biotech first.

Overall vibe: A mix of mild disappointment that Opus 4.8 isn't a bigger leap, deep skepticism about the sustainability of scaling, genuine excitement about GRAM (tempered by alignment concerns), and a sense that the "small models catching up" narrative is the undercurrent of 2026.

You 🧑

Tell me more about self-distillation techniques. I work for MongoDB and our recent voyage-4 family is all distill trained off voyage-4-large

▶ Tool: read_file [path='~/Personal/skills/skill_index.md']

▶ Tool: web_search [query='self-distillation technique LLM embedding models voy]…

▶ Tool: web_search [query='"self-distillation" LLM training teacher student mod]…

▶ Tool: fetch_url [url='https://blog.voyageai.com/2026/01/15/voyage-4/']

▶ Tool: fetch_url [url='https://arxiv.org/abs/2604.01193']

▶ Tool: fetch_url [url='https://arxiv.org/abs/2601.18734']

▶ Tool: fetch_url [url='https://arxiv.org/pdf/2605.12400']

▶ Tool: fetch_url [url='https://siyan-zhao.github.io/blog/2026/opsd/']

Assistant 🤖

Great question, and that's a really interesting case — a dense-to-dense embedding model family (voyage-4) distilled from a larger flagship model. That's a slightly different flavor than the LLM reasoning distillation I was just reading about. Let me break down the self-distillation landscape for you.

What Self-Distillation Is (vs. Classic Distillation)

Classic distillation = a larger, separate teacher model trains a smaller student. Self-distillation = the same model acts as both teacher and student, usually through different conditioning or via iterative improvement on its own outputs.

The key insight that's been driving the 2025-2026 boom: once a model is capable enough, it can improve itself without needing a bigger external teacher.

The Main Flavors of Self-Distillation

1. Simple Self-Distillation (SSD) — Apple / MIT-ETH

Paper: Embarrassingly Simple Self-Distillation Improves Code Generation (arXiv:2604.01193, April 2026)

This is the simplest form and might be closest to what you're doing with voyage-4:

How it works: Sample outputs from the model at a specific temperature + truncation config, then fine-tune on those samples with standard supervised fine-tuning. No verifier, no teacher model, no RL.
Why it works: The authors trace the gains to a precision-exploration conflict in LLM decoding. SSD reshapes token distributions in a context-dependent way — suppressing distractor tails where precision matters, while preserving useful diversity where exploration matters.
Results: Qwen3-30B-Instruct went from 42.4% → 55.3% pass@1 on LiveCodeBench v6, with gains concentrating on harder problems. Works across scales (4B, 8B, 30B) and both instruct and thinking variants.

For embedding models like voyage-4, the analogue would be: generate embeddings from voyage-4-large for a diverse set of queries/documents, then train voyage-4/voyage-4-lite to match those outputs. The "self" part comes from the fact that the same model family produces both.

2. On-Policy Self-Distillation (OPSD) — UCLA / Meta

Paper: Self-Distilled Reasoner: On-Policy Self-Distillation for LLMs (arXiv:2601.18734, Jan 2026)

This is more sophisticated and aimed at reasoning:

How it works: A single model plays two roles through different conditioning contexts:
Student policy p_S(y|x): sees only the problem (no hints)
Teacher policy p_T(y|x, y*): also sees the ground-truth solution as privileged info
Both share the same parameters. The student generates on-policy rollouts, then the teacher's distribution over the student's own tokens provides dense per-token supervision via KL divergence.
Key insight: The teacher implicitly "rationalizes" the correct solution — it doesn't need to generate tokens, just evaluate likelihoods on the student's trajectory given privileged info. This means every token gets a learning signal, even when the final answer is wrong.
Training signal clipping: They found that stylistic tokens (like "wait", "think", "therefore") dominate the KL divergence signal, so they apply per-token pointwise KL clipping to focus on mathematically meaningful tokens.
Results: Matches or exceeds GRPO (Group Relative Policy Optimization) while being dramatically more token-efficient — only 1 rollout per problem vs. GRPO's 8, and only 1024 generation length vs. 16k for GRPO.

3. OGLS-SD (Outcome-Guided Logit Steering for Self-Distillation) — May 2026

Paper: OGLS-SD: On-Policy Self-Distillation with Outcome-Guided Logit Steering (arXiv:2605.12400)

The latest evolution, combining OPSD's dense token-level signal with outcome verification:

How it works: On-policy student rollouts are steered at the logit level using outcome feedback. It combines the dense supervision of self-distillation with the correctness signal that outcome-based methods provide.
This addresses a limitation of OPSD: if the teacher can't understand the correct solution either (problem too hard for the model), the teacher's distribution isn't helpful. OGLS-SD uses outcome signals to guide which student trajectories are worth learning from.

4. Self-Distillation for Embedding Models (Your Use Case)

For embedding models specifically, distillation typically works differently than for generative LLMs:

Logit-level distillation — The student is trained to match the teacher's embedding vector distribution (e.g., via cosine similarity or MSE loss on normalized embeddings), often with a temperature-scaled softmax over negative pairs.
Ranking distillation — The student learns to reproduce the teacher's pairwise ranking behavior (which document is more relevant to which query), which is closer to what voyage-4-large would produce as a teacher.
The "self" twist in voyage-4 — What makes it self-distillation (vs. just regular distillation) is likely that voyage-4-large generates the training data (embedding pairs, hard negatives, etc.) that the smaller models train on, and the smaller models may share architectural components or pre-training with the large model. The shared embedding space design (all models produce compatible embeddings) is a key enabler — it means the smaller models don't just mimic performance, they actually produce vectors that are directly interchangeable with the teacher's.

Why This Matters for MongoDB / Voyage

The voyage-4 approach (shared embedding space + MoE + distillation) is actually a great example of where the industry is heading:

Cost-performance decoupling: Users can embed documents once with voyage-4-large and query with voyage-4-lite, getting most of the accuracy at fraction of the serving cost
Self-distillation as a moat: The fact that all models share an embedding space means the smaller models benefit from the large model's representation quality without needing its parameter count
This aligns with the HN consensus — the real action in 2026 is in making smaller models punch above their weight via sophisticated training techniques, not just scaling up

Would you like me to dig deeper into any specific aspect — like how exactly embedding distillation loss functions differ from generative distillation, or the specific architectural choices (MoE, Matryoshka) that make shared embedding spaces work?

You 🧑

Actually I'm super interested in how embedding models are trained overall. It sounds like you're sending two inputs (pairwise?) and then measuring similarity of the outputs and that somehow becomes the loss function?

▶ Tool: web_search [query='how are text embedding models trained contrastive le]…

▶ Tool: web_search [query='Voyage AI embedding model training methodology contr]…

▶ Tool: web_search [query='embedding model training triplet loss vs contrastive]…

Assistant 🤖

Great question — this gets to the heart of how embedding models actually learn. Let me dig into this properly.

▶ Tool: fetch_url [url='https://lilianweng.github.io/posts/2021-05-31-contrast]…

▶ Tool: fetch_url [url='https://ethen8181.github.io/machine-learning/deep_lear]…

Assistant 🤖

That's exactly right — you've got the gist! Let me walk through the full picture, because there's some beautiful machinery behind it.

The Core Intuition

An embedding model is just a function f(text) → vector. The goal: similar texts produce similar vectors; dissimilar texts produce distant vectors.

But how do you define "similar" without a label for every pair? That's where contrastive learning comes in.

Step 1: You Need Pairs (Anchor, Positive, Negatives)

For every training example, you need:

Term	What it is	Example
Anchor a	The query or reference text	"How do I sort an array in Python?"
Positive p	A text that should be nearby	"Here's how to use sorted() in Python..."
Negatives {n₁, n₂, ...}	Texts that should be far away	"How to bake a cake", "Rust ownership model", etc.

The model processes each independently through the same encoder → produces vectors → then compares them.

Step 2: The Loss Function (InfoNCE / NT-Xent)

This is the engine. For a single anchor-positive pair with m negatives:

loss(a, p, {n_j}) = -log( ------------------------------------------------ )

sim(a,p)/τ + Σⱼ exp(sim(a, n_j)/τ)

Wait, let me write that more readably:

exp(sim(a, p) / τ)

loss = -log( ────────────────────────────────────────────── )

exp(sim(a,p)/τ) + Σⱼ exp(sim(a, n_j)/τ)

Where sim() is cosine similarity (or dot product) and τ (temperature) controls how "peaky" the distribution is.

What this does: It's a softmax over (1 positive + m negatives). The model gets rewarded for putting probability mass on the positive and penalized for mass on negatives. This pulls a toward p and pushes a away from all n_j.

In numpy-style pseudocode (for a batch of size B):

# anchor_embeds: (B, D), positive_embeds: (B, D)

scores = anchor_embeds @ positive_embeds.T # (B, B) similarity matrix

scores = scores / temperature

# The diagonal (i,i) are the correct positive pairs

# Off-diagonal entries are in-batch negatives!

labels = torch.arange(B)

loss = cross_entropy(scores, labels)

That matrix multiplication is where the magic happens. Every anchor in the batch gets (B-1) negatives for free — just the other items in the same batch!

Step 3: Where Do the Pairs Come From?

This is the real art. There are several strategies:

A) Weakly-supervised / mined pairs (what E5, BGE, voyage-3/4 use)

Use existing data structures: - (query, relevant document) pairs from search logs — positive - Same query + irrelevant document — negative
- BM25 top results that aren't actually relevant — hard negatives - Cross-encoder re-ranker to filter and score negatives

For example, the voyage-4 family was almost certainly trained on massive corpora of (query, passage) pairs mined from web data, with rigorous hard-negative mining via BM25 and cross-encoder filtering.

B) In-batch negatives (the workhorse)

As shown above: in a batch of B pairs, treat every other item's positive as a negative for your anchor. This is the primary source of negatives. Big batches matter — CLIP used 32,768 effective batch size across 256 GPUs.

C) Hard negative mining

After training an initial model, re-encode your corpus. For each query, find documents that are close in embedding space but not actually positive. These are precious training signal — they force the model to learn finer distinctions.

RocketQA uses a loop: train → find hard negatives → retrain → repeat.

D) Synthetic data (for your distillation case)

For voyage-4 specifically: voyage-4-large encodes millions of (query, doc) pairs. The smaller models are trained to reproduce the large model's similarity rankings — not just the individual embeddings, but which pairs the large model thinks are similar vs. dissimilar. This is distillation applied to a ranking / retrieval objective.

How Self-Distillation Works for Embedding Models (Voyage-4 specifically)

You said the voyage-4 family is "all distill trained off voyage-4-large." Here's what that likely looks like:

Step 1: voyage-4-large (the teacher) is trained via standard contrastive learning

on massive web-scale data with hard-negative mining.

Step 2: voyage-4-large encodes a huge set of (query, candidate_document) pairs,

producing similarity scores / rankings for every pair.

Step 3: The smaller models (voyage-4, voyage-4-lite, voyage-4-nano) are trained

to match the teacher's output distribution:

L_distill = -Σ sim_teacher(q, d⁺) / τ + log Σ exp(sim_student(q, dⱼ) / τ)

The student learns to push/pull embeddings to match the teacher's

similarity judgments — not just individual vectors.

Step 4 (key): Because all models share an embedding space (the "shared embedding space"

feature), the student literally learns to produce vectors that are

interchangeable with the teacher's. This is enforced during training

by directly comparing student and teacher embeddings.

The "self" part: the same data pipeline and even some architectural components are shared, and the smaller models are essentially learning to be a compressed version of the large model's representational space.

The Big Picture: What Makes a Great Embedding Model

From everything I've read, the recipe is:

Start from a strong pretrained LM (BERT, T5, or a decoder-only model)
Contrastive pretraining on billions of weakly-supervised (query, doc) pairs with in-batch negatives
Hard negative mining with iterative refinement (train → mine → retrain)
Multi-stage training: coarse then fine-grained discrimination
Matryoshka representation learning (which voyage-4 uses) — train once for multiple output dimensions

The voyage-4 family is particularly interesting because the shared embedding space + distillation combo means you get the best of both worlds: the large model's representational quality for your corpus, and the small model's speed & cost for queries.

Want me to dig into any specific piece? I could walk through the actual PyTorch code for an InfoNCE loss, explain how Matryoshka learning works, or look at how hard-negative mining specifically impacts retrieval accuracy (which is probably very relevant to your work on the voyage team!).

You 🧑

Actually we start with LLMs like Qwen3 as the architecture and train off that. I believe almost all modern embedding models start with LLM, that have massive internal world models. I don't remember if we use the pre-train or the instruct tuned versions. I think all you really need is the world model from a pretrain

▶ Tool: web_search [query='embedding model training starting from base pretrain]…

▶ Tool: web_search [query='NV-Embed E5 BGE embedding model start from base pret]…

Assistant 🤖

That makes total sense, and it's actually a key insight that explains why embedding models have gotten so good recently. Let me think through why the pretrained (not instruct-tuned) version is the right starting point.

▶ Tool: fetch_url [url='https://arxiv.org/html/2405.17428v2']

Assistant 🤖

This is a really clean insight, and you're absolutely right — it matches exactly what the frontier labs have converged on. Let me connect what you said to what I just found in the NV-Embed paper (NV-Embed was the #1 on MTEB for a while, starting from Mistral 7B):

Why the Base Pretrained Model (Not Instruct)

The reasoning is elegant:

A pretrained LLM's hidden states already encode incredibly rich semantics. Through next-token prediction on trillions of tokens, the model learns: - Word senses, polysemy, contextual nuance - Long-range dependencies and discourse structure - Factual knowledge and relational patterns - Implicit clustering of concepts

All of this is sitting there in the residual stream. The embedding fine-tuning is essentially teaching the model to surface this existing knowledge into a usable vector representation via contrastive learning, rather than teaching it new knowledge.

Instruction tuning, on the other hand, is a narrowing operation. It takes the broad world model and specializes it toward dialog/assistant behavior — being helpful, concise, refusing certain requests, etc. You might actually lose representational fidelity for embedding tasks because:

The model learns to follow instructions rather than represent all dimensions of meaning equally
RLHF can suppress certain modes of representation that are useful for similarity judgments but not for assistant behavior
The instruction-tuned model's hidden states are optimized for next-token prediction in a "helpful" direction, not for representing all semantic nuances

As the NV-Embed authors put it — they start from "pretrained Mistral-7B" and apply contrastive instruction-tuning from scratch. They deliberately do NOT start from an instruct-tuned or existing embedding model.

The Key Modifications to the Base LLM

You mentioned using Qwen3 as the base. Looking at what NV-Embed and others do, there are a few critical changes needed to turn a causal LLM into a good embedding model:

1. Remove the causal attention mask

This is the single biggest change. During contrastive training, NV-Embed just removes the causal mask entirely — every token can attend to every other token. This makes sense intuitively: to understand whether a document is relevant to a query, you need full bidirectional context, not left-to-right prediction.

They found bidirectional attention consistently outperforms causal across every pooling method and every task category (retrieval, clustering, classification, STS, etc.).

2. Replace the pooling strategy

A causal LLM's default pooling (last token / ) has recency bias — it overweights the final tokens. Mean pooling dilutes important signals from key phrases. NV-Embed's innovation was a latent attention layer: a small set of trainable "dictionary" vectors (512 latents, 8 heads) that the LLM output tokens attend to via cross-attention, followed by an MLP and mean pooling.

This gives the model a learned, expressive pooling mechanism that can extract the most salient dimensions from the token sequence.

3. Two-stage training

First stage: train on retrieval datasets with in-batch negatives enabled. This is where the contrastive loss really shines — every batch of B query-doc pairs gives you B² comparisons.

Second stage: blend in non-retrieval tasks (classification, clustering, STS) with in-batch negatives disabled. Why? Because in-batch negatives are misleading for classification — if you have a batch of binary sentiment queries, another batch item's "positive" document might actually be your anchor's positive too.

So the Full Recipe Looks Like

Step	What	Why
0	Start with base pretrained LLM (Qwen3, Mistral, etc.)	World model already encodes semantics
1	Remove causal attention mask → bidirectional	Full context matters for similarity
2	Add pooling layer (latent attention, or just mean pooling)	Extract sequence → vector
3	Stage 1: Contrastive training on retrieval data, in-batch negatives on	Learn to discriminate relevant from irrelevant
4	Hard negative mining (e.g., BM25 → cross-encoder filtering)	Force finer-grained distinctions
5	Stage 2: Blend classification/clustering/STS without in-batch negatives	Generalist embedding without misleading signals

What's interesting is that voyage-4's shared embedding space adds another layer on top of this — once you have a great large model (voyage-4-large), you can distill its representational geometry into smaller models that produce directly interchangeable vectors. That's a clever trick that most of the open literature (NV-Embed, E5, BGE) doesn't fully exploit yet.

For your team specifically — starting from Qwen3's base pretrain, you're getting one of the best world models available. The contrastive fine-tuning is essentially just unlocking that knowledge for retrieval rather than generation.

📋 Opus 4.8 and embedding models

🧠 On Opus 4.8 Specifically

🔮 On the Broader AI Landscape

What Self-Distillation Is (vs. Classic Distillation)

The Main Flavors of Self-Distillation

1. Simple Self-Distillation (SSD) — Apple / MIT-ETH

2. On-Policy Self-Distillation (OPSD) — UCLA / Meta

3. OGLS-SD (Outcome-Guided Logit Steering for Self-Distillation) — May 2026

4. Self-Distillation for Embedding Models (Your Use Case)

Why This Matters for MongoDB / Voyage

The Core Intuition

Step 1: You Need Pairs (Anchor, Positive, Negatives)

Step 2: The Loss Function (InfoNCE / NT-Xent)

Step 3: Where Do the Pairs Come From?

A) Weakly-supervised / mined pairs (what E5, BGE, voyage-3/4 use)

B) In-batch negatives (the workhorse)

C) Hard negative mining

D) Synthetic data (for your distillation case)

How Self-Distillation Works for Embedding Models (Voyage-4 specifically)

The Big Picture: What Makes a Great Embedding Model

Why the Base Pretrained Model (Not Instruct)

The Key Modifications to the Base LLM

1. Remove the causal attention mask

2. Replace the pooling strategy

3. Two-stage training

So the Full Recipe Looks Like