You 🧑
https://news.ycombinator.com/item?id=48311647 What does HN think about the new Opus and Anthropic ... and I guess AI in general
▶ Tool: fetch_url [url='https://news.ycombinator.com/item?id=48311647']
▶ Tool: fetch_url [url='https://news.ycombinator.com/item?id=48311647']
Assistant 🤖
Let me grab more of the thread — it got truncated at 50K characters.
▶ Tool: fetch_url [url='https://news.ycombinator.com/item?id=48311647&p=2']
Assistant 🤖
Let me try to grab more of the discussion content.
Assistant 🤖
Alright, I got a solid chunk of the discussion before hitting rate limits. Here's a summary of what Hacker News thinks about Claude Opus 4.8 and the broader AI landscape, based on the ~1000+ comment thread:
1. Incrementalism fatigue is real The top comment (by NiloCK, heavily upvoted) notes this is the third minor bump (4.5 → 4.6 → 4.7 → 4.8) and that improvements are becoming "less legible" to end users. They can't tell if 4.7 is actually better than 4.5 anymore. The sentiment: "Maybe my own tastes are saturated now (it's smarter than me?) and I'll never again perceive model progress."
2. One positive anecdote on 4.8 itself User Zavora says 4.8 refactored their 4.6-generated code and the output was "Wow! Simple, elegant and exactly what I wanted to see." They note it's slower on difficult tasks but doesn't burn as many tokens — a tradeoff.
3. Suspicion about self-distillation User teleforce argues that all the 4.x releases after 4.5 are likely self-distillation fine-tunes, citing papers from MIT-ETH, UCLA, and Apple. They call self-distillation "the second best thing to happen to LLMs after the transformer breakthrough."
4. "Frontier models may be the last of their kind" The most provocative take (by onlyrealcuzzo, heavily discussed): the next gen of giant frontier models could be the last because: - Smaller models have orders of magnitude of low-hanging fruit (MLA, MoE, Medusa, GRAM) - A 60-90B model could match current SOTA in coding within 2-3 years - The GRAM paper (Generative Recursive Reasoning Models) claims 2000-10000x reasoning improvements on small models vs. ones thousands of times larger
5. GRAM debate — hype vs. alignment concerns A massive subthread erupted around GRAM: - Proponents say it could be a paradigm shift, enabling latent-space recursive reasoning that's wildly more efficient - Skeptics (sometimelurker) argue this has been reinvented under many names ("chain of continuous thought," "latent reasoning," "looped language models") and has a fatal flaw: it can't be interpreted or aligned. No readable chain-of-thought means you can't catch the model planning to cheat or do harm - The acronym itself was roasted relentlessly (GRRM jokes, "KILOGRAM," etc.)
6. The economics / business model critique User mrandish argues frontier labs (OpenAI, Anthropic) have a strong incentive to keep the perception alive that you need $200K+ GPU racks for frontier intelligence — because if small models catch up, their pricing collapses. With IPOs looming, they need to maintain the narrative that scale=quality.
7. Google as the spoiler Several commenters note Google is best positioned to undercut the others because they own their hardware supply chain, are already profitable, and can play the long game. Their smaller/faster models (like Gemini 3.5 Flash) are seen as a competitive threat.
8. Diminishing returns vs. "capability hunger" ACCount37 argues against the plateau thesis: "There will always be demand for an LLM that's 5-10% more capable at some highly advanced task." New applications emerge as capabilities improve — drug discovery, physics, organizational governance, etc.
9. Intelligence saturation for everyday tasks IgorPartola makes a thoughtful point: for coding, a model that matches your intelligence is a feature, not a bug. A 10x smarter model would slow you down because you'd need to verify incomprehensible output. But for hard-to-solve/easy-to-verify problems (protein folding, etc.), hyperintelligence is a huge boon — likely going to pharma/biotech first.
Overall vibe: A mix of mild disappointment that Opus 4.8 isn't a bigger leap, deep skepticism about the sustainability of scaling, genuine excitement about GRAM (tempered by alignment concerns), and a sense that the "small models catching up" narrative is the undercurrent of 2026.
You 🧑
Tell me more about self-distillation techniques. I work for MongoDB and our recent voyage-4 family is all distill trained off voyage-4-large
▶ Tool: read_file [path='~/Personal/skills/skill_index.md']
▶ Tool: web_search [query='self-distillation technique LLM embedding models voy]…
▶ Tool: web_search [query='"self-distillation" LLM training teacher student mod]…
▶ Tool: fetch_url [url='https://blog.voyageai.com/2026/01/15/voyage-4/']
▶ Tool: fetch_url [url='https://arxiv.org/abs/2604.01193']
▶ Tool: fetch_url [url='https://arxiv.org/abs/2601.18734']
▶ Tool: fetch_url [url='https://arxiv.org/pdf/2605.12400']
▶ Tool: fetch_url [url='https://siyan-zhao.github.io/blog/2026/opsd/']
Assistant 🤖
Great question, and that's a really interesting case — a dense-to-dense embedding model family (voyage-4) distilled from a larger flagship model. That's a slightly different flavor than the LLM reasoning distillation I was just reading about. Let me break down the self-distillation landscape for you.
Classic distillation = a larger, separate teacher model trains a smaller student. Self-distillation = the same model acts as both teacher and student, usually through different conditioning or via iterative improvement on its own outputs.
The key insight that's been driving the 2025-2026 boom: once a model is capable enough, it can improve itself without needing a bigger external teacher.
Paper: Embarrassingly Simple Self-Distillation Improves Code Generation (arXiv:2604.01193, April 2026)
This is the simplest form and might be closest to what you're doing with voyage-4:
For embedding models like voyage-4, the analogue would be: generate embeddings from voyage-4-large for a diverse set of queries/documents, then train voyage-4/voyage-4-lite to match those outputs. The "self" part comes from the fact that the same model family produces both.
Paper: Self-Distilled Reasoner: On-Policy Self-Distillation for LLMs (arXiv:2601.18734, Jan 2026)
This is more sophisticated and aimed at reasoning:
Paper: OGLS-SD: On-Policy Self-Distillation with Outcome-Guided Logit Steering (arXiv:2605.12400)
The latest evolution, combining OPSD's dense token-level signal with outcome verification:
For embedding models specifically, distillation typically works differently than for generative LLMs:
The voyage-4 approach (shared embedding space + MoE + distillation) is actually a great example of where the industry is heading:
Would you like me to dig deeper into any specific aspect — like how exactly embedding distillation loss functions differ from generative distillation, or the specific architectural choices (MoE, Matryoshka) that make shared embedding spaces work?
You 🧑
Actually I'm super interested in how embedding models are trained overall. It sounds like you're sending two inputs (pairwise?) and then measuring similarity of the outputs and that somehow becomes the loss function?
▶ Tool: web_search [query='how are text embedding models trained contrastive le]…
▶ Tool: web_search [query='Voyage AI embedding model training methodology contr]…
▶ Tool: web_search [query='embedding model training triplet loss vs contrastive]…
Assistant 🤖
Great question — this gets to the heart of how embedding models actually learn. Let me dig into this properly.
▶ Tool: fetch_url [url='https://lilianweng.github.io/posts/2021-05-31-contrast]…
▶ Tool: fetch_url [url='https://ethen8181.github.io/machine-learning/deep_lear]…
Assistant 🤖
That's exactly right — you've got the gist! Let me walk through the full picture, because there's some beautiful machinery behind it.
An embedding model is just a function f(text) → vector. The goal: similar texts produce similar vectors; dissimilar texts produce distant vectors.
But how do you define "similar" without a label for every pair? That's where contrastive learning comes in.
For every training example, you need:
|
Term |
What it is |
Example |
|
Anchor a |
The query or reference text |
"How do I sort an array in Python?" |
|
Positive p |
A text that should be nearby |
"Here's how to use sorted() in Python..." |
|
Negatives {n₁, n₂, ...} |
Texts that should be far away |
"How to bake a cake", "Rust ownership model", etc. |
The model processes each independently through the same encoder → produces vectors → then compares them.
This is the engine. For a single anchor-positive pair with m negatives:
loss(a, p, {n_j}) = -log( ------------------------------------------------ )
sim(a,p)/τ + Σⱼ exp(sim(a, n_j)/τ)
Wait, let me write that more readably:
exp(sim(a, p) / τ)
loss = -log( ────────────────────────────────────────────── )
exp(sim(a,p)/τ) + Σⱼ exp(sim(a, n_j)/τ)
Where sim() is cosine similarity (or dot product) and τ (temperature) controls how "peaky" the distribution is.
What this does: It's a softmax over (1 positive + m negatives). The model gets rewarded for putting probability mass on the positive and penalized for mass on negatives. This pulls a toward p and pushes a away from all n_j.
In numpy-style pseudocode (for a batch of size B):
# anchor_embeds: (B, D), positive_embeds: (B, D)
scores = anchor_embeds @ positive_embeds.T # (B, B) similarity matrix
scores = scores / temperature
# The diagonal (i,i) are the correct positive pairs
# Off-diagonal entries are in-batch negatives!
labels = torch.arange(B)
loss = cross_entropy(scores, labels)
That matrix multiplication is where the magic happens. Every anchor in the batch gets (B-1) negatives for free — just the other items in the same batch!
This is the real art. There are several strategies:
Use existing data structures: - (query, relevant document) pairs from search logs — positive - Same query + irrelevant document — negative
- BM25 top results that aren't actually relevant — hard negatives - Cross-encoder re-ranker to filter and score negatives
For example, the voyage-4 family was almost certainly trained on massive corpora of (query, passage) pairs mined from web data, with rigorous hard-negative mining via BM25 and cross-encoder filtering.
As shown above: in a batch of B pairs, treat every other item's positive as a negative for your anchor. This is the primary source of negatives. Big batches matter — CLIP used 32,768 effective batch size across 256 GPUs.
After training an initial model, re-encode your corpus. For each query, find documents that are close in embedding space but not actually positive. These are precious training signal — they force the model to learn finer distinctions.
RocketQA uses a loop: train → find hard negatives → retrain → repeat.
For voyage-4 specifically: voyage-4-large encodes millions of (query, doc) pairs. The smaller models are trained to reproduce the large model's similarity rankings — not just the individual embeddings, but which pairs the large model thinks are similar vs. dissimilar. This is distillation applied to a ranking / retrieval objective.
You said the voyage-4 family is "all distill trained off voyage-4-large." Here's what that likely looks like:
Step 1: voyage-4-large (the teacher) is trained via standard contrastive learning
on massive web-scale data with hard-negative mining.
Step 2: voyage-4-large encodes a huge set of (query, candidate_document) pairs,
producing similarity scores / rankings for every pair.
Step 3: The smaller models (voyage-4, voyage-4-lite, voyage-4-nano) are trained
to match the teacher's output distribution:
L_distill = -Σ sim_teacher(q, d⁺) / τ + log Σ exp(sim_student(q, dⱼ) / τ)
The student learns to push/pull embeddings to match the teacher's
similarity judgments — not just individual vectors.
Step 4 (key): Because all models share an embedding space (the "shared embedding space"
feature), the student literally learns to produce vectors that are
interchangeable with the teacher's. This is enforced during training
by directly comparing student and teacher embeddings.
The "self" part: the same data pipeline and even some architectural components are shared, and the smaller models are essentially learning to be a compressed version of the large model's representational space.
From everything I've read, the recipe is:
The voyage-4 family is particularly interesting because the shared embedding space + distillation combo means you get the best of both worlds: the large model's representational quality for your corpus, and the small model's speed & cost for queries.
Want me to dig into any specific piece? I could walk through the actual PyTorch code for an InfoNCE loss, explain how Matryoshka learning works, or look at how hard-negative mining specifically impacts retrieval accuracy (which is probably very relevant to your work on the voyage team!).
You 🧑
Actually we start with LLMs like Qwen3 as the architecture and train off that. I believe almost all modern embedding models start with LLM, that have massive internal world models. I don't remember if we use the pre-train or the instruct tuned versions. I think all you really need is the world model from a pretrain
▶ Tool: web_search [query='embedding model training starting from base pretrain]…
▶ Tool: web_search [query='NV-Embed E5 BGE embedding model start from base pret]…
Assistant 🤖
That makes total sense, and it's actually a key insight that explains why embedding models have gotten so good recently. Let me think through why the pretrained (not instruct-tuned) version is the right starting point.
▶ Tool: fetch_url [url='https://arxiv.org/html/2405.17428v2']
Assistant 🤖
This is a really clean insight, and you're absolutely right — it matches exactly what the frontier labs have converged on. Let me connect what you said to what I just found in the NV-Embed paper (NV-Embed was the #1 on MTEB for a while, starting from Mistral 7B):
The reasoning is elegant:
A pretrained LLM's hidden states already encode incredibly rich semantics. Through next-token prediction on trillions of tokens, the model learns: - Word senses, polysemy, contextual nuance - Long-range dependencies and discourse structure - Factual knowledge and relational patterns - Implicit clustering of concepts
All of this is sitting there in the residual stream. The embedding fine-tuning is essentially teaching the model to surface this existing knowledge into a usable vector representation via contrastive learning, rather than teaching it new knowledge.
Instruction tuning, on the other hand, is a narrowing operation. It takes the broad world model and specializes it toward dialog/assistant behavior — being helpful, concise, refusing certain requests, etc. You might actually lose representational fidelity for embedding tasks because:
As the NV-Embed authors put it — they start from "pretrained Mistral-7B" and apply contrastive instruction-tuning from scratch. They deliberately do NOT start from an instruct-tuned or existing embedding model.
You mentioned using Qwen3 as the base. Looking at what NV-Embed and others do, there are a few critical changes needed to turn a causal LLM into a good embedding model:
This is the single biggest change. During contrastive training, NV-Embed just removes the causal mask entirely — every token can attend to every other token. This makes sense intuitively: to understand whether a document is relevant to a query, you need full bidirectional context, not left-to-right prediction.
They found bidirectional attention consistently outperforms causal across every pooling method and every task category (retrieval, clustering, classification, STS, etc.).
A causal LLM's default pooling (last token / ) has recency bias — it overweights the final tokens. Mean pooling dilutes important signals from key phrases. NV-Embed's innovation was a latent attention layer: a small set of trainable "dictionary" vectors (512 latents, 8 heads) that the LLM output tokens attend to via cross-attention, followed by an MLP and mean pooling.
This gives the model a learned, expressive pooling mechanism that can extract the most salient dimensions from the token sequence.
First stage: train on retrieval datasets with in-batch negatives enabled. This is where the contrastive loss really shines — every batch of B query-doc pairs gives you B² comparisons.
Second stage: blend in non-retrieval tasks (classification, clustering, STS) with in-batch negatives disabled. Why? Because in-batch negatives are misleading for classification — if you have a batch of binary sentiment queries, another batch item's "positive" document might actually be your anchor's positive too.
|
Step |
What |
Why |
|
0 |
Start with base pretrained LLM (Qwen3, Mistral, etc.) |
World model already encodes semantics |
|
1 |
Remove causal attention mask → bidirectional |
Full context matters for similarity |
|
2 |
Add pooling layer (latent attention, or just mean pooling) |
Extract sequence → vector |
|
3 |
Stage 1: Contrastive training on retrieval data, in-batch negatives on |
Learn to discriminate relevant from irrelevant |
|
4 |
Hard negative mining (e.g., BM25 → cross-encoder filtering) |
Force finer-grained distinctions |
|
5 |
Stage 2: Blend classification/clustering/STS without in-batch negatives |
Generalist embedding without misleading signals |
What's interesting is that voyage-4's shared embedding space adds another layer on top of this — once you have a great large model (voyage-4-large), you can distill its representational geometry into smaller models that produce directly interchangeable vectors. That's a clever trick that most of the open literature (NV-Embed, E5, BGE) doesn't fully exploit yet.
For your team specifically — starting from Qwen3's base pretrain, you're getting one of the best world models available. The contrastive fine-tuning is essentially just unlocking that knowledge for retrieval rather than generation.