Vector Search at Scale: What Breaks After the Demo
09 Jun 2026You can build a working semantic-search prototype in a weekend now. Pick an embedding model, push your documents through it, store the vectors, and watch “search by meaning” light up in a demo. It feels like magic the first time.
Then you take it to production, real queries start arriving, and you find out the demo was the easy 5%.
I have spent the last few years running semantic search over millions of creative assets at FreePixel. This is some of the stuff the tutorials skip. The failures that only show up at scale, and the way I have come to think about search after living inside it for a while.
The bug that only existed at a million documents
Here is one that still makes me wince.
Our indexing pipeline pulled documents from Elasticsearch oldest-first, sorted by creation-date ascending. Perfectly reasonable. It worked fine for a long time.
The problem only appeared once the index held about 1.4 million documents. Because of that ascending sort, to index the newest images (the ones we had just added and most wanted searchable) the pipeline had to page through all 1.4 million older documents first to reach them. Every run, the thing that mattered most was last in line.
Nobody designs that on purpose. It is invisible at ten-thousand documents and a wall at a million. I think most of the hard parts of search are shaped like this. A decision that was obviously correct early on quietly becomes the bottleneck once the data grows. The fix was small, just make sort-order a parameter and let the pipeline fetch newest-first when that is what we need. Finding it was the expensive part.
The model is about 20% of the problem
The embedding model gets all the attention. It deserves the least of your worry.
LLaMA-based embeddings, the various commercial models, the open ones, they are all good enough to get you started, and swapping one for another is rarely where your search quality lives or dies. The other 80% is the unglamorous machinery around the model:
- How documents get in. Ingestion, chunking, what metadata rides along with each vector.
- What happens when ingestion fails halfway. At one point we had Postgres tables that were defined in code but had simply never been created in a particular environment. A dual-write to one of them failed silently until something downstream broke and we went digging. The model was flawless. The plumbing around it was not.
- The long tail of strange queries that no benchmark prepares you for, because your users do not search the way the eval dataset does.
If you are spending all your energy comparing embedding models and none on your ingestion pipeline, you are optimising the wrong 20%.
Pure vector search returns vibes
This is the one that surprises people who have only seen the demo.
Vector similarity gives you semantic closeness, which is not the same as correctness. Someone searches “red sports car” and gets a moody sunset back, because in embedding space those two things happened to land near each other. The model is not wrong exactly, it is doing precisely what it was built to do. It is just that “close in vector space” and “what the human actually wanted” are different questions.
Keyword search has the opposite personality. Dumb, literal, and precise. It will not understand that “automobile” and “car” are the same thing, but when it matches, it means it.
Production search almost always needs both. Hybrid retrieval, semantic reach plus keyword precision, and then a ranking layer tuned against your actual content rather than a leaderboard. Cosine distance on its own is a toy. A great toy, but a toy.
“More like this” is harder than it looks
A related trap. Similarity features have a way of quietly sprawling across a codebase.
Working on an older Elasticsearch stack, I once went looking for how a “more like this” query was implemented, and found the same idea expressed in several slightly different forms, scattered across different parts of the application. Each one had drifted a little. Each returned slightly different results for what users assumed was the same feature. Consolidating that into one consistent, well-understood path was a real project, not a cleanup.
The lesson generalises. Retrieval logic wants to leak. If “find similar” matters to your product, give it a single home and guard it, or you will end up maintaining five versions of your own search and wondering why results feel inconsistent.
Relevance is a loop, not a launch
You do not ship search and walk away. You ship it and start listening.
Which queries return nothing useful? Where do users rephrase, give up, or bounce? Which results get clicked and which get ignored? That feedback is the actual product. The index is never “done”, it is a living thing you tune as you learn how people really search, which is never how you assumed they would.
Treating relevance as a one-time launch is how you end up with a search box everyone in the company quietly stops trusting.
The short version
If I had to compress all of it into a line: search quality is really a data problem wearing a model’s clothes.
The teams winning at retrieval are not the ones with the most fashionable embedding model. They are the ones who treated ingestion, metadata, and ranking as first-class systems, who understood that the vector store is the easy part and the pipeline feeding it is where the real engineering lives.
That has been true at every scale I have worked at, and it gets more true as the data grows, not less. I hope to write more about the specific pieces (the metadata pipelines, the ranking work) in later posts.
I am Abdul Qabiz, CTO at FreePixel and co-founder of Allies Interactive, building GenAI pipelines, vector search, and the infrastructure underneath them. If you are wrestling with search that will not behave in production, that is exactly the kind of problem I like. Feel free to reach out.