Vector Search at Scale: What Breaks After the Demo

You can build a working semantic-search prototype in a weekend now. Pick an embedding model, push your documents through it, store the vectors, and watch “search by meaning” light up in a demo. It feels like magic the first time.

Then you take it to production, real queries start arriving, and you find out the demo was the easy 5%.

I have spent the last few years running semantic search over millions of creative assets at FreePixel. This is some of the stuff the tutorials skip. The failures that only show up at scale, and the way I have come to think about search after living inside it for a while.

The bug that only existed at a million documents

Here is one that still makes me wince.

Our indexing pipeline pulled documents from Elasticsearch oldest-first, sorted by creation-date ascending. Perfectly reasonable. It worked fine for a long time.

The problem only appeared once the index held about 1.4 million documents. Because of that ascending sort, to index the newest images (the ones we had just added and most wanted searchable) the pipeline had to page through all 1.4 million older documents first to reach them. Every run, the thing that mattered most was last in line.

Nobody designs that on purpose. It is invisible at ten-thousand documents and a wall at a million. I think most of the hard parts of search are shaped like this. A decision that was obviously correct early on quietly becomes the bottleneck once the data grows. The fix was small, just make sort-order a parameter and let the pipeline fetch newest-first when that is what we need. Finding it was the expensive part.

The model is about 20% of the problem

The embedding model gets all the attention. It deserves the least of your worry.

LLaMA-based embeddings, the various commercial models, the open ones, they are all good enough to get you started, and swapping one for another is rarely where your search quality lives or dies. The other 80% is the unglamorous machinery around the model:

  • How documents get in. Ingestion, chunking, what metadata rides along with each vector.
  • What happens when ingestion fails halfway. At one point we had Postgres tables that were defined in code but had simply never been created in a particular environment. A dual-write to one of them failed silently until something downstream broke and we went digging. The model was flawless. The plumbing around it was not.
  • The long tail of strange queries that no benchmark prepares you for, because your users do not search the way the eval dataset does.

If you are spending all your energy comparing embedding models and none on your ingestion pipeline, you are optimising the wrong 20%.

Pure vector search returns vibes

This is the one that surprises people who have only seen the demo.

Vector similarity gives you semantic closeness, which is not the same as correctness. Someone searches “red sports car” and gets a moody sunset back, because in embedding space those two things happened to land near each other. The model is not wrong exactly, it is doing precisely what it was built to do. It is just that “close in vector space” and “what the human actually wanted” are different questions.

Keyword search has the opposite personality. Dumb, literal, and precise. It will not understand that “automobile” and “car” are the same thing, but when it matches, it means it.

Production search almost always needs both. Hybrid retrieval, semantic reach plus keyword precision, and then a ranking layer tuned against your actual content rather than a leaderboard. Cosine distance on its own is a toy. A great toy, but a toy.

“More like this” is harder than it looks

A related trap. Similarity features have a way of quietly sprawling across a codebase.

Working on an older Elasticsearch stack, I once went looking for how a “more like this” query was implemented, and found the same idea expressed in several slightly different forms, scattered across different parts of the application. Each one had drifted a little. Each returned slightly different results for what users assumed was the same feature. Consolidating that into one consistent, well-understood path was a real project, not a cleanup.

The lesson generalises. Retrieval logic wants to leak. If “find similar” matters to your product, give it a single home and guard it, or you will end up maintaining five versions of your own search and wondering why results feel inconsistent.

Relevance is a loop, not a launch

You do not ship search and walk away. You ship it and start listening.

Which queries return nothing useful? Where do users rephrase, give up, or bounce? Which results get clicked and which get ignored? That feedback is the actual product. The index is never “done”, it is a living thing you tune as you learn how people really search, which is never how you assumed they would.

Treating relevance as a one-time launch is how you end up with a search box everyone in the company quietly stops trusting.

The short version

If I had to compress all of it into a line: search quality is really a data problem wearing a model’s clothes.

The teams winning at retrieval are not the ones with the most fashionable embedding model. They are the ones who treated ingestion, metadata, and ranking as first-class systems, who understood that the vector store is the easy part and the pipeline feeding it is where the real engineering lives.

That has been true at every scale I have worked at, and it gets more true as the data grows, not less. I hope to write more about the specific pieces (the metadata pipelines, the ranking work) in later posts.


I am Abdul Qabiz, CTO at FreePixel and co-founder of Allies Interactive, building GenAI pipelines, vector search, and the infrastructure underneath them. If you are wrestling with search that will not behave in production, that is exactly the kind of problem I like. Feel free to reach out.

RAG Is 80% Search Engineering and 20% LLM

Most “RAG systems” I get shown are a vector database, an LLM, and a bit of hope. They demo beautifully. You ask a question, the right answer comes back, everyone nods. Then real users turn up and you find out retrieval was the whole game all along.

I have spent a fair amount of time building and debugging retrieval over the last few years, and I have come round to a fairly blunt view. Retrieval-Augmented Generation is sold as plug-and-play, but in production it is one of the most underestimated systems in modern AI. Here is where it actually breaks.

Retrieval is the bottleneck, not generation

This is the bit that surprises people who have only seen the demo.

If the wrong passages come back, the smartest model on the planet will summarise the wrong thing. Fluently, confidently, in a tone that makes everyone believe it. The generation step is rarely where things go wrong. Almost every RAG failure I have debugged was actually a search failure wearing an LLM costume.

So when your RAG “mostly works” but occasionally says something confidently wrong, the instinct is to blame the model or tweak the prompt. Usually the real problem is upstream. The model was handed bad context and did exactly what it was told.

Chunking is a design decision, not a default

How you split your documents quietly decides how good your retrieval can ever be.

Split too small and you shred the context. A chunk that is half a sentence retrieves well on keywords but means nothing on its own. Split too big and you bury the signal. The one relevant line is now drowning in three paragraphs of surrounding text, and the embedding averages out into mush.

There is no universal chunk size, despite what the tutorial said. The right size depends on your content (legal documents and chat logs are not the same) and on the shape of your queries. You find it by looking at real failures, not by copying a number from a blog post.

“Closest vector” is not “most useful passage”

Cosine similarity is a starting point, not an answer.

The nearest vector is often not the most useful passage for the question. Two pieces of text can be semantically close and still unhelpful, and a keyword-exact match can be more valuable than anything the embeddings surface. Production retrieval almost always ends up needing more than raw vector search:

  • Reranking, to reorder the top candidates with a model that looks at the query and passage together.
  • Metadata filters, so you are searching the right subset before you ever compare vectors.
  • Hybrid search, keyword and semantic together, because each catches what the other misses.

Similarity logic likes to sprawl

One more from real life. On an older search stack I once went looking for how “find similar” was implemented, and found the same idea written five slightly different ways across the codebase. Each had drifted. Each returned slightly different results for what users thought was a single feature.

This is worth watching for, because retrieval logic leaks. The moment “find similar” or “related items” matters to your product, it tends to get reimplemented wherever someone needed it, and now you are maintaining several subtly different search behaviours and wondering why results feel inconsistent. Give it one home and guard it.

It is a loop, not a launch

You do not ship RAG and walk away. You ship it and start watching.

Which questions retrieve nothing useful? Where does the model hedge or hallucinate because the context was thin? Which sources never get retrieved even though they should? That feedback is the work. You tune chunking, adjust filters, add reranking, improve the underlying data, and you keep doing it. The system is never finished, because the way people ask questions keeps surprising you.

The honest split

If I had to put a number on it, RAG is something like 80% search engineering and 20% LLM. The model matters, but it is the easy, mostly-solved part. The hard, durable engineering is in retrieval: ingestion, chunking, metadata, hybrid search, reranking, and the feedback loop that keeps it honest.

Teams that treat RAG as an LLM problem ship impressive demos. Teams that treat it as a retrieval problem ship things people actually rely on.

So if your RAG mostly works and you want it to fully work, do not start with the model. Go and look at what is being retrieved. I would bet the gap is there.

Payments Are a Distributed State Machine, Not CRUD

The bug that taught me the most respect for billing systems was the one where our website showed an invoice for a subscription the customer had already cancelled.

That actually happened. The subscriptions page did not list it, cancelled and gone. But somewhere else in the product an invoice for it was still cheerfully on display. Two parts of the same system disagreed about whether someone was even a customer.

Payments look like CRUD until you have lived in them. Create a subscription, read its status, update the plan, delete on cancel. It feels like a table with some rows. Then you realise it is a distributed state-machine where the source of truth sits at the provider (Stripe, Razorpay), reaches you as webhooks, and those webhooks arrive out of order, more than once, and sometimes not at all.

Here are a few scars from that stretch of work, and what they drilled into me.

A subscription that quietly stopped renewing

A Razorpay subscription stopped auto-renewing in production. No error, no alert. It just did not renew, and we found out the slow way.

The cause was not Razorpay. It was us. A few weeks earlier we had split webhook handling into a separate path, a reasonable refactor on its own. But a renewal event slipped through the seam between the old handling and the new. The event came in, nothing was listening for it in the right place, and the subscription silently lapsed.

This is the thing about webhooks. They are not a request you control and can retry on your terms. They are an event stream the provider pushes at you, and any gap in your handling shows up later as money that did not move. When you change how you process them, you are changing a load-bearing wall.

“Invalid integer: all”

Around the same time, a merge to master lit up our Firebase cloud-functions with this:

error: "Invalid integer: all"
message: "Failed to fetch invoices from Stripe"

A tiny type assumption somewhere was passing the string all where an integer was expected. Completely invisible in code review. Loud and immediate in production, the moment it ran against real Stripe data.

I mention it because it is so ordinary. The expensive payment bugs are almost never clever. They are a string where a number should be, an event handled in the wrong place, a status that two systems read differently.

What I actually learned

A few principles that I now treat as non-negotiable when money is involved.

Make webhook handling idempotent. The same payment will be reported to you several times. A retry, a duplicate, a replay after an outage. Your job is not to process each delivery, it is to land in the same final state no matter how many times you see the event. Dedupe on the provider’s event id, make the state transition safe to apply twice, and stop assuming “exactly once”.

The provider is the source of truth. Your database is a cache that is allowed to be wrong. This sounds obvious and almost nobody builds like they believe it. Your local subscription.status is a convenience copy. It will drift. So you reconcile against the provider rather than trusting your own row, and you build a way to re-sync when (not if) they disagree.

State changes are not free-form. A subscription goes active, past_due, cancelled, and the valid transitions between those are a small graph. Writing each webhook handler as “set status to whatever the event says” is how you end up showing invoices for cancelled subscriptions. Model the state-machine explicitly and reject transitions that should not happen.

The bugs that cost money do not throw exceptions. That is the uncomfortable part. A crash you will see. A subscription that silently fails to renew, or two pages that disagree about a customer’s status, those just sit there quietly costing you until someone notices. So you instrument the money-paths more carefully than the rest of your app, and you alert on the absence of events you expected, not just on errors.

The mindset shift

If I had to compress it: stop thinking of billing as your data and start thinking of it as a projection of the provider’s truth, assembled from an unreliable event stream.

Once that clicks, a lot of the defensive work stops feeling like overhead and starts feeling like the actual job. Idempotency, reconciliation, explicit state, alerting on silence. None of it is glamorous. All of it is the difference between a billing system you trust and one you are quietly afraid of.

If your billing “mostly works”, that is exactly the phrase I would worry about. Go and find the place where two parts of your system disagree about who is paying you.

Yahoo! Mail IMAP Download Limit Issue

I recently realised that Mail.app on my Mac doesn’t show more than 10000 messages in Yahoo! Inbox. I thought, it could be issue with Mail.app, and I configued Thunderbird to download Yahoo! mails, and even Thunderbird couldn’t download more than 10000 messages.

I searched around to learn if YahoO! has changed something; indeed they have. It seems with their regular two-way sync IMAP you can only download 10000 messages in Inbox (or per folder) using third-party mail software (Mail.app, Thunderbird, etc.).

Yahoo! provides a way to export messages (one way) via a different IMAP.

We need to change the IMAP server to export.imap.mail.yahoo.com; and we can download all messages. You can learn more about this here.

Stay Safe

The 2020 hasn’t started well, our world is going through a pandemic caused by a novel coronavirus (COVID-19).

People across the globe are suffering in many ways; many have died, many others are infected, thousands are under isolation, many thousands are quarantined, and many millions are under lockdown.

Everyone appears so small in front of this virus; from the powerful and developed countries to developing nations with poor population of billions; from richest to poorest.

In this time, every country is looking at other countries for help and support; every human being is counting on another human being for help and support (social distancing and other measures to contain the virus, charity, food, and so on.)

The issues we have been fighting over for decades have become non issues during this time when whole world is trying to fight this common enemy.

However, there are a few (media or politicians or people) who are still doing what they were doing - communalising, spreading hatred, creating enmities between communities, and so on. It appears that they have sold their souls to the satan, and there is no humanity left in them.

We (humans) need to learn a lot from this crisis. We have been bad to the nature and with each other. We have caused so much damage.

We should learn to live in harmony with each other, and the nature; learn to care for each life on this planet, learn to make sure no one is hungry, or thirsty; learn to remove disparities; learn to respect wildlife; learn to respect everyone and everything.

That’s the way we avoid future pandemic if we survive this one.

I hope that this virus is contained before it takes any more lives directly or indirectly (economic aspects of lockdowns).

Stay safe everyone.

Hello from 2019!

The year 2019 is about to end. I haven’t written in this space for over an year. I thought, I will write a short entry to say “hello” to anyone who still visits this blog.

Isn’t it shame I got spoiled by twitter and stopped blogging here? You know, the irony? It is that I don’t even follow twitter anymore, and only time I tweet is to complain about a product and service :-)

Time dilutes everything -- we can’t laugh same way on same joke over and again; hence we can’t keep crying over same issues over and again.

I have accepted things -- wrote many entries about hardship of doing software/it business from smaller cities, and made peace with it, perhaps found a way to deal with things. As it’s said, necessity is the mother of invention.

Anyway, I have been spending most of my time building a business for a narrow vertical - helping Zendesk customers take their self-service offering (Help Center, Knowledge Base) to next level.

A part of my day goes in talking to customers, and helping them with help of my team. Then rest of time my team and I spend building things using html, css, and javascript with a mix of Zendesk Curlybar (Handlebar derivative) templating language.

Had I ever imagined I would find myself devoting most of my time over years (at least 4 now -- since I got involved) for something that’s technologically this trivial compared to what I have done in past, and can still do?

Why do I it then? I do it because I realise that even a simple products (made using a simple recipe and with simpler ingredients) brings smiles and a lot of happiness to thousands of users (mostly non technical) everyday. It’s that purpose that drives us, and personally keeps me happier than ever. We walk extra mile to help our customers beyond those lines of thinking if it should cost extra or not; we don’t even talk about that most of the time.

We have thousand and thousands of happy customers, and we have got tons of testimonials over the years. When someone is able to build what they envision on top of our product, and then they celebrate by printing the work on cake -- that makes our day.

Diziana Theme printed on Cake Customised Diziana Drogo Theme printed on a cake.

Whatever (work) time I am left with, I spend learning and practicing new stuff; managing a fleet of servers and services. Once in a while, there is 1-4 weeks busy time when I am building some (micro) service and integration for other on going projects in the company. I love these short bursts where I push myself to actually deliver things using latest and greatest (stable) tech-stack under a lot of time pressure. I have learnt so much, and helps me feel relevant even when I am aging like my Macbook :-)

It’s not about me anymore; it’s about them (team, customers, company -- all people who are positively affected).

Don’t just scratch the surface

We tend to get in a comfort zone after scratching the surface, i.e. barely learning about anything.

The little confidence that comes from scratching the surface is good but we don’t need to just settle there; we need to keep scratching below the surface.

I have interviewed and worked-with many people over the years; people who have had different educational degrees (bachelor of technology, bachelor of science, master of computer application, etc.).

I realised few knew in depth about any topic; and most only touched the surface. I intentionally didn’t use the word scratching because I found they didn’t even do that.

Simon Sinek correctly says that this is the age where we lack patience, and want instant gratification for many reasons.

Isn’t it easier to google a problem and get results with quick solutions, e.g. one from Stack Exchange family sites (StackOverflow)?

That’s useful but not always. If we want to build our career in anything we need to work harder than that.

Most people end up copying code from those green ticked answers. AFAIK, the green tick means the answer is accepted by the person who asked the question?

That means, the answer might not objectively be acceptable in all situations or by everyone?

If one spends some time critically reading & thinking entire thread, some learning (applicable in similar future situations) can be expected to happen.

I believe, StackOverflow and similar sites are very useful provided it’s used to enhance learning process, and weigh different options/opinions about a problem.

It takes years to get good at something. There would be some technologies that might not last for years, but there would be many (specially web standards or other standards) that would be there in better forms in years to come.

We can only create or contribute anything useful if we keep learning in systematic manner.

Once we get used to it (systematic manner: discipline, focus, getting below the surface), a momentum is built and it only requires little force to learn new versions of standards/tools/language, and apply effectively.

It’s important that beginners spend time learning various jargons, concepts, and fundamentals; and keep reading and practicing everyday (follow a book or a good course - and complete it); take every opportunity to get deeper at the subject in the hand.

Our career is not limited to work hours, hence our learning should keep happening all the time we can manage beyond work hours. I am sure we all can manage enough to shine.

We can’t expect to use time at work to learn basics or read books or practice; a good professional won’t do that.

A good professional practices and tries to be ready to perform whenever required. Like many other, our field of work requires professionalism and craftsmanship.

Let’s say, you are requested to work on an existing project that is being done in JavaScript for frontend (using some framework) and backend (most of stuff - API, workers, etc.).

Assuming you never worked on JavaScript, how would you start contributing to the project as soon as possible?

The answer deserves a long post I will write soon. Meanwhile why don’t you share your experience or opinions?


This post was written using WordPress mobile client. Please let me know if you find any typos.

Hiring is hard

Ok, I know I shouldn’t generalise things like that.

I will stick to my case: Hiring is hard in Kanpur. Do you know I moved back to this city ten years back? I guess it was April 30 or May 1 2008.

Among many of my failures the biggest one has been - not being able to build or sustain kind of team I wanted. See, it’s my fault there

I can’t even find good HTML/CSS guys here. Or am I failing to reach out and connect to them? Got any better ideas, I am all ears.

I tend to hire people who would value working in Kanpur. For example, someone pursuing B.A, or B.Sc. degree, or someone who has strong reasons to be in Kanpur. I am trying to be patient while they spend ages to learn things. Patience is virtue.

MacBook Pro 2017 - Keyboard & Trackpad issues

I have started using a Apple MacBook Pro 2017 (15 inch) around two weeks back.

I have been little unproductive since then because of keyboard and trackpad issues:

  • Keyboard isn’t as good as we have one in MacBook Pros from 2011. I don’t feel the feedback from keys, and my typing accuracy has been very poor for some reason.
  • Trackpad doesn’t allow me to use single finger (tap + tap pressed) dragging of file without drag-lock. The dragging stops the moment I press any key on keyboard, e.g. to switch window where I want to drop the file.

I don’t want to use dragging with drag-lock or using three fingers. I don’t want to reconfigure all other things to use four finger gestures.

Unlearning something is painful and frustrating. I am happy to learn new things if I think those are better than old ones. I don’t think that’s the case here.

I will update this post if I find some other issues.

Been a while

Hope you all are doing well. I know some of you still visit this space to check what I am up to.

It has been a while since I posted anything here. I am now to let everyone know that I am still around.

I think of writing sometimes but then I go blank. With growing age, I have started feeling I don’t know anything.

Anything I think of writing has already been written, or perhaps not but I think like that. What value my writing will offer? Wouldn’t it add more noise to already noisy internet causing information overload?

I might be wrong. I guess, I can write something that might be useful to someone. I will try to.