What's Missing in AI Patent Search

Prior art search has been failing for a long time. Not just AI — the way the world does prior art search has been slowly breaking down for well over a decade. The volume of patents has grown faster than our capacity to search them. The expertise required is deep, scarce, and doesn't scale. Even the best searcher in the world can only work one case at a time. The system is only as good as its weakest link and that is a compounding problem for everyone. This matters more now than ever. Every AI system being built for patents — drafting, analysis, portfolio strategy, licensing, due diligence — depends on having the right context. If the retrieval is wrong, everything downstream is wrong. LLMs with partial information will confidently draw wrong conclusions. You won't know what you missed until it's far too late.

So the stakes are high. AI seems like it should already have the answer. New tools are launching seemingly monthly. In-house counsel are vibe-coding solutions and getting real results. While more are accepting that LLMs are powerful tools for analyzing and drafting patents, prior art search remains a source of frustration. This begs the question: can AI actually do prior art search?

Benchmarks and comparisons are a natural way to try to answer that question. The problem is that many of the comparison studies out there aren't actually comparing the same thing. And the conclusions people are drawing — which approaches work and which don't — are built on a fundamental misunderstanding.

The Confusion

For AI-based prior art to work we need two different things.

1. Better systems that can consistently and correctly take a request and deliver results. The conversation around these systems is: How well does the system perform end-to-end? Does it find the right art? Can you trust it? Does it justify the cost? This conversation matters for adoption, for building trust with practitioners, and for justifying ROI.

2. Better tools for patent-specific information retrieval. These are the individual techniques and components that systems are built on. Questions here are: How good are the embeddings? How do embeddings compare to traditional semantic approaches? How do citation, classification, and keyword techniques compare? Which techniques are complementary? This conversation matters for building better tools, for understanding where the technology actually is, and for making real progress on the core problem.

Both conversations are important. Both need rigorous measurement. But they require completely different evaluation methods, and right now the industry is mixing them together. We cannot benefit from tools that we do not understand.

What This Looks Like in Practice

Here's the pattern we keep seeing: someone tests several AI tools by running a query and either reviewing some set of top hits or comparing them to past search results they expect to see. They count relevant hits or matches. They check for overlap across approaches. Then they rank the tools. They may keep this as an internal comparison or they may publish the results.

The implicit claim is: "I tested these tools and here's how they compare." But what's actually being compared? One tool uses a classifier to predict CPC codes, then runs a semantic search constrained to only patents within those classes, and returns roughly 90 pre-filtered results. Another tool uses an agentic pipeline — the AI decomposes the query, runs multiple searches, evaluates results, and returns a curated set of 42. A third tool gives you a raw ranking of the entire corpus with a set of tools to help you refine.

These aren't the same thing. The first is a retrieval pipeline with an automated pre-filter. The second is a multi-step autonomous search system. The third is an information retrieval component — a building block that's designed to be combined with other building blocks.

This is a product evaluation and product evaluations are valuable. The confusion comes when a product evaluation is presented as a technology evaluation. The three approaches above are apples and oranges. Comparing a multi-step pipeline with an individual semantic search is like comparing an examiner's full search report against a query run on a single class code. In the first case we can measure how good the examiner was. In the second case we can measure how effective the class codes are. If the examiner did better than the class code, would you say that class codes just aren't effective enough yet?

Both measurements help us improve but only if we're clear about what they are measuring.

Why This Matters More Than It Seems

Continuing with the example above, imagine a benchmark that ranks class codes (just one retrieval method) last and examiner search first. If you didn't know about class codes or how examiners search, what conclusions would you draw? You'd probably conclude that class codes are inferior and possibly even stop using them. What you wouldn't see is that examiners rely on class codes as well as other techniques.

A much more useful benchmark would compare citation search, class search, and keyword search separately. Then evaluate the optimal ways to combine them. Then use that to train examiners to search more effectively -- optimizing the system.

But that optimization is only possible when you have clear benchmarks of the individual components. The tools available determine the ceiling on what the system can achieve. In 2017 transformers kicked off an AI revolution in natural language processing, giving us a new set of tools for information retrieval. A deeper understanding of those components and how they can be used effectively to optimize the system is sorely missing in AI prior art search benchmarks.

What This Means for Buyers of AI

Most AI solutions today are providing a system not a component. The products offered combine many retrieval approaches to give you that first set of results. This is understandable: the people selling those products naturally want you to have the best possible experience right away. Prior art search is complicated so presenting the results from a single retrieval approach isn't the best way to ensure you have a good first impression. Some of the products may then have ways to optimize and refine from there too.

I've seen companies buy components expecting a system and then regretting the purchase when the component alone wasn't finding their past search results. I've also seen companies buy a system that worked great in 7 of their 10 test cases only to find out that at scale it ended up only working for about 2 out of 10 -- with no clear path to improvement.

If you conflate systems and components, you can't build on them. If the system is already a black box combining various approaches and it doesn't work, you have no meaningful way to make it better. You don't know which piece failed or why. Your only option is starting over. If a component didn't work, you can see why, compare other components, and use them together. You can build a reliable system on that.

Amplified is, by our own admission, different. We provide raw retrieval components in an interface that allows you to combine them the way you wish. That didn't always deliver the best first impression but it did let us learn so we could not just build a better system, but actually raise the ceiling of what is possible.

Our goal since founding has been to restore patent quality by solving prior art search at scale. We believe that the existing search technologies were not good enough to achieve that goal even in combination. So, instead of trying to sell a system with a low ceiling, we set out to build the missing pieces. Doing that well meant providing building blocks transparently and partnering with our customers to understand how best to combine them.

The Transparency Problem

This confusion around benchmarks exists partly because most vendors have no incentive to clarify it. If your tool applies classification code filters, keyword constraints, and neural re-ranking behind the scenes before the user sees a single result — and that pipeline produces a clean first page — why would you explain how the sausage is made? The opacity serves you.

Most AI search tools won't tell you which components retrieve what. They won't tell you when or how their system constrains your search space. They won't tell you what you're not seeing. This isn't necessarily malicious — there are legitimate reasons to abstract complexity. But making meaningful evaluation impossible limits progress.

When a tool returns 90 results all in one CPC subclass, is that a search engine or a classification-constrained pipeline? When another returns 28 curated families, is that retrieval or autonomous research? Without transparency, you can't know — and if you can't know, you can't design a fair test.

Comparison of a Retrieval Component and Search System

Here's what I mean in concrete terms. A recently published benchmark study tested the following query: "usage of ultrasound to accelerate whisky aging". Amplified's results look dramatically different depending on which layer you test:

| What's Being Tested | Families | Relevant | Rate | |------------------------------|----------|----------|-------| | Unfiltered lexical retrieval | 60 | 7 | 11.7% | | Retrieval + class filter | 60 | 42 | 70.0% | | Agentic system (reasoning) | 28 | 21 | 75.0% |

Same query. The difference is entirely about which layer you evaluate. If you test an individual retrieval component alone, you get 11.7%. If you test the system, you get 75%. Both numbers are real. Both are valuable. But they answer different questions.

Evaluating Systems vs. Evaluating Retrieval Methods

Both types of evaluation are worth doing. The industry needs more of both, not less. But the pitfalls are different for each, and mixing them up undermines the real work.

When evaluating search systems, the common pitfall is testing too few cases and only measuring past found references without considering new ones. This usually is a result of busy people being asked to test too many systems in a small window of time. The problem is that a system that looks great on 7 out of 10 test cases will most likely perform very differently across 100 real-world searches. System evaluation needs to reflect the work it's replacing.

When evaluating retrieval components, the pitfall is the opposite — testing them as if they were systems. A retrieval component is a building block. Relative performance is more important than absolute. In other words, does an embedding search find something that other approaches miss? Would embedding A or B rank the desired results higher within the same fixed pool? These are the right questions because they help you understand how this new component changes the system. Evaluating this requires less human review than a system evaluation but you need much larger sample sizes to get statistically meaningful results.

The worst outcome is when these get crossed — when a retrieval component gets tested with a system-level benchmark and the results are reported as a verdict on the technology. That happens a lot and it doesn't help anyone.

We'd like to see more studies, not fewer. But we'd also like to see them be clearer about what's being measured and why. If you've published or come across evaluations of AI patent search tools, we'd love to hear about them — tag us or share them in the comments so we can all learn from each other.

What This Taught Us About Ourselves

We need to be honest about our part in this confusion.

We built Amplified around a different philosophy — we explicitly give you the components of a search system and let you combine them. Neural and classic semantic search, keyword filtering, classification codes, citation analysis — each is visible, each is controllable, each can be used independently or together.

We wrote about this approach back in 2022 — the idea that the most powerful search combines professional intuition with AI capability, using your expertise to guide the machine rather than hoping the machine guesses right on its own.

That transparency has come at a cost. When evaluators test our semantic retrieval component in isolation — the raw, unfiltered baseline — they sometimes see a broad landscape of results. It can look bad next to a tool that has already done several rounds of automated filtering behind the scenes. We've lost deals because of this exact comparison. That's on us.

By exposing our retrieval component as the default experience, we hoped to provide transparency and composability. But we also made it easy for people to have a bad experience and decide it's not worth their time. That's partly an interface problem we're working to address, and partly a communication gap. We could have done more to make the distinction between our retrieval component and our complete search system obvious from the start.

We still believe that transparency is the right approach — that practitioners should be able to see and control every component of their search. But we're recognizing that we also need to make it easier to experience what the full system delivers, not just the individual building blocks.

To do this, we started piloting AI agents last November. This has made things much clearer because our agent combines the tools for you. You still get the transparency of seeing which tools the AI used, but you no longer have to learn the best way to combine them. Now the AI does it for you, analyzes the results, iterates based on what it learns, and gives you a final report. It can run fully autonomously but knows when to check with the human for feedback or clarification.

The Missing Piece

Patent quality is one of the biggest unsolved problems in innovation. The patent system was designed to fuel progress, but it's become so opaque that the people who depend on it most can't see clearly enough to make good decisions. That opacity drives wasted filings, bloated portfolios, duplicated effort, and strategic decisions made on gut feel instead of real intelligence.

Solving this requires getting both conversations right — building better components and building better systems. But the component problem comes first, because you can't build a reliable system on a weak foundation. We've been building AI for patents since 2017, starting with custom embeddings across the entire global patent corpus. We built those embeddings because we believed the fundamental retrieval piece was the missing foundation. Word2vec, doc2vec, TF-IDF, BM25, citation graphs, classification codes, boolean logic, off-the-shelf vector search — these all have real limitations for patent-specific work. Each captures something useful, but none of them alone is good enough to build reliable search systems on. The industry needed a better cornerstone component. That's what we set out to build.

Everything in Amplified — the search modes, the filters, the agentic capabilities, the analysis tools — is built on top of that foundation. When we built our product, we thought of it as a research tool: give people the components of a search system in a transparent way and let them combine those components using their own expertise. Now, thanks to LLMs, we can have AI agents doing that combining instead of asking users to do it manually.

That's the direction the entire industry is heading. Every company building an AI patent search system — whether it's agentic, hybrid, or something else — needs a strong retrieval foundation to build on. Most are trying to build that foundation themselves, on top of general-purpose models that weren't designed for patents. Some are doing interesting work. But building and maintaining patent-specific embeddings across the global corpus is an enormous infrastructure challenge, and it's not where most teams should be spending their energy.

We've started to realize that this foundation shouldn't stay locked inside our product. If the bottleneck for the industry is the retrieval layer — and we believe it is — then the best thing we can do is make that layer available to the people trying to solve the harder problems on top of it.

So we're starting to open it up. We're working with a small group of companies right now to give them direct access to the same patent-specific embeddings and composable search infrastructure that powers Amplified. The goal is to let teams building AI patent tools focus on combining components into systems that work — instead of spending years rebuilding the retrieval foundation from scratch.

It's early. We're still learning what this looks like in practice, and we want to build it alongside the people who'll actually use it. If you're working on patent search tooling or workflows and this resonates, we'd welcome the conversation — get in touch.

More broadly, we'd welcome discussion on any of this — the evaluation problem, the technology, the data, including the parts that don't flatter us. If anyone working in this space wants to compare approaches or collaborate on better evaluation standards, we're here for it.

Prior art search is too important to get wrong. So is evaluating the tools that do it. And so is building the foundation that makes better tools possible.

blogWhat's Missing in AI Patent Search