Recently I spent quite a bit of time thinking about differences between modern search and search as it was 10 years ago. Modern search is data driven. It doesn’t need (much) manual adjustment of weights or anything else. Modern search self-adjusts by consuming logs of user behavior, query logs and other data sources (such as crawling relevant websites).
Modern search is semantic (as opposite to full text). It includes components designed to provide query understanding and enhancement logic. Many of these components are extremely low latency (within 30 ms) as they communicate with the user on query building (for example, autocomplete). Modern search typically doesn’t make a lot of use of classic reverse word indices because the document is represented by a semantic description.
Modern search is highly personalized. Personalization in the form of user’s input, context, and past behavior participates in query building and other (Learning To Rank (LTR) etc.) activities. Query itself significantly grows in size and can reach several hundred Kilobytes.
Semantic search means embeddings and that means several machine learning models deployed as a part of the search environment.
We need a model to generate embeddings for static data, another (or a variety of that one) to generate a representation of a query, possibly yet another model to handle user dialog and yet another model to rank the responses. There are two fundamental problems that are independent of the nature of your data and the goal of your search. One of the problems is the difference between environment and tools preferred by data scientists for all things machine learning (ML) and enterprise software developers for all things production. Data scientists prefer mainly python-based with various python-specific tool kits (pandas, TensorFlow etc.). Enterprise developers run services mainly written in Java in Java-oriented environments. The issue is transferring a model trained in a python-specific toolkit into a Java environment and running it there with sufficient performance.
Another fundamental problem is testing-related: the problem I want to discuss today. Models work on the data they are trained on. Typically the data in test and prod environments is nothing alike, as very few enterprises have a continuously running process migrating data from prod to test. Such migrations are hideously expensive in terms of development effort and runtime expenses. The data has to be anonymized, cleaned up while somehow staying consistent across all databases. Considering that every (micro)service typically has its own storage, keeping all of that in sync during data migration is highly non-trivial. Search isn’t an exception to this problem either as during online embedding generation some of the features typically come in real time from various services. For example, personalization information changes quite quickly. The only truly generic solution to this problem is no separate testing environment at all (unfortunately this is also very common). My preferred solution is to test in a production environment but in insulation from handling real production traffic. The first step is collecting the testing data. For that, I need a “feature flag” — a configuration option that is accessible by everything at runtime and can be changed without a deployment. My feature flag will trigger a request-marking functionality in the UI. Now a certain percentage of requests (in this case, only search-related) will be marked with the “log everything” option. The UI is smart enough to mark the query and the clicks (if any) the user makes on query results. The backend logs everything, including requests and responses to other services.
The next step is processing the log into a separate, test-related storage. With this approach I can create a stub for (or mock) online services to avoid the side effects and create a stub for (or mock) any of the search components as I already got their output. As the data is not leaving the production environment and its controls I don’t have to invent new ways to handle Personally Identifiable Information, sanitize the data etc. I turn the logging on only as needed so I don’t need to constantly carry the burden of super large log files. When I am done testing I can completely delete the data or use it for training new ML models. During testing I replay the recorded query traffic and observe the results.
While working on search problems I found a couple of really good articles:  and . On one hand, I wish I found them sooner — that would cut down on research time. On the other hand, it’s nice to find an independent confirmation to one’s way of thinking. Also it’s known that you need to know 80% of the answer to ask a good question. One way or another I hope these articles will be as useful to you as they were to me (or more).
1.LY Wu, A. Fisch, S. Chopra, K. Adams, A. Bordes, and J. Weston. Starspace: Embed all the things! InThirty-Second AAAI Conference on Artificial Intelligence, 2018
2. S. Liberman, S. Bar, R. Vannerom, D. Rosenstein & R. Lempel. Search-Based Serving Architecture of Embeddings-BasedRecommendation.