The bitter truth people hardly talk about is that one cannot really be sure of what the outcome of a machine learning model will be in reality, even after a seriesThe bitter truth people hardly talk about is that one cannot really be sure of what the outcome of a machine learning model will be in reality, even after a series

When Models Meet the Real World: Lessons from Production ML

We created a churn prediction model to determine which users will exit within one week. At the first phase of the model testing, which was done offline, the model performed well with a precision rate of 80% and 75% recall rate, showing success.  A very impressive beginning that received favourable feedback. Then we proceeded to the production phase, and very quickly, everything changed completely: the model turned into a catastrophe. It was missing almost all the early churners, which implies a failure in detection by the model. This dropped the production precision by 50% in comparison to the offline test. This was a very embarrassing moment for us as we had spent a lot of time thoroughly checking the accuracy of the model before the official launch.

From all indications, the failure was not necessitated by the model itself but by external factors outside the control of the model. While working at Yandex and now eBay, I have come to realise that even if your offline metrics look great, there is always a gap between test data and production data. That gap affects not only model performance, but also user experience, business impact, and trust in ML systems.

The bitter truth people hardly talk about is that one cannot really be sure of what the outcome of a machine learning model will be in reality, even after a series of tests.  Production is a messy, evolving process full of possibilities. Therefore, the rest of this article is about understanding that gap. It will cover how it arises, how to detect it, and how to design processes so that it doesn’t surprise you in production.

Section 1: The Three Data Gaps Nobody Warns You About

1.1 Concept Drift: Working at eBay has given me a first-hand experience of how often user behaviour shifts. This makes it difficult for a model developed in a previous month to perform effectively in the next few months.  This similar experience applies to Yandex. The users' weekday behaviour may change during the weekends. This implies that the change in search queries is influenced by trends and seasonal updates. This, in a nutshell, is termed concept drift.  One of the ways to address these kinds of challenges is to create a temporal validation split. This implies that models created for a previous month should be tested in another month, and the test set in the search-based service should have a full seasonal cycle.

1.2 Sampling Bias in Test Sets: Data sets should be curated and cleaned with caution because test sets often under-represent rare or noisy examples that appear in real-world traffic. As a result, a test set may look “representative” but still fail to capture the long tail of real usage and undetected faults, which can cause serious challenges to users during production. To reduce this risk, a good practice is to set up a production mirror that samples a small percentage of real production traffic through a model in order to detect any errors or bias in the larger test set before the actual production.

1.3 Feature Pipeline Divergence: This is a failure that is often embarrassing but highly preventable. The code that runs in the training pipeline is a Jupyter notebook, while that of production is in a latency-sensitive service. Even a small mismatch between them can silently ruin a model. I recall when I developed a model for user classification, in the training pipeline, we represented the “country” feature using short codes like US, UK, DE, while in production, the integration team sent full country names such as United States, United Kingdom, Germany. From a human perspective, the data looked correct, but for the model, this was a completely different, unseen vocabulary. Depending on the encoding, most values were mapped to an “unknown” bucket, and model quality dropped significantly for days before we traced it back to this representation mismatch. To prevent this kind of error, it is important to minimise training-serving skew: reuse the same preprocessing logic in both environments and thoroughly check that the features computed in production match those used during training before deployment.

Section 2: Building Production Aware Phase

2.1 Shadow Mode Deployment: This practice, also known as dark launching, is the collaborative deployment of a new model alongside an existing one during production without any malfunction on the user experience.  Although only the old model is analysed. At eBay, this method was very useful to us. It helped us detect and curtail errors during production. Additionally, this method is best used when introducing a new model, and should be a first step, as it does not provide feedback for user reaction.

2.2 Canary Testing with Real Data: A live test typically follows immediately after the shadow mode testing is completed. Canary deployment means testing the model on a small number of users with a traffic as low as 1%-5%.  In this phase, we monitor metrics such as memory consumption, error rates, and latency, and trigger a rollback if the new model shows performance degradation or abnormal drift patterns. An example of the phased rollback schedule we can use is: 1% - 5% -10% - 25% - 50% -100%. This method once helped us detect an error that would have been missed in the shadow model.

2.3 Adversarial Test Generation: Create edge cases that deliberately test your model. Use production logs to discover rare patterns that could cause errors during production and integrate them into the automated test. Where appropriate, use adversarial testing tools (for example, textAttack or Counterfit) to systematically perturb inputs and find weaknesses in the model. This makes the model withstand the challenges of the wild.

Section 3: Monitoring Post-deployment

In production, continuous monitoring is crucial, especially after the initial deployment.

3.1 Data Drift Detection: To catch problems early, we need to monitor not only business metrics, but also how the input data itself changes over time. Metrics like the Population Stability Index (PSI) and Kullback-Leibler (KL) divergence can be used to detect data drift and quantify how production data shifts before it significantly impacts model performance. We use PSI to monitor feature distributions by comparing each feature’s current week of production data against a reference data distribution. For example, PSI values between 0.1-0.25 might suggest a slight drift, while values over 0.25 indicate a drift requiring intervention. Additionally, in one case, we saw a change in the length of user queries.  We evidenced a PSI value exceeding 0.15, which we treat as a warning threshold for that feature. In response,  we retrained the model with more recent data and explicitly included longer queries in the training set, before users noticed a drop in quality. Without continuous monitoring, we would have discovered the issue only after users started complaining.

3.2 Feedback Loops: Failures during production can also be used to create additional training and test cases for improvement and not perceived as discouragement, because we learn from our mistakes. Data drifts or unforeseen circumstances that disrupt production, can be turned into new examples for further training. One effective pattern to achieve this is to create a continuous learning system. For example, for a classification model, we track missclassifications: some samples are automatically sent back to the training or testing set with their corrected labels, while for more difficult cases might require human review and annotation.  Although there is a need to have a balance, as too many feedback loops can cause instability in models.

3.3 The Rollback Plan: Even with all of the above, some deployments will go wrong. That’s why a clear, fast rollback strategy is critical. In practice, this means defining automatic or semi-automatic triggers that monitor key performance and reliability metrics (e.g., conversion rate, CTR, latency, error rate) with them against a baseline or control model, and initiate rollback when degradations exceed agreed thresholds. For example, a 3-5% drop in a core metric may indicate that an investigation is needed, while 10% is a strong sign to stop the rollout and revert to the previous version. The exact numbers depend on the product and risk tolerance, but the key idea is to decide these thresholds in advance, not during a crisis. For this to work, old models should not be discarded. They should be stored, versioned, and ready to be reactivated quickly as backup plans. Notably, in many modern production systems, rollback does not perform a redeployment action but rather serves as a configuration switch. This makes reverting safe changes a matter of minutes rather than hours.

Section 4: Three Principles for Long-Term Success

My conclusion from working many years in Yandex and now at eBay is that the three principles which can be relied upon are: (1) Test sets should age like Production Data: to keep test sets relevant, it is important to refresh them with recent production samples. For instance, a test set for June should reflect the June traffic pattern, not that of another month. Updating the training and test set reduces surprises and increases performance. (2) Validation is Continuous, not One-Time: the pre-deployment testing phase is not as important as the post-deployment phase because this is when you realise how effective the model is. Many teams tend to spend most of their effort on the pre-deployment testing phase, and relatively little on the post-deployment phase, which is a risky imbalance. It is the production that determines the success of a model. (3) Build for Failure: We should assume that unexpected situations will occur and design a resilient system that can handle them. The question is not whether a model will ever fail, but how the system responds when it does. That means having clear recovery paths, for instance, by keeping multiple model versions that can be quickly reactivated if an issue arises.

CONCLUSION

It can be deduced from all that has been discussed above that no level of careful offline evaluation can eliminate the challenges in a machine learning model caused by the gap between testing and production.  Although this gap cannot be bridged, it can be managed through thorough monitoring and strategic testing methods, like applying a timely data drift monitoring approach. The pain I felt during my challenges helped me to understand this better and be more strategic in our testing and production phases. This has saved me from incurring further costs and reducing mistakes. This is not a step that can be taken once but one that requires meticulous and strategic planning. These procedures may seem expensive, but they are far more beneficial than losing user trust or revenue when things go wrong in production.

\

:::info HackerNoon editor’s note: this article has been inaccurately flagged for being AI assisted, possibly because of its academic nature and tone.

:::

\

Market Opportunity
RealLink Logo
RealLink Price(REAL)
$0.0728
$0.0728$0.0728
-1.36%
USD
RealLink (REAL) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

Building a DEXScreener Clone: A Step-by-Step Guide

Building a DEXScreener Clone: A Step-by-Step Guide

DEX Screener is used by crypto traders who need access to on-chain data like trading volumes, liquidity, and token prices. This information allows them to analyze trends, monitor new listings, and make informed investment decisions. In this tutorial, I will build a DEXScreener clone from scratch, covering everything from the initial design to a functional app. We will use Streamlit, a Python framework for building full-stack apps.
Share
Hackernoon2025/09/18 15:05
Which DOGE? Musk's Cryptic Post Explodes Confusion

Which DOGE? Musk's Cryptic Post Explodes Confusion

A viral chart documenting a sharp decline in U.S. federal employment during President Trump's second term has sparked unexpected confusion in cryptocurrency markets
Share
Coinstats2025/12/20 01:13
Google's AP2 protocol has been released. Does encrypted AI still have a chance?

Google's AP2 protocol has been released. Does encrypted AI still have a chance?

Following the MCP and A2A protocols, the AI Agent market has seen another blockbuster arrival: the Agent Payments Protocol (AP2), developed by Google. This will clearly further enhance AI Agents' autonomous multi-tasking capabilities, but the unfortunate reality is that it has little to do with web3AI. Let's take a closer look: What problem does AP2 solve? Simply put, the MCP protocol is like a universal hook, enabling AI agents to connect to various external tools and data sources; A2A is a team collaboration communication protocol that allows multiple AI agents to cooperate with each other to complete complex tasks; AP2 completes the last piece of the puzzle - payment capability. In other words, MCP opens up connectivity, A2A promotes collaboration efficiency, and AP2 achieves value exchange. The arrival of AP2 truly injects "soul" into the autonomous collaboration and task execution of Multi-Agents. Imagine AI Agents connecting Qunar, Meituan, and Didi to complete the booking of flights, hotels, and car rentals, but then getting stuck at the point of "self-payment." What's the point of all that multitasking? So, remember this: AP2 is an extension of MCP+A2A, solving the last mile problem of AI Agent automated execution. What are the technical highlights of AP2? The core innovation of AP2 is the Mandates mechanism, which is divided into real-time authorization mode and delegated authorization mode. Real-time authorization is easy to understand. The AI Agent finds the product and shows it to you. The operation can only be performed after the user signs. Delegated authorization requires the user to set rules in advance, such as only buying the iPhone 17 when the price drops to 5,000. The AI Agent monitors the trigger conditions and executes automatically. The implementation logic is cryptographically signed using Verifiable Credentials (VCs). Users can set complex commission conditions, including price ranges, time limits, and payment method priorities, forming a tamper-proof digital contract. Once signed, the AI Agent executes according to the conditions, with VCs ensuring auditability and security at every step. Of particular note is the "A2A x402" extension, a technical component developed by Google specifically for crypto payments, developed in collaboration with Coinbase and the Ethereum Foundation. This extension enables AI Agents to seamlessly process stablecoins, ETH, and other blockchain assets, supporting native payment scenarios within the Web3 ecosystem. What kind of imagination space can AP2 bring? After analyzing the technical principles, do you think that's it? Yes, in fact, the AP2 is boring when it is disassembled alone. Its real charm lies in connecting and opening up the "MCP+A2A+AP2" technology stack, completely opening up the complete link of AI Agent's autonomous analysis+execution+payment. From now on, AI Agents can open up many application scenarios. For example, AI Agents for stock investment and financial management can help us monitor the market 24/7 and conduct independent transactions. Enterprise procurement AI Agents can automatically replenish and renew without human intervention. AP2's complementary payment capabilities will further expand the penetration of the Agent-to-Agent economy into more scenarios. Google obviously understands that after the technical framework is established, the ecological implementation must be relied upon, so it has brought in more than 60 partners to develop it, almost covering the entire payment and business ecosystem. Interestingly, it also involves major Crypto players such as Ethereum, Coinbase, MetaMask, and Sui. Combined with the current trend of currency and stock integration, the imagination space has been doubled. Is web3 AI really dead? Not entirely. Google's AP2 looks complete, but it only achieves technical compatibility with Crypto payments. It can only be regarded as an extension of the traditional authorization framework and belongs to the category of automated execution. There is a "paradigm" difference between it and the autonomous asset management pursued by pure Crypto native solutions. The Crypto-native solutions under exploration are taking the "decentralized custody + on-chain verification" route, including AI Agent autonomous asset management, AI Agent autonomous transactions (DeFAI), AI Agent digital identity and on-chain reputation system (ERC-8004...), AI Agent on-chain governance DAO framework, AI Agent NPC and digital avatars, and many other interesting and fun directions. Ultimately, once users get used to AI Agent payments in traditional fields, their acceptance of AI Agents autonomously owning digital assets will also increase. And for those scenarios that AP2 cannot reach, such as anonymous transactions, censorship-resistant payments, and decentralized asset management, there will always be a time for crypto-native solutions to show their strength? The two are more likely to be complementary rather than competitive, but to be honest, the key technological advancements behind AI Agents currently all come from web2AI, and web3AI still needs to keep up the good work!
Share
PANews2025/09/18 07:00