MLOps: Training Data Versus Production
Machine learning models require training, which involves using carefully selected historical data, as specified by data scientists. Afterward, the task of creating a production pipeline falls to the MLOps engineer. As a result, training data and live production data are generated differently, resulting in fundamental differences that complicate matters and create challenges for unprepared individuals.
Training data consists of a subset of historical data. Typically presented in a large tabular format, it remains static even if the extract is recreated. However, data scientists need to experiment with different data and features, which means that the schema of the training data can change as they invent and discard feature engineering ideas. The size of the data extract is limited by the computing resources available in the ML training environment, which is why training data is often a sample of historical data. It always contains examples from multiple entities and requires a “time travel” approach, representing a snapshot of data values available in the past rather than the current values. Latency is not a significant concern when creating training data.
In production, machine learning models consume live data that continuously changes. Although the data values evolve over time, the schema of the production data remains static. Production data pipelines do not require “time travel” but instead utilize the current feature values. Depending on the use case, the ML system may process data for a single entity at a time or operate as a scheduled batch job, handling data for multiple entities. In some cases, low latency may be crucial. The daily throughput of production data can be extremely large, often surpassing the volume of training data. Target leakage is typically not a problem when ML systems use the most recent data. Despite the differences from training data, production data must align with it in terms of consistency—for example, similar time of day, sourced from the same database, and with the same data freshness.
Many organizations build separate data pipelines for training and production purposes, but this approach can be risky. An emerging solution is the use of feature stores, which provide architectural consistency between training and serving. Feature stores serve as the source of truth for feature values available at specific points in time, reducing latency by pre-computing feature values as data becomes available. A feature store is a great starting point, especially as part of an integrated feature platform. Read more about feature platforms here.