Beware of Dangerous Time-Awareness Blind Spots
If you’re using a modern data stack for AI, your data pipeline may have dangerous blind spots. Here’s why: current data stacks are designed to support analytics tasks without affecting the performance of operational systems. This means that data warehouses are separate from operational systems, resulting in data that are never 100% up to date.
This isn’t necessarily a problem for periodic business intelligence tasks that can wait for the data warehouse to be periodically refreshed before running reports or refreshing dashboards. But when it comes to using AI for live decisions, like loan acceptance or fraud detection, these blind spots can be a serious issue.
When AI makes live decisions, it must be aware of any missing real-time data. Unfortunately, some data will be missing from the data warehouse because it was created since the last refresh batch job commenced. Additionally, some data will be missing because it was created while the last refresh batch job was running. To make matters worse, when training an AI system on historical data, data scientists need time travel capabilities to ensure that only data available at a historical point in time is used. And when transforming raw data into features, table joins must be time-aware.
Using legacy data stacks for AI can lead to inconsistent training and production inference pipelines. The AI system may expect data that isn’t ready yet and make incorrect decisions, or it could even error or fail. Workarounds for AI-specific requirements often involve inefficient and unreadable SQL spaghetti code.
It’s clear that we need to start building AI-ready data stacks that can automatically discover, analyze, and adjust for blind spots. We also need automatically generated optimized time-aware SQL to ensure that AI systems can make live decisions with confidence. By doing so, we can eliminate blind spots and create a data stack that’s fully optimized for AI’s unique needs.