Sampling Big Data
Recent Posts
Colin Priest
Your modern data stack isn’t designed for training AI systems because it doesn’t know how to sample.
Are you struggling with training your AI systems using your modern data stack? Do you find that your data stack is not designed for the specific needs of AI training? Well, you’re not alone. Most data stacks are designed for business intelligence tasks, and therefore, they are not suited for training AI systems.
Currently, data stacks used for business intelligence typically use aggregated data, where dashboards and reports use all the data or simple filtering e.g. by geographic region.
AI systems have different requirements versus business intelligence. Before you can deploy an AI system, it must be trained on detailed historical data. However, bigger data doesn’t always mean better results. Sometimes too much data can overwhelm the machine learning environment, and the ML environment may not even have the resources for all the data. To downsample the data, data scientists use weighted random sampling and rules about timing, which can be quite complex. Therefore, data scientists often use workarounds such as selecting fewer data columns or using simple aggregations instead of detailed data.
Using a legacy data stack for AI training can lead to underperforming AI systems, as data scientists may miss vital signals in the data due to too much data moving around that is not actually being used.
It’s high time we start designing data stacks that meet the specific requirements of AI, such as a tool to define training data sampling parameters that can create feature-engineered suitably-sized training data within the database. Don’t let your modern data stack hold you back from training powerful and effective AI systems.