Is it true that when it comes to training data, the more, the better?
Well, big data can become a big problem for data scientists. Moving data from storage to the ML environment can be time-consuming and expensive, and opens the door to security risks. Plus, machine learning training environments are commonly stored on a local machine that isn’t suited for big data, with limited memory, processing power, and disk storage. And pandas, the tool of choice for data scientists, is not suited to big data either.
After the usual pre-processing steps, big data can become quite tractable for data science environments, with data scientists reporting data size reductions of 50-95%. But the challenge remains of how to practically get from big data to tractable data, without the long waits, the security problems, and without learning new tools.
Some organizations have turned to data engineers to translate pandas scripts into scalable SQL. But that creates its own set of problems, by decoupling data science experiments from data preparation.
Data scientists need a better solution! A solution that:
- processes data in the database,
- minimizes data movement,
- doesn’t rely on data engineers, and
- uses familiar tools.
It’s time to make big data more manageable and accessible to data scientists!