Why Great Data Is the Secret to AI Success

Valuable Enterprise Data Is Often Structured

Enterprises collect and generate an enormous amount of information, and for a payment service processor like Mollie, this is mostly structured data. The company does have unstructured text and voice data from customer service, but the vast majority is structured data collected from their APIs.

“It’s almost always structured data in enterprise scenarios,” Caldas said. “The main data sources that are driving our value are tables managed by internal and external systems that we fetch and process to build models.”

Understanding the Semantic Meaning of Tabular Data

Although structured data is valuable for enterprises, it can often be hard to understand what tabular data actually represents. Typically, a small subset of data is highly governed and well documented. However, most of the time there’s not a lot of data documentation available.

Caldas says the answer is still diving into the data, talking with the producers, exploring with business intelligence tools, and trying to determine the semantic meaning. Even if you have very good documentation, there’s a lot of value in understanding the system that’s generating the data and how it’s being used.

“There’s a lot of back and forth between the producers and users of data in order to understand the semantic context,” Caldas said. “In a sense, it’s still a craft getting to understand what’s in the table and there’s a lot of value in spending time exploring.”

The Importance of Feature Engineering

Feature engineering is a process for preparing data to build machine learning models by selecting features or individual attributes of different entities. Creating these features requires a lot of business context and knowledge of how the data is going to be used. Data scientists can use feature engineering to represent a business problem as a data problem to develop a valuable AI model.

“We’re not feeding a lot of unstructured data into our models,” Caldas explained. “Most of the time, there’s an intermediate step where we have to take the data in its original form and transform it into features that we can then feed into the models.”

According to Caldas, you get much more value out of building better features than trying to optimize your machine learning model. There’s benefits to trying more complex model architectures, but initially you can get more out of getting a better understanding of your data and extracting different signal types.

Signal types are different meanings you extract from data, such as stability, similarity, or recency. These are all unique ways to transform the same data to extract more value from it.

Feature Engineering Best Practices

Whenever you’re building a new model, Caldas recommends starting with the features you already have and seeing if there’s any potential there. Then the next step is to understand the business problem by talking to the people that have been performing the task manually. These insights can then be converted into features.

While some features are specific to a given model, there are others that are generic and can be applied to many different models. This means you could end up spending a lot of time on feature selection if you don’t have a good way to organize and manage your features. That’s why a feature store makes it much easier to reuse the right features across many different models.

“I’m a big believer in feature stores because a feature that you use for one purpose can be used for other purposes,” Caldas said. “As we get a richer set of features, we have a better starting point for building a new model.”

Although having very large feature sets can be beneficial, Caldas believes it’s better to focus on the variety of signal types. He recommends ensuring every feature you add is sufficiently different from the rest of the features you already have to avoid meaningless features. By curating feature lists that capture the most important signals from your data, you can create more accurate and useful machine learning models.

Solving Feature Engineering with FeatureByte

FeatureByte has recently released an open source SDK that solves many of the issues Caldas has discussed. Data scientists can capture a variety of signal types and the semantic meaning from their data with just a few lines of Python code. In fact, data scientists and engineers can create, experiment, serve, and manage features with just our SDK.

By radically simplifying feature engineering and management, the FeatureByte SDK allows organizations to streamline their data pipelines and get more value from AI faster. This can accelerate AI innovation and lead to better business decisions for the organization.

Want to learn more about tabular data and feature engineering? Watch the full webinar here.

Explore more posts