From Hobbies to State-of-the-Art Feature Engineering
Most people have a hobby and like to collect things like stamps, recipes, or music. And while businesses may not have hobbies, they do have inventories of assets that are aggregated by both count and value. Taking the concept of inventories a step further, businesses also have collections of customers, transactions, and events that can be organized and analyzed.
Computers are amazing at counting and tallying, which makes them perfect for handling large collections of data. But when it comes to feature engineering, simply counting items isn’t always enough.
In the field of data science, detailed data about multiple objects or events often need to be summarized into collections. The most commonly used approach is called cross-aggregations, which involves grouping objects by their labels and applying an aggregation function to the items within each group. It’s worth noting that inventories are actually a special case of cross-aggregations, where the aggregation function is either a count of items or the sum of their values or weights. Note that a cross-aggregation function doesn’t have to be a count or sum – it could be a maximum, the latest label, or the latest date!
Some machine learning tools accept Python dictionary features as inputs, while others require further feature engineering. Regardless, once you’ve calculated cross-aggregates, you can use your domain knowledge to uncover more targeted signals, such as entropy, cosine similarity, the most common value, and unique count. It’s all about finding the best way to organize and understand your data!
Here are my top 5 tips for feature engineering cross-aggregation and inventory signals:
- For robust metrics, it’s important to use a window period long enough for a reasonably sized set of observations.
- Experiment with aggregation functions other than counts, such as value-weighted inventory or the latest observed label.
- Cross-aggregation should be done within a database to avoid moving big data around.
- Compare inventories to obtain similarity and stability signals.
- Further summarise cross-aggregations to get the exact signal you need, whether that be popularity, diversity, or magnitude.
At FeatureByte, we’ve built an open-source feature engineering library that makes it easy to create cross-aggregation and inventory features. Click here for a free download, with worked examples in Python: https://docs.featurebyte.com/