Big data gets all the attention, but all too often small data is the reality—and that’s okay. Think about when you or your team embarks on a new product or project. You’re creating something brand new, and the sky’s the limit. It’s never been done before! It’s exciting for sure, but it also means there will not be much data available, let alone a rich, labeled dataset available for analysis and ripe for machine learning. Instead, your data science team has to solve problems with an experimental dataset of only a few dozen to a few hundred rows. Ah, the challenges of being an innovator.
There are workarounds, however, so don’t get yourself into a sweat. Here are a few proven tricks that our teams use when we wish we had more data than we do.
- Combine/augment data: Look for connections. If there is a time or geographical aspect, connect it with other relevant data. What other connections can you find? For example, are you looking at homeshare activity? Find information on events happening nearby, neighborhood characteristics, and so forth. Or, if you’re looking at commercial business, look for ‘eyes on the ground’ or other, location-based information. Augmentation has a multiplicative effect on your data, and can often be more powerful than simply adding observations.
- Make your own: We’re not the first or only company to build internal experimental apps purely for data-gathering, and for good reason. It’s valuable! If you can build a model or prediction that is at least slightly better than random noise you’ll be able to use it. For Slice, we’re not looking to monetize data itself, instead our focus is to develop underwriting models for innovative insurance products. In this way, any data we can generate ourselves is learning a opportunity. The same logic can be applied across all different industries and types of apps.
- Work with what you’ve got: Although the exact vanishing point of power in data is unclear, directional signals can be found in just a few dozen observations – with ML models in a few hundred to thousand, and deep learning models in tens or hundreds of thousands. Even a little can be enough to build on when you are aiming for an MVP (minimum viable product).
Remember, there are some real advantages to small data. Data engineering is less complex because everything can reside ‘in-memory’, and QA is simpler (even if population is unknown). After all, the best approach is to start with the absolute minimum.
In short, powering up your data boils down to a few simple steps:
- analyze the data you have and find signals
- if you don’t have enough data, look externally to find more
- if you can’t find more, then find related data
- if you can’t find related data, create or source your own data
- and from there, iterate as quickly as you can
Models need care and feeding. The sooner you get your model working, the sooner you can kick off that feedback / retraining loop, and the better your predictions will be. They’ll continue to get better with more care and feeding, and so it exponentially goes.
Before you know it, your small data will be fully matured and wanting more privacy and independence, right before it moves out of the house to live its own life visiting only on weekends to do some laundry. Your job will be done and you’ll be congratulating yourself on your great success, while reminiscing about how it seems like just yesterday your data was small.