By John Mannes
It’s easy to forget that even with the fanciest of machine learning models, we still need humans in the trenches cleaning input data. Descartes Labs, a startup that combines satellite imagery with data about our planet to produce insights and forecasts, knows this all too well. The company ended up building its own cloud-based parallel computing infrastructure to clean and process its massive corpus of satellite imagery. Today it’s giving a handful of developers and early customers access to this system.
Companies like Descartes Labs cannot just throw raw satellite imagery into machine learning models to extract insights. Images captured contain clouds, cloud shadows and other atmospheric aberrations that make it impossible to compare images taken at different times. A small cloud over a field, for example, that wasn’t present in previous images, could completely throw off a model attempting to predict crop yields.
To overcome this challenge, engineers can use composite images to optimize for the best pixels across a collection of images. Google Maps employs composite imagery to remove clouds and create representations of the globe that are evenly lit by the sun.
The problem with combining dozens of satellite captures of the entire earth is that it’s incredibly computationally intensive. This is where Descartes Labs’s processing engine comes into play to convert into composites the petabytes of geospatial data it has.