Posted April 28, 2020

The properties of a software system – performance, accuracy, security, compliance, and so much more – have traditionally been dictated by the code used to build it. As a result, an entire tool chain has been developed to aid in writing, debugging, securing, and analyzing code.

However, with more software systems built around AI and ML models, system performance, accuracy, and security are as much a function of the data the system operates on as the code they’re built on. Building ML systems has become a competitive edge, especially in product differentiation, with valuable use cases from fraud detection to recommendation engines. Increasingly that competitive edge depends on ML that is used for real-time decisions, requires fresh data, and is low latency and high scale.

Unfortunately, the tooling to aid the data scientists and data engineers who build AI/ML systems is far less mature than the products built to aid in software development. Data scientists often work locally, training models and building the pipelines of data that feed them. But taking that local model into at-scale production is an arduous, time-consuming process, subject to constraints that just aren’t present in the training environment. Furthermore, models trained offline have to be pushed online, and operate on the same type of data (called features) in order to give sensible results. But the tooling to standardize, govern, and collaborate around ML data is still incredibly immature.

When I first met the Tecton team, they were pioneering a project from within Uber called Michelangelo. Their goal was to democratize machine learning and AI by providing the tooling for data management in ML pipelines, so it was as easy to build an ML system as to code a simple app.

The goal is simple to state, but figuring out how to do it took years of experience building large ML and AI systems within companies such as Uber, Google, Airbnb, and Facebook.

The insight the Michelangelo team had was to build a platform, which they called a feature store to manage the particular data signals (i.e. “features”) important to the ML systems much like you’d manage code. The feature store made the handoff easier between data scientists who identify the features, and the data engineers who manage the systems that use them in production. With Michelangelo, data scientists and engineers could extract features offline to train models, and then move those features in a consistent manner to production.

At Uber, the feature store greatly improved the time it took to get ML models into production, and provided a standardized and unified repository of the most important signals to the business. It also provided an interface between data scientists and data engineers so they could collaborate to achieve goals with fewer errors. Today, Michelangelo and its feature store power thousands of models in production.

The feature store garnered immediate attention throughout the industry. However, Mike, Jeremy, and Kevin, who worked on the project, knew that there was a lot more that could be built to further their goals of democratizing ML, so they created Tecton.

We initially did the seed investment in Tecton at the end of 2018 and it was soon obvious how much the industry wanted better tooling around ML data. After tracking a number of deep engagements with top ML teams and their interest in what Tecton was building, we invested in Tecton’s A alongside Sequoia. We strongly believe that these systems will continue to increasingly rely on data and ML models, and an entirely new tool chain is needed to aid in developing them. Therefore, we at a16z are incredibly thrilled to be working with Tecton to aid in building the most sophisticated AI and ML data pipelines.

***