What does data-centric ML look like in practice?

Apr 09, 2021 3:01 am

Hi ,


Quick reminder that the podcast is on a break this and next week--I'll be back with a new interview on Tuesday April 20!


The last two newsletter editions have featured both Andrej Karpathy and Andrew Ng discussing the need for an industry shift from model-centricity to data-centricity. The first three links below are all related to that idea: why and how you can improve the quality of your training data.


Data Quality at Airbnb

"In early 2019, the company made an unprecedented commitment to data quality and formed a comprehensive plan to address the organizational and technical challenges we were facing around data. Airbnb leadership signed off on the Data Quality initiative — a project of massive scale to rebuild the data warehouse from the ground up using new processes and technology."


This series of articles (2 so far) details Airbnb's journey as they increased the quality of their data through changes at the levels of people, technology, and governance. 


As the potential of machine learning is realized, and the need for high quality data with it, I'd expect to see more companies taking on similar initiatives and startups building with it in mind from the very start.


If you know of other good resources on this topic, please send them to me!


What is Data Programming?

Snorkel AI, a prominent ML startup, recently raised a $35M Series B to continue building out Flow, their platform for data programming, and Application Studio, their newly-announced no-/low-code modeling solution. The latter is easy enough to understand, but what about the former?


Snorkel AI is a spin-off startup from an open source Stanford research project (also named Snorkel) that pioneered the approach which was first described in their 2016 paper "Data Programming: Creating Large Training Sets, Quickly":


"Users express weak supervision strategies or domain heuristics as labeling functions, which are programs that label subsets of the data, but that are noisy and may conflict. We show that by explicitly representing this training set labeling process as a generative model, we can “denoise” the generated training set, and establish theoretically that we can recover the parameters of these generative models in a handful of settings."


This approach is very interesting, not the least of which is because it works really well: "subject matter experts build models 2.8x faster and increase predictive performance an average 45.5%"


Unfortunately, the open source repo is not being actively worked on anymore, so practical usage is most likely limited to Snorkel AI customers. Nonetheless, learning about this has been fascinating and I'd love to see more research like it.


To learn more, also see the Snorkel paper from 2017 and their CEO Alex Ratner's MLSys talk.


Synthetic data to the rescue?

If you had told me a few years ago that synthetic data would play a major role in real-world ML systems, I would not have been a believer. Maybe for certain domains like self-driving or other robotics where game simulations can be used, but surely not for everything, right?


Wrong (and don't call me Shirley!)


Fast-forward to now and I've worked with equivalent amounts of real and "fake" data. Some of the applications I've used generated data for are training (both entirely synthetic like SynthText and synthetically augmented (dealing with PII)) and for testing/demos (can't use real customer data for obvious reasons).


This recent VentureBeat article showcases some of the startups that are working to solve problems like these with synthetic data.


Also see this older MIT news article on the Synthetic Data Vault, a set of open source tools for data generation.


What are people using GPT-3 for?

"Nine months since the launch of our first commercial product, the OpenAI API, more than 300 applications are now using GPT-3, and tens of thousands of developers around the globe are building on our platform. We currently generate an average of 4.5 billion words per day, and continue to scale production traffic."


Nearly a year after it's debut, it seems the hype around GPT-3 replacing writers, programmers, and [insert profession here] has died down somewhat. In its place, there's been releases of some really creative uses of the model. This article on the OpenAI Blog features three of them and announces new features to the API.


I love seeing these applications--it feels like I'm getting a glimpse into the future and really spark my imagination as to what is possible with this technology.


The MLOps Community Turns 1!

I don't need to tell you that I'm a huge fan of the ML Ops Community. It's amazing how much Demetrios and team have managed to do in just one year to bring together ML practitioners from all over the world.


To celebrate, Demetrios put together a list of his favorite podcast episodes he produced. Whether you're a long time listener or completely new, I highly recommend you check them out!


Thanks for reading and have a great rest of your week!

Charlie


Comments