Meta CEO Mark Zuckerberg has a hot take on Big Tech’s race for AI training data: It’s not about the data.
“The thing that I think is going to be more valuable is the feedback loops rather than any kind of upfront corpus,” Zuckerberg said in an interview with the Command Line, a tech industry newsletter.
Feedback loops are used to retrain and improve AI models over time based on their previous outputs. These algorithms let AI models know when they make an error, for example, and provide them with data to adjust their future performance.
“Having a lot of people use it and then seeing how people use it and being able to improve from there is actually going to be a more differentiating thing over time,” he said.
Sourcing new data for their insatiable AI models to consume —which theoretically will make them smarter — is now an obsession for companies racing to dominate AI.
Companies like OpenAI, Google, Amazon, Meta, and others have considered some wild solutions. Meta, for instance, was so desperate for data at one point that it considered buying the publishing company Simon & Schuster and even weighed risking copyright lawsuits for more material, The New York Times reported.
Another solution to the problem of limited data is just creating new data, something Big Tech calls “synthetic data.” Synthetic data is artificially generated and designed to mimic data generated by real-world events. Zuckerberg’s into it.
“I think there’s going to be a lot in synthetic data, where you are having the models trying to churn on different problems and see which paths end up working, and then use that to reinforce,” he said.
Anthropic, the maker of chatbot Claude, has also fed internally generated data into its models. And ChatGPT maker OpenAI is considering it, although CEO Sam Altman said at a conference last May that the key is having a model “smart enough to make good synthetic data.”
And while Zuckerberg sees feedback loops as the key to building powerful AI models, there are also risks in relying on them. They could reinforce some of their mistakes, limitations, and biases if they’re not trained on “good data” to begin with.