In a bid to protect its crown jewels, OpenAI is now requiring government ID verification for developers who want access to its most advanced AI models.
While the move is officially about curbing misuse, a deeper concern is emerging: that OpenAI’s own outputs are being harvested to train competing AI systems.
A new research paper from Copyleaks, a company that specializes in AI content detection, offers evidence of why OpenAI may be acting now. Using a system that identifies the stylistic “fingerprints” of major AI models, Copyleaks estimated that 74% of the outputs from rival Chinese model, DeepSeek-R1, were classified as OpenAI-written.
This doesn’t just suggest overlap — it implies imitation.
Copyleaks’s classifier was also tested on other models including Microsoft’s phi-4 and Elon Musk’s Grok-1. These models scored almost zero similarity to OpenAI — 99.3% and 100% “no-agreement” respectively — indicating independent training. Mistral’s Mixtral model has some similarities, but DeepSeek’s numbers stood out starkly.
The research underscores how even when models are prompted to write in different tones or formats, they still leave behind detectable stylistic signatures — like linguistic fingerprints. These fingerprints persist across tasks, topics, and prompts, and can now be traced back to their source with some accuracy. That has enormous implications for detecting unauthorized model use, enforcing licensing agreements, and protecting intellectual property.
OpenAI didn’t respond to requests for comment. But the company discussed some reasons why it introduced the new verification process. “Unfortunately, a small minority of developers intentionally use the OpenAI APIs in violation of our usage policies,” it wrote when announcing the change recently.
OpenAI says DeepSeek might have ‘inappropriately distilled’ its models
Earlier this year, just after DeepSeek wowed the AI community with reasoning models that were similar in performance to OpenAI’s offerings, the US startup was even clearer: “We are aware of and reviewing indications that DeepSeek may have inappropriately distilled our models.”
Distillation is a process where developers train new models using the outputs of other existing models. While such a technique is common in AI research, doing so without permission could violate OpenAI’s terms of service.
DeepSeek’s research paper about its new R1 model describes using distillation with open-source models, but it doesn’t mention OpenAI. I asked DeepSeek about these allegations of mimicry earlier this year and didn’t get a response.
Critics point out that OpenAI itself built its early models by scraping the web, including content from news publishers, authors, and creators — often without consent. So is it hypocritical for OpenAI to complain when others use its outputs in a similar way?
“It really comes down to consent and transparency,” said Alon Yamin, CEO of Copyleaks.
Training on copyrighted human content without permission is one kind of issue. But using the outputs of proprietary AI systems to train competing models is another — it’s more like reverse-engineering someone else’s product, he explained.
Yamin argues that while both practices are ethically fraught, training on OpenAI outputs raises competitive risks, as it essentially transfers hard-earned innovations without the original developer’s knowledge or compensation.
As AI companies race to build ever-more capable models, this debate over who owns what — and who can train on whom — is intensifying. Tools like Copyleaks’ digital fingerprinting system offer a potential way to trace and verify authorship at the model level. For OpenAI and its rivals, that may be both a blessing and a warning.