Did DeepSeek-R1 Train on OpenAI’s Model? Study Finds 74.2% Similarity

Advertisements

A new study by Copyleaks has uncovered a solid similarity between texts generated by DeepSeek-R1 and those produced by OpenAI’s model.

According to the research, 74.2% of DeepSeek-R1’s outputs share stylistic fingerprints with OpenAI’s technology, raising talks about possible reliance on OpenAI’s model during training.

This revelation has also led to discussions around data sourcing, intellectual property rights, and transparency in AI development. If DeepSeek-R1 was trained using OpenAI-generated content without disclosure, it could cause legal and ethical risks, including reinforcing biases and limiting diversity in AI-generated text.

The study employed an advanced text attribution method, utilising three independent AI classifiers trained on outputs from OpenAI, Gemini, Claude, and Llama. To ensure accuracy, a classification was only confirmed when all three classifiers reached the same conclusion. This approach resulted in a 99.88% precision rate, with a false-positive rate of just 0.04%.

During testing, DeepSeek-R1’s texts were found to align with OpenAI’s writing style in 74.2% of cases. In contrast, Microsoft’s Phi-4 model exhibited a 99.3% disagreement rate with existing AI-generated texts, indicating independent training.

Did DeepSeek-R1 Train on OpenAI’s Model? Study Finds 74.2% Similarity — Source: Copyleaks

Shai Nisan, Copyleaks’ chief data scientist, commented on the importance of the findings, stating, “With this research, we have moved beyond general AI detection as we knew it and into model-specific attribution, a breakthrough that fundamentally changes how we approach AI content.”

The research team, led by Yehonatan Bitton, Shai Nisan, and Elad Bitton, adopted a rigorous “unanimous jury” approach to ensure reliability of their findings. Their method went beyond identifying known AI models to also detecting previously unseen ones by analysing unique stylistic markers.

If DeepSeek-R1’s model was developed using OpenAI’s work without proper attribution, it could mislead investors and stakeholders about the originality of its technology.

This ultimately points to cautiousness about AI governance, competitive fairness, and the risks of intellectual property infringement in the industry. Transparency in model training and attribution is highly important in maintaining trust and ensuring ethical development practices.

0Shares

Did DeepSeek-R1 Train on OpenAI’s Model? Study Finds 74.2% Similarity

…While Microsoft’s Phi-4 Shows 99.3% Independence

Joan Aimuengheuwa

TSMC Pledges $100 Billion for U.S. Chip Manufacturing Expansion, with Trump’s Backing

Leave a Reply Cancel reply

Welcome Back!

Retrieve your password