Did DeepSeek-R1 Train on OpenAI’s Model? Study Finds 74.2% Similarity

Joan Aimuengheuwa — Tue, 04 Mar 2025 08:33:13 +0000

A new study by Copyleaks has uncovered a solid similarity between texts generated by DeepSeek-R1 and those produced by OpenAI’s model.

According to the research, 74.2% of DeepSeek-R1’s outputs share stylistic fingerprints with OpenAI’s technology, raising talks about possible reliance on OpenAI’s model during training.

This revelation has also led to discussions around data sourcing, intellectual property rights, and transparency in AI development. If DeepSeek-R1 was trained using OpenAI-generated content without disclosure, it could cause legal and ethical risks, including reinforcing biases and limiting diversity in AI-generated text.

The study employed an advanced text attribution method, utilising three independent AI classifiers trained on outputs from OpenAI, Gemini, Claude, and Llama. To ensure accuracy, a classification was only confirmed when all three classifiers reached the same conclusion. This approach resulted in a 99.88% precision rate, with a false-positive rate of just 0.04%.

During testing, DeepSeek-R1’s texts were found to align with OpenAI’s writing style in 74.2% of cases. In contrast, Microsoft’s Phi-4 model exhibited a 99.3% disagreement rate with existing AI-generated texts, indicating independent training.

Source: Copyleaks

Shai Nisan, Copyleaks’ chief data scientist, commented on the importance of the findings, stating, “With this research, we have moved beyond general AI detection as we knew it and into model-specific attribution, a breakthrough that fundamentally changes how we approach AI content.”

The research team, led by Yehonatan Bitton, Shai Nisan, and Elad Bitton, adopted a rigorous “unanimous jury” approach to ensure reliability of their findings. Their method went beyond identifying known AI models to also detecting previously unseen ones by analysing unique stylistic markers.

If DeepSeek-R1’s model was developed using OpenAI’s work without proper attribution, it could mislead investors and stakeholders about the originality of its technology.

This ultimately points to cautiousness about AI governance, competitive fairness, and the risks of intellectual property infringement in the industry. Transparency in model training and attribution is highly important in maintaining trust and ensuring ethical development practices.

The post Did DeepSeek-R1 Train on OpenAI’s Model? Study Finds 74.2% Similarity appeared first on Tech | Business | Economy.

DeepSeek and OpenAI Archives | Tech | Business | Economy

Did DeepSeek-R1 Train on OpenAI’s Model? Study Finds 74.2% Similarity