type
status
date
slug
summary
tags
category
icon
password
Published
In early 2025, as artificial intelligence rapidly developed, the emergence of DeepSeek R1 attracted widespread industry attention. As an independently developed open-source large language model from China, it not only challenged the position of Western AI giants with its high performance at low cost but also quickly became a market favorite. However, amid the applause, we cannot ignore a key question: why does DeepSeek R1, despite its success, have a hallucination rate significantly higher than the industry average? Does this suggest a bias in our evaluation standards for AI tools?
The Rise of DeepSeek R1
DeepSeek R1 was officially released on January 10, 2025, followed by the open-sourcing of its model weights on January 20. This model gained user favor at a surprising speed, with the DeepSeek application climbing to the top of Apple's App Store download chart within just one week of release, surpassing ChatGPT which had long held the top position. This rapid rise triggered significant fluctuations in the tech stock market, reportedly causing a selloff of up to $1 trillion (approximately £800 billion) in US technology stocks.
DeepSeek R1's ability to attract such attention is mainly attributed to three notable characteristics:
1. Cost-effectiveness: DeepSeek R1's development cost is estimated at only $5.6 million, far below OpenAI GPT-4's estimated $100 million
2. Open-source nature: Released under the MIT license, allowing developers to freely use and modify it
3. Performance: Despite its low development cost, its performance in coding, mathematics, and science can rival leading proprietary models
However, amid this praise, an issue that cannot be ignored is DeepSeek R1's hallucination rate, which is significantly higher than other models.
There may be several reasons for this high hallucination rate:
1. "Over-helpful" tendency: R1 tends to add information not present in the original text, even if this information might be factually correct. This "benign hallucination" accounts for 71.7% of R1's hallucinations, compared to just 36.8% for V3.
2. Demonstration of reasoning process: R1 shows its thinking process in detail, a feature that enhances transparency but may also increase the chance of hallucinations.
3. Token-based reasoning method: R1 employs extensive token-based reasoning methods, with an average token count of 4717.5, far higher than other models' 191.75 to 462.39. This multi-step problem-solving process, while improving accuracy, also increases the risk of hallucinations. (This also means that for the same question, assuming equal token prices, R1 will generate 10-25 times the cost)
4. Training method differences: R1 used extensive reinforcement learning strategies to enhance reasoning abilities, which may inadvertently cause the model to prioritize plausibility over strict factual accuracy when generating explanations.
Despite DeepSeek R1's excellence in certain aspects, from a practical tool perspective, its high hallucination rate makes it difficult to be the first choice as an efficiency tool.
Truly effective AI tools should be based on accuracy. When we use AI tools for research, code writing, or data analysis, incorrect information can lead to serious consequences. Imagine how dangerous it would be if you relied on an AI assistant for investment decisions, and there was a 14% chance that the data it provided was fabricated.
Innovation and Marketing
The viral popularity of DeepSeek R1 prompts a profound consideration: in the AI field, is innovation increasingly becoming a gimmick for secondary propagation, serving online marketing more than enhancing the actual value of tools?
When we strip away the marketing halo and calmly analyze the DeepSeek R1 phenomenon, we can see the following points:
1. Short-term sensation vs. long-term value: DeepSeek R1's release indeed caused a sensation, leading to significant fluctuations in US technology stocks, but this short-term impact does not equate to long-term practical value.
2. Two sides of open-source advantages: While open-source promotes innovation and collaboration, it may also lead to insufficient quality control. Compared to closed-source models, R1 lacks strict internal quality standards and supervision.
3. Misleading comparison of efficiency and accuracy: Although R1 performs excellently in some benchmark tests, these tests often cannot fully reflect the accuracy requirements in actual application scenarios.
4. Hidden costs of cost savings: R1's low operating cost (input token price: $0.55 per million tokens, compared to GPT-o1's $15 per million tokens) is indeed tempting, but as previously stated, each question calls for 10-25 times more tokens than other models, so this price is not absolutely cheap. Moreover, the potential risk costs brought by accuracy gaps may far exceed these surface savings.