zhuyaoyu/CodeV-R1-Distill-Qwen-7B

CodeV-R1-Distill-Qwen-7B

The paper is coming soon!

1. Introduction

The post-training phase of large language models (LLMs) has advanced rapidly, with models like OpenAI’s GPT-o1, DeepSeek-R1, and Kimi-1.5 showcasing remarkable reasoning capabilities. Notably, DeepSeek-R1 introduced a simple yet powerful rule-based reinforcement learning (RL) approach that enables the emergence of reasoning patterns. While these advancements have primarily targeted software programming languages, there is growing interest in adapting LLMs for hardware description languages (HDLs)—critical tools for chip design and hardware verification.

However, HDLs such as Verilog face challenges akin to low-resource languages, including limited high-quality instruction-following data and constrained model capabilities in generating accurate Register Transfer Level (RTL) code. These limitations hinder the performance and cross-language generalization of specialized code LLMs. To address this, we propose leveraging knowledge distillation to equip smaller, efficient models with DeepSeek-R1-like reasoning abilities.

As a continuation of the work initiated with CodeV, we introduce CodeV-R1-Distill-Qwen-7B, a model distilled from DeepSeek-R1 using our CodeV dataset. This model outperforms prior non-reasoning LLMs across major Verilog benchmarks, demonstrating superior code synthesis and problem-solving capabilities. Intriguingly, distilling Verilog code also enhances the model’s mathematical reasoning, suggesting broader synergies between hardware-centric training and general logical reasoning.

2. Model Summary

Data Preparation: Initially, we re-summarize and formulate questions from the original CodeV dataset utilizing Deepseek-v3. We then filter out straightforward problems—those solvable by Qwen2.5-Coder-7B-Instruct and Qwen2.5-Coder-32B-Instruct within five attempts—as well as having non-synthesizable issues. For the remaining data, we use DeepSeek-R1 to generate one response per question. Problems with a Rouge-L score greater than 0.5 compared to the benchmark-tested problems are also filtered out. After these processes, approximately 87,000 (problem, code) pairs remain.
Training: We employ LLaMAFactory to apply supervised fine-tuning (SFT) to Qwen2.5-Coder-7B-Instruct using this refined dataset of 87,000 pairs. Training is conducted over six epochs with a learning rate of 1e-5 and a batch size of 64.

3. Evaluation Results

During the evaluation phase, the maximum generation length is configured to 16,384 tokens. A temperature setting of 0.6 is applied, and 20 responses are generated per query to estimate the pass@1 score.

Our evaluation encompasses Verilog benchmarks, including VerilogEval and RTLLM. For VerilogEval v2, we examine zero-shot scenarios in both specification-to-RTL translation and code completion tasks. Concerning RTLLM, results are reported for version 1.1, which offers a broader spectrum of comparative analyses. Furthermore, we find that the acquisition of the reasoning process in Verilog problems, as facilitated by DeepSeek-R1, enhances the model's out-of-domain mathematical capabilities.

VerilogEval (v2)

Model	Model size	Type	Spec-to-rtl	Completion
GPT-4o	Undisclosed	General	62.5%	59.0%
GPT-4 Turbo	Undisclosed	General	61.1%	53.9%
GPT-4	Undisclosed	General	32.0%	42.3%
Mistral Large	Undisclosed	General	37.5%	34.0%
Llama3.1	405B	General	57.2%	56.4%
Llama3.1	70B	General	42.8%	35.3%
Llama3	70B	General	43.9%	37.8%
Llama2	70B	General	5.3%	1.3%
Llama3.1	8B	General	19.1%	2.6%
CodeLlama	70B	Coding	34.9%	37.2%
DeepSeek Coder	33B	Coding	21.7%	25.0%
CodeGemma	7B	Coding	9.5%	8.3%
DeepSeek Coder	6.7B	Coding	29.6%	24.4%
RTL-Coder	6.7B	Verilog RTL	36.8%	35.9%
CodeV-R1-distill (ours)	7B	Verilog RTL	65.4%	65.1%

RTLLM (v1.1)

Model	Model size	Type	Pass@1
GPT-4o	Undisclosed	General	33.8%
GPT-3.5 Turbo	Undisclosed	General	28.3%
Llama3.1	405B	General	38.9%
Nemotron-4	340B	General	18.9%
Llama3.1	8B	General	19.1%
CodeLlama	7B	Coding	17.9%
CodeQwen	7B	Coding	24.1%
Starcoder2	15B	Coding	15.5%
DeepSeek Coder	6.7B	Coding	23.1%
DeepSeek-Coder-V2	16B	Coding	33.1%
DeepSeek-Coder-V2	236B	Coding	34.5%
RTL-Coder	6.7B	Verilog RTL	36.8%
CraftRTL	6.7B	Verilog RTL	53.1%
CodeV-R1-distill (ours)	7B	Verilog RTL	56.2%

Math

Model	AIME	Math	AMC	Minerva	Olympiad Bench	Average
Qwen2.5-7b-instruct-1M	11.25%	72.61%	41.11%	25.92%	34.66%	37.11%
Qwen2.5-math-7b-instruct	12.08%	82.25%	49.4%	27.64%	37.31%	41.74%
Qwen2.5-coder-7b-instruct (baseline)	5.63%	63.5%	35.62%	21.02%	28.64%	30.88%
CodeV-R1-distill (ours)	11.04%	74.35%	45.86%	25.79%	38.7%	39.15%

4. Usage

CodeV-R1-Distill-Qwen-7B can be utilized in the same manner as Qwen or Llama models.

For instance, you can easily start a service using vLLM:

vllm serve zhuyaoyu/CodeV-R1-Distill-Qwen-7B --tensor-parallel-size 2 --max-model-len 16384 --enforce-eager

Usage Recommendations

During training and evaluation, we use a system prompt

You are a helpful assistant. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and<answer> </answer> tags, respectively, i.e., <think> reasoning process here </think><answer> answer here </answer>.  Now the user asks you to write verilog code. After thinking, when you finally reach a conclusion, enclose the final verilog code in ```verilog ``` within <answer> </answer> tags. i.e., <answer> ```verilog\n module top_module(in, out, ...) ... ``` </answer>.\n

It is recommended to use this prompt.

5. License

CodeV-R1-Distill-Qwen-7B is derived from Qwen-2.5 series, which are originally licensed under Apache 2.0 License, and now finetuned with 87k samples curated with DeepSeek-R1.

6. Citation

@misc{CodeV-R1-Distill-Qwen-7B,
  author = {IPRC-DIP},
  title = {CodeV Model Distilled from DeepSeek-R1},
  url = {https://huggingface.co./zhuyaoyu/CodeV-R1-Distill-Qwen-7B},
  year = {2025}
}

zhuyaoyu
/

CodeV-R1-Distill-Qwen-7B