nuojohnchen commited on
Commit
dac23c7
·
verified ·
1 Parent(s): 3fe8312

Update app.py

Browse files
Files changed (1) hide show
  1. app.py +1 -1
app.py CHANGED
@@ -11,7 +11,7 @@ HF_TOKEN = os.environ.get("HF_TOKEN", None)
11
 
12
  DESCRIPTION = '''
13
  <div>
14
- <h1 style="text-align: center;">JudgeLRM: Large Reasoning Models as a Judge</h1>
15
  <p>This Space demonstrates the <a href="https://huggingface.co/nuojohnchen/JudgeLRM-7B"><b>JudgeLRM</b></a> model, designed to evaluate the quality of two AI assistant responses. JudgeLRM is a family of judgment-oriented LLMs trained using reinforcement learning (RL) with judge-wise, outcome-driven rewards. JudgeLRM models consistently outperform both SFT-tuned and state-of-the-art reasoning models. Notably, JudgeLRM-3B surpasses GPT-4, and JudgeLRM-7B outperforms DeepSeek-R1 by 2.79\% in F1 score, particularly excelling in judge tasks requiring deep reasoning.</p>
16
  <p>Enter an instruction and two responses, and the model will think, reason and score them on a scale of 1-10 (higher is better).</p>
17
  <p>You can also select Hugging Face models to automatically generate responses for evaluation.</p>
 
11
 
12
  DESCRIPTION = '''
13
  <div>
14
+ <h1 style="text-align: center;">JudgeLRM</h1>
15
  <p>This Space demonstrates the <a href="https://huggingface.co/nuojohnchen/JudgeLRM-7B"><b>JudgeLRM</b></a> model, designed to evaluate the quality of two AI assistant responses. JudgeLRM is a family of judgment-oriented LLMs trained using reinforcement learning (RL) with judge-wise, outcome-driven rewards. JudgeLRM models consistently outperform both SFT-tuned and state-of-the-art reasoning models. Notably, JudgeLRM-3B surpasses GPT-4, and JudgeLRM-7B outperforms DeepSeek-R1 by 2.79\% in F1 score, particularly excelling in judge tasks requiring deep reasoning.</p>
16
  <p>Enter an instruction and two responses, and the model will think, reason and score them on a scale of 1-10 (higher is better).</p>
17
  <p>You can also select Hugging Face models to automatically generate responses for evaluation.</p>