File size: 4,049 Bytes
9c68964
 
 
 
 
 
46eba93
9c68964
 
 
0eaa38a
9c68964
 
065faaf
9c68964
065faaf
9c68964
 
 
065faaf
 
 
 
 
 
 
 
9c68964
 
 
065faaf
9c68964
065faaf
 
 
 
9c68964
 
 
065faaf
 
 
 
9c68964
 
 
 
 
065faaf
 
9c68964
 
 
065faaf
 
 
 
9c68964
 
 
 
 
065faaf
 
 
9c68964
 
 
065faaf
9c68964
 
 
065faaf
 
 
 
 
9c68964
 
 
065faaf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
---
title: SE-Arena
emoji: 🛠️
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.25.2
app_file: app.py
hf_oauth: true
pinned: false
short_description: The chatbot arena for software engineering
---

# SE Arena: An Interactive Platform for Evaluating Foundation Models in Software Engineering

Welcome to **SE Arena**, an open-source platform designed for evaluating software engineering-focused foundation models (FMs), particularly large language models (LLMs). SE Arena benchmarks models in iterative, context-rich workflows that are characteristic of software engineering (SE) tasks.

## Key Features

- **Multi-Round Conversational Workflows**: Evaluate models through extended, context-dependent interactions that mirror real-world SE processes.
- **RepoChat Integration**: Automatically inject repository context (issues, commits, PRs) into conversations for more realistic evaluations.
- **Advanced Evaluation Metrics**: Assess models using a comprehensive suite of metrics including:
  - Traditional metrics: Elo score and average win rate
  - Network-based metrics: Eigenvector centrality, PageRank score
  - Community detection: Newman modularity score
  - **Consistency score**: Quantify model determinism and reliability through self-play matches
- **Transparent, Open-Source Leaderboard**: View real-time model rankings across diverse SE workflows with full transparency.

## Why SE Arena?

Existing evaluation frameworks (like Chatbot Arena, WebDev Arena, and Copilot Arena) often don't address the complex, iterative nature of SE tasks. SE Arena fills critical gaps by:

- Supporting context-rich, multi-turn evaluations to capture iterative workflows
- Integrating repository-level context through RepoChat to simulate real-world development scenarios
- Providing multidimensional metrics for nuanced model comparisons
- Focusing on the full breadth of SE tasks beyond just code generation

## How It Works

1. **Submit a Prompt**: Sign in and input your SE-related task (optional: include a repository URL for RepoChat context)
2. **Compare Responses**: Two anonymous models provide responses to your query
3. **Continue the Conversation**: Test contextual understanding over multiple rounds
4. **Vote**: Choose the better model at any point, with ability to re-assess after multiple turns

## Getting Started

### Prerequisites

- A [Hugging Face](https://huggingface.co.) account
- Basic understanding of software engineering workflows

### Usage

1. Navigate to the [SE Arena platform](https://huggingface.co./spaces/SE-Arena/Software-Engineering-Arena)
2. Sign in with your Hugging Face account
3. Enter your SE task prompt (optionally include a repository URL for RepoChat)
4. Engage in multi-round interactions and vote on model performance

## Contributing

We welcome contributions from the community! Here's how you can help:

1. **Submit SE Tasks**: Share your real-world SE problems to enrich our evaluation dataset
2. **Report Issues**: Found a bug or have a feature request? Open an issue in this repository
3. **Enhance the Codebase**: Fork the repository, make your changes, and submit a pull request

## Privacy Policy

Your interactions are anonymized and used solely for improving SE Arena and FM benchmarking. By using SE Arena, you agree to our Terms of Service.

## Future Plans

- **Analysis of Real-World SE Workloads**: Identify common patterns and challenges in user-submitted tasks
- **Multi-Round Evaluation Metrics**: Develop specialized metrics for assessing model adaptation over successive turns
- **Enhanced Community Engagement**: Enable broader participation through voting and contributions
- **Expanded FM Coverage**: Include domain-specific and multimodal foundation models
- **Advanced Context Compression**: Integrate techniques like LongRope and SelfExtend to manage long-term memory

## Contact

For inquiries or feedback, please [open an issue](https://github.com/SE-Arena/Software-Engineering-Arena/issues/new) in this repository. We welcome your contributions and suggestions!