|
--- |
|
title: SE-Arena |
|
emoji: 🛠️ |
|
colorFrom: blue |
|
colorTo: purple |
|
sdk: gradio |
|
sdk_version: 5.25.2 |
|
app_file: app.py |
|
hf_oauth: true |
|
pinned: false |
|
short_description: The chatbot arena for software engineering |
|
--- |
|
|
|
# SE Arena: An Interactive Platform for Evaluating Foundation Models in Software Engineering |
|
|
|
Welcome to **SE Arena**, an open-source platform designed for evaluating software engineering-focused foundation models (FMs), particularly large language models (LLMs). SE Arena benchmarks models in iterative, context-rich workflows that are characteristic of software engineering (SE) tasks. |
|
|
|
## Key Features |
|
|
|
- **Multi-Round Conversational Workflows**: Evaluate models through extended, context-dependent interactions that mirror real-world SE processes. |
|
- **RepoChat Integration**: Automatically inject repository context (issues, commits, PRs) into conversations for more realistic evaluations. |
|
- **Advanced Evaluation Metrics**: Assess models using a comprehensive suite of metrics including: |
|
- Traditional metrics: Elo score and average win rate |
|
- Network-based metrics: Eigenvector centrality, PageRank score |
|
- Community detection: Newman modularity score |
|
- **Consistency score**: Quantify model determinism and reliability through self-play matches |
|
- **Transparent, Open-Source Leaderboard**: View real-time model rankings across diverse SE workflows with full transparency. |
|
|
|
## Why SE Arena? |
|
|
|
Existing evaluation frameworks (like Chatbot Arena, WebDev Arena, and Copilot Arena) often don't address the complex, iterative nature of SE tasks. SE Arena fills critical gaps by: |
|
|
|
- Supporting context-rich, multi-turn evaluations to capture iterative workflows |
|
- Integrating repository-level context through RepoChat to simulate real-world development scenarios |
|
- Providing multidimensional metrics for nuanced model comparisons |
|
- Focusing on the full breadth of SE tasks beyond just code generation |
|
|
|
## How It Works |
|
|
|
1. **Submit a Prompt**: Sign in and input your SE-related task (optional: include a repository URL for RepoChat context) |
|
2. **Compare Responses**: Two anonymous models provide responses to your query |
|
3. **Continue the Conversation**: Test contextual understanding over multiple rounds |
|
4. **Vote**: Choose the better model at any point, with ability to re-assess after multiple turns |
|
|
|
## Getting Started |
|
|
|
### Prerequisites |
|
|
|
- A [Hugging Face](https://huggingface.co.) account |
|
- Basic understanding of software engineering workflows |
|
|
|
### Usage |
|
|
|
1. Navigate to the [SE Arena platform](https://huggingface.co./spaces/SE-Arena/Software-Engineering-Arena) |
|
2. Sign in with your Hugging Face account |
|
3. Enter your SE task prompt (optionally include a repository URL for RepoChat) |
|
4. Engage in multi-round interactions and vote on model performance |
|
|
|
## Contributing |
|
|
|
We welcome contributions from the community! Here's how you can help: |
|
|
|
1. **Submit SE Tasks**: Share your real-world SE problems to enrich our evaluation dataset |
|
2. **Report Issues**: Found a bug or have a feature request? Open an issue in this repository |
|
3. **Enhance the Codebase**: Fork the repository, make your changes, and submit a pull request |
|
|
|
## Privacy Policy |
|
|
|
Your interactions are anonymized and used solely for improving SE Arena and FM benchmarking. By using SE Arena, you agree to our Terms of Service. |
|
|
|
## Future Plans |
|
|
|
- **Analysis of Real-World SE Workloads**: Identify common patterns and challenges in user-submitted tasks |
|
- **Multi-Round Evaluation Metrics**: Develop specialized metrics for assessing model adaptation over successive turns |
|
- **Enhanced Community Engagement**: Enable broader participation through voting and contributions |
|
- **Expanded FM Coverage**: Include domain-specific and multimodal foundation models |
|
- **Advanced Context Compression**: Integrate techniques like LongRope and SelfExtend to manage long-term memory |
|
|
|
## Contact |
|
|
|
For inquiries or feedback, please [open an issue](https://github.com/SE-Arena/Software-Engineering-Arena/issues/new) in this repository. We welcome your contributions and suggestions! |
|
|