SWE Bench

February 8th 2025

SWE Bench

Elevating SWE-bench Verified with Blackbox Agent

We are excited to share the latest improvements to the Blackbox SWE AI Agent. With these advancements, we have successfully increased our state-of-the-art performance on SWE-Bench Verified from 62.8% to 65.2%. In this blog post, we would like to share more details of our results.

SWE-bench serves as a comprehensive AI evaluation benchmark that measures a model's capability to tackle real-world software engineering tasks. It specifically assesses how well the model can address GitHub issues from widely-used open-source Python repositories. For each task, the AI model is provided with a Python environment and a local copy of the repository just before the issue was resolved. The model must then comprehend, modify, and test the code before proposing its solution.

Each proposed solution is evaluated against the actual unit tests from the pull request that resolved the original GitHub issue, ensuring that the AI model can replicate the functionality achieved by the original human contributor.

SWE-bench evaluates not just the AI model in isolation, but an entire "agent" system. Here, an "agent" refers to the combination of the AI model and the surrounding software infrastructure. This infrastructure is responsible for generating prompts for the model, interpreting the model's outputs to take action, and managing the interaction loop where the results of the model's previous actions inform its next prompt. The performance of an agent on SWE-bench can vary significantly based on this infrastructure, even when using the same underlying AI model.

While there are numerous benchmarks for assessing the coding capabilities of Large Language Models, SWE-bench has gained traction for several reasons:

It utilizes real engineering tasks from actual projects, rather than hypothetical competition or interview questions.
It remains an open field—there is ample opportunity for improvement. No model has yet surpassed 62.8% completion on SWE-bench Verified (though the updated Blackbox Agent is currently at 62.8%).
It evaluates an entire "agent" rather than just a model in isolation. Open-source developers and startups have successfully optimized their infrastructures to significantly enhance performance around the same model.

It is important to note that the original SWE-bench dataset includes tasks that cannot be solved without additional context beyond the GitHub issue (for instance, specific error messages). SWE-bench-Verified is a curated subset of 500 problems from SWE-bench that have been reviewed by humans to ensure they are solvable, providing a clear measure of coding agents' performance. This is the benchmark we will reference in this post.

Test time scaling: Recipe to punch above the weight

Recent advancements in test-time scaling [3] have demonstrated promising results in improving the baseline performance of models in single-attempt scenarios. These approaches are particularly effective for tackling complex tasks. There are three main strategies for leveraging test-time compute:

Iterative Self-Refinement
- Models iteratively refine their initial solutions using execution feedback, verification, or simulated results within their reasoning process.
Parallel Sampling
- Generating multiple diverse candidate solutions can enhance the likelihood of finding a correct one. A learned model or pipeline then selects the best solution for final submission.
Search-Based Methods (Beam Search, Monte Carlo Tree Search, etc.)
- These approaches [6] involve sampling multiple responses per step and selecting a subset for further propagation using a reward model (PRM), continuing until completion or reaching a step limit.

We focused on improving test-time scaling in the first two areas and identified key methods that contributed to our performance gains.

Expanding Solution Coverage

As demonstrated in prior research [2,3,4,5], increasing the number of sampled solution trajectories improves the success rate. By expanding our trajectory count by 1.25x, we achieved a notable increase in coverage from 74.8% to 77.6%, leading to an overall improvement of ~3% in performance.

Enhancing Verifier Performance

While increasing sampled trajectories improves solution coverage, it also poses challenges for the verifier pipeline, which must efficiently evaluate a larger set of potential solutions. Many of the newly successful tasks involve more complex and confounding solutions, making verification more difficult. To address this, we introduced a multi-faceted selection approach:

Enhancing Model-Generated Test Case Coverage: Improving in-trajectory model-generated test cases to better evaluate correctness.
Committee-Based Verification: Incorporating a combination of Process Reward Models (PRM) and LLM-as-Judge to further select the final solution.

These refinements boosted our overall verifier pipeline selection performance to ~84%, improving our ability to identify and submit the most promising solutions.

Conclusion

We are working to bring these advancements into our VSCode agent. We are also looking forward to further advancing this research. Stay tuned for more updates!

Citation

@article{blackboxai_2025,
  title   = "Advancing SoTA and improving Blackbox SWE AI Agent",
  author  = "Umapathi, Logesh Kumar and Rizk, Richard and Rizk, Robert",
  journal = "blackbox.ai",
  year    = "2025",
  url     = ""
}

References

[1]: Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, & Karthik R Narasimhan (2024). SWE-bench: Can Language Models Resolve Real-world Github Issues?. In The Twelfth International Conference on Learning Representations.

[2]: Li, Y., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., Eccles, T., Keeling, J., Gimeno, F., Dal Lago, A., Hubert, T., Choy, P., Masson d'Autume, C., Babuschkin, I., Chen, X., Huang, P.S., Welbl, J., Gowal, S., Cherepanov, A., Molloy, J., Mankowitz, D., Sutherland Robson, E., Kohli, P., Freitas, N., Kavukcuoglu, K., & Vinyals, O. (2022). Competition-level code generation with AlphaCode. Science, 378(6624), 1092–1097. https://arxiv.org/abs/2203.07814

[3]: Charlie Snell, Jaehoon Lee, Kelvin Xu, & Aviral Kumar. (2024). Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. https://arxiv.org/abs/2408.03314

[4]: Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V. Le, Christopher Ré, & Azalia Mirhoseini. (2024). Large Language Monkeys: Scaling Inference Compute with Repeated Sampling. https://arxiv.org/abs/2407.21787

[5]: Ryan Ehrlich, Bradley Brown, Jordan Juravsky, Ronald Clark, Christopher Ré, & Azalia Mirhoseini. (2025). CodeMonkeys: Scaling Test-Time Compute for Software Engineering. https://arxiv.org/abs/2501.14723

[6]: Antonis Antoniades, Albert Örwall, Kexun Zhang, Yuxi Xie, Anirudh Goyal, & William Wang. (2024). SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement. https://arxiv.org/abs/2410.20285

Blog.

SWE Bench

SWE Bench

Elevating SWE-bench Verified with Blackbox Agent

Test time scaling: Recipe to punch above the weight

Expanding Solution Coverage

Enhancing Verifier Performance

Conclusion

Citation

References