About BI4LLMC

Large Language Models for code (LLMc) have transformed the landscape of software engineering (SE), demonstrating significant efficacy in tasks such as code completion. However, despite their widespread use, there is a growing need to thoroughly assess LLMc, as current evaluation processes heavily rely on accuracy and robustness metrics, lacking consensus on additional influential factors in code generation. This gap hinders a holistic understanding of LLMc performance, impacting interpretability, efficiency, bias, fairness, and robustness. The challenges in benchmarking and data maintenance compound this issue, highlighting the need for a comprehensive evaluation approach.

How can we standarize, evolve and implement benchmarks for evaluating LLMc and multi-agent approaches for code generation?

This is the question we aim to investigate in the workshop BI4LLMC. The BI4LLMC 2024 workshop is the first workshop and provides a venue for researchers and practitioners to exchange and discuss trending views, ideas, state-of-the-art, work in progress, and scientific results highlighting aspects of software engineering and AI to address the problem of data quality in modern systems.

Topics of Interest

  • Data and curating processes for benchmarks to evaluate LLMc
  • Practices to collect and maintain datasets to evaluate LLMc
  • Protocols and evaluation metrics beyond accuracy for LLMc
  • Best practices from practitioners and industry for conducting research related to benchmarking LLMc
  • Automated tooling and Infrastructure for Evaluating LLMc
  • Agent and multi-agent for code generation benchmarking

Schedule

June 26, 2024 - All times are in Trondheim, Norway local time (GMT+2).

Start - End Topic Presenters
09:15am - 09:30am Welcome and Overview Organization Committee
09:30am - 09:50am Benchmarking for AI-for-code products Satish Chandra
09:50am - 10:30am Data Curation for Realistic LLM Benchmarks Audris Mockus
10:30am - 11:00am Break
11:00am - 12:45pm Benchmarking current challenges & gaps Group 1
1:00pm - 2:00pm Lunch
2:00pm - 2:30pm LLM interpretability Ziyu Yao
2:30pm - 3:30pm LLM Interpretability and agentic SE benchmarking Group 2
3:30pm - 4:00pm Break
4:00pm - 4:30pm Discussion on comunity benchmark infraestructure future Group 3, Group Online
4:30pm - 4:50pm Wrap Up

Sratchpad

Keynote



...

Ericsson-Harlan Mills Chair Professor, research scientist at Meta
University of Tennessee, Knoxville, USA

Data Curation for Realistic LLM Benchmarks

Traditional benchmarks often suffer from static datasets, limited adaptability, and inconsistent quality, which fail to reflect the evolving challenges faced by developers. Data curation techniques may leverage the nature of collected data with statistical and AI methods to clean, organize, and enhance datasets, ensuring they arediverse, up-to-date, and representative of real-world coding environments. These approaches, should take into account version history (unique to software engineering domain) instead of treating code samples as simply blobs. World of Code research infrastructure provides the scale (nearly entire OSS), diversity, and high-performance schema needed to create benchmarks that are both realistic (based on actual commits) and dynamic (with updates to WoC infrastructure). By leveraging its capabilities of extracting real instances of nearly any software engineering tasks and flexible schema of mapping blobs, authors, API with version history, researchers and developers can ensure coding assistants are evaluated against challenges that mirror real-world software development practices.

Audris Mockus is Ericson Harlan Mills Chair Professor in the Department of Electrical Engineering and Computer Science at the University of Tennessee, Knoxville and a research scientist at Meta. He is a leading researcher in empirical software engineering, known for developing innovative methods to analyze digital traces from software development activities. These traces capture the complex interplay of individuals, groups, culture, and artifacts, requiring novel analytical approaches. Mockus has been instrumental in creating research infrastructures like the World of Code (WoC), which supports studies on open-source software ecosystems. His work spans diverse domains, including big data, software engineering, and forensic anthropology.


...

Software Engineer at Google

Benchmarking for AI-for-code products

I will talk briefly about my experiences with benchmarking for AI-for-code products, and offer some general lessons.

Satish Chandra is a software engineer at Google, where he applies machine learning techniques to improve developer productivity and leads the work on internal developer infrastructure using these techniques.


...

Assistant Professor
George Mason University, USA

LLM interpretability

While benchmarks are now prevalently used to evaluate LLMs and LLMs for code, they do not reveal why a model works (or not) and can be tricked as well. In this talk, I will present our recent research about “LLM interpretability” and show two ways how it fills the gap of benchmarks, including (1) measuring and predicting a model's intrinsic performance and (2) allowing for fine-grained model control for improvement.

Ziyu Yao is an Assistant Professor of the Department of Computer Science at George Mason University. Her research focus is natural language processing and artificial intelligence and their applications across domains including software engineering/LLM for code. She organized the LLM Explainability for Reasoning and Planning workshop at COLM 2025 and the NLP for Programming (NLP4Prog) workshop at ACL 2021. Her research was founded by National Science Foundation, Microsoft Accelerate Foundation Models Research Award, and Virginia Commonwealth Cyber Initiative, among others.

Venue

...