BI4LLMC Workshop 2025

About BI4LLMC

Large Language Models for code (LLMc) have transformed the landscape of software engineering (SE), demonstrating significant efficacy in tasks such as code completion. However, despite their widespread use, there is a growing need to thoroughly assess LLMc, as current evaluation processes heavily rely on accuracy and robustness metrics, lacking consensus on additional influential factors in code generation. This gap hinders a holistic understanding of LLMc performance, impacting interpretability, efficiency, bias, fairness, and robustness. The challenges in benchmarking and data maintenance compound this issue, highlighting the need for a comprehensive evaluation approach.

How can we standarize, evolve and implement benchmarks for evaluating LLMc and multi-agent approaches for code generation?

This is the question we aim to investigate in the workshop BI4LLMC. The BI4LLMC 2024 workshop is the first workshop and provides a venue for researchers and practitioners to exchange and discuss trending views, ideas, state-of-the-art, work in progress, and scientific results highlighting aspects of software engineering and AI to address the problem of data quality in modern systems.

Topics of Interest

Data and curating processes for benchmarks to evaluate LLMc
Practices to collect and maintain datasets to evaluate LLMc
Protocols and evaluation metrics beyond accuracy for LLMc
Best practices from practitioners and industry for conducting research related to benchmarking LLMc
Automated tooling and Infrastructure for Evaluating LLMc
Agent and multi-agent for code generation benchmarking

Schedule

June 26, 2024 - All times are in Trondheim, Norway local time (GMT+2).

Start - End	Topic	Presenters
09:15am - 09:30am	Welcome and Overview	Organization Committee
09:30am - 09:50am	Benchmarking for AI-for-code products	Satish Chandra
09:50am - 10:30am	Data Curation for Realistic LLM Benchmarks	Audris Mockus
10:30am - 11:00am	Break
11:00am - 12:45pm	Benchmarking current challenges & gaps	Group 1
1:00pm - 2:00pm	Lunch
2:00pm - 2:30pm	LLM interpretability	Ziyu Yao
2:30pm - 3:30pm	LLM Interpretability and agentic SE benchmarking	Group 2
3:30pm - 4:00pm	Break
4:00pm - 4:30pm	Discussion on comunity benchmark infraestructure future	Group 3, Group Online
4:30pm - 4:50pm	Wrap Up

Sratchpad

Keynote

Prof. Audris Mockus

Ericsson-Harlan Mills Chair Professor, research scientist at Meta
University of Tennessee, Knoxville, USA

Data Curation for Realistic LLM Benchmarks

Traditional benchmarks often suffer from static datasets, limited adaptability, and inconsistent quality, which fail to reflect the evolving challenges faced by developers. Data curation techniques may leverage the nature of collected data with statistical and AI methods to clean, organize, and enhance datasets, ensuring they arediverse, up-to-date, and representative of real-world coding environments. These approaches, should take into account version history (unique to software engineering domain) instead of treating code samples as simply blobs. World of Code research infrastructure provides the scale (nearly entire OSS), diversity, and high-performance schema needed to create benchmarks that are both realistic (based on actual commits) and dynamic (with updates to WoC infrastructure). By leveraging its capabilities of extracting real instances of nearly any software engineering tasks and flexible schema of mapping blobs, authors, API with version history, researchers and developers can ensure coding assistants are evaluated against challenges that mirror real-world software development practices.

Show/Hide Bio

Audris Mockus is Ericson Harlan Mills Chair Professor in the Department of Electrical Engineering and Computer Science at the University of Tennessee, Knoxville and a research scientist at Meta. He is a leading researcher in empirical software engineering, known for developing innovative methods to analyze digital traces from software development activities. These traces capture the complex interplay of individuals, groups, culture, and artifacts, requiring novel analytical approaches. Mockus has been instrumental in creating research infrastructures like the World of Code (WoC), which supports studies on open-source software ecosystems. His work spans diverse domains, including big data, software engineering, and forensic anthropology.

Satish Chandra

Software Engineer at Google

Benchmarking for AI-for-code products

I will talk briefly about my experiences with benchmarking for AI-for-code products, and offer some general lessons.

Show/Hide Bio

Satish Chandra is a software engineer at Google, where he applies machine learning techniques to improve developer productivity and leads the work on internal developer infrastructure using these techniques.

Prof. Ziyu Yao

Assistant Professor
George Mason University, USA

LLM interpretability

While benchmarks are now prevalently used to evaluate LLMs and LLMs for code, they do not reveal why a model works (or not) and can be tricked as well. In this talk, I will present our recent research about “LLM interpretability” and show two ways how it fills the gap of benchmarks, including (1) measuring and predicting a model's intrinsic performance and (2) allowing for fine-grained model control for improvement.

Show/Hide Bio

Ziyu Yao is an Assistant Professor of the Department of Computer Science at George Mason University. Her research focus is natural language processing and artificial intelligence and their applications across domains including software engineering/LLM for code. She organized the LLM Explainability for Reasoning and Planning workshop at COLM 2025 and the NLP for Programming (NLP4Prog) workshop at ACL 2021. Her research was founded by National Science Foundation, Microsoft Accelerate Foundation Models Research Award, and Virginia Commonwealth Cyber Initiative, among others.

BI4LLMC'25

June 26, 2025 -- Trondheim, Norway

About BI4LLMC

Topics of Interest

Schedule

June 26, 2024 - All times are in Trondheim, Norway local time (GMT+2).

Sratchpad

Keynote

Prof. Audris Mockus

Ericsson-Harlan Mills Chair Professor, research scientist at Meta
University of Tennessee, Knoxville, USA

Data Curation for Realistic LLM Benchmarks

Satish Chandra

Software Engineer at Google

Benchmarking for AI-for-code products

Prof. Ziyu Yao

Assistant Professor
George Mason University, USA

LLM interpretability

Venue

Topics of Interest

June 26, 2024 - All times are in Trondheim, Norway local time (GMT+2).

Prof. Audris Mockus

Ericsson-Harlan Mills Chair Professor, research scientist at Meta University of Tennessee, Knoxville, USA

Data Curation for Realistic LLM Benchmarks

Satish Chandra

Software Engineer at Google

Benchmarking for AI-for-code products

Prof. Ziyu Yao

Assistant Professor George Mason University, USA

LLM interpretability

Ericsson-Harlan Mills Chair Professor, research scientist at Meta
University of Tennessee, Knoxville, USA

Assistant Professor
George Mason University, USA