# Code Interpreter Benchmark
## Introduction
To assess LLM's ability to use the Python Code Interpreter for tasks such as mathematical problem solving, data visualization, and other general-purpose tasks such as file handling and web scraping, we have created and open-sourced a benchmark specifically designed for evaluating these capabilities.
### Metrics
The metrics are divided into two parts: code executability and code correctness.
- Code executability: evaluating the ability of the LLM-generated code to be executed.
- Code correctness: evaluating whether the LLM-generated code runs correctly.
### Domain
When evaluating the code executability, we further divide it into three specific domains: `Math`, `Visualization`, `General problem-solving`.
In terms of code correctness, we calculate accuracy rates for `Math` and `Visualization`.
## Results
Executable Rate of Generated Code (%) |
Model | Math↑ | Visualization↑ | General↑ |
GPT-4 | 91.9 | 85.9 | 82.8 |
GPT-3.5 | 89.2 | 65.0 | 74.1 |
LLaMA2-7B-Chat |
41.9 |
33.1 |
24.1 |
LLaMA2-13B-Chat |
50.0 |
40.5 |
48.3 |
CodeLLaMA-7B-Instruct |
85.1 |
54.0 |
70.7 |
CodeLLaMA-13B-Instruct |
93.2 |
55.8 |
74.1 |
InternLM-7B-Chat-v1.1 |
78.4 |
44.2 |
62.1 |
InternLM-20B-Chat |
70.3 |
44.2 |
65.5 |
Qwen-7B-Chat |
82.4 |
64.4 |
67.2 |
Qwen-14B-Chat |
89.2 |
84.1 |
65.5 |
Accuracy of Code Execution Results (%) |
Model | Math↑ | Visualization-Hard↑ | Visualization-Easy↑ |
GPT-4 | 82.8 | 66.7 | 60.8 |
GPT-3.5 | 47.3 | 33.3 | 55.7 |
LLaMA2-7B-Chat |
3.9 |
14.3 |
39.2 |
LLaMA2-13B-Chat |
8.3 |
8.3 |
40.5 |
CodeLLaMA-7B-Instruct |
14.3 |
26.2 |
60.8 |
CodeLLaMA-13B-Instruct |
28.2 |
27.4 |
62.0 |
InternLM-7B-Chat-v1.1 |
28.5 |
4.8 |
40.5 |
InternLM-20B-Chat |
34.6 |
21.4 |
45.6 |
Qwen-7B-Chat |
41.9 |
40.5 |
54.4 |
Qwen-14B-Chat |
58.4 |
53.6 |
59.5 |
## Usage
### Installation
```shell
git clone https://github.com/QwenLM/Qwen-Agent.git
cd benchmark
pip install -r requirements.txt
```
### Dataset Download
```shell
cd benchmark
wget https://qianwen-res.oss-cn-beijing.aliyuncs.com/assets/qwen_agent/benchmark_code_interpreter_data.zip
unzip benchmark_code_interpreter_data.zip
mkdir eval_data
mv eval_code_interpreter_v1.jsonl eval_data/
```
### Evaluation
To reproduce the comprehensive results of benchmark, you can run the following script:
```Shell
python inference_and_execute.py --model {model_name}
```
{model_name}:
- qwen-1.8b-chat
- qwen-7b-chat
- qwen-14b-chat
- llama-2-7b-chat
- llama-2-13b-chat
- codellama-7b-instruct
- codellama-13b-instruct
- internlm-7b-chat-1.1
- internlm-20b-chat
The benchmark will run the test cases and generate the performance results. The results will be saved in the `output_data` directory.
**Notes**:
Please install `simhei.ttf` font for proper display in matplotlib when evaluating visualization task. You can do this by preparing `simhei.ttf` (which can be found on any Windows PC) and then running the following code snippet:
```python
import os
import matplotlib
target_font_path = os.path.join(
os.path.abspath(
os.path.join(matplotlib.matplotlib_fname(), os.path.pardir)),
'fonts', 'ttf', 'simhei.ttf')
os.system(f'cp simhei.ttf {target_font_path}')
font_list_cache = os.path.join(matplotlib.get_cachedir(), 'fontlist-*.json')
os.system(f'rm -f {font_list_cache}')
```
#### Code Executable Rate
```Shell
python inference_and_execute.py --task {task_name} --model {model_name}
```
{task_name}:
- `all_ci`: All tasks including Math / Visualization / General problem-solving
- `visualization`: Visualization task
- `math`: Math task
- `general`: General problem-solving task
#### Code Correctness Rate
```Shell
python inference_and_execute.py --task {task_name} --model {model_name}
```
{task_name}:
- `visualization`: Visualization task
- `gsm8k`: Math task
## Configuration
The inference_and_exec.py file contains the following configurable options:
- `--model`: The model to test which can be one of `qwen-14b-chat`, `qwen-7b-chat`, `qwen-1.8b-chat`, `qwen-7b-chat`, `llama-2-7b-chat`, `llama-2-13b-chat`, `codellama-7b-instruct`, `codellama-13b-instruct`, `internlm-7b-chat-1.1`, `internlm-20b-chat`.
- `--task`: The test task which can be one of `all`, `all_ci`, `visualization`, `math`, `general`, `gsm8k`.
- `--output-path`: The path for saving evaluation result.
- `--input-path`: The path for placing evaluation data.
- `--output-fname`: The file name for evaluation result.
- `--input-fname`: The file name for evaluation data.
- `--force`: Force generation and will overwrite the cached results.
- `--eval-only`: Only calculate evaluation metrics without re-inference.
- `--eval-code-exec-only`: Only evaluate code executable rate
- `--gen-exec-only`: Only generate and execuate code without calculating evaluation metrics.
- `--gen-only`: Only generate without execuating code and calculating evaluation metrics.