liutao 2a9fd77762 提示词工程		3 months ago
..
metrics	2a9fd77762 提示词工程	3 months ago
models	2a9fd77762 提示词工程	3 months ago
parser	2a9fd77762 提示词工程	3 months ago
prompt	2a9fd77762 提示词工程	3 months ago
utils	2a9fd77762 提示词工程	3 months ago
README.md	2a9fd77762 提示词工程	3 months ago
code_interpreter.py	2a9fd77762 提示词工程	3 months ago
config.py	2a9fd77762 提示词工程	3 months ago
inference_and_execute.py	2a9fd77762 提示词工程	3 months ago
requirements.txt	2a9fd77762 提示词工程	3 months ago

Code Interpreter Benchmark

Introduction

To assess LLM's ability to use the Python Code Interpreter for tasks such as mathematical problem solving, data visualization, and other general-purpose tasks such as file handling and web scraping, we have created and open-sourced a benchmark specifically designed for evaluating these capabilities.

Metrics

The metrics are divided into two parts: code executability and code correctness.

Code executability: evaluating the ability of the LLM-generated code to be executed.
Code correctness: evaluating whether the LLM-generated code runs correctly.

Domain

When evaluating the code executability, we further divide it into three specific domains: Math, Visualization, General problem-solving. In terms of code correctness, we calculate accuracy rates for Math and Visualization.

Results

Executable Rate of Generated Code (%)
Model	Math↑	Visualization↑	General↑
GPT-4	91.9	85.9	82.8
GPT-3.5	89.2	65.0	74.1
LLaMA2-7B-Chat	41.9	33.1	24.1
LLaMA2-13B-Chat	50.0	40.5	48.3
CodeLLaMA-7B-Instruct	85.1	54.0	70.7
CodeLLaMA-13B-Instruct	93.2	55.8	74.1
InternLM-7B-Chat-v1.1	78.4	44.2	62.1
InternLM-20B-Chat	70.3	44.2	65.5
Qwen-7B-Chat	82.4	64.4	67.2
Qwen-14B-Chat	89.2	84.1	65.5

Accuracy of Code Execution Results (%)
Model	Math↑	Visualization-Hard↑	Visualization-Easy↑
GPT-4	82.8	66.7	60.8
GPT-3.5	47.3	33.3	55.7
LLaMA2-7B-Chat	3.9	14.3	39.2
LLaMA2-13B-Chat	8.3	8.3	40.5
CodeLLaMA-7B-Instruct	14.3	26.2	60.8
CodeLLaMA-13B-Instruct	28.2	27.4	62.0
InternLM-7B-Chat-v1.1	28.5	4.8	40.5
InternLM-20B-Chat	34.6	21.4	45.6
Qwen-7B-Chat	41.9	40.5	54.4
Qwen-14B-Chat	58.4	53.6	59.5

Usage

Installation

git clone https://github.com/QwenLM/Qwen-Agent.git
cd benchmark
pip install -r requirements.txt

Dataset Download

cd benchmark
wget https://qianwen-res.oss-cn-beijing.aliyuncs.com/assets/qwen_agent/benchmark_code_interpreter_data.zip
unzip benchmark_code_interpreter_data.zip
mkdir eval_data
mv eval_code_interpreter_v1.jsonl eval_data/

Evaluation

To reproduce the comprehensive results of benchmark, you can run the following script:

python inference_and_execute.py --model {model_name}

{model_name}:

qwen-1.8b-chat
qwen-7b-chat
qwen-14b-chat
llama-2-7b-chat
llama-2-13b-chat
codellama-7b-instruct
codellama-13b-instruct
internlm-7b-chat-1.1
internlm-20b-chat

The benchmark will run the test cases and generate the performance results. The results will be saved in the output_data directory.

Notes: Please install simhei.ttf font for proper display in matplotlib when evaluating visualization task. You can do this by preparing simhei.ttf (which can be found on any Windows PC) and then running the following code snippet:

import os
import matplotlib
target_font_path = os.path.join(
    os.path.abspath(
        os.path.join(matplotlib.matplotlib_fname(), os.path.pardir)),
        'fonts', 'ttf', 'simhei.ttf')
os.system(f'cp simhei.ttf {target_font_path}')
font_list_cache = os.path.join(matplotlib.get_cachedir(), 'fontlist-*.json')
os.system(f'rm -f {font_list_cache}')

Code Executable Rate

python inference_and_execute.py --task {task_name} --model {model_name}

{task_name}:

all_ci: All tasks including Math / Visualization / General problem-solving
visualization: Visualization task
math: Math task
general: General problem-solving task

Code Correctness Rate

python inference_and_execute.py --task {task_name} --model {model_name}