liutao 2a9fd77762 提示词工程 3 months ago
..
metrics 2a9fd77762 提示词工程 3 months ago
models 2a9fd77762 提示词工程 3 months ago
parser 2a9fd77762 提示词工程 3 months ago
prompt 2a9fd77762 提示词工程 3 months ago
utils 2a9fd77762 提示词工程 3 months ago
README.md 2a9fd77762 提示词工程 3 months ago
code_interpreter.py 2a9fd77762 提示词工程 3 months ago
config.py 2a9fd77762 提示词工程 3 months ago
inference_and_execute.py 2a9fd77762 提示词工程 3 months ago
requirements.txt 2a9fd77762 提示词工程 3 months ago

README.md

Code Interpreter Benchmark

Introduction

To assess LLM's ability to use the Python Code Interpreter for tasks such as mathematical problem solving, data visualization, and other general-purpose tasks such as file handling and web scraping, we have created and open-sourced a benchmark specifically designed for evaluating these capabilities.

Metrics

The metrics are divided into two parts: code executability and code correctness.

  • Code executability: evaluating the ability of the LLM-generated code to be executed.
  • Code correctness: evaluating whether the LLM-generated code runs correctly.

Domain

When evaluating the code executability, we further divide it into three specific domains: Math, Visualization, General problem-solving. In terms of code correctness, we calculate accuracy rates for Math and Visualization.

Results

Executable Rate of Generated Code (%)
ModelMath↑Visualization↑General↑
GPT-491.985.982.8
GPT-3.589.265.074.1
LLaMA2-7B-Chat 41.9 33.1 24.1
LLaMA2-13B-Chat 50.0 40.5 48.3
CodeLLaMA-7B-Instruct 85.1 54.0 70.7
CodeLLaMA-13B-Instruct 93.2 55.8 74.1
InternLM-7B-Chat-v1.1 78.4 44.2 62.1
InternLM-20B-Chat 70.3 44.2 65.5
Qwen-7B-Chat 82.4 64.4 67.2
Qwen-14B-Chat 89.2 84.1 65.5
Accuracy of Code Execution Results (%)
ModelMath↑Visualization-Hard↑Visualization-Easy↑
GPT-482.866.760.8
GPT-3.547.333.355.7
LLaMA2-7B-Chat 3.9 14.3 39.2
LLaMA2-13B-Chat 8.3 8.3 40.5
CodeLLaMA-7B-Instruct 14.3 26.2 60.8
CodeLLaMA-13B-Instruct 28.2 27.4 62.0
InternLM-7B-Chat-v1.1 28.5 4.8 40.5
InternLM-20B-Chat 34.6 21.4 45.6
Qwen-7B-Chat 41.9 40.5 54.4
Qwen-14B-Chat 58.4 53.6 59.5

Usage

Installation

git clone https://github.com/QwenLM/Qwen-Agent.git
cd benchmark
pip install -r requirements.txt

Dataset Download

cd benchmark
wget https://qianwen-res.oss-cn-beijing.aliyuncs.com/assets/qwen_agent/benchmark_code_interpreter_data.zip
unzip benchmark_code_interpreter_data.zip
mkdir eval_data
mv eval_code_interpreter_v1.jsonl eval_data/

Evaluation

To reproduce the comprehensive results of benchmark, you can run the following script:

python inference_and_execute.py --model {model_name}

{model_name}:

  • qwen-1.8b-chat
  • qwen-7b-chat
  • qwen-14b-chat
  • llama-2-7b-chat
  • llama-2-13b-chat
  • codellama-7b-instruct
  • codellama-13b-instruct
  • internlm-7b-chat-1.1
  • internlm-20b-chat

The benchmark will run the test cases and generate the performance results. The results will be saved in the output_data directory.

Notes: Please install simhei.ttf font for proper display in matplotlib when evaluating visualization task. You can do this by preparing simhei.ttf (which can be found on any Windows PC) and then running the following code snippet:

import os
import matplotlib
target_font_path = os.path.join(
    os.path.abspath(
        os.path.join(matplotlib.matplotlib_fname(), os.path.pardir)),
        'fonts', 'ttf', 'simhei.ttf')
os.system(f'cp simhei.ttf {target_font_path}')
font_list_cache = os.path.join(matplotlib.get_cachedir(), 'fontlist-*.json')
os.system(f'rm -f {font_list_cache}')

Code Executable Rate

python inference_and_execute.py --task {task_name} --model {model_name}

{task_name}:

  • all_ci: All tasks including Math / Visualization / General problem-solving
  • visualization: Visualization task
  • math: Math task
  • general: General problem-solving task

Code Correctness Rate

python inference_and_execute.py --task {task_name} --model {model_name}

{task_name}:

  • visualization: Visualization task
  • gsm8k: Math task

Configuration

The inference_and_exec.py file contains the following configurable options:

  • --model: The model to test which can be one of qwen-14b-chat, qwen-7b-chat, qwen-1.8b-chat, qwen-7b-chat, llama-2-7b-chat, llama-2-13b-chat, codellama-7b-instruct, codellama-13b-instruct, internlm-7b-chat-1.1, internlm-20b-chat.
  • --task: The test task which can be one of all, all_ci, visualization, math, general, gsm8k.
  • --output-path: The path for saving evaluation result.
  • --input-path: The path for placing evaluation data.
  • --output-fname: The file name for evaluation result.
  • --input-fname: The file name for evaluation data.
  • --force: Force generation and will overwrite the cached results.
  • --eval-only: Only calculate evaluation metrics without re-inference.
  • --eval-code-exec-only: Only evaluate code executable rate
  • --gen-exec-only: Only generate and execuate code without calculating evaluation metrics.
  • --gen-only: Only generate without execuating code and calculating evaluation metrics.