|
3 mesi fa | |
---|---|---|
.. | ||
metrics | 3 mesi fa | |
models | 3 mesi fa | |
parser | 3 mesi fa | |
prompt | 3 mesi fa | |
utils | 3 mesi fa | |
README.md | 3 mesi fa | |
code_interpreter.py | 3 mesi fa | |
config.py | 3 mesi fa | |
inference_and_execute.py | 3 mesi fa | |
requirements.txt | 3 mesi fa |
To assess LLM's ability to use the Python Code Interpreter for tasks such as mathematical problem solving, data visualization, and other general-purpose tasks such as file handling and web scraping, we have created and open-sourced a benchmark specifically designed for evaluating these capabilities.
The metrics are divided into two parts: code executability and code correctness.
When evaluating the code executability, we further divide it into three specific domains: Math
, Visualization
, General problem-solving
.
In terms of code correctness, we calculate accuracy rates for Math
and Visualization
.
Executable Rate of Generated Code (%) | |||
---|---|---|---|
Model | Math↑ | Visualization↑ | General↑ |
GPT-4 | 91.9 | 85.9 | 82.8 |
GPT-3.5 | 89.2 | 65.0 | 74.1 |
LLaMA2-7B-Chat | 41.9 | 33.1 | 24.1 |
LLaMA2-13B-Chat | 50.0 | 40.5 | 48.3 |
CodeLLaMA-7B-Instruct | 85.1 | 54.0 | 70.7 |
CodeLLaMA-13B-Instruct | 93.2 | 55.8 | 74.1 |
InternLM-7B-Chat-v1.1 | 78.4 | 44.2 | 62.1 |
InternLM-20B-Chat | 70.3 | 44.2 | 65.5 |
Qwen-7B-Chat | 82.4 | 64.4 | 67.2 |
Qwen-14B-Chat | 89.2 | 84.1 | 65.5 |
Accuracy of Code Execution Results (%) | |||
---|---|---|---|
Model | Math↑ | Visualization-Hard↑ | Visualization-Easy↑ |
GPT-4 | 82.8 | 66.7 | 60.8 |
GPT-3.5 | 47.3 | 33.3 | 55.7 |
LLaMA2-7B-Chat | 3.9 | 14.3 | 39.2 |
LLaMA2-13B-Chat | 8.3 | 8.3 | 40.5 |
CodeLLaMA-7B-Instruct | 14.3 | 26.2 | 60.8 |
CodeLLaMA-13B-Instruct | 28.2 | 27.4 | 62.0 |
InternLM-7B-Chat-v1.1 | 28.5 | 4.8 | 40.5 |
InternLM-20B-Chat | 34.6 | 21.4 | 45.6 |
Qwen-7B-Chat | 41.9 | 40.5 | 54.4 |
Qwen-14B-Chat | 58.4 | 53.6 | 59.5 |
git clone https://github.com/QwenLM/Qwen-Agent.git
cd benchmark
pip install -r requirements.txt
cd benchmark
wget https://qianwen-res.oss-cn-beijing.aliyuncs.com/assets/qwen_agent/benchmark_code_interpreter_data.zip
unzip benchmark_code_interpreter_data.zip
mkdir eval_data
mv eval_code_interpreter_v1.jsonl eval_data/
To reproduce the comprehensive results of benchmark, you can run the following script:
python inference_and_execute.py --model {model_name}
{model_name}:
The benchmark will run the test cases and generate the performance results. The results will be saved in the output_data
directory.
Notes:
Please install simhei.ttf
font for proper display in matplotlib when evaluating visualization task. You can do this by preparing simhei.ttf
(which can be found on any Windows PC) and then running the following code snippet:
import os
import matplotlib
target_font_path = os.path.join(
os.path.abspath(
os.path.join(matplotlib.matplotlib_fname(), os.path.pardir)),
'fonts', 'ttf', 'simhei.ttf')
os.system(f'cp simhei.ttf {target_font_path}')
font_list_cache = os.path.join(matplotlib.get_cachedir(), 'fontlist-*.json')
os.system(f'rm -f {font_list_cache}')
python inference_and_execute.py --task {task_name} --model {model_name}
{task_name}:
all_ci
: All tasks including Math / Visualization / General problem-solvingvisualization
: Visualization taskmath
: Math taskgeneral
: General problem-solving taskpython inference_and_execute.py --task {task_name} --model {model_name}
{task_name}:
visualization
: Visualization taskgsm8k
: Math taskThe inference_and_exec.py file contains the following configurable options:
--model
: The model to test which can be one of qwen-14b-chat
, qwen-7b-chat
, qwen-1.8b-chat
, qwen-7b-chat
, llama-2-7b-chat
, llama-2-13b-chat
, codellama-7b-instruct
, codellama-13b-instruct
, internlm-7b-chat-1.1
, internlm-20b-chat
.--task
: The test task which can be one of all
, all_ci
, visualization
, math
, general
, gsm8k
.--output-path
: The path for saving evaluation result.--input-path
: The path for placing evaluation data.--output-fname
: The file name for evaluation result.--input-fname
: The file name for evaluation data.--force
: Force generation and will overwrite the cached results.--eval-only
: Only calculate evaluation metrics without re-inference.--eval-code-exec-only
: Only evaluate code executable rate--gen-exec-only
: Only generate and execuate code without calculating evaluation metrics.--gen-only
: Only generate without execuating code and calculating evaluation metrics.