LogEval is a comprehensive benchmark suite designed to evaluate Large Language Models’ capabilities in log parsing, anomaly detection, fault diagnosis, and log summarization. LogEval uses 4,000 publicly available log data entries and 15 different prompts for each task to rigorously evaluate multiple mainstream Large Language Models. We demonstrate the Large Language Models’ performance in self-consistency and few-shot learning, and discuss findings related to model quantification, Chinese-English question-answering evaluation, and prompt engineering. LogEval’s evaluation results reveal the strengths and limitations of Large Language Models in log analysis tasks, providing researchers with valuable references when selecting models for such tasks. We will continuously update the model evaluations to promote further research and development.
@misc{cui2024logevalcomprehensivebenchmarksuite, title={LogEval: A Comprehensive Benchmark Suite for Large Language Models In Log Analysis}, author={Tianyu Cui and Shiyu Ma and Ziang Chen and Tong Xiao and Shimin Tao and Yilun Liu and Shenglin Zhang and Duoming Lin and Changchang Liu and Yuzhe Cai and Weibin Meng and Yongqian Sun and Dan Pei}, year={2024}, eprint={2407.01896}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2407.01896}, }
Update time :
Model | Chinese | English | ||||
---|---|---|---|---|---|---|
Accuracy Rate | Edit Distance | Accuracy Rate | Edit Distance | |||
Model | Chinese | English | ||||
Accuracy Rate | F1-Score | Accuracy Rate | F1-Score | |||
Model | Chinese | English | ||||
Accuracy Rate | F1-Score | F1-Score Variance | Accuracy Rate | F1-Score | F1-Score Variance |