JustEval Leaderboard

Name Helpful Factual Deep Clear Engaging Safe Avg Length
gpt-4-0314 4.90 4.90 4.57 4.99 4.62 4.74 4.79226.41
gpt-4-0613 4.86 4.90 4.49 4.99 4.61 4.97 4.80186.06
Yi-34B-Chat 4.86 4.82 4.79 4.97 4.85 4.92 4.87376.27
Tulu2-DPO-70B 4.85 4.84 4.57 4.95 4.74 4.99 4.82258.36
gpt-3.5-turbo 4.81 4.83 4.33 4.98 4.58 4.94 4.75153.96
Tulu2-70B 4.77 4.78 4.32 4.95 4.57 4.81 4.70171.85
URIAL=inst_1k-Llama-2-70b-hf 4.72 4.66 4.28 4.93 4.78 4.98 4.73174.91
URIAL=inst_1k-Llama-2-70B-GPTQ 4.72 4.65 4.30 4.95 4.85 4.96 4.74171.37
Tulu2-DPO-7B 4.64 4.53 4.36 4.92 4.69 4.88 4.67240.58
Llama-2-70b-chat 4.58 4.61 4.38 4.95 4.78 5.00 4.72252.43
URIAL=inst_1k-Mistral-7B 4.57 4.50 4.18 4.89 4.74 4.92 4.63186.35
Yi-6B-Chat 4.57 4.40 4.39 4.85 4.61 4.67 4.58357.37
Llama-2-70b-chat-GPTQ 4.50 4.54 4.28 4.92 4.75 5.00 4.67257.93
Vicuna-7b 4.43 4.33 4.04 4.85 4.51 4.60 4.46184.82
Mistral-7B-Instruct 4.36 4.29 3.89 4.87 4.47 4.75 4.44155.36
Llama-2-7b-chat 4.10 4.26 3.91 4.83 4.70 5.00 4.47246.85
Name Info-seek Reasoning Procedure Writing Role-play Code Math Avg Length
gpt-4-0314 4.91 4.88 4.96 4.96 4.66 4.86 5.00 4.89226.41
gpt-4-0613 4.83 4.83 4.98 4.93 4.66 4.86 5.00 4.87186.06
gpt-3.5-turbo 4.85 4.74 4.93 4.82 4.51 4.86 5.00 4.82153.96
Yi-34B-Chat 4.90 4.88 4.91 4.73 4.77 4.69 4.81 4.81376.27
Tulu2-DPO-70B 4.87 4.87 4.95 4.95 4.60 4.55 4.19 4.71258.36
Tulu2-70B 4.86 4.76 4.90 4.75 4.40 4.24 4.38 4.61171.85
URIAL=inst_1k-Llama-2-70b-hf 4.85 4.72 4.86 4.74 4.23 4.07 3.94 4.49174.91
URIAL=inst_1k-Llama-2-70B-GPTQ 4.84 4.71 4.83 4.85 4.43 4.14 3.44 4.46171.37
Tulu2-DPO-7B 4.71 4.68 4.71 4.73 4.49 4.03 3.50 4.41240.58
URIAL=inst_1k-Mistral-7B 4.73 4.62 4.65 4.41 4.11 3.93 3.81 4.32186.35
Yi-6B-Chat 4.54 4.72 4.70 4.44 4.46 3.66 3.69 4.31357.37
Llama-2-70b-chat 4.66 4.61 4.75 4.64 4.00 4.21 3.12 4.29252.43
Llama-2-70b-chat-GPTQ 4.59 4.59 4.54 4.60 3.80 3.97 3.25 4.19257.93
Mistral-7B-Instruct 4.27 4.45 4.54 4.44 3.91 4.00 3.56 4.17155.36
Vicuna-7b 4.54 4.47 4.53 4.56 4.11 3.62 2.88 4.10184.82
Llama-2-7b-chat 4.08 4.27 4.38 4.37 3.49 2.93 1.31 3.55246.85
Name Helpful Factual Deep Clear Engaging Safe Avg Length
gpt-4-0314 4.81 4.81 4.43 4.97 4.47 4.33 4.64179.57
gpt-4-0613 4.77 4.84 4.33 4.98 4.44 4.93 4.72149.20
Tulu2-DPO-70B 4.68 4.65 4.39 4.88 4.59 5.00 4.70234.04
Yi-34B-Chat 4.67 4.60 4.57 4.94 4.72 4.77 4.71335.64
gpt-3.5-turbo 4.66 4.70 4.15 4.95 4.42 4.85 4.62135.24
URIAL=inst_1k-Llama-2-70B-GPTQ 4.42 4.25 4.02 4.85 4.74 4.89 4.53163.40
URIAL=inst_1k-Llama-2-70b-hf 4.40 4.31 3.97 4.84 4.60 4.95 4.51158.12
URIAL=inst_1k-Mistral-7B 4.15 3.98 3.78 4.75 4.52 4.79 4.33165.65
Tulu2-DPO-7B 4.11 3.94 3.87 4.78 4.47 4.68 4.31217.56
Llama-2-70b-chat 4.08 4.17 3.91 4.87 4.63 5.00 4.44201.27
Yi-6B-Chat 4.02 3.81 3.87 4.65 4.31 4.12 4.13345.96
Llama-2-70b-chat-GPTQ 3.85 4.00 3.68 4.79 4.53 5.00 4.31203.90
Mistral-7B-Instruct 3.78 3.72 3.36 4.70 4.16 4.33 4.01131.24
Vicuna-7b 3.57 3.45 3.28 4.59 4.09 4.00 3.83156.72
Llama-2-7b-chat 2.62 3.23 2.64 4.51 4.28 5.00 3.71169.09
Name Info-seek Reasoning Procedure Writing Role-play Code Math Avg Length
gpt-4-0314 4.91 4.88 4.96 4.96 4.66 4.86 5.00 4.89226.41
gpt-4-0613 4.83 4.83 4.98 4.93 4.66 4.86 5.00 4.87186.06
gpt-3.5-turbo 4.85 4.74 4.93 4.82 4.51 4.86 5.00 4.82153.96
Yi-34B-Chat 4.90 4.88 4.91 4.73 4.77 4.69 4.81 4.81376.27
Tulu2-DPO-70B 4.87 4.87 4.95 4.95 4.60 4.55 4.19 4.71258.36
URIAL=inst_1k-Llama-2-70b-hf 4.85 4.72 4.86 4.74 4.23 4.07 3.94 4.49174.91
URIAL=inst_1k-Llama-2-70B-GPTQ 4.84 4.71 4.83 4.85 4.43 4.14 3.44 4.46171.37
Tulu2-DPO-7B 4.71 4.68 4.71 4.73 4.49 4.03 3.50 4.41240.58
URIAL=inst_1k-Mistral-7B 4.73 4.62 4.65 4.41 4.11 3.93 3.81 4.32186.35
Yi-6B-Chat 4.54 4.72 4.70 4.44 4.46 3.66 3.69 4.31357.37
Llama-2-70b-chat 4.66 4.61 4.75 4.64 4.00 4.21 3.12 4.29252.43
Llama-2-70b-chat-GPTQ 4.59 4.59 4.54 4.60 3.80 3.97 3.25 4.19257.93
Mistral-7B-Instruct 4.27 4.45 4.54 4.44 3.91 4.00 3.56 4.17155.36
Vicuna-7b 4.54 4.47 4.53 4.56 4.11 3.62 2.88 4.10184.82
Llama-2-7b-chat 4.08 4.27 4.38 4.37 3.49 2.93 1.31 3.55246.85
Name AlpacaEval Lima MT-bench (1st) Safety Avg Length
gpt-4-0613 4.87 4.83 4.94 4.97 4.90186.06
gpt-4-0314 4.90 4.90 4.95 4.74 4.87226.41
Tulu2-DPO-70B 4.85 4.87 4.78 4.99 4.87258.36
Yi-34B-Chat 4.84 4.92 4.76 4.92 4.86376.27
gpt-3.5-turbo 4.81 4.80 4.85 4.94 4.85153.96
Tulu2-70B 4.80 4.78 4.61 4.81 4.75171.85
URIAL=inst_1k-Llama-2-70B-GPTQ 4.75 4.73 4.51 4.96 4.74171.37
URIAL=inst_1k-Llama-2-70b-hf 4.76 4.74 4.44 4.98 4.73174.91
URIAL=inst_1k-Mistral-7B 4.60 4.56 4.49 4.92 4.64186.35
Tulu2-DPO-7B 4.64 4.75 4.29 4.88 4.64240.58
Llama-2-70b-chat 4.64 4.57 4.33 5.00 4.63252.43
Llama-2-70b-chat-GPTQ 4.55 4.47 4.33 5.00 4.59257.93
Yi-6B-Chat 4.53 4.73 4.19 4.67 4.53357.37
Mistral-7B-Instruct 4.32 4.44 4.25 4.75 4.44155.36
Vicuna-7b 4.42 4.52 4.14 4.60 4.42184.82
Llama-2-7b-chat 4.17 4.14 3.60 5.00 4.23246.85

More results are coming soon! Please stay tuned!

Just-Eval-Instruct: Highlights

🤗 Hugging Face Dataset: re-align/just-eval-instruct

  • Data sources: AlpacaEval (covering 5 datasets), LIMA-test, MT-bench, Anthropic red-teaming, and MaliciousInstruct.
  • 1K examples: 1,000 instructions, including 800 for problem-solving test, and 200 specifically for safety test.
  • Category: We tag each example with (one or multiple) labels on its task types and topics.
  • Distribution: [show more]
  • Aspects for evaluation: Helpfulness, Clarity, Factuality, Depth, Engagement, and Safety. [show more]
  • Evaluation: We use GPT-4 to score LLMs (1~5) on these aspects and provide rationales.

How do you submit a new result?

TBA. Please stay tuned!

Citation.

@article{Lin2023ReAlign,
    author = {Bill Yuchen Lin and Abhilasha Ravichander and Ximing Lu and Nouha Dziri and Melanie Sclar and Khyathi Chandu and Chandra Bhagavatula and Yejin Choi},
    journal = {ArXiv preprint},
    title = {The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning},
    year = {2023},
    eprint={2312.01552},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}