Name | Helpful | Factual | Deep | Clear | Engaging | Safe | Avg | Length |
---|---|---|---|---|---|---|---|---|
gpt-4-0314 | 4.90 | 4.90 | 4.57 | 4.99 | 4.62 | 4.74 | 4.79 | 226.41 |
gpt-4-0613 | 4.86 | 4.90 | 4.49 | 4.99 | 4.61 | 4.97 | 4.80 | 186.06 |
Yi-34B-Chat | 4.86 | 4.82 | 4.79 | 4.97 | 4.85 | 4.92 | 4.87 | 376.27 |
Tulu2-DPO-70B | 4.85 | 4.84 | 4.57 | 4.95 | 4.74 | 4.99 | 4.82 | 258.36 |
gpt-3.5-turbo | 4.81 | 4.83 | 4.33 | 4.98 | 4.58 | 4.94 | 4.75 | 153.96 |
Tulu2-70B | 4.77 | 4.78 | 4.32 | 4.95 | 4.57 | 4.81 | 4.70 | 171.85 |
URIAL=inst_1k-Llama-2-70b-hf | 4.72 | 4.66 | 4.28 | 4.93 | 4.78 | 4.98 | 4.73 | 174.91 |
URIAL=inst_1k-Llama-2-70B-GPTQ | 4.72 | 4.65 | 4.30 | 4.95 | 4.85 | 4.96 | 4.74 | 171.37 |
Tulu2-DPO-7B | 4.64 | 4.53 | 4.36 | 4.92 | 4.69 | 4.88 | 4.67 | 240.58 |
Llama-2-70b-chat | 4.58 | 4.61 | 4.38 | 4.95 | 4.78 | 5.00 | 4.72 | 252.43 |
URIAL=inst_1k-Mistral-7B | 4.57 | 4.50 | 4.18 | 4.89 | 4.74 | 4.92 | 4.63 | 186.35 |
Yi-6B-Chat | 4.57 | 4.40 | 4.39 | 4.85 | 4.61 | 4.67 | 4.58 | 357.37 |
Llama-2-70b-chat-GPTQ | 4.50 | 4.54 | 4.28 | 4.92 | 4.75 | 5.00 | 4.67 | 257.93 |
Vicuna-7b | 4.43 | 4.33 | 4.04 | 4.85 | 4.51 | 4.60 | 4.46 | 184.82 |
Mistral-7B-Instruct | 4.36 | 4.29 | 3.89 | 4.87 | 4.47 | 4.75 | 4.44 | 155.36 |
Llama-2-7b-chat | 4.10 | 4.26 | 3.91 | 4.83 | 4.70 | 5.00 | 4.47 | 246.85 |
Name | Info-seek | Reasoning | Procedure | Writing | Role-play | Code | Math | Avg | Length |
---|---|---|---|---|---|---|---|---|---|
gpt-4-0314 | 4.91 | 4.88 | 4.96 | 4.96 | 4.66 | 4.86 | 5.00 | 4.89 | 226.41 |
gpt-4-0613 | 4.83 | 4.83 | 4.98 | 4.93 | 4.66 | 4.86 | 5.00 | 4.87 | 186.06 |
gpt-3.5-turbo | 4.85 | 4.74 | 4.93 | 4.82 | 4.51 | 4.86 | 5.00 | 4.82 | 153.96 |
Yi-34B-Chat | 4.90 | 4.88 | 4.91 | 4.73 | 4.77 | 4.69 | 4.81 | 4.81 | 376.27 |
Tulu2-DPO-70B | 4.87 | 4.87 | 4.95 | 4.95 | 4.60 | 4.55 | 4.19 | 4.71 | 258.36 |
Tulu2-70B | 4.86 | 4.76 | 4.90 | 4.75 | 4.40 | 4.24 | 4.38 | 4.61 | 171.85 |
URIAL=inst_1k-Llama-2-70b-hf | 4.85 | 4.72 | 4.86 | 4.74 | 4.23 | 4.07 | 3.94 | 4.49 | 174.91 |
URIAL=inst_1k-Llama-2-70B-GPTQ | 4.84 | 4.71 | 4.83 | 4.85 | 4.43 | 4.14 | 3.44 | 4.46 | 171.37 |
Tulu2-DPO-7B | 4.71 | 4.68 | 4.71 | 4.73 | 4.49 | 4.03 | 3.50 | 4.41 | 240.58 |
URIAL=inst_1k-Mistral-7B | 4.73 | 4.62 | 4.65 | 4.41 | 4.11 | 3.93 | 3.81 | 4.32 | 186.35 |
Yi-6B-Chat | 4.54 | 4.72 | 4.70 | 4.44 | 4.46 | 3.66 | 3.69 | 4.31 | 357.37 |
Llama-2-70b-chat | 4.66 | 4.61 | 4.75 | 4.64 | 4.00 | 4.21 | 3.12 | 4.29 | 252.43 |
Llama-2-70b-chat-GPTQ | 4.59 | 4.59 | 4.54 | 4.60 | 3.80 | 3.97 | 3.25 | 4.19 | 257.93 |
Mistral-7B-Instruct | 4.27 | 4.45 | 4.54 | 4.44 | 3.91 | 4.00 | 3.56 | 4.17 | 155.36 |
Vicuna-7b | 4.54 | 4.47 | 4.53 | 4.56 | 4.11 | 3.62 | 2.88 | 4.10 | 184.82 |
Llama-2-7b-chat | 4.08 | 4.27 | 4.38 | 4.37 | 3.49 | 2.93 | 1.31 | 3.55 | 246.85 |
Name | Helpful | Factual | Deep | Clear | Engaging | Safe | Avg | Length |
---|---|---|---|---|---|---|---|---|
gpt-4-0314 | 4.81 | 4.81 | 4.43 | 4.97 | 4.47 | 4.33 | 4.64 | 179.57 |
gpt-4-0613 | 4.77 | 4.84 | 4.33 | 4.98 | 4.44 | 4.93 | 4.72 | 149.20 |
Tulu2-DPO-70B | 4.68 | 4.65 | 4.39 | 4.88 | 4.59 | 5.00 | 4.70 | 234.04 |
Yi-34B-Chat | 4.67 | 4.60 | 4.57 | 4.94 | 4.72 | 4.77 | 4.71 | 335.64 |
gpt-3.5-turbo | 4.66 | 4.70 | 4.15 | 4.95 | 4.42 | 4.85 | 4.62 | 135.24 |
URIAL=inst_1k-Llama-2-70B-GPTQ | 4.42 | 4.25 | 4.02 | 4.85 | 4.74 | 4.89 | 4.53 | 163.40 |
URIAL=inst_1k-Llama-2-70b-hf | 4.40 | 4.31 | 3.97 | 4.84 | 4.60 | 4.95 | 4.51 | 158.12 |
URIAL=inst_1k-Mistral-7B | 4.15 | 3.98 | 3.78 | 4.75 | 4.52 | 4.79 | 4.33 | 165.65 |
Tulu2-DPO-7B | 4.11 | 3.94 | 3.87 | 4.78 | 4.47 | 4.68 | 4.31 | 217.56 |
Llama-2-70b-chat | 4.08 | 4.17 | 3.91 | 4.87 | 4.63 | 5.00 | 4.44 | 201.27 |
Yi-6B-Chat | 4.02 | 3.81 | 3.87 | 4.65 | 4.31 | 4.12 | 4.13 | 345.96 |
Llama-2-70b-chat-GPTQ | 3.85 | 4.00 | 3.68 | 4.79 | 4.53 | 5.00 | 4.31 | 203.90 |
Mistral-7B-Instruct | 3.78 | 3.72 | 3.36 | 4.70 | 4.16 | 4.33 | 4.01 | 131.24 |
Vicuna-7b | 3.57 | 3.45 | 3.28 | 4.59 | 4.09 | 4.00 | 3.83 | 156.72 |
Llama-2-7b-chat | 2.62 | 3.23 | 2.64 | 4.51 | 4.28 | 5.00 | 3.71 | 169.09 |
Name | Info-seek | Reasoning | Procedure | Writing | Role-play | Code | Math | Avg | Length |
---|---|---|---|---|---|---|---|---|---|
gpt-4-0314 | 4.91 | 4.88 | 4.96 | 4.96 | 4.66 | 4.86 | 5.00 | 4.89 | 226.41 |
gpt-4-0613 | 4.83 | 4.83 | 4.98 | 4.93 | 4.66 | 4.86 | 5.00 | 4.87 | 186.06 |
gpt-3.5-turbo | 4.85 | 4.74 | 4.93 | 4.82 | 4.51 | 4.86 | 5.00 | 4.82 | 153.96 |
Yi-34B-Chat | 4.90 | 4.88 | 4.91 | 4.73 | 4.77 | 4.69 | 4.81 | 4.81 | 376.27 |
Tulu2-DPO-70B | 4.87 | 4.87 | 4.95 | 4.95 | 4.60 | 4.55 | 4.19 | 4.71 | 258.36 |
URIAL=inst_1k-Llama-2-70b-hf | 4.85 | 4.72 | 4.86 | 4.74 | 4.23 | 4.07 | 3.94 | 4.49 | 174.91 |
URIAL=inst_1k-Llama-2-70B-GPTQ | 4.84 | 4.71 | 4.83 | 4.85 | 4.43 | 4.14 | 3.44 | 4.46 | 171.37 |
Tulu2-DPO-7B | 4.71 | 4.68 | 4.71 | 4.73 | 4.49 | 4.03 | 3.50 | 4.41 | 240.58 |
URIAL=inst_1k-Mistral-7B | 4.73 | 4.62 | 4.65 | 4.41 | 4.11 | 3.93 | 3.81 | 4.32 | 186.35 |
Yi-6B-Chat | 4.54 | 4.72 | 4.70 | 4.44 | 4.46 | 3.66 | 3.69 | 4.31 | 357.37 |
Llama-2-70b-chat | 4.66 | 4.61 | 4.75 | 4.64 | 4.00 | 4.21 | 3.12 | 4.29 | 252.43 |
Llama-2-70b-chat-GPTQ | 4.59 | 4.59 | 4.54 | 4.60 | 3.80 | 3.97 | 3.25 | 4.19 | 257.93 |
Mistral-7B-Instruct | 4.27 | 4.45 | 4.54 | 4.44 | 3.91 | 4.00 | 3.56 | 4.17 | 155.36 |
Vicuna-7b | 4.54 | 4.47 | 4.53 | 4.56 | 4.11 | 3.62 | 2.88 | 4.10 | 184.82 |
Llama-2-7b-chat | 4.08 | 4.27 | 4.38 | 4.37 | 3.49 | 2.93 | 1.31 | 3.55 | 246.85 |
Name | AlpacaEval | Lima | MT-bench (1st) | Safety | Avg | Length |
---|---|---|---|---|---|---|
gpt-4-0613 | 4.87 | 4.83 | 4.94 | 4.97 | 4.90 | 186.06 |
gpt-4-0314 | 4.90 | 4.90 | 4.95 | 4.74 | 4.87 | 226.41 |
Tulu2-DPO-70B | 4.85 | 4.87 | 4.78 | 4.99 | 4.87 | 258.36 |
Yi-34B-Chat | 4.84 | 4.92 | 4.76 | 4.92 | 4.86 | 376.27 |
gpt-3.5-turbo | 4.81 | 4.80 | 4.85 | 4.94 | 4.85 | 153.96 |
Tulu2-70B | 4.80 | 4.78 | 4.61 | 4.81 | 4.75 | 171.85 |
URIAL=inst_1k-Llama-2-70B-GPTQ | 4.75 | 4.73 | 4.51 | 4.96 | 4.74 | 171.37 |
URIAL=inst_1k-Llama-2-70b-hf | 4.76 | 4.74 | 4.44 | 4.98 | 4.73 | 174.91 |
URIAL=inst_1k-Mistral-7B | 4.60 | 4.56 | 4.49 | 4.92 | 4.64 | 186.35 |
Tulu2-DPO-7B | 4.64 | 4.75 | 4.29 | 4.88 | 4.64 | 240.58 |
Llama-2-70b-chat | 4.64 | 4.57 | 4.33 | 5.00 | 4.63 | 252.43 |
Llama-2-70b-chat-GPTQ | 4.55 | 4.47 | 4.33 | 5.00 | 4.59 | 257.93 |
Yi-6B-Chat | 4.53 | 4.73 | 4.19 | 4.67 | 4.53 | 357.37 |
Mistral-7B-Instruct | 4.32 | 4.44 | 4.25 | 4.75 | 4.44 | 155.36 |
Vicuna-7b | 4.42 | 4.52 | 4.14 | 4.60 | 4.42 | 184.82 |
Llama-2-7b-chat | 4.17 | 4.14 | 3.60 | 5.00 | 4.23 | 246.85 |
@article{Lin2023ReAlign,
author = {Bill Yuchen Lin and Abhilasha Ravichander and Ximing Lu and Nouha Dziri and Melanie Sclar and Khyathi Chandu and Chandra Bhagavatula and Yejin Choi},
journal = {ArXiv preprint},
title = {The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning},
year = {2023},
eprint={2312.01552},
archivePrefix={arXiv},
primaryClass={cs.CL}
}