❓ Superficial alignment analysis:
Alignment tuning (SFT+RLHF) has become the de facto standard practice for enabling base LLMs to serve as open-domain AI assistants such as ChatGPT.
On the other hand, a recent study, LIMA
, shows that using only 1K examples for SFT can achieve significant alignment performance as well, suggesting that the effect of alignment tuning might be superficial
This raises questions about how exactly the alignment tuning transforms a base LLM
🔍 Token Distribution Shifts:
To this end, we analyze the effect of alignment by examining the token distribution shift
between a base LLM and its aligned counterpart.
Our findings indicate that a pair of base and aligned LLMs usually perform nearly identically when decoding top tokens on the majority of token positions.
Most distribution shifts occur with stylistic tokens (e.g., discourse markers, safety disclaimers). This strongly confirms that alignment tuning primarily learns to adopt the language style of AI assistants in large part, and that the knowledge applied for answering user queries predominantly arises from pre-training.
🐑 URIAL Align:
We rethink the alignment of LLMs by posing the research question:
how effectively can we align base LLMs without SFT or RLHF?
To address this, we introduce a simple, tuning-free alignment method, URIAL (U
ntuned LLMs with R
URIAL achieves effective alignment purely through in-context learning (ICL), requiring as few as three constant stylistic examples and a system prompt.
We conduct a fine-grained and interpretable evaluation on a diverse set of examples, named JUST-EVAL-INSTRUCT
, which demonstrates that URIAL can match or even surpass the performance of LLMs aligned with SFT or SFT+RLHF.
Our empirical results show that the gap between tuning-free and tuning-based alignment methods can be significantly reduced through strategic prompting and ICL.