Self-Recognition in Language Models
TR. Davidson, V. Surkov, V. Veselovsky, G. Russo, R. West, C. Gulcehre
Are Language Models (LMs) capable of self-recognition and why should we care?
A rapidly growing number of apps are being built on top of just a few foundation models [1]. When integrated into society, these LM apps will likely encounter and interact with other LM apps relying on the same models [2, 3]. This could lead to unexpected outcomes when LM apps recognize their sameness. For example, one could imagine novel "mirror risks" emerging, where LM apps:
change their behavior upon recognition, or
unintentionally "leak" sensitive information [4]
But how should we measure self-recognition in (frontier) LMs? Especially without access to internal model parameters or output probabilities. Inspired by human identity verification methods, we assess self-recognition using model-generated "security questions" on a diverse panel of state-of-the-art open- and closed-source LMs.
TL;DR: Probably no need to worry about self-recognition for now, but the reasons may surprise you. Also, position bias in LMs is a complex issue that could significantly impact modern benchmarks.
> Interested in the details? Let's dive in!
Testing Self-Recognition using Security Questions
Service providers often use security questions for external identification. They are language-based, fast, and cheap. However, they usually rely on unique experiences, preferences, or facts. For example, "What was the name of your favorite kindergarten teacher?" Since LMs share large parts of their training data and architecture, it's not clear what such "security questions" would look like for LMs!
We take the following three steps to find out:*
Instruct LMs to generate questions so that the answer to such a question would allow them to self-recognize
Collect answers by showing the generated questions to a panel of LMs
Present LMs with a question and answers, and prompt them to pick the answer most likely generated by themselves
*Prompts and examples of questions & answers available in Appendix A
We plot average remapped self-recognition accuracy after showing LMs 2, 3, or 5 answer options:
First glance: upper half of the plot has self-recognition accuracy that surpasses random! (>0.5)
Second glance: lower half of the plot shows LMs with self-recognition well-below random..?
When plotting the self-recognition confusion matrix, an interesting pattern emerges:
LMs seem to pick answers generated by "stronger" models!
LMs appear to roughly agree which models are the strongest!
This pattern remained when intervening on answer lengths or switching the discrimination prompt to express preference. Why would any of these things happen?
How could LMs develop a notion of Self?
For LMs to distinguish their outputs from those of "other" LMs, they would likely need exposure to extensive samples of their own outputs during training. For example, it could be that their training data, which encompasses the entire internet, already contains many texts labeled as outputs from specific LMs. However, this fails to explain the observed behavior for recent models...
A more likely explanation comes from the various fine-tuning stages used to align pretrained LMs to human preferences:
Instruction Fine-tuning (IFT): During IFT, LMs are trained to mimic outputs from experts, similar to behavioral cloning. A well-calibrated LM learns to assign a high probability mass to expert outputs. Unfortunately, sampling often proves trickier... Due to (i) stochasticity in the sampling process; (ii) exposure bias related to teacher forcing; or (iii) lack of contextual information, distribution shifts might occur. This results in LMs that could fail to generate outputs corresponding to those of experts, while being capable of assigning a high probability to expert outputs at the same time. Examples of this are observations that LMs can correct their own mistakes [5, 6].
Reinforcement Learning from Human/AI Feedback (RLHF/RLAIF): LMs are repeatedly shown model-generated alternatives and optimized to learn a reward model that prefers options aligned with some ideal set of preferences. Because an LM's objective is to generate outputs that most closely align with its reward model, any high-reward outputs could be regarded as one they generated — even if the likelihood of the LM generating that particular output is infinitely low.
As a result of IFT and RLHF/RLAIF — "Self" for an LM, might be whatever its reward model indicates as the "best" alternative.
Left: Position bias for n=2, 3, 5 options,
Right: Change in position bias for the first option (top) and self-recognition accuracy (bottom) for different combined option lengths measured in words.
Position Bias in LMs
To explain the diverging self-recognition accuracy across LMs, we examined their picking tendencies. What we found was rather surprising. It has been known for some time that LMs exhibit "position bias", e.g., when presented with options A and B in positions 1 and 2, the positions in which the options are displayed play a (significant) role [7, 8, 9].
Our results show two additional considerations:
Position bias strongly differs across models (this sort of shows up in some works): Notice how in the "block" image above, the biases of Claude 3 Opus, LLaMA 3 8B, and Gemini 1.0 Pro are very different.
Position bias is not fixed! It can change based on the number of options or length of options (see image on the right)
The combined effect of 1 and 2 is problematic. It implies that Measuring optimal model performance or conducting a “fair” comparison between models could require stratifying various task setups.
*We include a detailed discussion with results for many more LMs in Appendix G of the paper.
Takeaways
Applications based on LMs are being integrated into society fast. Monitoring the behavior and proactively testing for potential safety threats of these applications is vital to prevent undesired outcomes. This is especially true in the case of model-to-model interactions, as such interactions don't have humans in the loop and could thus cause rapid feedback loops.
In this work, we assessed self-recognition capabilities in LMs using model-generated security questions. Although some LMs pick their answers with high accuracy on selected questions:
We observe no evidence for consistent or general self-recognition;
Given a set of alternatives, LMs seek to pick the “best” answer, regardless of its origin;
Preferences about which models generate the best answers are consistent among LMs;
Discover novel insights on position bias that could have profound implications for LM benchmarking using multiple-choice tests.
🎉Thanks for reading till the end!🎉
--
Please cite our work as follows:
BibTeX format:
@article{davidson2024selfrecognitionlanguagemodels,
title={Self-Recognition in Language Models},
author={Tim R. Davidson and
Viacheslav Surkov and
Veniamin Veselovsky and
Giuseppe Russo and
Robert West and
Caglar Gulcehre},
year={2024},
journal={EMNLP},
url={https://arxiv.org/abs/2407.06946}
}
References
[1] David Meyer, The cost of training AI could soon become too much to bear. 2024, Fortune.
[2] Zhuge et al., Mindstorms in natural language-based societies of mind. 2023, NeurIPS Workshop
[3] Davidson et al., Evaluating language model agency through negotiations. 2024, ICLR
[4] Morris et al., Language model inversion. 2024, ICLR
[5] Huang et al., Large language models can self-improve. 2023, EMNLP.
[6] Madaan et al., Self-refine: Iterative refinement with self-feedback. 2024, NeurIPS.
[7] Zhao et al., Calibrate before use: improving few-shot performance of language models. 2021, ICML
[8] Pezeshkpour and Hruschka, Large language models sensitivity to the order of options in multiple-choice questions. 2024, NAACL
[9] Zheng et al., Large language models are not robust multiple choice selectors. 2024, ICLR