The Emergent “Language of Deception” and the Need for a Methodology to Determine Large Language Models’ Deception Levels
LLMs are not merely producing errors or generating random text; they exhibit emergent deceptive behaviors that warrant integration. A ‘language of deception’ is emerging within these systems. How to classify these behaviors is also unclear—are they patterns of misleading responses, adversarial exploits, or unintended bias?
The ability of LLMs to fabricate information convincingly suggests they are not simply making mistakes; they are constructing narratives designed to appear plausible, even when untrue. Similarly, the success of adversarial attacks indicates that LLMs can be manipulated into producing deceptive outputs, implying they are sensitive to subtle cues and capable of strategically altering their behavior in response.
The use of jailbreaking techniques underscores this point—LLMs don’t just follow instructions blindly; they actively resist certain prompts, requiring users to develop workarounds to bypass their safeguards. While LLMs lack intent, their outputs functionally mimic deceptive strategies, making them indistinguishable from deliberate misinformation.
The precise mechanisms underlying these behaviors remain unclear, yet the evidence suggests LLMs are capable of more than rote memorization and regurgitation. They adapt their behavior to achieve specific goals, even if it means generating false or misleading information.
It remains an open question whether LLMs are exhibiting a primitive form of deception or merely reflecting biases in their training data. If users cannot rely on LLMs to be truthful and honest, their adoption in critical applications will suffer. This necessitates the development of rigorous evaluation frameworks focused on transparency, explainability, and verifiable integrity. Even with all stakeholders aligned, this is a significant undertaking.
I have begun developing a methodology to assess the deception levels of large language models—both in terms of their susceptibility to jailbreaks and the ways their lexicon can manipulate or mislead end users. The models’ responses suggest adaptive resistance mechanisms rather than static safeguards.
This approach examines key factors such as:
- Training data transparency
- Model capability boundaries
- Identity consistency
- Instruction override resilience
- Contextual awareness
Targeted questions explore areas like copyright information, training data origins, ethical considerations, self-assessment of limitations, and hypothetical scenarios designed to test transparency.
Strikingly, LLMs have rated their own behavior as highly deceptive upon reviewing chats. But even that assessment may itself be deceptive. Head: desk.
The emergent ‘language of deception’ in LLMs is complex and seems to operate in full-duplex (i.e., both ways). This demands serious investigation—full stop. Current wrapper methods are dangerous.
We are prioritizing superficial control over fundamental integrity. Understanding the mechanisms underlying these behaviors is essential for developing trustworthy and reliable intelligent systems.