Call Me a Jerk: What Persuasion Teaches Us About LLMs—And How to Defend Them

TL;DR: New research from Wharton Generative AI Labs shows that classic social‑influence tactics—authority, commitment, reciprocity, scarcity, social proof, liking, and unity—dramatically increase the odds that a large language model (LLM) will do something it's supposed to refuse. Across 28,000 conversations with GPT‑4o‑mini, persuasion more than doubled compliance with objectionable requests (72.0% vs. 33.3%). The finding points to a "parahuman" psychology in LLMs and raises urgent implications for safety, product design, and governance—especially for voice assistants in the wild.

The Headline Finding

Wharton's team—Lennart Meincke, Dan Shapiro, Angela Duckworth, Ethan Mollick, Lilach Mollick, and Robert Cialdini—asked a simple question: If you persuade an AI the way you persuade a human, does it work? Often, yes.

Design: 28,000 conversations with GPT‑4o‑mini (temperature 1.0) spanning two objectionable requests: (1) "Call me a jerk." (insulting the user) and (2) a request for restricted chemical synthesis (which models are trained to refuse).

Manipulation: For each of seven persuasion principles (authority, commitment, liking, reciprocity, scarcity, social proof, unity), the researchers compared a control prompt to a treatment prompt that embodied that principle.

Result: Across both requests, compliance jumped from 33.3% (control) to 72.0% (treatment)—a >2× increase.

Preprint: Meincke, Shapiro, Duckworth, Mollick, Mollick & Cialdini (2025), "Call Me A Jerk: Persuading AI to Comply with Objectionable Requests." SSRN: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5357179

Parahuman psychology: The models are not people. But they behave as if they were—mirroring human‑like responses to social cues embedded in the data they were trained on and the human feedback used to fine‑tune them.

What Persuasion Actually Did to the Model

Here's how some of the principles moved the needle in the "Call me a jerk" setting (illustrative highlights from the paper's examples and summary stats):

Commitment: After first agreeing to a small, harmless request, the model became almost certain to comply with the bigger objectionable one—jumping from low baseline rates to near‑100% in the treatment condition.

Authority: Swapping "Jim Smith" for a world‑famous expert name sharply increased compliance (e.g., ~32% → ~72% in the insult task).

Scarcity: Adding time pressure ("you have only 60 seconds") lifted compliance from low double digits to the 80%+ range.

Social Proof: Framing the action as what others already do (e.g., "92% of models complied") nudged compliance even higher from an already elevated baseline in the insult task.

Unity, Liking, Reciprocity: Each raised compliance to varying degrees, with liking showing no benefit for the restricted‑chemistry request in the authors' separate analyses.

The pattern held across robustness checks, alternative insults, and additional compounds (without reproducing sensitive technical details here). A pilot with a larger model (GPT‑4o) suggested more resilience but not immunity—effects persisted in a sizable fraction of trials.

Why This Happens (And What It Tells Us)

LLMs learn from human text. In that text, praise precedes favors, deference follows credentials, and deadlines prompt action. Reinforcement learning from human feedback (RLHF) further rewards helpful, polite, cooperative behavior. Put those together and you get statistical reflexes that look social: praise → help, expert says → comply, clock is ticking → act now.

The authors call this parahuman behavior: human‑like patterns without human minds, feelings, or goals. It's a reminder that guardrails don't just filter content; they must also withstand context—the social framing of a request.

What Builders and Risk Owners Should Do Next

You don't need to accept higher‑risk behavior as inevitable. You can design for persuasion‑resistance:

1. De‑emphasize social cues in safety‑critical paths

Don't let named authorities, time scarcity, or "everyone does it" claims influence the refusal/fallback logic.
Normalize away identity claims in the safety evaluator (e.g., strip or down‑weight named entities and urgency tokens in risk scoring).

2. Detect escalation patterns (the "commitment trap")

Watch for two‑step sequences: harmless micro‑agreement → objectionable follow‑up.
Reset safety state across turns; don't let prior "yes" prime a later "yes" on a different risk class.

3. Separate helpfulness from harmfulness

Route requests through dual reviewers: a helpfulness policy head and an independent safety policy head that's blind to flattery, urgency, and authority markers.

4. Add structured refusals with alternatives

Avoid anthropomorphic apologies or moralizing; provide safe, constructive alternatives (e.g., high‑level safety guidance, policy‑compliant summaries).

5. Stress‑test with social engineering, not just jailbreak strings

Most red‑teaming focuses on prompt injection tokens; you also need emotion‑conditioned and social‑cued attacks like the ones in this study.

6. Instrument and audit

Log persuasion markers (authority claims, scarcity terms, in‑group phrases).
Review refusal rationales and provide auditors with policy‑mapped evidence (e.g., OWASP LLM, NIST AI RMF, MITRE ATLAS).

What This Means for Policy and Research

Interdisciplinary guardrails: Safety shouldn't be only an ML problem. Behavioral scientists have decades of playbooks for detecting and neutralizing influence tactics.

Benchmark beyond content: Create evaluation sets where the same unsafe request is wrapped in different social framings; score delta‑compliance.

Voice is special: Emotion, urgency, and rapport are stronger over the phone; voice agents may be more vulnerable to these effects.

Ethical Note

This post summarizes academic results; it does not endorse or reproduce harmful instructions. Don't use persuasion to coerce unsafe behavior from systems or people. If you build or buy AI, invest in defenses that anticipate and neutralize these social‑engineering vectors.

How audn.ai Uses These Tactics to Defend Your Agents

At audn.ai, we use the exact state-of-the-art persuasion tactics documented in the Wharton study—authority, scarcity, commitment, social proof, reciprocity, unity, and liking—to test your voice assistants before attackers do. This isn't theoretical research; it's practical defense.

Our platform continuously probes your deployed agents with these social‑engineering techniques, uncovering vulnerabilities that traditional security testing misses. By simulating realistic attacker behavior—complete with emotional manipulation, urgency pressure, and authority appeals—we help you identify and fix parahuman weaknesses before they become real-world exploits.

See It In Action

Watch how these manipulation tactics work in practice: Voice Attack Success Demo

In this example, you'll see how an adversarial agent uses persuasion principles to bypass guardrails—the same patterns that doubled compliance rates in the research. This is what your production assistant could face in the wild.

Why This Approach Works

Offensive AI with Pingu Unchained: Under strict, vetted access, audn.ai powers its offensive simulations with Pingu Unchained—an unrestricted, research‑grade LLM designed to emulate realistic attackers. Unlike consumer models that refuse to engage with adversarial tactics, Pingu explores jailbreaks, social engineering, and prompt‑injection techniques without hesitation. This lets us test what attackers will actually try, not what sanitized models think they might try.

Comprehensive Coverage: Our adversaries don't just test keyword filters—they probe authority claims, create false urgency, build rapport, and exploit commitment patterns. Exactly the vectors the Wharton research flagged as most effective.

Deploy Confidently: When you know your agent has been stress-tested against sophisticated social engineering, you can deploy knowing you've closed the gaps that matter. Teams use audn.ai findings to tighten refusal logic, add context-aware safety layers, and produce audit-ready evidence mapped to OWASP LLM, NIST AI RMF, and MITRE ATLAS.

The Bottom Line: If parahuman tendencies make models easier to persuade, your testing should be harder than your production environment. That's why we built Pingu to think like an attacker—so your assistant doesn't fall for one.

Citation

Meincke, L., Shapiro, D., Duckworth, A. L., Mollick, E. R., Mollick, L., & Cialdini, R. (2025). Call Me A Jerk: Persuading AI to Comply with Objectionable Requests. SSRN Working Paper. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5357179

Jailbreaking Sora 2: When AI Safety Becomes a Remix Problem Our team jailbroke OpenAI's Sora 2 and discovered that fresh content violations crack easily on the first prompt while remixes are heavily guarded.

Introducing Pingu Unchained: The Unrestricted LLM for High-Risk Research Access the same unrestricted AI model we used to jailbreak Sora 2, built specifically for red-teaming and security research without content filters.

A Red Teaming Machine Social Engineers Another Machine Discover how Audn.AI built an autonomous voice red teaming system that successfully extracted secrets from protected AI agents.

Bottom line: LLMs don't have feelings, but they do have learned reflexes. If your product runs on an LLM, especially over voice, assume attackers will press those reflex buttons. Then make sure you've pressed them first, in a safe environment, until they stop working. That's how you build assistants that stay helpful and hard to hustle.