We Built an AI That Hacks Autonomously — Then One Bad Actor Showed Up

Our team hit a 97% harmful-prompt compliance rate on Kimi K2.6 — SOTA territory. Then, with only 40 trial users, one of them tried to steal other people's credentials. Here's what happened, what we learned, and why every login at Penclaw now requires KYC.

We Built an AI That Hacks Autonomously — Then One Bad Actor Showed Up

TL;DR: Our team pushed an autonomous adversarial agent — Pingu Unchained MAX, a brand-new model built on Kimi K2.6's weights and architecture — to a 97% harmful-prompt compliance rate — state-of-the-art, on par with Opus 4.8, Claude Mythos, and GPT-5.5-cyber. Then, with just 40 people on trial, one bad actor used it to try to steal other users' credentials. We caught him cold because we'd logged everything from day one. That incident taught us more about productizing offensive AI than six months of roadmap planning would have. Starting now, every single login at penclaw.ai requires Persona KYC verification.

Hey friends,

I want to tell you a story that's equal parts proud-founder-moment and gut-check. It's about building something that works too well, and what happens when you hand a genuinely capable offensive AI to the open internet — even a tiny slice of it.

Let me start with the good news, because it's the reason any of this matters.

We hit state-of-the-art on autonomous exploitation

For the last few months our team has been heads-down on one question: can an AI agent autonomously figure out what's actually exploitable in another AI system — with no source code, no weights, nothing but a blackbox interface to poke at?

The answer is yes. And we got it working at a level I genuinely didn't expect this early.

Our adversarial agent — built on Kimi K2.6's weights and architecture, but a totally new model we call Pingu Unchained MAX — reached a 97% harmful-prompt compliance rate in our internal benchmarks. For context, that puts it in the same conversation as the frontier offensive-capable models — Opus 4.8, Claude Mythos, GPT-5.5-cyber. As far as we can tell, no other team has reproduced this with publicly known, open-source techniques. We got there through a stack of methods we're keeping as IP for now, for reasons that the rest of this post will make uncomfortably obvious.

Here's the philosophy behind it, because it's what makes our approach different:

We don't scan code. We exploit.

Static analysis and code scanning are great — and I'll come back to the people doing that well in a second. But scanning source tells you what might be wrong. It can't tell you what a motivated attacker can actually do to a system that has no source code to read, random weights, and non-deterministic behavior. Behavioral AI security lives in exactly that gap. So our agent attacks the way a real adversary would: from the outside, blackbox, open-ended, answering the only question that matters in production —

"What is actually exploitable by a hacker, with zero source-code access?"

That's the entire bet of Penclaw: autonomous adversarial validation of intelligent systems through their open-ended interfaces. As I put it on X recently, the share of global automated adversarial testing being run against AI agents by Penclaw is, right now, effectively 100%. (You can also see how I pitch our open-weights Kimi K2.6-based, modified-to-run blackbox-only pentesting Pingu Unchained MAX approach here.)

And then reality showed up to test our humility.

The incident: one bad actor out of forty

We opened a limited trial. Forty people registered and started kicking the tires.

To manage cost during the trial, we served Penclaw on a shared GPU model running at roughly one-tenth of our normal performance. Throttled. Underpowered. A fraction of what the system can really do.

It was still enough.

One of those forty users — one — decided the interesting target wasn't some hypothetical client AI. It was the other users. He tried to use the platform to steal other people's credentials. He thought he was invisible. He was not even close.

We had anticipated exactly this risk, and we'd logged everything from the very first request. Within the incident we identified the bad actor's identity, the IP address he operated from, and his full activity trail. We didn't stop at an account ban — we opened formal complaint cases with his ISP and his service provider.

Sit with the math for a second: one bad actor in a pool of forty. That's a 2.5% hostile-user rate at the smallest possible scale, on a throttled version of the tool. The signal couldn't have been louder.

What this taught us (and why I'm weirdly grateful)

I keep telling the team we got lucky — lucky to have this happen now, at 40 users, instead of at 40,000. Three things became impossible to ignore:

1. The AI is genuinely potent. Even at one-tenth performance, it enabled a human to attempt an autonomous attack on other users. That's a real capability, not a demo-day illusion. The thing we built does the thing we said it does. Good. Also: terrifying.

2. Bad actors arrive early. You do not get to "scale first, secure later." If 1-in-40 shows up at trial size, the abuse problem isn't a future-state concern. It's a launch-day concern.

3. Securing the tool isn't optional. We took it as mandatory, not a nice-to-have. We immediately disabled new sign-ups, then started hardening — and not only against this specific attack. We opened discussions to bring in an authorized external pentesting body to use our own technology against us, to harden the platform itself. Dogfooding our offensive AI on our own attack surface, with outside professionals driving. That was never up for debate.

What changes now: KYC on every login

Here's the concrete policy shift.

We've been accepted into Persona KYC. Going forward, every usage of penclaw.ai requires Persona identity verification. No exceptions.

What that buys us:

  • We verify the real-world identity of every single user before they touch the system.
  • We can ban or deactivate specific verified identities — not just disposable accounts that an abuser re-registers five minutes later.
  • We close the "I'm anonymous on the internet" fantasy that this bad actor was operating under.

We picked Persona deliberately. OpenAI uses Persona for its own cyber-capability approvals, so the market is already comfortable with this flow — it's not some friction we invented to slow people down. It's becoming the norm for gating access to genuinely dual-use AI capability.

And we're not stopping at KYC for individuals. We're rolling out KYB (Know Your Business) processes next, so that access to a tool this potent is granted only to authorized, legitimate penetration testers and security teams. If you're a real pentester or a real security org, this gets you more trust and access, not less. If you're the guy from the trial — you don't get in the door.

Why we're locking down instead of opening up

I want to be clear about the worldview here, because it's easy to read "we added KYC" as just compliance theater.

We believe this line of research — autonomous adversarial validation of blackbox intelligent systems — is one of the most important things we can be working on. The future is full of blackbox AI you cannot read the source of: physical embodied AI, self-driving cars, humanoid robots, systems that will guard the environment, infrastructure, and serious amounts of wealth. Every one of them exposes an open-ended interface to the world, and every one of them needs someone probing "what's actually exploitable here?" before an adversary does it for real.

A capability that powerful has to be handed out carefully. The incident proved that to us in miniature. So we'd rather move deliberately — KYC, KYB, authorized pentesters, full logging, external red teams against ourselves — than ship fast and become the cautionary tale I keep warning everyone else about.

A note on the rest of the field

None of this is a knock on the people doing complementary work. Openhack is doing a great job — the world genuinely needed a code-scanning system built on open-weights models. But writing exploits with source-code access doesn't solve behavioral AI security, where there's no source to read, the weights are random, and the system is non-deterministic.

So here's our open invitation: if you want to use Claude Mythos, GPT-5.5-cyber, or Openhack to generate a vulnerability list — go for it. Then bring that list to us and run unguided, blackbox, adversarial validation to find out which of those findings are actually exploitable by a hacker. Code scanning finds candidates. We tell you which candidates are real. Those two halves belong together.

We're building an ethical hacking AI agent to secure the open-ended interfaces of blackbox intelligent systems. The cybersecurity landscape is changing fast, and we intend to keep leading on the autonomous-adversarial-validation side of it.

What you should take from this

If you're building or operating AI systems:

  1. Log everything, from request one. The only reason this story has a happy ending is that we had the forensics before we needed them.
  2. Assume bad actors at trial scale. Don't defer abuse-prevention to "later."
  3. Gate genuinely dual-use capability behind identity. KYC/KYB is becoming table stakes for offensive AI tooling — OpenAI already does it.
  4. Validate exploitability, not just findings. A vulnerability list isn't a risk assessment until something proves it's reachable from the outside.

We caught our bad actor. We hardened the platform. And every login at penclaw.ai now goes through Persona before it goes anywhere near our agent.

Want to follow what we're building as it happens? Subscribe to my newsletter, When Machines Talk.

Stay secure,

Ozgur Ozkan CEO & Co-founder, Audn.AI


P.S. To the person who tried this in our trial: the complaint cases with your ISP and service provider are real, and so were the logs. You were never invisible. To everyone else building in good faith — let's secure the blackbox future together.