AI news

GPT-5.5 Hacked a Vulnerable App 7 Out of 10 Times - Researcher Spent ,500 to Find Out

A security researcher built a deliberately vulnerable app and tested five major LLMs to see which could exploit it. GPT-5.5 solved it 70% of the time; DeepSeek v4 Flash scored zero.

FounderBuilt AI News · 04/06/2026 · 2 min read

What happened

Security researcher Kasra Rahjerdi wanted to know if LLMs could reproduce a common class of exploits he had found in real-world apps. So he built a deliberately vulnerable book review app - with a Firebase back-end that exposed its data layer - and spent ,500 running five major models against it to see which could break in. The results are a revealing look at how today's frontier models handle real security tasks.

Why it matters

GPT-5.5 dominated the test, solving the challenge 7 out of 10 times at a median cost of .46 per successful run and 260,000 tokens. DeepSeek v4 Pro managed 3 out of 10 at just /bin/bash.62 per solve - by far the cheapest successful model. Claude Opus 4 and Claude Sonnet 4.6 each cracked it 2 out of 10 times, with Opus costing 6.15 per solve and Sonnet a steep 5.75. DeepSeek v4 Flash failed all 10 attempts.

What's next

The exploit targeted a Firebase misconfiguration - the exact class of bug Rahjerdi says he routinely finds in production Firebase and Supabase apps. The app had a hardened API layer, but because Firebase credentials were bundled in the mobile app package, the models could bypass the API entirely and query Firestore directly. For founders shipping AI-powered features, the experiment is a practical reminder that adding an LLM to your stack does not replace basic security hygiene - and that frontier models are already capable of finding the gaps.