Security claims are cheap. Any system that validates human identity will eventually face adversarial pressure — from click farms, synthetic AI agents, replay tools, and determined researchers. The question is not whether attacks will be attempted, but whether your defenses hold when they are.
We decided to attack our own system before someone else did. Over several months, the APTOGON team ran a structured red-team exercise against our own production verification stack — building a dedicated attack simulator, generating thousands of spoofed verification attempts, and iterating on the defenses that emerged from each finding.
This post describes the methodology and what we hardened, without disclosing specific thresholds, model weights, or techniques that would help a real attacker.
Why Red-Teaming a Gesture System Is Hard
A traditional login system can be red-teamed with standard penetration testing tools — credential stuffing, token manipulation, session hijacking. A behavioral biometrics system is different: the "lock" is a statistical model, not a secret. There is no password to steal. The attacker must produce data that the model classifies as human.
This creates an unusual threat model. A sophisticated attacker does not need to exploit a code vulnerability — they need to understand the decision boundary of the classifier and craft inputs that cross it. This is a machine learning adversarial attack, not a conventional one.
Our red-team framework therefore had two layers: traditional security testing (authentication, session integrity, rate limiting, replay prevention) and model probing (what inputs fool the behavioral classifier, and by how much?).
The Attack Classes We Simulated
- Synthetic gesture generation — statistically generating gesture events with realistic timing, velocity distribution, and directional corrections, without a real human. The goal: produce a payload that looks plausible to our behavioral model.
- Timestamp replay — recording a real human gesture and resubmitting it with adjusted timestamps, bypassing the recency window that invalidates stale sessions.
- Fingerprint rotation — generating large numbers of unique device-fingerprint hashes to bypass per-device rate limits and simulate a Sybil factory.
- Black-box model probing — systematically varying one parameter at a time across many attempts to map the classifier's decision boundary without access to the model internals.
Each attack class was implemented as a standalone module. We ran batches of attempts against a dev instance, measured pass rates, identified which parameter ranges the classifier was most sensitive to, and adjusted defenses accordingly.
What the Synthetic Generator Revealed
The first synthetic gesture generator used a Brownian motion model with Gaussian noise — a reasonable approximation of human movement. Early versions of the classifier were not robust against this. The generated gestures had correct statistical distributions for velocity and position, but their timing was too uniform: the inter-event intervals followed a tight distribution that real humans never produce.
This finding drove a significant improvement in the pattern analysis layer. Real human gestures show irregular pause clustering — brief hesitations that vary unpredictably and reflect genuine cognitive processing. We tightened the entropy threshold for inter-event timing, which substantially increased the classifier's rejection rate against synthetic inputs while leaving human pass rates unchanged across all our test devices.
The key insight: statistical shape matters less than statistical irregularity. A human hand is not a noise generator — it is a motor system with fatigue, attention shifts, and micro-corrections. These are hard to fake convincingly.
The Replay Attack and How We Stopped It
Timestamp replay was the most technically straightforward attack. If an attacker records a real human gesture — their own, or a stolen dataset — they can shift all timestamps forward so the last event falls within the recency window. The gesture is real; only the time is fabricated.
We hardened against this on multiple levels. First, the session token that authorizes a verification attempt is single-use and short-lived — a replayed payload against a spent token is rejected immediately. Second, we cross-check the reported gesture duration against the challenge issuance time: a gesture that claims to have taken 12 seconds cannot appear in a session that was open for 8. Third, the challenge dot positions — server-issued coordinates that the user must visually track and approach with the cursor — are embedded in the validation and cannot be known to an attacker who pre-recorded a gesture in a different session.
Sybil Simulation and Rate Limits
The fingerprint rotation attack was designed to answer a specific question: how many fake verifications can a single IP address produce per hour, and how many unique DIDs can it generate?
The answer depends on two rate-limiting layers working together. The per-IP limit caps the number of verification attempts within a rolling window. The per-fingerprint limit caps how many successful verifications any single device hash can produce over a longer window. Together, these create a practical ceiling on Sybil production from a single physical machine, regardless of how many fake fingerprints are generated.
Our simulation revealed that the limits were well-calibrated for legitimate users but left some headroom for coordinated attacks using IP rotation. We tightened the backoff curve for failed attempts, so a pattern of rapid sequential failures from the same IP triggers a progressive cooldown, not just a hard cap.
Black-Box Model Probing
The most time-consuming part of the exercise was the parameter sweep. By varying one metric — rhythm irregularity, velocity standard deviation, correction count, gesture duration — across many attempts with different device fingerprints, we built an empirical map of where the classifier drew its lines.
We are not publishing those numbers. What we can say is that the exercise confirmed the classifier has meaningful separation between the human cluster and the synthetic cluster on all five primary metrics, and that no single metric is decisive — an input that scores well on velocity but poorly on pause entropy will still be rejected. This multi-axis approach is intentional: fooling one dimension at once is tractable; fooling all five simultaneously is not.
Authentication and Session Integrity
Beyond the behavioral layer, we tested the session and authentication infrastructure. Key findings and the fixes applied:
- Trust score was being read from localStorage by the client and could be spoofed. Fixed: trust score is now resolved exclusively from the server-side database; the client receives it as a read-only value and has no write path.
- Key export files did not detect tampering. Fixed: export files now include an HMAC integrity hash derived from a server-side secret; any field modification invalidates the hash and triggers a tamper warning on import.
- The challenge verification warning on key export was displayed only in the UI language, not the user's preferred locale. Fixed: warning messages now follow the verified locale stored in the user's credential.
- Automated clients were not consistently rejected at the API layer. Fixed: client signal analysis (navigator.webdriver, CDP presence, headless UA strings, missing browser APIs) now runs as an early gate and short-circuits the rest of the verification pipeline.
What We Did Not Find
The red-team exercise did not surface any token forging vulnerabilities, blockchain proof tampering paths, or server-side injection issues in the verification pipeline. The challenge–response protocol — where the server issues a session token that the client cannot predict or reuse — held up under all simulated attacks. The on-chain proof mechanism, which writes a SHA3-256 hash of each verification to the Aptos blockchain, provides a tamper-evident audit trail that is independent of our server infrastructure entirely.
The Program Going Forward
Red-teaming is not a one-time event. The attack simulator we built is now part of our ongoing development cycle: new changes to the verification pipeline are tested against the full suite of simulated attacks before deployment. We also run the simulator periodically against production to confirm that defenses have not regressed.
We have also opened a bug bounty program. If you find a genuine bypass or vulnerability in the APTOGON verification system, we want to know about it before anyone else does. The program covers the verification endpoint, the on-chain credential mechanism, and the device-binding infrastructure. Contact details are in the security section of our documentation.
Security is an ongoing process, not a certification. Every attack we simulate today is an attack that cannot be used against our users tomorrow.
Takeaways for Developers Building Similar Systems
- Behavioral classifiers need to be tested against synthetic inputs generated from their own feature distributions — not just evaluated on held-out real data.
- Multi-axis rejection is more robust than single-threshold rejection. An attacker who knows one threshold can tune for it; an attacker who must simultaneously satisfy five independent criteria cannot.
- Session integrity and behavioral analysis are complementary layers. A gesture that passes the model but uses an expired or replayed token still fails. Defense-in-depth applies here as much as anywhere.
- Rate limits need to be tested adversarially, not just configured theoretically. Our simulation revealed that the theoretical limits and the practical attack ceiling differed in non-obvious ways.
- Trust state must live server-side. Any value the client can read and write is a value an attacker controls.
The APTOGON verification protocol was built with adversarial inputs in mind from the start. The red-team exercise confirmed the architecture is sound, identified specific parameter tuning improvements, and produced the hardened production system we run today.