A Hong Kong finance worker authorized a $25 million transfer after a video call with what he believed was his CFO. The face and voice were deepfakes. In another case, a Vancouver SaaS company lost $2.3 million when attackers used a cloned CEO voice on a Zoom call to push through "urgent vendor payments"—after they’d already phished the CFO’s email and studied how the real CEO talked. The pattern isn’t new. Business email compromise has relied on impersonation for years. What’s new is that the impersonation is no longer just text. It’s voice and video that pass the casual human check, and that’s enough to make people bypass the controls you thought you had.
Identity verification has become a primary control point for remote work, onboarding, and high-value transactions. It’s also become a primary target. Attackers aren’t only trying to fool a selfie or a liveness check. They’re building durable impersonation: synthetic voices for vishing and BEC, synthetic faces for account creation and takeover, and increasingly, attacks that skip the camera and microphone entirely by feeding your systems pre-recorded or AI-generated streams. If your verification design assumes "we’ll detect the fake," you’re already behind. The better question is whether you’re still trusting a single channel—and whether that channel is even the one you think it is.
Voice: The New BEC Vector
Email-based BEC works because authority and urgency override normal verification. Add a voice that sounds like the CEO, and the override gets stronger. Employees are conditioned to comply with executive requests. A voice carries tone, stress, impatience—cues that feel more "real" than text and short-circuit the "let me verify this" reflex. Attackers know it. Voice cloning no longer requires a lab. A few minutes of public audio—earnings calls, interviews, social clips—can produce a model that says whatever the attacker scripts. The tools are cheap and improving. The attack chain is straightforward: get the CFO or payments person on a call (or create urgency so they call back a number the attacker controls), play the cloned voice with a plausible story—urgent deal, confidential transfer, don’t tell anyone yet—and wire instructions go through before anyone double-checks on a separate channel.
The financial scale is real. One estimate puts AI-enabled fraud losses on a path from roughly $12 billion in 2023 to tens of billions by 2027. North American deepfake-related fraud cases have surged. The NexusFlow-style incident—spear-phish for access, study the victim’s communication, then deploy a cloned voice on a video call—is a repeatable playbook. Better voice biometrics alone won’t fix it. Never treat voice (or video) as sufficient proof of identity for high-impact actions. Voice can be part of a flow; it cannot be the only gate for moving money or changing access.
Video and the Sensor That Isn’t There
On the video side, the problem splits into two. First, the media itself: deepfakes and face swaps that look good enough to pass quick human or automated checks, especially in low resolution, on mobile, after compression. Research on real-world deepfake benchmarks (e.g., in-the-wild political and synthetic datasets) shows that detectors that perform well in clean conditions often degrade when the footage has been re-encoded, cropped, or captured on a phone. "Does this face look real?" is a moving target, and the target is moving in the attacker’s favor.
Second, and often less discussed: the sensor may not be in the loop at all. Injection attacks don’t present a fake face to the camera. They replace the video stream between the capture device and your backend. Virtual cameras, compromised devices, emulators, or man-in-the-middle style substitution can feed your system a perfect-looking, pre-recorded or synthetic video. From your system’s perspective, it’s receiving a valid stream. Traditional liveness checks look for physical giveaways—screen moiré, reflections, depth inconsistency. Injected streams are digitally pristine; they never went through a physical capture, so those cues aren’t there to detect. Your liveness logic is answering "is this video consistent with a real camera?" while the input is coming from a pipe that bypassed the camera. "Stronger deepfake detection" alone doesn’t fix the issue. If the capture path is compromised, the best detector is still judging the wrong thing.
Replay attacks add another twist. Researchers have shown that playing synthetic speech through a speaker and re-recording it through a microphone can significantly degrade detection—natural room acoustics and device artifacts strip away some of the synthetic fingerprints that models rely on. Even "we’ll detect synthetic audio" is fragile when the audio is replayed in a realistic environment. The same idea applies to video: replayed or re-injected sessions can look and sound more "human" to both humans and detectors.
Why Single-Channel Verification Fails
The core failure is treating verification as a single decision: "Is this the right person?" when the real question is "Is this session legitimate?" A session includes the media, the device, the network path, and the behavior. If you only look at the media, you’re vulnerable to better fakes and to attacks that never give you real media. If you only look at "liveness" in the pixels, injection defeats you. If you only look at "voice match," cloning and replay defeat you.
In enterprise settings, a single successful bypass isn’t a one-off scam. It’s an access event. The attacker may create a fraudulent account, pass a remote hiring or KYC check, take over an existing account, or use the verified identity as a stepping stone for privilege escalation. The bar is: we don’t grant durable trust on the basis of one channel that can be spoofed or bypassed.
What to do about it
Stop treating voice or video as sufficient for high-stakes decisions. For wire transfers, access grants, or sensitive data requests, require a second factor that isn’t the same channel. Callback to a known, out-of-band number (from a trusted directory, not from the request); in-person or hardware-backed approval for very high value; or a separate, pre-established process that doesn’t rely on "the person on this call said so." Zero-trust for voice: verify through another path before acting.
Validate the session, not just the pixels. Where you do use face or voice, treat them as one signal among several. Device integrity (is this a known, unmodified device? is the camera/mic pipeline protected?), behavioral signals (does the interaction look human and consistent with a normal verification flow?), and consistency across the session (e.g., does the same session show signs of injection or automation?) should feed the decision. That implies instrumentation: understanding capture path, device attestation where feasible, and behavioral analytics. Vendors that offer "full-session" or "session-level" verification are pushing in this direction—media perception plus integrity plus behavior. Have more than one axis so that if one is bypassed, the others can still block.
Assume the capture path can be hostile. Design so that a perfect-looking face or voice isn’t enough. That means not relying on a single liveness or deepfake detector in isolation; don't assume the input stream is authentic. Where you can, enforce camera/mic integrity (e.g., attested capture, no virtual cameras in high-assurance flows), and combine that with media analysis. For the highest-assurance cases, consider in-person or hardware-backed steps so that the "session" is anchored in something harder to spoof.
Train and policy. People still approve wires and access. Training that "urgent voice or video requests require a separate verification step" and that "we never bypass process because the caller sounds like the boss" reduces the effectiveness of social engineering. Clear, simple policies help too: no wire or access change on the basis of a single call or video, and a defined out-of-band verification path that’s easy to use so that compliance isn’t the exception.
Deepfakes and injection aren’t future threats. They’re in use today in BEC, account takeover, and identity onboarding fraud. Voice-only and video-only authentication were always weak for high-value decisions; they’re now insufficient. Redesign verification so trust is never granted on one spoofable channel, and so the session (not just the face or voice) is what gets verified.