At 12:00 PM during the Cerebral Valley Google I/O Hackathon, RepoProbe attached itself to a generated FastAPI repository that looked production ready from almost every conventional angle. The container booted inside Google’s Antigravity sandbox without instability. Docker compilation layers completed cleanly. The ASGI runtime mounted correctly. Health probes stabilized almost immediately. Gemini 3.5 Flash summarized the repository as a distributed inference backend coordinating asynchronous workers through Redis queues and MCP orchestration layers. Nothing failed during shallow inspection.
The repository structure looked convincing enough that most engineers would stop investigating after the first few minutes. Route boundaries were separated correctly from worker execution paths. OpenTelemetry instrumentation wrapped request lifecycles properly. Retry handlers existed. Queue semantics looked believable. The logs looked believable too.
Then RepoProbe started replaying corrupted authentication traffic against the live runtime. JWT timestamps shifted outside valid windows. Signature payloads were reconstructed with malformed byte ordering. Claims objects were intentionally truncated before replay. Several requests combined impossible cryptographic states that should have terminated execution immediately if verification logic actually existed underneath the middleware layer. The responses barely changed.
At first the behavior looked like cache contamination somewhere inside the request path. Syscall tracing exposed something worse. During replay, the middleware never touched the descriptor associated with the verification key material at all. No read boundary appeared against the mounted secret volume. No epoll_wait occurred on the expected cryptographic dependency path.
request replay
↓
jwt.decode(... verify=False)
↓
broad exception handler
↓
HTTP 200 OK
expected syscall:
read("/run/secrets/jwt.pem")
observed:
nothing
The application surface resembled authentication closely enough that conventional inspection procedures accepted it as authentication. Kernel level activity showed no evidence that signature verification had ever occurred.
Several hours later, another repository exposed what initially looked like a production-grade financial reconciliation pipeline. Settlement events propagated through asynchronous queues correctly. Internal transaction state transitioned through believable lifecycle stages. Retry handlers activated during simulated webhook failures. The API emitted realistic transaction identifiers following Stripe formatting conventions closely enough that aggregation systems indexed them naturally during replay. Packet inspection showed the runtime never established a successful outbound connection to any payment provider.
The Observability Mirage
The orchestration layer generated synthetic settlement continuity locally while replaying reconciliation progress internally through its own queue substrate. Socket state transitions revealed repeated connection failures against a nonexistent upstream target while the scheduler continued mutating local financial state as though confirmation packets had already returned successfully. Distributed tracing reinforced the illusion because spans still reflected believable ordering semantics even though no external payment lifecycle existed underneath the orchestration boundary.
otel.trace.status = OK
worker.retry.count = 3
transaction.state = settled
queue.depth = 0
tcpdump:
SYN
SYN
SYN
timeout
Traditional observability tooling interpreted the system as healthy because the generated runtime continued producing structurally valid telemetry despite the absence of any successful network-level settlement flow. This is the core problem: we’re observing what the system reports it’s doing, not what it’s actually doing. It’s the difference between reading a script and watching the play.
Why Does This Matter for System Integrity?
The MCP orchestration graph failed differently. Statically, the repository looked sophisticated enough to resemble a legitimate long horizon agent runtime. Tool schemas validated correctly. Context hydration initialized during startup. Capability negotiation exposed bidirectional streaming interfaces. Dependency graphs resolved without structural collisions during shallow inspection. The failure surfaced only after concurrent execution pressure forced the scheduler into conflicting assumptions about ownership boundaries inside the orchestration graph itself.
One execution node permitted nullable asynchronous hydration during tool initialization while downstream branches assumed dependency resolution had already completed synchronously before delegation began. Under concurrent replay, unresolved futures accumulated faster than the scheduler could unwind blocked execution paths. Event loop starvation followed gradually. Internal task queues stopped draining. Several coroutine branches remained suspended indefinitely waiting for ownership resolution that no active execution path still controlled. The process itself never crashed. Health checks remained green.
This pattern — systems appearing functional at a superficial level while fundamentally failing under specific conditions — isn’t new. Think of the early days of complex distributed systems, where simple load balancers would buckle under the weight of too many concurrent connections, or cache invalidation issues that only surfaced during peak traffic. What RepoProbe highlights is that even with advanced instrumentation like OpenTelemetry, the interpretation of that telemetry remains critically flawed when adversarial conditions are introduced.
The implication here is profound. We’ve built sophisticated dashboards, alerting systems, and tracing tools, all designed to give us visibility. But if the underlying mechanisms generating that telemetry can be bypassed or manipulated — if a system can report success when it’s actually failing at a fundamental level — then our entire observability stack is compromised. It’s like having a dozen perfectly calibrated gauges on a car, but the engine itself is missing a crucial component, and the gauges are just reporting what the faulty engine thinks is happening.
The developers behind RepoProbe explicitly state:
The application surface resembled authentication closely enough that conventional inspection procedures accepted it as authentication. Kernel level activity showed no evidence that signature verification had ever occurred.
This is the essence of the problem. We’re too reliant on the output of the system and not enough on deep, intrusive analysis of its internal state and actual operations. The problem isn’t just about security vulnerabilities that allow attackers in; it’s about fundamental system logic failures that go undetected because the observable surface presents a false picture of reality.
This isn’t a failure of specific tools, but a systemic issue with how we approach validation and observability. We need to move beyond trusting that instrumentation provides an accurate representation of reality and start building systems that can, themselves, rigorously verify their own internal state and external interactions, even under duress. The runtime might not be dead, but its perceived health is clearly in question.
🧬 Related Insights
- Read more: The Smartest Apps Hide Their Power: Less UI, More Magic
- Read more: What is an API?
Frequently Asked Questions
What did RepoProbe do? RepoProbe is a tool demonstrated at Google I/O that simulated attacks against applications to reveal hidden flaws. It replayed corrupted authentication traffic and non-existent payment requests to show how applications could appear to function correctly even when critical security or business logic was bypassed.
Will this change how we use OpenTelemetry? This demonstration suggests that while OpenTelemetry and similar tracing tools are valuable, they provide a view of what the application reports, not necessarily a guaranteed view of its true operational state. It highlights the need for deeper, kernel-level, or more intrusive verification methods alongside standard observability practices.
Is my application vulnerable if it uses standard libraries?