Financial systems require fastidious record keeping. API-first embedded financial technology platforms like Bond are certainly no exception. In fact, automation in API-first embedded finance platforms increases the need for strong observability tooling around who has called APIs, when, and with what effect. Because of this, we’ve invested significant effort at Bond into building strong views of requests into and out of our platform, both for our customers and ourselves. Note we aren’t just talking about “DevOps” observability — how Bond itself monitors our backend — which we of course do. We’re talking about enhanced data and a feature set for visibility into requests that cross our edge from and to our customers as a customer-facing and back-office platform feature.
Bond has built in visible API call logs from the start, but our newest iteration, besides improving observability, leverages three important general principles for engineering at Bond: 1) contract-first design of performant services, 2) awe-inspiring interfaces for clients as well as our own back office operations, and 3) no-downtime, backwards-compatible rollouts across our platform. In this post, we talk about our wholesale upgrade of API call logging as well as a more recent live migration to kafka following these principles.
Our enhanced logging system started with the API contract, realized in a protobuf/gRPC specification for “API call logs” and saving and searching those logs. Our objects contain data about the caller, a record of the call itself, and request/response data including:
- Identity: a “brand id” (our customer or “tenancy” key), a user id, the credential identity used, originating IP address, etc.
- Call Record: domain, path, HTTP method, response status, request start time, end-to-end request duration, ingress/egress, etc.
- Metadata: all request/response headers and body data, with masking on sensitive fields before data is ever stored
With this design any request into our platform is fully described up to redactions of sensitive data (HTTP credentials, SSNs, etc.) that we don’t ever store explicitly. This sort of detail is very useful in an embedded finance platform where requests like “open a credit card,” “transfer money across accounts,” or “this transaction occurred” have real-world consequences involving identity and money and may require review.
Contracts in protobuf with code generation (“codegen”) allowed us to easily implement a lightweight python service for storing and serving logs. The contracts, codegen, and a thin layer of generic and reusable gRPC tooling also makes it easy to implement gRPC clients for those services. Currently, our gRPC clients include a lambda function for log submission from AWS API gateway, a gRPC client in another service for outgoing request logging, a dockerized CLI tool for ad-hoc searching, and an integration in our app backend for log search.
Our API log’s data layer is an AWS RDS Postgres DB organized into two tables, a log table and metadata table, linked via foreign keys with both tables indexed on our likely search columns. Verbose API call log data is relatively storage intensive, so we only store log records in this service for 30 days (cleaning up via an hourly CronJob), but we also ETL log records into a data warehouse for non-production access and longer-term record keeping. This sets up minimal hot-warm tiers for our API call log data allowing us to access recent records quickly and easily, while still persisting data for higher level summary metrics as well as longer-term back-office auditing requirements.
Incoming (Ingress) Call Logs
Our gateway layer proxies all requests into our service mesh and provides a perfect vantage point for enhanced logging of API calls into Bond. We currently use a containerized AWS lambda function as a proxy that can handle all AWS API gateway requests and submit logs, as shown in the diagram below. (A future blog post will outline how we’re also improving our entire approach to the edge.)
After a proxied request to our clusters has completed, our lambda proxy submits a log via gRPC. Any failure in logging is absorbed by the proxy so we don’t interfere with actual requests. However, logging errors have been quite rare, almost entirely absent after fixing some unexpected edge cases (e.g., large raw base64-encoded request content for mobile deposit flows, auth token expiration with long-running requests).
Outgoing (Egress) Call Logs
As sketched in the diagram above, the same logging service is also used to store outgoing requests from Bond to clients, specifically in our webhooks callbacks. Webhooks are a common and straightforward way for a platform to push information to its users without polling. This is a key event-driven feature of Bond's platform as many actions in automated financial systems take too long for synchronous API request-response cycles. Know Your Customer (KYC) processes, for example, can take minutes and require asynchronous handling.
Bond maintains a notification system and event hierarchy for notifying our clients of 20+ customer, account, card and spending events via webhook callbacks. When a callback is executed based on a notification, a log of the callback is submitted and stored with a lightweight gRPC client built from the protobuf contracts and flagged as an “egress” log to distinguish from incoming (“ingress”) request logs. Otherwise, outgoing call logs conform to exactly the same schema. Storing these logs helps us certify an attempt and response for any specific event requiring client notification. The resultant logs allow us to answer questions such as “did we send any KYC evaluation results for customer X during the last day?” or “which customer callbacks tend to fail and get retried, and why?” quickly and confidently.
Upgrading a Live Service: Logs on kafka
Our system has readily supported ingesting tens of thousands of request logs per day with a single containerized gRPC service. Yet, after launching our full-stack API call logging system we also subsequently rolled out kafka to support higher throughput, reliable performance, and internal data flexibility. Bond takes continuous improvement seriously and moves quickly to deliver features while continuing to optimize on those features.
By writing our API call logs to kafka we can:
- Speed up our edge with faster writes of log data (we saw, roughly, 100-200ms turn into 1-2ms as shown below)
- Disconnect our customer’s call latency from service and database performance, improving uniformity of logging latency (as shown below)
- Store and “stream” log data for other business purposes including analysis and retention
Disconnecting database and edge performance was particularly important to us. By buffering logging in kafka, we can decrease customer call latency by avoiding a save/acknowledge cycle for every call as well as increase the uniformity of logging latency. Optimizing our service database design isn’t the most critical business priority (yet), which means we can still hit intermittent write delays. But with kafka, we can largely eliminate impact on our customers’ experience while we grow and further optimize. Note we aren’t using kafka here as the data layer for call logs (in tandem with a tool like ksqlDB), but rather like a message bus to move data to feature-tailored persistence layers.
We undertook a backward-compatible upgrade by adding a RPC for “publishing” a log, which just produces the log to kafka instead of inserting (and indexing) into the database. Using protobufs helps us make such changes safely and cleanly. This particular architecture choice — create an RPC instead of publish to kafka directly — was strategic for two reasons: (a) it lets us avoid complexities within AWS related to API Gateway, lambdas, and VPCs when connecting to MSK and (b) with the production in our long-running service instead of in an ephemeral lambda we can utilizing kafka’s buffering and increase throughput. Our webhook callback logging was easier to migrate: being inside our VPC we could just start producing call logs right onto the topic directly. Of course, a kafka consumer for logs runs alongside the service, subscribes to the call log topic, and stores the messages published there in our database using the same code our gRPC service would.
Observability is more about how data is presented in a consumable way to our customers and internal stakeholders (like engineering and customer support) than about the underlying data itself. Concerning API call visibility, the full content of our stored logs is rendered in Bond Portal, the application clients use to manage embedded finance programs on Bond. You can even try Portal out yourself in the self-signup sandbox!
To provide this feature, our backend translates GraphQL calls into Bond-internal gRPC calls returning paginated data for in-app rendering. Logs are segmented by customer (our “tenancy”), so one customer can’t see another’s logs, and searchable by timeframe, method, status, and more. All the request and response details (headers and bodies) are nicely rendered and copyable as JSON from BondOS’ Develop dashboard. Bond’s customers thus have full visibility into what they or their automations are doing on Bond’s Banking-as-a-Service platform.
Internally we also use our ad-hoc CLI clients, database access, and other dashboards to provide additional information about how customers are using our commercial products as well as our self-service sandbox.
In particular, our dedicated (and self–signup) sandbox is an important feature for customers, especially during their product build stage. As much as Bond is already compressing the lead time required to offer a financial product, it still takes time and effort. Being able to clearly see, track, and review API call logs from the sandbox is a feature some of our customers have found indispensable in their journey towards live products powered by Bond.
Bond believes in and builds reliable, transparent, and auditable financial services APIs. We recently refactored our API call logging system, a critical and cross-cutting platform component, to supply our customers and back-office with nearly request-reproducibly detailed logs of calls into Bond as well as our webhook callbacks. In this post, we’ve described our redesigned system and touched on important general engineering principles that help us move fast with awe-inspiring features and technologies on a live financial services platform. Sign up yourself and contact us about what you could build on Bond!