Payments Platform-Engineering Reliability
If you tuned in to Monster Scale Summit this year, you may have seen our talk on migrating the American Express Payments Network - not once, but twice — with zero customer-impacting downtime — meaning no transactions were interrupted and no planned maintenance windows were required during either migration. The session focused on how we moved live payments traffic reliably under strict operational constraints. If you missed it, the talk is available to watch on the Monster Scale Summit website.
This article expands on the conference talk and dives deeper into the engineering decisions, tradeoffs, and lessons learned across both migrations.
Context: The Payments Network
The payments network is a mission-critical distributed system responsible for processing critical payments traffic, including live card authorization. It serves as the bridge between American Express merchants, acquirers, and issuers globally.
This platform must be continuously available, operate at low latency, and handle large volumes of critical traffic.
Migration Constraints
In 2018, American Express began a multi-year modernization of our payments network, including migrating from a legacy platform to a new microservices-based architecture.
A migration of this scale had to operate within several non-negotiable constraints:
The migration had to be performed online, with no planned or unplanned downtime.
... continue reading