๐๐๐ถ๐น๐ฑ๐ถ๐ป๐ด ๐ฅ๐ฒ๐๐ถ๐น๐ถ๐ฒ๐ป๐ ๐ฃ๐ฎ๐๐บ๐ฒ๐ป๐ ๐ฆ๐๐๐๐ฒ๐บ๐
Building resilient Payment Systems is not a walk in the park. So many factors are to be considered and the system expectation is purposely high. Recently Shopify Engineering released an article explaining the 10 most useful tips and tricks to consider when building resilient payment systems.
Lower the Timeouts.
The team suggested that all plausible places for Timeouts should Timeouts installed. For instance, Ruby's built-in Net::Http client has a default Timeout of 60 seconds to open a connection, write data, and read a response. This is too long for online applications where a user is waiting. The author suggests an open timeout of one second with a write and read or query timeout of five seconds as a starting point.
Install Circuit breakers.
Circuit breakers, just like Shopify's semian, protect services by raising an exception once a service is detected as being down. This saves resources once a service is detected as being down. This saves resources by not waiting for another time out. Semian protects Net::Http, MySQL, Redis, and gRPC services with a circuit breaker in Ruby.
Understand capacity.
They discuss Little Law which states that the average number of customers in a system equals their average arrival rate, multiplied by their average time. Understanding this relationship between queue size, throughput, and latency can help design systems that can handle load efficiently.
Add monitoring and alerting.
Based on Google's Site Reliability Engineering (SRE), there are four golden signals any user-facing system should be monitored for, Latency, traffic, errors, and saturation. Monitoring these metrics can help identify when the system is at risk of going down due to Overload.
Structured Logging.
Using structured logging in a machine-readable format should be the point of emphasis, for instance using key=value pairs or JSON, which allows log aggregation systems to parse and index data. In distributed systems, passing along a correlation identifier to understand what happened inside a single web request or background job is useful.
Use idempotency Keys.
To ensure payment or refund happens exactly once, they recommended using idempotency keys, which track attempts and provide only a single request that is sent to the financial partners. This can be achieved by sending a unique idempotency key for each slot.
Be consistent with reconciliation.
Reconciliation ensures that records are consistent with those of financial partners. Any discrepancies are recorded and automatically remediated where possible. This process helps maintain accurate records for tax purposes and generate accurate reports for merchants.
Incorporate Load Testing.
Regular load testing helps test the system's limits and protection mechanisms by simulating large-volume flash sales. Shopify uses scriptable load balancers to throttle the number of checkouts at any time.
Incident management.
Shopify uses a Slack bot to manage incidents, with roles for coordinating the incident, public communication, and restoring stability. This process starts when the on-call service owners get paged, either by an automatic alert based on monitoring or by hand, if someone notices a problem.
Incident Retrospectives.
Retrospective meetings are held within a week after an incident to understand what happened, correct incorrect assumptions, and prevent the same thing from happening in the future. This is a crucial step in learning from failures and improving system resilience.
Credits to Shopify Engineering blog here https://t.co/Y4TcX2V2ir and @milan_milanovic on X(Formerly Twitter).
Thank you and see you in the next