Plaid Explains Why Building Internal Key Management System Was Fundamental to Security Strategy

Staying secure means staying ahead of what’s coming, according to Fintech Plaid.

At Plaid, they reportedly run hundreds of services across tens of thousands of pods on dozens of Kubernetes clusters hosted in AWS.

Shuaiwei Cui and Anirudh (Ani) Veeraragavan noted that their teams ship fast and frequently, so their infrastructure landscape will soon “look different from how it does today.”

In the face of all this change, Plaid’s Security team is tasked with enabling the business by “managing risk in an efficient and proactive manner.”

As mentioned in a blog post, building an internal Key Management System (Plaid KMS) was a fundamental piece of their security strategy for “managing sensitive data at scale, and now that they’ve operated it for 3+ years, they’re confident that they have not only secured the present but also prepared to secure the future.”

The Fintech firm has described their journey of creating and leveraging a secure Key Management System to “protect sensitive data at Plaid.”

According to the update from the financial infrastructure provider, cryptography is the foundation of data security.

Before Plaid KMS, engineers found it hard to “leverage cryptographic operations and approaches were bespoke across 22+ services maintained by 13+ different teams.”

The Security team was struggling to manage existing use cases, let alone future use cases “as Plaid was beginning to undergo a period of massive growth.”

The urgency of building a Paved Road for cryptography was “apparent.”

Although there were potential vendor solutions, such as AWS KMS, they chose to build their own KMS to “address challenges unique to Plaid.”

  • Scalability: We needed a solution that could seamlessly scale with our rapidly growing data volumes without the constraints of vendor-imposed limits. For example, AWS KMS had and still has account-level limits on API calls and number of keys, which were inadequate for our needs.
  • Cost efficiency: Using a vendor KMS at our scale would have incurred substantial recurring expenses. Cost efficiency is a business priority, and ideally we would not just maintain costs but instead reduce them.
  • Self-serve: We want to empower our engineers to use secure solutions in an independent manner. By building an internal KMS, we would be able to deeply integrate with existing tools at Plaid. For example, access control of Plaid KMS-managed cryptographic keys is provided by our internal authentication and authorization platform, which our engineers are already familiar with.

While they want to avoid “not invented here syndrome,” they also acknowledge that deep ownership of critical services can “accelerate business outcomes.”

The experience of building and operating Plaid KMS has empowered the Security Team to “accelerate investment in security solutions running in the production critical path.”

These strategic benefits, alongside the technical benefits “have made the decision well worth it.”

As a critical service for cryptographic operations, Plaid KMS must maintain high availability and “efficient performance, even in the face of fluctuating workloads. Achieving this level of operational rigor required close collaboration with other engineering teams in order to implement numerous enhancements.”

Dedicated, elastic resources: Plaid KMS runs on its own dedicated network and compute resources.

Additionally, all resources on the request / response path “autoscale in the event of increased usage. Plaid KMS is a CPU-bound service, and so we found CPU utilization to be a high signal metric to inform our autoscaling configuration.”

  • Workload segregation: On top of having dedicated resources, we also segregated them by online and offline workloads. Online workloads, such as API requests from customers, require high availability / low latency and are prioritized. Offline workloads, such as our data analytics platform, are more fault tolerant but also involve large amounts of data and requests. This separation prevents higher volume but less frequent offline workloads from disrupting online workloads.
  • Optimized API usage: Finally, we reduced the amount of traffic that goes to Plaid KMS in the first place. By taking advantage of envelope encryption, we encrypt / decrypt variable-sized payloads locally on the client service and only encrypt / decrypt the fixed-sized data key using Plaid KMS. This technique not only reduces the necessary network bandwidth but also ensures performance is more predictable. Additionally, client services can also batch decrypt multiple payloads in a single request, which enables more efficient bulk processing.

Beyond the technical improvements, they also “ensure operational excellence is a part of our daily practices.”

Plaid regularly review service KPIs, perform pre-mortems “for high-risk changes, and invest heavily in observability. Doing the hard work upfront ensures quiet oncall rotations for our team down the line.”

The most time consuming part of the project “wasn’t designing nor building Plaid KMS—it was the migration.”

Although they anticipated that migration would be complex, the number of bespoke approaches to cryptography sprinkled “among client services meant that we were continually running into unforeseen challenges.”

They approached the migration using “a standard playbook of Derisk, Enable, and Finish.”

They started off with the most challenging client services to “ensure that Plaid KMS was operationally ready, they scaled migrations for 80% of client services by providing self-serve tooling and documentation, and they personally migrated the remaining 20% of longtail client services that didn’t conform to expected practices, were unstaffed, or had any other issues.”

Their investments in operational excellence ensured they “were well prepared for the Derisk and Enable phases, but getting through the Finish phase required creativity and grit.”

These longtail cases demanded close collaboration “between service owners and the Security Team to ensure a smooth and successful migration.”

Plaid explained that while it can be tempting to ignore these cases, the “benefits of a migration are only realized if you fully finish it, and so we pushed onwards.”

Today, all Plaid services have fully onboarded onto Plaid KMS.

Any enhancements can now be implemented “directly in Plaid KMS, simplifying future upgrades or migrations.”

Deleting all the legacy code was a milestone that “took a long time to reach, but it was well worth the wait.”



Sponsored Links by DQ Promote

 

 

 
Send this to a friend