The “Security” chapter of Sam Newman’s book, “Building Microservices” triggered my thoughts on one of the worst cybersecurity breaches of the 21st century [1]. In 2019, SolarWinds, a software company that provides network and infrastructure monitoring solutions to its customers, fell victim to a hack. It was discovered that attackers leveraged leaked credentials to gain access to the company’s build server where they distributed a malicious version of the software update in a supply chain attack to around 30,000 public and private organizations [2]. Of those affected, 80% were Fortune 500 companies, such as Cisco, Microsoft, Visa, and Ford [3][4]. This malicious version served as a backdoor malware, which compromised the data, networks, and systems of thousands of SolarWinds clients.
This incident serves as a prime example of implicit trust as a security model in IT systems leading to catastrophic outcomes. Based on my own experience and with the help of the book “Building Microservices,” I aim to provide a basic understanding of the difference between implicit trust and zero trust in security models and how these principles can be applied to a Kubernetes setup hosting multiple microservices. As a final point, I will explore how the implementation of a zero trust environment could have mitigated the impact and consequences of a cyberattack like the SolarWinds breach.
To get started, let’s establish a fundamental understanding of the two security models at hand: zero trust and implicit trust. In a system relying on implicit trust, everything within a predetermined perimeter is deemed trustworthy, while anything outside this boundary is untrustworthy. Essentially, as Sam Newman described it, it’s comparable to medieval-era defenses, where a city’s high walls and gate protect those inside from outside threats. However, if an intruder breaches the walls or gate, they gain free access to roam the city as they please.
In contrast, a zero trust security model, also referred to as perimeterless security, strives to erase the demarcation between a trustworthy internal and untrusted external area. A commonly used policy in a zero trust setup is “never trust, always verify.” Essentially, in a zero-trust environment, assume that the system is already compromised, meaning default trust of a user, device, or service is strictly forbidden [5].
Let’s take a closer look at how the zero trust security model can be implemented in a cloud-native-based microservice architecture like Kubernetes, given our understanding of security models. We’ll review some potential issues, explore how to handle them, and touch on remaining difficulties:
Securing data can be divided into two categories: Data in Transit and Data at Rest.
Data in Transit
Data in Transit involves data that is transferred between services. In a traditional, implicit trust environment, security is focused on the perimeter (ingress), and oftentimes, services behind the perimeter are allowed to communicate untrusted. However, with a zero trust security model, communication between services must be secured as well as the environment’s perimeter. In Kubernetes, Istio can be used to achieve this setup by creating a programmable application network that enables service-to-service communication using mutual Transport Layer Security (mTLS) without requiring any special configurations on the application side. Istio deploys a sidecar proxy alongside each pod, enforcing mTLS for communication between services over a control plane. However, the implementation of sidecar proxies on every pod can have drawbacks as it can consume a significant amount of resources and potentially double the number of containers running in your Kubernetes cluster.
Data at Rest
Data at Rest pertains to data stored on disks, volumes, or drives. This comprises database credentials, TLS certificates, or configuration files stored on persistent volumes in Kubernetes. While Kubernetes offers built-in objects for handling sensitive information, such as Secrets and ConfigMaps, it’s important to be mindful of potential security risks when using Infrastructure as Code (IaC). In many cases, IaC is used to manage Kubernetes resources, which means that sensitive data at rest may be stored in a repository as base64 encoded format. Unfortunately, this can make it easier for attackers with access to the repository to steal sensitive data. This risk can be mitigated by encrypting this data. Bitnami’s Sealed Secrets is a solution that tries to address this issue. Sealed Secrets consists of a cluster-side controller and a client-side utility tool named kubeseal. The kubeseal tool uses asymmetric crypto to encrypt secrets that only the controller can decrypt; even the original author cannot decrypt the original data from the generated SealedSecret. This allows you to encrypt data locally and check it into your IaC repository. When deploying the secret in Kubernetes, the controller manages the decryption of the secret. As a result, an attacker with repository access will not be able to steal sensitive data. Keep in mind that if the private key of the controller becomes exposed, attackers would be able to decrypt the checked-in data. Therefore, securing the private key for asymmetric encryption must be one of your highest priorities.
Credential and secret rotation are crucial steps in mitigating the risk of credential stuffing attacks, which have become rampant in recent years. Attackers take advantage of the bad practice of password recycling, where users reuse the same login credentials across multiple sites. If these credentials are leaked or brute-forced from one site, hackers will launch a credential stuffing attack that tries the same leaked credentials on other sites to gain access to users’ accounts. To minimize these attacks’ time window, rotating credentials is an effective approach, along with training on how to handle passwords properly. By requiring people to change passwords regularly, attackers’ window of opportunity can be reduced. At regular intervals, they are required to change their account passwords, making it more difficult for an attacker to gain access. However, rotating credentials too frequently or for all accounts might lead to dissatisfaction. Instead, it’s best to apply rotation only to accounts leading to valuable resources. Additionally, applying credential/secret rotation to authentication between services or database credentials is a good idea, reducing the amount of time an attacker has to do damage. However, implementing such a behavior in a microservice architecture requires careful thought: applications must be designed to handle rotation and backup strategies need to be implemented in case the rotation fails.
The principle of least privilege advocates for granting access to the lowest possible level needed to perform required functions for the shortest possible time. In Kubernetes, this means that not every member of the team requires unrestricted access to all resources like repositories, namespaces, or GitOps continuous delivery tools such as Argo CD. Implementing the least privilege principle at an organization-level can be done using a guide outlined in Tony Loehr’s blog post “Using the Principle of Least Privilege for Maximum Security” [6]. Secure access to repositories and GitOps tools can be achieved using a protocol like LDAP, suggesting the centralization of access management eliminating inconsistencies and simplifying attack analysis in the event of an intrusion.
Kubernetes incorporates two types of accounts: one for active/human users and another for processes that access the Kubernetes API. Kubernetes RBAC has restricted access to resources like controllers, secrets, CRDs, pods, among others, and governs these accounts [7]. Given that minimal RBAC rights should be assigned to users and service accounts, assigning only the permissions explicitly required for their operation is recommended.
Extending the principle of least privilege to the pod level is recommended to reinforce security. One way to enforce this principle is by using Pod Security Admission, which can limit access to resources. Additionally, it is crucial to limit database access for containers running in the pod to reduce the risk of data theft. Suppose the attacker infiltrates the pod, which has read-only access to some tables and steals some information. In that case, it’s still a compromising situation. However, with the prescribed restrictions in place, the attacker’s access to all data would be curbed, limiting also the possibility of data manipulation with the read-only access. Such restrictions can significantly expedite the analysis and resolution of a potential attack.
Nonetheless, implementing such sophisticated approaches necessitates significant time and expertise. Regular training on the use and configuration of these tools for the team members who operate them is required
While a microservice architecture offers several benefits, it also has its downsides. One of the major challenges is keeping track of the many dependencies and libraries present in the environment. Without careful management, it can be challenging to identify which dependency versions are being used, resulting in CVEs going unnoticed for days or even weeks. Typically, CVEs have three primary sources: base images, code, and coding language [8].
When deploying applications in Kubernetes, they usually run in a container which, in turn, is launched in a pod. The container typically includes a base image with built-in libraries, which might contain CVEs. Aside from base images, third-party libraries play a critical role in your code’s functionality and can introduce vulnerabilities to your application’s security; even if the vulnerability is not directly related to your code, it might still pose a risk if the vulnerable function is being used. This type of vulnerability is commonly known as a transitive vulnerability. Furthermore, the coding language and its runtime might expose your application to security flaws. Thus, the security of your application does not rely solely on the functions used in your code, as a vulnerable version of the programming language and its runtime could also leave it vulnerable to attacks.
To decrease the number of potential vulnerabilities in your system, there are a few methods you can use. One important step is to narrow down your base image to only include the necessary libraries. This will reduce the attack surface. To achieve this, you can either select a well-maintained base image with only the libraries you need, such like the container images from Distroless, or build a custom base image with the necessary libraries. However, if you choose to build a custom base image, it will be your responsibility to regularly maintain and fix vulnerabilities found in the libraries.
Another effective method is to integrate CVE monitoring into your CI/CD pipeline. This will allow you to check for vulnerabilities before a build is even deployed. You can use tools such as OWASP Dependency-Checker, Snyk or Mend to analyse dependencies and detect possible CVEs. Depending on the configuration, your build may fail if the CVE score exceeds a certain threshold. It is important to have these checkers already included in your development process as it is the most low-cost and direct way to resolve vulnerabilities. However, it is crucial to ensure that these checkers run in scheduled pipelines and have a notification system, particularly for repositories that may not be well maintained or are overlooked, to avoid undetected vulnerabilities.
Alongside technical measures, it is also beneficial to implement organizational measures to proactively reduce the time it takes to fix a CVE. Regularly updating your runtimes and libraries, even when they don’t have a CVE, can significantly lower the response time when a severe vulnerability does occur. If you have to update multiple versions of a library during a severe CVE, it could majorly delay your reaction time. Therefore, it is best to keep your libraries up-to-date and reduce the time it takes to update/upgrade to the latest version where the CVE is hopefully fixed.
In Kubernetes, the IaC approach is the go-to method for configuring and controlling service deployment. The benefit of the IaC approach is that deployment of the entire system can be applied repeatedly, producing the same results every time. This approach is especially advantageous in case of a security breach, where an attacker has tampered with a service. In such an incident, it is important to not only find and address the security flaw, but also ensure that all affected systems can be rebuilt in a short amount of time. By using IaC, tampered services can be rebuilt quickly and efficiently, reducing the amount of time customers are exposed to risks such as data leaks.
Detecting attacks in a zero-trust environment can prove challenging, but there are several techniques available that focus on monitoring unusual or atypical behavior to identify unwanted access on both technical and organizational levels. One such approach is deploying a Security Information and Event Management (SIEM) solution. SIEM collects and analyzes security events and alerts from various sources, such as network devices, servers, and databases, to detect indications of potential attacks or breaches. In the event of a security incident, SIEM can facilitate more efficient investigation and response. Additionally, SIEM’s centralized design provides a comprehensive view of security events throughout the organization, making it easier to manage and monitor security.
While the points mentioned above provide insight into key aspects of zero trust, it’s important to note that there are numerous other strategies to consider. Due to the limited scope of this blog post, I am unable to delve into all the technological and organizational methods that can be employed to strengthen your zero trust environment. However, it’s crucial to explore other options available to you to ensure the establishment of a robust zero trust system.
The CrowdStrike Intelligence Team’s blog post goes into great detail about how the attackers were able to modify the source code to inject their malware into SolarWinds’ build server, which required a high degree of sophistication and expertise [9]. The attackers spent several months testing the plan of adding malware to the source code and modifying deployments multiple times [10]. They were ultimately successful in employing hash verification, AES encryption, and file manipulation to insert their malware after more than a year of having access to the build server.
Nevertheless, one question remains unanswered: how did the attackers gain access to the system? The discovery of leaked credentials provides some insight into the situation. In 2019, a security researcher found the password “solarwinds123” on a private Github repository created by a SolarWinds intern in 2017 that allowed the researcher to add files to the update server [11]. It’s still unclear how the leaked password contributed to the attack, but possibilities include brute-force guessing of company passwords or hackers entering through compromised third-party software, among others.
Looking at the security aspects of the SolarWinds attack, it appears that the company did not adhere to the principles of a zero trust security model. The attackers had approximately a year to develop and carry out the attack, which raises questions about SolarWinds’ lack of security measures. For example, the absence of two-factor authentication or credential rotation is concerning when dealing with a build server that serves thousands of customers.
Additionally, the incident involving the GitHub password raises further concerns about the company’s practices. Granting an intern full access to the update-server is a significant violation of the principle of least privilege. The password to the update-server was checked into GitHub in 2017 and remained valid until a security expert contacted SolarWinds. This means that the duration of the password validity was 2.5 years. The prolonged validity of the password indicates a significant gap in the password management and credential rotation of SolarWinds’ internal security practices.
Adopting the principles of zero trust security, like securing data at rest, rotating credentials, limiting scope and privileges, and instituting robust security monitoring, could have played a critical role in preventing or minimizing the impact of the SolarWinds attack.
While a zero trust security model can seem appealing, implementing it throughout an entire environment, or company, is not always practical. It requires significant resources, expertise, and organizational changes, making it challenging to establish and maintain. Instead, you can determine which parts of your environment are the most worth protecting with zero trust. For instance, a backend service that has access to a database cluster is a critical component to secure. Meanwhile, a frontend service that just exposes static HTML files to the public might not require the same level of security measures. By defining different security zones, based on sensitivity of data and impact of services, you could deploy and secure services based on their unique needs. This approach can help businesses maintain their security while considering their available resources and overall business goals.
I hope you found my thoughts on zero-trust security useful. Implementing such a system in a company or environment can be a challenging task, so I’d love to hear about your experiences and challenges with zero-trust security. Feel free to leave a comment or email me at stefan.pezzei@andamp.io. I believe that by sharing our experiences, we can all learn and improve our security strategies.
References
[1] Sam Newman — Building Microservices, 2nd Edition
[2] Saheed Oladimeji, Sean Michael Kerner — SolarWinds hack explained: Everything you need to know
[3] Kathryn Haring — How Tech Companies Can Prevent a SolarWinds Level Breach
[4] Mia Jankowicz, Charles R. Davis — These big firms and US agencies all use software from the company breached in a massive hack being blamed on Russia
[5] Etay Maor — Zero Trust: The What, Why And How
[6] Tony Loehr — Using the Principle of Least Privilege for Maximum Security
[7] Jekayin-Oluwa Olabemiwo — Applying the principle of least privilege to Kubernetes using RBAC
[8] Anthony Tam, Behnam Shobiri — 5 Best Practices for Reducing CVEs in Container Applications
[9] CrowdStrike Intelligence Team — SUNSPOT: An Implant in the Build Process
[10] Sudhakar Ramakrishna — New Findings From Our Investigation of SUNBURST
[11] Brian Fung, Geneva Sands — Former SolarWinds CEO blames intern for ‘solarwinds123’ password leak