Open Source Malicious Packages: The Problem

This is the first episode in a series of articles about the most prevalent kind of software supply chain attacks: those that (ab)use a public registry of software components, intended for open-source projects to upload artifacts that could be shared with other users. When the bad guys publish malicious software there, using the registry as a vehicle for malware distribution, we have a supply chain attack when the victim organizations install or run the infected software component. 

To simplify the discussion we will talk about software packages:, components in a packaged form produced by third parties. This includes not only components used by package managers like NPM or Poetry, but also operating system components including libraries and executable binaries, container images, and virtual machines, or tool extensions for development, build, and deployment tools. We have seen malicious packages everywhere. Cybercriminals do not mind: they are delighted by the alternatives provided by modern software infrastructures and use the registry and the tool that best fits their intent. So please remember that software packages are a shorthand for container images, binary packages, open-source repositories, and extensions or plugins of all sorts (IDEs, CI/CD systems, build tools). All are routinely under attack.

The series will have 5 episodes:

  • What Is the Problem With Open Source Packages? This is the theme of this post. Why are criminals of all kinds publishing malicious packages? Why should I be concerned?
  • Anatomy of Malicious Packages: What Are the Trends? In this episode, we focus on the threat we are monitoring with our MEW system, day after day. With a large background noise due to a large number of malicious packets using typosquatting or dependency confusion, a smaller percentage of attacks are much more insidious and pose a greater risk. How has the bad actors’ behavior regarding OS changed in the recent past? What are the numbers? What are the tactics, techniques, and procedures used, and the harmful actions seen?
  • Protecting Against Open Source Malicious Packages: What Does (Not) Work. Most security-aware professionals have ideas about how to handle this threat. We have heard security managers saying without hesitation that SCA tools already tell you when a package version is malware. Or that they depend on well-known, highly reviewed software components, where any malware would be promptly detected and removed. TheyThat they use open minor/patch versions for automatically getting vulnerability fixes, and that is the proper, recommended way to lower the risk on open source dependencies, following the “patch early, patch often” principle. In this episode, we will review why these ideas are wrong, and how such misconceptions are contributing to the popularity of this attack mechanism, and to an overwhelming risk that organizations are experiencing. We will end with what does work, and which is the effort and resources involved.
  • Open Source Malicious Packages: The Xygeni Approach. In this episode, we present which is the strategy we follow at Xygeni for our Malware Early Warning (MEW) system. How does this multi-stage system work in real-time when a new package version is published, how evidence is captured from different sources, how triage is done, which classification criteria are we following, and why some manual analysis is yet needed to confirm the nature of a malicious package candidate? How the feedback from our internal and registry teams helps the system to learn from past evidence gathered to reduce false positives to a minimum. WeAnd we will explain how we are helping NPM, GitHub, PyPI, and other key infrastructures in the open-source open source ecosystems to reduce dwell time
  • Exploiting Open Source: What To Expect From The Bad Guys. The series ends focusing on the newest actions the adversaries are embracing to make the attacks stealthier, harder to detect, more targeted against specific industries, and extracting more benefit from this class of attacks. Will ransomware attacks be delivered using this vehicle? How are the bad guys leveraging AI tools to deliver more sophisticated malicious packages? Are top popular projects in danger? This is to give readers a feeling about this arms race, and what to expect in the short term (second half of 2024) and the medium term (2025). We will learn how attacks like the recent XZ-Utils Backdoor, or the living-off-the-land attack on electron-builder in Marchmarch 2024 are showing that we should keep vigilant on how the adversaries evolve. 

Let’s open the stage with the first episode: What’s going on with malicious open-source open source packages?

What Is the Problem With Open Source Packages?

In recent years, wrongdoers of all kinds used open-source open source software registries to deliver for delivering malicious behavior. These activities are as old as the open source, but their frequency exploded in the last three years. 

Publishing malicious components into public registries (dependency-based attacks) is asymmetric guerilla warfare that threat actors use to distribute malware, leveraging the trust that organizations put in open-source components coming from unknown developers (remember the dependency xkcd comic?). Because you trust packages and do not mind reviewing manually the package contents and theirits dependencies, these attacks are extraordinarily effective. And the asymmetry comes because they can be largely automated and the bad guys do not need to interact with the victim directly. They simply upload the package into the public registry and let it go.

Malicious packages surged by a 6x factor in 2022, and continued to grow by a 2.5x factor in 2023. Last year a whopping 245,000 malicious packages were seen, a figure that more than doubles the total number from previous years combined. This is exponential growth! From package removals as confirmed malware in the hundreds during 2021 and in the thousands during 2022 we saw much more background “noise” during 2023, with a similar pace for this year. And hidden in that background caused by unsophisticated cybercriminals following the “path of least resistance”, a minority of high-profile attacks reached headlines even in general media.

Why is this a problem of such magnitude? There is an excess of trust all over the chain. Open-source software is distributed with its source code, and released under a given license. Yes, anyone can inspect the source code; but, who does at large? Who, after inspecting that the software has no malware, builds the software from the sources?  Who, before passing the packaged component (also known as a package) downstream to the package manager or the build tool, makes sure that the package is not riddled with malware and corresponds with the supposed source code it should come from?

Why does the infrastructure allow such easy attacks?

Package registries are open, often requiring minimal verification of the identity of the publisher. “Anyone is welcome to publish their software here!” The bar for attackers is set low: they use disposable email addresses and disposable GitHubgithub accounts to create hundreds of malicious packages in short, phishing-like, campaigns. Only for targeted ones a higher sophistication is needed: We saw even creating a credible GitHub source repository with many stars and commits from multiple fake contributors and other metrics of popularity and maintenance. Getting stargazers and reputation from fake contributions is not difficult to automate. We saw abuses on open software infrastructures of all kinds, not only malware, like the tea protocol incident.

Package managers were designed for ease of use ease-of-use and not for security. They can run pre- and post-install post-install scripts (sometimes compiling native code for a library is necessary). Also, Package managers install packages from multiple sources, and sometimes the default is to use public registries. They did not check for a mismatch between the metadata in the publish request, and the metadata in the package itself.

Dependencies are nested and form a graph. In certain ecosystems like Node (JavaScript), small-grained dependencies accumulate in the hundreds or thousands. One thing is to have strict control overon direct dependencies declared by my software projects, but transitive dependencies are harder to control. Open source followed “the friends of my friends are my friends”. Brotherhood is the norm in the wild Far East! Threat actors know this and hide deeply the malicious behavior in obscure dependencies that are often unknown. This was the case with the event-stream incident targeting the Copay wallet

This is how open-source software worked since its inception. It will not change much. Some package registries are demanding at best two-factor authentication, and often just for the most popular packages. Some registries provide scopes, a namespace owned by a vetted organization, but tragically others do not support it (PyPI) or make it optional (NPM).  It is interesting to note that even a simple screening scheme (based on control of the DNS or GitHub repository/organization matching the group ID) and making PGP signatures mandatory for all artifacts except checksums removes most of the “noise”, typosquatting-like malicious packages, and limits much of dependency confusion. Sophisticated attacks are possible but much harder, with only a few like the com.github.codingandcoding:maven-compiler-plugin known for Maven Central. And not all maven registries follow the same practices!

Security controls on package managers may burden but do not impede dependency attacks. The problem with multi-factor authentication is that for automation, derived credentials like access tokens or APIapi keys are generated for accounts to be used in APIapi calls made from automation scripts, with no backing interactive user providing a second factor. MFA is good for protecting user accounts from password leaks, but the generated access tokens or APIapi keys need to be protected while active, or their owner will be impersonated by the adversaries. A large fraction of package-based supply chain campaigns start with a leaked key/token. Just remember incidents like Ledger, 3CX, and many more, where non-interactive credentials were first exfiltrated in a preliminary intrusion for launching the supply chain attack.

The response given to this threat was not robust enough. In the third episode, we will focus on what worked, and what failed miserably. The industry needs to work collectively on the standards, processes, education, and tooling to mitigate risks to global supply chains. This is not a problem a single organization can solve on its own.

To end this section, the crucial misunderstanding: we are talking about malicious packages, not vulnerable ones. Vulnerabilities come from design or coding errors, accidentally introduced, without bad intent. The vulnerabilities may be exploited, but many are not. Malicious packages are always intentional, and there is 100% exploitability if they get executed. No comparable risk! Hence it is paradoxical to see how many efforts are put intoon detecting and mitigating vulnerabilities, and the lack of equivalent measures for malicious components

“We take security seriously”

Open Source Malicious Packages: The Problem 2

Let’s imagine the customary Acme Corporation. Acme, a major provider for WileCoyote.com, has most of its software coming from third parties, with more than 80% from open-source source projects. They produce software for internal usage, but they also provide software for their partners, providers, and customers / end-users. Acme has software written in Go, JavaScript, Java, C#, and Python, and runs most of its software on the cloud, under Kuberneteskubernetes clusters.  Acme builds its custom images from base images taken from Docker Hub and other registries. And they share a few libraries, packages, and container images in public registries as well.

Acme takes security seriously. They are pretty aware of the problem of open source security, and the risk it conveys. All developers, system managers, and DevOpsdevops engineers use those cute little crypto keys as second-factor authentication. All commits to code repos are signed, branch protection is enabled with mandatory code reviews, CI/CD locked, secrets stored in a secret vault, and with an internal registry partially mirroring external registries where only the allowed, white-listed components are stored. It is required that software built by Acme must take third-party dependencies from this registry. 

Probably most organizations fit into this profile. Dear reader, yours certainly fits if you are yet here, isn’t it?

Then one ill-fated day, an important frontend developer at Acme ran npm install acme-cute-lib, forgetting that @acme/cute-lib was the right scoped dependency. The exact mistake is not important, many things may go wrong even when one assumes perfect control of the software lifecycle. Our developer did not know that an APT group was targeting Acme and published a malicious component under that name, in a cunning way so the malicious behavior activates only when the software is installed on Acme computers. The package was not detected for weeks after its publication. 

An installation script is run that searches for credentials (there were many juicy access tokens in our developer’s laptop), allowing access to internal software repositories, and the aforementioned internal repository, which of course is only accessible via VPN. The malicious code managed to use the existing VPN connection and publish a second-stage malicious component into the internal registry, affecting a common utils library shared by most of the software delivered by Acme.

Weeks after, other organizations using Acme’s published tools started seeing strange traffic on their networks, with traffic using Acme’s protocol but directed to hosts resembling the Acme domain. The traffic was encrypted but system monitoring tools found access to unexpected files and the execution of processes that look like system commands but which end up running downloaded executables. 

The rest is history: Acme first denied that such behavior was imputable to them and that all security measures were in place. Only after the cybersec media started asking why the source of the detected behavior originated from Acme’s components, and security analysis posted how riddled were those components with stealthy malware, Acme had to recognize the incident and called in an incident response firm. A negative marketing campaign that undermined hard-earned confidence in a second. “Acme was one npm install away from disaster” was a common headline. Then lawsuits and canceled contracts followed suit.

Do you see resemblances with known past incidents? Acme fell to a supply chain incident in two phases, using a mix of dependency confusion/typosquatting attacks that used a developer workstation as a beachhead for infecting components that ended up in software used by third parties. How could this be prevented or mitigated? 

Why poisoned packages are so popular

This hypothetical incident shows that even with a reasonable approach to open-source security, organizations need specific measures to avoid falling prey to malware in open-source components. Schematically, the threat actor can:

  • Create a new package (following the well-known typosquatting or dependency confusion avenues, this is the most traversed path by the bad guys in volume);
  • Try to infect an existing one, either by injecting it into source code, trying to disguise it as a contributor via pull request, or using social engineering to become a maintainer (as “Jao Tan” did in the XZ Backdoor or right9ctrl GitHub user did in the event-stream incident in the fall of 2018), or by gaining open source repository credentials and impersonating the maintainer;
  • Inject malware during the build of the package, either by running a malicious build script, or interfering withat package downloads with man-in-the-middle intercepts (fortunately, TLS is now always required in most registries).
  • Inject the packaged component directly into the registry, typically by capturing the registry credentials (the preferred alternative for many sophisticated attacks like Acme’s, where the compromised workstation in the first stage had the internal registry access token e.g. in the usual .env or ~/.m2/settings.xml: bad actors do know where to look for secrets). Vulnerabilities in the registries were also exploited. 

Poisoning registries with malware is the basis for dependency attacks. Nothing new under the sun: its prevalence exploded, but the same techniques work now as five years ago.  

The malicious package can operate at installation, during software build, or at runtime. And the behavior ranges from information exfiltration e.g. extracting secrets for a second-phase attempt, to source code extraction, dropping additional malware. In the next episode, we will dissect the malicious packages and how they are published.

Further reading

The next episode Anatomy of Malicious Packages: What Are the Trends? will focus on real cases we are monitoring with our Malware Early Warning system, day after day. We will review which types of malware were seen, and which tactics, techniques, and procedures are the favorites. We will examine obfuscation and how they try to hide from potential reviewers, the evasion techniques to avoid detection, and how they are evolving with telemetry and lateral movement. Please stay tuned! 

References

Unifying Risk Management from Code to Cloud

with Xygeni ASPM Security