The open problem of library identification in minified APK files. Directions and "reality checks".
Author's note: I have been taking the issue of identifying vulnerabilities in APK files more seriously as of late, building a couple of closed source systems to hunt for problems. Much of this is day-job work. What follows is a high level discussion of what actually "works" for solving this problem, and also some thoughts about whether this exercise of generating data-flow graphs, ASTs, regex's, you name it, is meaningful.
Background
This is an old, old problem. The idea has been snake-oiled to death by private security companies that attempt to do library identification by all sorts of names without mentioning that they also assume your particular company, composed of the hopes and dreams of thousands of developers each with their own agenda, has a uniform build process and neatly specified dependency and linking scripts for every *.so, *.jar, and who-what-who-have-it for every complicated spaghetti code and overworked and underpaid code monkey (myself included) has hacked together in 2 hours trying to get the damn thing shipped.
I digress, the key point being that if such practices existed, said companies / tools would not really need to exist, and the maybe-sadness-inducing part of it is that even the companies paid big bucks through corporate can't scrape together a single qualified developer enough to surmount Diaphora or any of the tools that actually kinda sort of work for this purpose, despite tools like JADX providing them the basic semantic information necessary to solve the core issue.
So I've of late been hacking something together to get all that working, and I'll let you in on a simple, effective strategy. Just appropriately replace variable names in the AST for decompiled code with an abstracted form (i.e. foo, bar -> _1, _2), and that gets the system a decent amount of the way to accurately fingerprinting library versions. This noted, and here is the rub, you still need to scrape absolutely everything off of Maven or your repository center of choice, á la the Libscan tool (rest in peace), although if you look at the Bloom filter approach taken in associated functions to this location in the code you'll start to remember your last marijuana trip a bit too well. As an aside, I'm half convinced bloom filters were invented on Mary Jane, hence the name.
Rolling your own
Back to the core problem. It only takes a little creative leap to actually build a decent function matching framework, but there are few consistent systems outside of the centralized domains of power (e.g. Google, Github, Motorola's private code-bases, so on) that actually have or have built the horsepower to catalog each library JAR and fingerprint, but once you have that basic part done, it is pretty easy to use strings, even, to detect key libraries (e.g. User-Agent okhttp3/x.x.x for CVE-2016-2402, noted that you also have to hack the DNS resolver and all sorts of B.S. if you want to use it. Much easier to target SMS or an employee at the company with enough money to make it worth their while to introduce a backdoor).
And I guess that is why there's not a consistent open source database for reasonable compiled library fingerprints with deltas to capture the accurate version. I guess the Software Transparency project is an attempt at this, sort of, but any time I see "blockchain" mentioned on a modern JS framework template site my mind goes to rug pull scheme before legitimate effort, but that's just me, and I want to be proven wrong. But thankfully, digging into it, there was this paper, this data-source and goblinWeaver slash repos under the chains project. But if you want the good, useful stuff, you've to build it yourself like me, and unfortunately that ends up proprietary because who pays for a project like that but big tech or some cryptocurrency firm.
It is somewhat trivial to start downloading everything off of all the Maven mirrors we can, decompiling them, and storing the output on our external drives. The remaining problem being the need for consistent decompiled output for the minified libraries included in your *.dex or other binary format, and the original linked artifacts from Maven or other library resource. That sort of data-farm pays dividends, though, since as long as you are a proper bookkeeper, as decompilers get better, so does your identifier. This system built, let's take a step back and consider what we are actually looking for.
Too great of an ideal, too soon?
I suppose if you are reading this you are a hacker in some form or another, so you might be slamming the keyboard already: "bro, you don't need to download everything, you even pointed out how that okhttp3 CVE above wasn't all that useful." To which I will come clean. It is better to save yourself the compute time and storage, head on over to cve.org, and just grab out the critical, RCE-level exploitable issues that are public and affect your platform of choice, then go on the hunt for those vulnerable versions in privileged system services, or even better, the specific vulnerable functions for which you have an exploit, being called in an exploitable manner.
At that point, we get the real work and security effort under play. It isn't so much the library version itself but the presence of the semantics that make the code exploitable that we care about. Which is somewhat less of a standard SBOM-evangelist "let's scrape the world for fun and for profit", and more of the dusty, bored librarian job. Infinitely less glory, infinitely more rewards. But once you sit down to do that, you can find all sorts of fun things in the world. You discover that some of the key universal applications across our modern computing structure are held together by bubblegum and hopes and dreams.
Broader considerations and navel-gazing on cybersecurity
Unfortunately, the systems most targeted take a massive amount of capital to produce. As a result of standard capitalist dynamics, free market law holds that these same systems are more often than not (I'd hesitate to bet always) stitched together by wealthy elites using the efforts of thousands of underpaid and uneducated workers. The latter of which, while ethically (and intellectually, believe it or not) elite, cannot spare time or energy to develop flows for security or privacy, as they are nipped at by their own security concerns, wolves at their ankles, asking for rent money, for food money, and their hearts for a breath of fresh air.
Thus most "pay me big bucks" SBOM companies provide false security blankets and a scapegoat for accountability. Unfortunately, cybersecurity as a field tends towards this market dynamic until a hunter (hacker) with a bone to pick derails the market narrative like Morris. We see an example was made of him. We can't have security at a societal or global level until we can universally, or in large numbers, support weak humans like Tolstoy, propped up enough to spend years producing decent-enough art. It does happen in some cases today: look at Linux (joke?). At the present time we have serfs under threat, rushed to use AI, who couldn't produce non-vulnerable patterns even if they tried, and are incentivized more to introduce a back door to any would-be-panopticon complex than to support that same complex's edification. I'm an optimist and you may consider being one too.
Parting Thoughts
Two of them. First, the cliché that identifying the flaws in modern systems is easy if you know what to look for: develop matching metrics for specific exploitable vulnerabilities, don't waste time, and apply those metrics accurately. Second, there's no hope for techno-elite so long as that elitism is based on any form of subjugation which does not also provide a broad enough foundation for its own complex of edification with which said elite can be considered or live as such.
Just the rules of the road, as I see it on this fine Saturday. Thank you for reading. I'm back to "dusty" digital library shelves to catalog more vulnerability semantic fingerprints. Maxwell Bland 01-10-26.
Return to homepage.