Open Source Friday with aboutcode.org

GitHub| 01:07:37|Apr 11, 2026
Chapters10
Host introduction sets the stage for a discussion with Phipe about open source origins and the focus of the session.

Phipe from aboutcode.org breaks down how Scan Code and related data tools govern licenses, security, and provenance in open source at scale—even with AI and evolving regulations.

Summary

Phipe, the maintainer from aboutcode.org, joins an Open Source Friday session on GitHub to share the origins of their work in licensing and software supply chain management. He explains how licensing was intentionally abstracted into four key terms (GPL, MIT, BSD, Apache) to reduce legal friction for developers. The conversation pivots to Scan Code, a data-first project that systematically detects licenses, provenance, and security signals across code, bundles, and binaries, and scales to millions of packages with a focus on correctness and performance. Phipe describes the architecture: a small query against a large index, datadriven detection, and fast tokenization and bit-vector techniques to handle tens of thousands of license texts and notices. He discusses how license data is curated and maintained, including community contributions from lawyers, and the practical realities of dealing with transitive dependencies and complex packaging ecosystems like Rust, Python, and Java. The team’s approach balances automation with human review, acknowledging that 100% automation is unattainable at scale. They also touch on the regulatory pull from Europe’s cyber resilience act and the need for transparency in what’s inside software supply chains. Finally, Phipe demonstrates Scan Code and related tools in a live demo, highlighting how users can inspect packages, dependencies, and licenses, and how the project collaborates with initiatives like package URL (purl) and clearlydefined.org to standardize data.

Key Takeaways

  • Licensing for open source is most effective when simplified to a few standard terms (GPL, MIT, BSD, Apache), allowing developers to focus on code instead of legalese.
  • Scan Code uses a data-first, purely datadriven detection approach (tokenization, bit vectors, and automata) to scale license and provenance detection across 40,000+ license texts and thousands of files.
  • The project emphasizes assessing transitive dependencies and package provenance to prevent hidden risks in large baselines of code and containers.
  • Maintenance at scale relies on community contributions (including lawyers) to spot non-detected licenses and on policy-driven views to triage issues for faster review.
  • Data and data provenance are as crucial as the tooling; ‘origin’ data powers license, security, and health signals across millions of packages.
  • Clearly defined and package URL ecosystems (purl, SPDX, and related standards) are central to interoperable, machine-readable open source metadata.
  • AI and agent-based tooling are approached cautiously: current strategy favors human review and post-hoc verification to avoid false positives in licensing and security data.

Who Is This For?

Developers and security engineers who build, deploy, or rely on open source software; OSS maintainers who need scalable license and vulnerability management; teams evaluating AI-assisted coding or supply-chain tooling.

Notable Quotes

"There is no business for opensource business software because your accountant, your sales clerk is not the one that's programming the tool."
Phipe explains why aboutcode.org pivoted toward licensing-focused tooling rather than traditional OSS business software.
"The biggest win of open source is to abstract lawyers away."
On simplifying licensing with standard terms to reduce legal complexity.
"One word matters. A license can be MIT, but if the text says redistribution is not permitted, that changes everything."
Demonstrates the subtlety and localization issues in license text that Scan Code has to handle.
"We needed to evolve from regex-based approaches to a data-driven model that uses tokenization and bit vectors for speed and precision."
Explains the technical evolution of Scan Code’s detection engine.
"Data about origin, license, and security is more important than the code itself for our users."
Phipe on the data-centric philosophy behind their tooling.

Questions This Video Answers

  • How does Scan Code detect licenses and provenance across large codebases?
  • What are package URL (purl) and SPDX, and why are they important for OSS metadata?
  • How can organizations assess the security and licensing health of transitive dependencies at scale?
  • What is ClearlyDefined and how does it relate to aboutcode.org’s Scan Code data?
Open Source FridayPhipeaboutcode.orgScan CodeLicense detectionSoftware supply chainPackage URL (purl)ClearlyDefinedSPDXLicensing best practices
Full Transcript
Heat. Heat. Heat. Good morning, good afternoon, and good evening. We are joining live today on Open Source Friday with Phipe, the maintainer from about Code, Scan Code Kit. Uh we're super excited to have everyone join us from around the world and dive into a little bit around uh scan code. So Phipe would love to have you do an introduction to yourself um and maybe just an overview of how you got into open source. Yes. So um actually my first foray in open source was uh last century and there was this weird stuff called the web and there was this weird server uh called Netscape which was literally a web server and that was available as source code running on some IBM Unix machine of sorts and that was kind of fun and and the first thing I played with was a tool called PHP actually and said well it's funny at the at the at the time actually it was called personal homepage slashform interpreter and the big step ahead of uh uh programming CGI in C was PHP because you you could actually avoid doing low-level C programming so that's how I got it into that and but but eventually I lost lost my job uh about the time of September 11th and um I said well there's something to do about open source eventually it's eating the world and everything's going to be replaced by open source software and services 20 years from now I was very wrong about the timing but uh eventually that's how I get into that doing tooling and and eventually fell into licensing and and software supply chain that's where I am Maybe maybe uh that that story probably deserves a little bit of a deep dive, but how do you fall into tooling and licensing as a category within open source? Like yeah, how did you get into that? Well, uh about doing a couple 360 degrees turns initially the plan was to build business software enterprises planning software as open source. Mhm. It was a very bad idea because there's no business for opensource business software. And I I I'll repeat that. There is no business for opensource business software because your accountant, your sales clerk is not the one that's programming the the the tool. It's extremely difficult to have both users that are contributors. And what you see is that you you have very few if any true open-source solutions for business. We did that. We turned around and um we ended up looking at asset management. So still bit business but IT asset management bit more closer to the to the code and eventually delve into building tools like compilers debuggers very low level and and IDs around Eclipse Net Beans at the time and one day a guy reach out says hey you know we're we're in the Silicon Valley we're buying all these companies we're buying this software company I'd like to know if there's something that's GPA already says don't know we we knew of course about licensing a bit because we are doing a lot of tooling um and said we've never done that but you know I know how to use grip so why not that's fair that's a good way right there I can grab it out was very open with the guy he was a he was a friend and eventually we used a lot of grap and we found a lot of weird licensing and weird origin that could of the the software company that we're buying. Okay. And being tools developers scratching the hitch, we we we transformed the grap scrap the grap script into something a bit more fleshed out. That makes sense. Um interesting. And and at that point in time, like obviously you pulled a lot of details like how did you think about the like the legal aspect of of licensing and just kind of thinking through like did you have to have other people from the team to join or Yeah. I guess I mean if you think of it open source is defined by licensing. Yes 100%. The the biggest win of open source is has been to abstract lawyers away. I can say GPL, MIT, BSD, Apache four words and I've abstracting abstracted away like thousands of words and lines of contracts. I think the the the thing is that you you don't have to be uh super savvy about legal topics to understand licensing for open source. If you think about commercial licensing, every single contract is unique and that's true. will spend a lot of time trying to interpret the terms of this and this contract. Here we we were able for a while because there's there's some uh back kick at the moment especially coming from LM and and AI and and all the interesting innovation licensing for a while we've been able to abstract away legal ease from code and focus just on a few words for license. Yeah, I think that's interesting. I mean, like to your point, it was uh when you can standardize and simplify access is one and then two, make it easier for people to not run into risks of building, use, consuming. Um certainly made it easier. It is interesting now with like the models uh shifting licensing and changes. But anyways, it's um but but what what it kind of comes down to is what I think we'll talk about is like across the the transitive dependencies like how do you then manage the complexity of licenses under the hood and make sure they're all compliant. Um which is kind of an interesting segue to scan code. So maybe like just share a little bit about like what is the project? What does it do both like about code or you know and then scan code the project. So uh about code.org is a public benefit nonprofit corporation based in uh in Brussels, Belgium and we extending that to become a a small open source foundation funded and managed by its members to ensure we can maintain the tools and data. What we do most and foremost is to provide good data about open source packages where they're coming from what's the license are there known security issues and eventually there's health risk I mean would we be able to predict ahead of time that execut was about to happen I was check I was checking yesterday we did a bit of a clean sweep on all the all the GitHub actions and apps we're using on our repos because of the 3D compromise and light lm and and all that happened. I checked yesterday one of the actions we're using is using a a JavaScript package that was last touched 14 years ago. I said I mean maybe the code is gone. It's done. I mean it's done. It's simple. But the point is 14 years ago what can go wrong with that? The guy can die. somebody can take over the account. Uh there's so many things that could happen. And of course, this action has full access to everything if even if they don't have access to secrets, they they would be able to do commits on from that action running in into a repos. So there's a lot of things that that need to be taken into action there, taken into consideration. I like it. and and and maybe maybe kind of double clicking in that like how did you come up with the need for the project like what problem do you feel like it's solving today as like a core aspect? Yeah. So initially first need was you know grap in a large code base to find the world GPL. Yeah that was the uh at the same time we were building a dro of Eclipse plugins. So a more polished ID focusing and actually a series of ID focusing on on a lot of niche development environment full stack and all of these plugins were under different licenses. So we needed to be a bit savvy about that and and give them a name and a code that would make it easy to to reference to that. um stepping a bit ahead of GP when we started having a lot of requests on demand to to help with this due diligence. We started writing tools and so literally the the first iteration was a bunch of regular expression. The second the second was al also a bunch of regular expression. It was extremely bad and slow, but it produced it did the job. It's it's sure the point it did the job. And I think to your point probably on that too is like given the finite set of licenses that you were looking for like reax are fairly simple in terms of like finding those right it it gets you 80% of the way. Yeah. Yeah. The difficulty is getting to 90 95. The the problem in itself for license detection is is is like a search problem. Yeah. Except you're not searching the whole internet. You're searching data set of 20 40,000 files or mentions of licenses uh across one codebase. Yeah. But if you think of it, the for instance, Google search prior to the advent of LLM is and has been limited to searching for 32 words. Right here, our index is maybe 100 megabytes, not 100 pabytes. But the query is gigabytes of code which is so it's a search problem but the the the the size of the problem is inverted. The query is very big, the index is very small. Mhm. Um and so but that's essentially the the architecture of scan code. You you have a a small query, a large query is a small thing. So everything has to be optimized so you can have something that much. The other thing that's really important in licensing is one word matters. I I have a stash of a small stash of license I like to keep um which are examples of uh projects that have say they declare supposedly an MIT MIT license. It says redistribution is permitted but you say redistribution is not permitted which is very different which is very different right it changed there's not a huge amount of them but enough for this to be a problem at scale and enough to trip very easily a tool that's that's a bit too naive about using radar expression the it's not the comma that matters but straight eders make all difference you We we helped a couple years ago the the Linux kernel maintainers to clean up a bit the licensing documentation the kernel literally some of the notices were GPL in the code in a license header in the C header you had just you know copyright John do and comma GPL it so you you these letters small letters matter but it's also a problem when you say MIT mint if in Germany means means with and you cannot just take that at face value. Oh, interesting. I didn't even think about that like the localization aspect to be like localization aspect and something as mundane as BSD, you know, BSD sockets. Not an indication of license. You can have an implementation of BSD sockets and internet stack in in any in any license. too. So, so maybe double clicking on that a little bit is like over time what started as like GP and reax like how has the project evolved to like search like improved search techniques and and tools um and then also like more complex licenses and more complex projects with many many more transitive dependencies like you're you're you know while some may have 20,000 some have many many more than that you know. Yeah. So the the the the thing is you you you have essentially three three or four different type of license documentation a full license text that's easy uh you have an Apache license it's easy to to to match and detect then you have notices which says oh this file is under the Apache license easy enough except that everybody is reusing the same format for the notice And you have an Apache notice where somebody just replaced Apache 20 by GPL 20. And again here the notice are extremely similar except for a few characters. And then you have a truckload of vague references plus structured ones uh that come in package manifests and in some case more recently in in what we call SPDX license identifiers which are small convention to to document in the code the license. So in the package manifest you have license fields that exist in in say package.json or pi project tommol or a cargo toml for each of these languages. In many case this is even there's lot of things which are specific say cargotomo for rust there's a a convention in the rust ecosystem to say everything is apache y and that doesn't means Apache and MIT it means Apache or MIT and it's very specific to Rust so you need to be able to account for these conventions because even Then a field that looks structured from the outside is not so structured after all. Sure. So you need to account for that. The difficulty is there's two two way to put that into code. The first way is a big pile of spaghetti and if and oh if you have LGPL30 and MIT then it's this and that. I've done that. It's it's I've done that a couple times before we reach that. It's it's super hard to maintain. Yeah, it's a major pain in theas because each new case requires a new case in the code and eventually we what we evolve is something which is purely datadriven where the code is smart and the more data you throw at it the fastest detection gets and the better it gets. Which is a bit counterintuitive, but we use I mean literally we we use things we tr would think to be really weird for something that that would originally be solved with using just grape. We we transform we tokenize license text. We transform that in bit vectors. We do intersection of bit vectors which is very fast and we use a lot of automatons and we have an optimized diff which has been rewritten in in C to make sure we do multiple sequence alignments at at of sequence of integers. So there's a lot of tricks there. The whole point trying to to find something which is not too slow and and really focusing on correctness first. So not too slow means if you have about like we have about 40,000 example of license text and exert notices and you're trying to find if any of that is present into a file going to do a pair wise div between each of these and let's say you do that on average 10 times that's four 400,000 and then you have 100,000 files roughly the size of the kernel. Yeah, you have the opportunity to do that uh on a single processor and and take a couple years or a thousand processor and take a month and and none of these options are are I mean time wise it's not happy and financial wise not happy on the wallet either to put a thousand thousand CPU for a month or or a couple hours even. So the whole trick is trying to find a way to uh make it affordable. So it runs on my laptop for instance Linux kernel in about 20 minutes about 100. It's actually really like 20 minutes is pretty reasonable. It's pretty reasonable. It's a fast laptop. Yeah. Fair enough. Fair enough. Uh interesting. And then because of that like may maybe talk a little bit around um how is the team managed around like building and kind of maintaining this project like it's it's it's an interesting kind of uh project because it's um you know maybe not like a a traditional core dependency but like very important in the ecosystem for compliance managing open source like how does the team think about building and ma maintaining the project since you're not maybe getting as many productionoriented bugs and things like that into it. Yeah. Well, people people tend to be very vocal when you get problems, especially of that kind. So, um we're we're there's really a side of managing data. So, we have a lot of contributors in the community that scout and spot new weird licenses. Some of them are lawyers. Actually, we've received awesome contributions from lawyers just saying, "Hey, you know, I've spotted this license here. It's not detected correctly." That's really great. Um, the rest of it is really organizing the the traditional work on the code. Um, knowing that for us the code is almost secondary to the data. It's more mean to an end where ideally I'd like to ensure that everyone has clear information about the license, the security, the health of an open source project and package. And so so you can focus on on using that or not using that um and do right fun thing with it rather than saying oh is it is it okay for me to use that? Is it vulnerable? Is the project dead? which is the kind of sex questions we all have to ask ourselves every time. I mean literally among ourselves we had these discussions and saying yesterday one of the one of the mainteners say hey you know I'd like to use this and this to parse yara rules says hm great interesting ply yara okay let's let's pretend we've built all the tools we'd like to have what can you say about this it happens the project has been unmaintained for four years the guy that maintained it says I'm sorry, I've stopped maintaining it. It was out there for 25 years, but now it's gone. So, it's yours. You're welcome to vendor it or do whatever with it, but I'm I'm done with it. That's the kind of information we'd like to have. And and the difficulty is that was one case, but as you know, you can build a a JavaScript application that's deployed in containers and in a snap you have 10,000 packages and you're talking about one app. What if you have 100 or at the scale of GitHub like maybe you have a thousand or 10,000 just for you to manage all your your your services? That's a scale that's difficult for anyone to manage whatever your means and resources. Yes. So eventually we need to whether it's enterprise opensource project pull our resource together so we don't have to do twice the same work. I should not have had to do this work on checking this project yesterday. I should have been able to go to a place which tells me yes, this stuff is dead. You're welcome to fork it. But yeah, if you if you're willing to use it, you you you have to do the work. Um um or don't use this one because it's it's it's a bunch of dependencies. Which means that you you're carrying a lot of baggage or this other one which has too many vulnerabilities which are unresolved. Um, so I I I don't know how to to do that, but to do it together, to me, that's the only way. And there is no alternative. The problem is too big. The cost of doing it yourself is too big at any scale. Um, especially because now you have regulations. In Europe, for instance, cyber resilience act. which uh we we we have to go out of the closet as open source developers. They found us. The good thing is that they they found us in a nice way. There's a lot of reference in this regulation that I didn't read. I skimmed and I grabbed a few words. There's a lot of mention of free and open source software in it. But eventually it means that uh understanding what you put in a piece of software becomes super important. Of course, we know it's important. Uh but it's super easy to add a dependency and and not understanding that you have pulled another 10 which have pulled another 10 and correct not everything may be super palatable in terms of maintenance provenence license. Yeah I I think that's spot on. I think like that to your point it's it's increasingly important um to be thinking about those things especially as we kind of have more agent AIdriven development as well. Um there there was a comment maybe maybe kind of dive in um that that a demo is worth uh I think they said a demo is worth a thousand words. So maybe we can start start doing a demo and uh the project a little bit if you want to I'll pull up your screen and then we'll have the infinite windows for a minute. Yeah. Yeah. Yeah. Yeah. Yeah. That's okay. That's okay. Um, so one of the tools called Scan Code, there's a command line version and and a a a web version if you would like to um yeah, the way I like to think of it, let me double check here. Um so what one is command line what is web based but it's basically a way to organize and run pipelines on a piece of code. So these are projects here that you can see. Um this is an example of a project where I use as an input uh the source of a version of lo 4j and another one which is the the binary. And when you run a scan, in this case I was running a a simple scan which going to inspect the code base for packages. So it's download downloading the code. It's extracting the archives and then eventually uh looking for uh finding binaries and and scanning application package manifest. Um, what I found here is four packages and I have their versions. By the way, this here, let me zoom a bit for the folks that this here is called a package or pearl. If you've not heard about it, you will. Um it's a small standard that we started in scan code which is now used a bit everywhere including at GitHub but also uh it's been merged into the uh CV record format since October so it's fairly recent late October um it's a way to identify a package in Esbomb document volunte database we use of course in about Pearl and package URLs throughout. But that's very obvious where you see it's it's a Maven package that comes from this in this case Maven group ID and has this name and this version. So here I have found these four packages. One of them is from source. I can drill down into the set of of resource the files that exist for for that. Um, I have a way to drill down also in the dependencies and all the different package URL that are also referenced, which is weird. You'd say lo forj depends on Jackson. Well, actually it does. Um, so even lo forj has potentially vulnerabilities. In this case, we're lucky there's none on that specific version. The whole point is you input the code, you can look at the dependencies that were detected. It's actually a lot of package dependencies, not not only uh something that's obvious that you would have all these dependencies included in in LO forj. um you can drive into the file and if you if you go a bit more details in this case well it's not super interesting there's only one license um let me look at something a bit more so this one is a docker image maybe not what I wanted to check so other interesting we can do here I took a a docker image called Scout which happens to be a proprietary tool from Docker which does a bit of scanning just out of curiosity and what you see is that there's actually only four files in that in that docker image let me let me go into the files very few files 14 files if I look at uh just files we have four real files a JSON vex some certificate an etc password and uh 123 mgabytes binary. Uh in this case it's XMAC binary. So I'm assuming this is probably we could dive into that. It's probably is a an elf. Probably an elf. I mean, we're we're running on the on the Docker image, so it's probably enough. Um, the funny stuff is again, if I go back here, one package, I see 420 dependencies. What it does here is that it went and introspect the binaries, extracted dependency information, and reported these. And of course, it's a propriator package. So uh I I really I mean I cannot do any any more than that. But the point is that I can introspect deeply into binaries. That's one of the other aspect. I can expect inspect in this case a docker image again a more mundane one which is based on alpine. So it's b image from alpine pretty pretty standard nothing super funky. If we go into something a bit more in involved in terms of license here it's a large uh tarble it's actually an image also from a next cloud which is a project which provides office uh office tools so it's it's a bit more interesting here we we run just a base pipeline we have potentially given the policy that I set here a certain number of license issues based on my policy. Uh so it's really policydriven. But we can see that we have uh here a certain amount of detection of licenses which were made which related to an AfroGPL and I can dive into the details of where it was found exactly and and eventually what is the rule that triggered that. In this case it was just a single word. Mhm. Uh but you see how you you can have all these uh level of details here. We see detection group by license. I can again dive into the dependency and 25,000 of course is a bit more engaged than than a couple thousand. We have of course a lot of PHP composer I didn't I could with something like that when you have 25,000 like when when you talk to your your users today how how do they think about how to prioritize their time and like where to look and you know kind of use this most efficiently. Yeah. So the thing is we we're building a whole stuff of to do to make sure that you can focus only on the things that matters. So you put a policy saying these are the license that I care for and the ones that are potentially problematic or the ones that are really problematic and you will be able to see only the ones that are that that you want to look at. Uh in this case there's something which says maybe not on the scan but you have a status which says hey these are the to-dos that you really want to look at. Mhm. uh because of course it can be very much overwhelming on the other end of the spectrum. Um we there's really three layers. The data, the tools to act on the data and and tools well tools to actually collect the data and tools to act on the data to do some level management. So right at the top of the the stack there's this tool called deja code where here you don't manage so much codebase as you manage products. So you have a product, you have a version and the scans we just loop they they actually feed into that. And here you're able to do a bit more advanced activities like for instance you know I have vulnerabilities. I have a rescoring of these vulnerabilities and I can filter to only look for instance the critical ones. And then I've been able to focus on four of the packages out of 151. So that's that's the idea. Yep. It's it's it's an area where we're doing a lot of work. Um because even there we know there's a lot of vanity CVs. Some of them are legit, some of them are shitty. Being able to score something when you we know in this specific case that we have non exploit is is typically something that requires really immediate attention. So which means that we've been scanning now we go into the data on the vulnerability side where vulnerable code is a database where we aggregate information about security for packages like exploits like uh references seities for multiple data sources. Uh the the difference about many of the things is that everything is skinned by Pearl and what we care here is not so much the vulnerabilities in as much it's the pearl. What is the package that affected where do you have fixes and that across the board that gives you a bit of an idea a bit of a flavor of what we do with all the tools. uh there's lots more to to it than that but at least think of it as you know data about origin license and security and when I mean data about origin actually pop there you go this is the database we have for created packages which is about 21 million packages it's more of a demo but this other project is called clearly defined. It's a project that's we we co-manage at the OSI, the open source initiative and which is running scan code behind the scene on a continuing basis and you can see it's fairly fresh well reasonably fresh at least. Um the the point is that here we're we're talking about um if things are working according to plan then they were never working according to plan. Yeah. We have a larger volume of packages. We're we're talking in the range of about 55 millions. Wow. Yes. But it's probably too much. You know I care for not this 55 million scans. I care about 1 million package that most everyone use all the time that should be fully vetted and scanned and watched so that everyone can benefit from it. So anyway, the the whole point is data to must with clarifying with what we're doing here origin. So you have detailed scans on the package they are doing ahead of time voluity informations the license with the license database which is then used to feed uh license detection and making sure this data is widely available for everyone and and I guess like one thing I'm curious about like once you've kind of done this at scale how do you maintain this like are are you actively just checking for diffs at the package level then going forward or so that's a problem. Thank you for the question. Uh the the there's no easy easy way. So there's there's a big amount of work that's done on clear defined to create the data. we're talking a good amount of commits and and fresh commits and actually and actually this this a lot of the contribution there comes from Microsoft on GitHub. But so the the whole point here is uh you have somebody that looked at this ICU normalizer package from Rust package couple years ago. license for version 150 was not correct and that's the license that should have been detected. So that's one way to get better information through curation but but that doesn't solve the problem you were raising is how do you deal with all the changes all the time eventually we we need to evolve we have some code in predefined uh that's been contributed by chin tum there's there's a bit of other place to look at but if you have a diff not so much at the package code level but at the metadata level. If there's no major change in the code, if there's no changes in copyright, the license is the same. Uh the dependency tree is mostly similar with minor changes. You may be able to carry forward this analysis that was done in version 15. Yeah. to version 151 fairly guaranteed every time though, right? That I guess that's kind of my point is like how do you then make sure that that stays correct and at like at the paces of those subp packages commits and changes like obviously you should only be looking at license diffs in this case but even then it's it's question of accuracy. Yeah. No no it's it's it's a big it's a big problem. So but that's one technique is going to be to look at these diffs. Okay. to facilitate the review not so much to come to an immediate conclusion but saying hey this there were not so many changes after all here so we can focus on on considering this as autocreated with a good enough confidence not 100% it's not been reviewed by an expert uh and the same applies to another domain which is well for instance LMS are wonderful at copying code. Uh we've built a tool actually specifically for that integrated in in um in scan code which is to be able to find if code may have been derived from opensource code. So it's not so much detecting really AI generated code in as much that it's able to detect strong similarities between open source code that may have used to train an LLM and the code that was generated. Uh in practice anal is very easy to have an LLM spit verbatim the code that was used for training because they memorize it. Yeah. It's probability. It's Yeah, exactly. Yeah. But uh you can ask, you know, give me the code of this package version uh or a function that does that and they're going to go and search exactly for this package version without looking outside just inside what's been used for training. So, so that's an interesting area uh also to be able to find similarities and difference. One of the things we do in in scan code is to detect uh detect binaries and do binary analysis. I don't know if I have well um that an example of a binary analysis here where we've mapped the source and the binary. So you have for instance a class file in Java mapped to a source file. And the interesting thing here is what would be the files that exist in the binary that don't have a corresponding source file. Yeah, that's another way not so much to to do div, but it's it's a diffing between source and binaries. You do it for we do it for Java, but also for else and native code like we've seen with go a bit earlier. The interesting thing here is you find things which are problematic which are then feeding a review. In this case, HC trace H trace sorry includes a a log forj uh vendored modified copy which is problematic. Mhm. But so the the whole point is there's no magic single technique but ways to nudge nudge and guide the things that needs to review. And the last smile is going to expose that to the community. So we can curate this data the same way we're doing it on credify but we do it on also composition and eventually on on security issues or health issues. I was talking a bit earlier on about this uh this library we're looking at yesterday and on the side it looks very nice and then uh I go into that soft party and there's a ga module to ply which is a faster generator in Python and the guy said you know last maintenance three years ago three years ago and let me look a bit uh bit further up going upstream. Oh, archived. And the guy says, "You know, I've worked on 25 years for this. It's over." There's there's a project we're collaborating with called CLE, common life cycle events, um, which is standardizing together with Cyclone DX and and Package RL, which is trying to capture this information in a structured way. Yeah. I mean it feels like something like this often just gets buried in the readme and so you know you don't always see and like archive makes sense but not everyone archives the project and even if it's archived doesn't necessarily mean it's like fully done so yeah it's interesting um yeah yeah so so this is the kind of things I think we we can eventually resolve together but to me the the other thing also to to look at which are I don't know if they're interesting or frightening or a bit of both. But agentic coding of course is is upon us. Um and it tra so many question for for open source and and the things we do on software supply chain. If if I can push uh or just nudge an agent or a series of agents to extract a spec from a piece of software and project that in another language or refactor that entirely just using the the same API like we've seen with Chardet a couple weeks ago. What does it means for open source? I mean if everybody feels free that to change license uh what does it means for copyrights it's it's an interesting thing there's a lot of slope also and and slope means also sometimes perking weird weird things like that it's not evident that uh if you if you look at this package on pipi uh that there's anything that tells tells an AI agent not to use that. Well, it's even worse here. It's last public version is from 2018. But so there's maybe clues there, of course. Yeah. But even then there's like to your point an agent isn't tr I mean it it's it could if you instruct it to not use certain things that have age in that. But in general, if that defines that it's the right package to be pulled, like it it very easily could pull a six-year-old package at that point, right? Yeah. And so the problem is even if you dig a bit further, is this a problem really? Maybe Ply which is passer generator based on lex and yak in python is a project which is mature and done. Sure, it may not be subject to a lot of security issues. It's a tool that takes a grammar, generates Python code. There's so many question that's raised that are raised there. We're not very well equipped to do this kind of uh it's it's very easy to say, oh, I I see ply. I'm using it. And if you don't dig a bit into the dependency tree, it's super hard to do it with with tools even there. I mean we are talking about subm modules which was a fork of the upstream which has been abandoned. This is the kind of things we can only do together. And I think for for agents being able to unleash agents if you believe in vi coding I'm not I'm not a believer but that's another question that's not the point. being able to safely ensure that your agents can work with a set of open source packages that that you have vetted. I think it's an important thing otherwise you can pick the wrong version, old version, vulnerable versions. One of the small tool we worked on um is is in the case of Python and we're expanding it to other ecosystems is to be able to to to do what we call nonvulnerable dependency resolution. So the the idea is uh uh I think do we have a yeah so there's there's at least an blog article I made or two shar made a couple years ago but the idea is very simply to say if you can take the non-dependency range that you support you you're using uh jungu on python so you know you want any version super to five and you have a range of possible versions that you support functionally and then if you know that version 517 to 523 are vulnerable to this CV which is a low severity and uh version 6.2 two is vulnerable to this high vulnerability severity and these other the other versions are are fine. Then what you want is resolve the version that match your dependency choices functionally and that are not vulnerables. How do you think about in those cases like any like organizational built-in rules of what to include, not not to include and like how that may be instructed at the uh the coding agent level? And then how does that kind of influence the dependencies here that you're building to identify, fix, find, remediate um anything that may have a vulnerability or or surface that into the agent itself? I think I think that's probably a great great way to to go at it. The the the two thing is rules when you have them in an organization are meant to to be broken at times and you need to have an escape hatch for exceptions, right? Because it could be that you know there's this great tools that you want to use internally that's not in the in the product that you ship and it's perfectly okay to use it and you don't want to have this block that's so binary that you cannot be just uh this or that. The second thing is unfortunately for for agent coding is that agents are are are stock stoastic animals and and they're not always very good at respecting rules all the time. So which means that in all case you want to be able to have this enforcement at the tools level whether they're agent or not. uh you always want to be able to verify afterwards. Um, I always like to think about this tool couple years ago, a JavaScript library, actually a JavaScript interpreter written in Java, which is probably still out there somewhere, uh, which was called Rhino, made by Yeah. Modzilla Rhino a couple years ago. Actually, let's look at maybe we have it here. Uh several years ago there was somebody in the build that that had uh basically injected uh fetching a zip attached to a blog post with sample code that was injecting in the the build and then use afterwards. That was I mean of course you laugh but you know there there was no there was no it's probably probably somewhere in a release like that there was no no malice in that it was just it was convenient and happened for tens of years or 10 plus years the blog was still up and the archive was still there. This is the kind of stuff you want to be able to check and you can only check if you do after the fact binary uh binary build verification. Today you may be able to catch it but you it means that you you would restrict fetching anything during the build. Um and uh otherwise it's super hard. Yeah, literally if I recall correctly, it's we're talking about code that's 20 years old, right? Uh I I recall vaguely and just on top of my head, uh it's probably not there anymore, but anyway, there was some place somewhere in the build script, a code, a piece of code that was fetching this this block code. You need to have this verification in all case because we make mistakes, agents make mistakes also quite often. Absolutely. No, that makes sense. Um maybe shifting as we kind of think about um where agentic coding is going like how how are you thinking about how you're approaching scan code going forward? Like are you embedding agents in your own coding flow? Are you using them to validate verify some of your own packages? is like what does that look like for you? Um that's that's yeah so eight months ago we we had collectively decided among the mainteners not to use any because it was going into the way and into the flow of uh of our development at large especially when doing code reviews more often than not the tools were not helping but going in the way. the the latest frontier models in the last I said starting with Opus 46 and and and now GPT 55 54 Pro and else are are are a bit of a head scratcher because because because the the things you're able to achieve with them is surprising in many days. Um, and so we're we're considering using that. What we don't want to to do at this stage is include uh agents in our tools themselves. because again the the difficulty is one character matters. You cannot uh let me actually let me show you maybe bookmarks. through bookmarks to bar. I have my my small stash of license, fubar license. There you go. One random one. That was an example like that. You know, LLMs are very good at ignoring that. because statistically it looks really like a very good MIT license. It even says it's an MIT license. And it could be it could be a mistake. It could be a prank. It could be made on purpose, but it still says what it says, right? Like, and you see the the problem here, I don't want to to put on GitHub, but this this is using a tool called licency. Which is based on statistics also. It's not using aentki, but it's using statistics. And statistic, if you have 99.99% of the content which match MIT Even not granted still can make mistake. this this is the difficulty there where the way we using machine learning on our side is more to help uh point to these kinds of oddities uh not so much for the detection but running on the the the results of the detection. So we looked briefly at um clearly define 55 million of scans. I need to and we need to run a truckload of analysis and stats on that. We're working also with another project right now called software heritage which is a small project whose goal is to archive all the source code of all the open source projects forever. Um and um they have about two pabytes of source code in something it resembles a giant g tree. It's it's an open source uh project. It's open data and it's it's committed supported by many organization probably even GitHub actually I wouldn't be surprised. The point here is that we have 30 billion files and we want to review the scans for package manifest, license and copyright on these 30 billion files to potentially 10 billion bugs. And and so at that scale, you need to apply some machine learning. Yeah. and statistics to be able to determine where will I get the biggest benefits. What are the out of the 10 billion bugs? What is the million that I should worry about first? Million you should worry about first. Yes. Well, which talking about about about slope by the way? We we're we're getting our fair share of slope. Interesting. That'd be interesting to hear what you're seeing. Well, there you go. Uh, I have about 172 PR open here on Scanco Toolkit. So, one of our repos. Uh, what you can see here, first time contributor, first- time contributor, first- time contributor. Uh if I look at the the first four pages, so 75 repos, I know that there's about 80% of these, if not 90% of these are vibe coded. Some of them are bonafide and good ones. Some of them are The problem is I don't know which one. And the the the irony of the stuff is that I can spend four or five days of my life to review this and do that again next week and the week after. And but that's not fair. The the thing is that the the the I'm better at running a prompt that any of the vibe coder would say it's their first time contribution. Yeah. and and it doesn't help me to get this contributions in any case. But you know this may be a very good one. So I need to review that. But why would I review that over another one? it does become hard. We we've launched some some tools that I I'll share with you after on on kind of managing some of this. But I I do think it's an interesting question. Um how do you help you know scale up across like PRs that have these things? So yeah. Yeah. So right now we we we probably have across the the or I'd say about 200ish uh pull requests that are likely junk. And the the difficulty is that it's a deterrent for new contributors to see a lot of stale pull requests. It's it's it's a pain in the ass for the maintainers to see a lot of request knowing that there are gems and the difficulty is even worse than that is that many of these folks don't mean bad. Some of them may be agents but I don't think that's the case. A lot of them are new bonafide contributors people that want to help but using AI to to do some stuff right. They want to help. They want to do the right thing and they don't understand they're not helping. Yeah. I some of the things that I've encouraged teams to do is um create within the project areas of focus for the people that want to contribute so that they're not adding burden and toil. It's more about focus direction. But it still doesn't like neglect the fact that people are are they're they're learning, they're trying to be helpful um and now they're accelerating that help. Yeah. and and and the thing is some of these may be aspiring contributors which we could groom into awesome maintainers and I don't know how to sort which one is which. That's that's really a bit of a challenge we're dealing with. We were discussing last week to to to to use the nuclear option and turn off pull requests entirely and keep that only for project maintainers and and then potentially make it super easy to become a contributor with read only access to the repo but you're you're granted privilege access to the repo which enables you to not commit to the repo but at least create pull requests. uh it tapered a bit this week. So maybe maybe that's something we'll we'll deal with differently, but we we didn't pull the trigger yet on the nuclear option. No, that's fair. That's fair. Uh I I know we're running up at at time. How can folks uh engage with the project, engage with you, um think about license compliance and kind of managing dependency management like yeah, how can they get started? So we there there's three ways you can engage. Um, if you're interested in package URL, you can go to packagurl.org. If you're doing an sbomb, you're looking in the volunte database using Microsoft sbomb tools, GitHub dependencies and and and the ability to execute or extract bomb from your repo, you'll deal with spurl. You'll see that. So, you can contribute there. You can help. We need a lot of help there. The small one is clearly defined. Uh I mentioned it before uh we have weekly meetings on Wednesdays and um and um so that's another area where you can contribute u and monthlys and the last one is on about code proper um where we have all these repos on aboutcode.org or specifically uh the tools the low-level tools that are used to powerly define but really uh join anything you do is going to be contributed to the commands as open code and open data and open standards. So there's nothing to there's no strings attached. There's no no hidden thing anywhere. Um we need help to to make sure we can make the opensource world a better place and and it's easier to reuse more code uh faster and more efficiently. Do I have QR code somewhere? Popup pop pop. QR code. QR code. No. No. No, no, no cure We can also share links in the chat, too. So, if you want to um if I can, if I can send them uh on chat, that might help. Yes. Uh let me let me How do I do that? Stupid of me. Uh so, let me get so clearly defined. So, clearly.io aboutcode.org and and aboutcode-org. Okay. and package URL in If some of you are working in a in a corporation that use our tools, I mean feel free to send some cash away. Uh that will be much welcomed. Again, we're public benefit nonprofit association and um everything we do is free and open source code and data. So it's helpful to to to make sure you have healthy and safe software supply chains. So contributions in in kind contributions in in uh money are are much welcomed. We were blessed by the way to be uh uh part of the the the GitHub secure open source fund. Just uh nudge blood and shout out uh which been super helpful. We found tons of stuff about security we didn't knew about before. It's been an interesting learning experience. It's a it's a unique program. I'll drop the link for folks as well if they're interested in in applying. But I I agree and I think the the lens of security keeps evolving uh as well. And so I'm excited to see how the the program evolves for these next few cohorts. So yeah. Awesome. Well, thank you for joining everyone. excited to jump in and uh if you are interested in checking in on package URL about code or clearly defined uh check that out and reach out to Phipe and thank you all for joining us today. Have a wonderful day. Thank you very much. Have a great weekend. My

Get daily recaps from
GitHub

AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.