From xz to ibus: more questionable tarballs

dhx · on April 2, 2024

1.5.29-rc2 was tagged on 9 Nov 2023 [1] and, as an example, did not contain "N_("CJK Unified Ideographs Extension I")," in src/ibusunicodegen.h [2].

Commit 228f0a77b2047ade54e132bab69c0c03f0f41aae from 28 Feb 2023 introduced this change instead. It's the same person who tagged 1.5.29-rc2 and committed 228f0a77b2047ade54e132bab69c0c03f0f41aae which is typically an indication the maintainer tar'd their checked out git folder and accidentally included changes not get committed.

The question raised is whether anyone is auditing these differences before the checksummed tarballs are added to package repositories.

[1] https://github.com/ibus/ibus/releases/tag/1.5.29-rc2

[2] https://github.com/ibus/ibus/blob/0ad8e77bd36545974ad8acd0a5...

[3] https://github.com/ibus/ibus/commit/228f0a77b2047ade54e132ba...

dhx · on April 2, 2024

^ typo: 28 Feb 2023 is meant to be 28 Feb 2024, or almost 4 months later.

juitpykyk · on April 2, 2024

GitHub has the feature of downloading the tree as a zip, why is this not used?

rwmj · on April 2, 2024

There are two main reasons, one bad, one good(-ish):

(1) Traditional autoconf assumed you only had a shell and cc and make, and so the ./configure script which is a huge ball of obscure shell commands is shipped. I think most distros will now delete and recreate these files, which probably should have happened a lot earlier. (Debian has been mostly doing this right for a long time already.)

(2) Some programs generate a lot of code (eg. in libnbd we generate thousands of lines of boilerplate C from API descriptions*). To avoid people needing to install the specific tools that we use to generate that code, we distribute the generated files in the tarball, but it's not present in git. You can still build from git directly, and you can also verify the generated code exactly matches the tarball, but both cases mean extra build dependencies for end users and packagers.

* Generating boilerplate code is a good thing in general as it reduces systematic errors, which are a vastly more common source of bugs compared to highly targeted supply chain attacks.

watt · on April 2, 2024

I advocate for checking in the auto-generated code. You can see the differences between the tool runs, can see how changes in tooling affect the generated code, can see what might have caused a regression (hey it happens).

Sometimes tooling can generate unstable files, I recall there was time when Eclipse was notorious there, for example when saving XML files they liked to reorder all the attributes. But these are bugs that need to be fixed. Tooling should generate perfectly reproducible files.

rwmj · on April 2, 2024

We started off doing this, but you end up with enormous diffs which are themselves confusing. Example, only about 5% of this change is non-generated:

https://github.com/libguestfs/libguestfs/commit/5186251f8f68...

Probably depends on the project as to whether this is feasible, but for us we intentionally want to generate everything we can in order to reduce systematic errors.

dtech · on April 2, 2024

in github, you can mark a file a generated [1], which hides it in the PR view by default

[1] https://docs.github.com/en/repositories/working-with-files/m...

acdha · on April 2, 2024

Wouldn’t an attacker like JiaT75 do that to increase the odds of someone skimming it?

sgtcodfish · on April 2, 2024

They might try - that's why it's important if you're generating + committing generated code that you also have a CI step that runs before merging anything which ensures that the generated code is up-to-date and rejects any change request where generated code is out of date.

Mostly this helps with people simply forgetting to re-run the generator in their PR but it's a useful defence against people trying to smuggle things into the generated files, too!

acdha · on April 2, 2024

Yeah, I guess my general thought is that anything which encourages hiding files is actively risky unless you have some kind of robust validation process. As an example, I was wondering how many people would notice an extra property in a typically gigantic NPM lock file as long as it didn’t break any of the NPM functions.

jjgreen · on April 2, 2024

The same feature recently added to GitLab

maccard · on April 2, 2024

I disagree - you should ensure your dependencies are clearly listed. Docker excels at this - it's a host platform independent way of giving you a text based representation of an environment.

arp242 · on April 2, 2024

Docker is a Linux thing, and very much not host-platform independent. It's just "chroot on steroids", and you're essentially just shipping a bunch of Linux binaries in a .tar.gz.

It works on other systems because they emulate or virtualize enough of a Linux system to make it work. That's all fine, but comes with serious trade-offs in terms of performance, system integration, and things like that. A fair trade-off, but absolutely not host-platform independent.

bandrami · on April 2, 2024

Sort of. I have about 15 containers running on my dev laptop as I type. Which versions of xz are on each of them, and how do I make sure of that?

efrecon · on April 2, 2024

Downloading the tarball/zip with just a shell and regular utils is possible. See https://github.com/efrecon/ungit

Lockal · on April 2, 2024

There is a minor note that technically there is a weak guarantee that checksums won't break after server update and recompression with different version / alternative implementation of gzip.

https://github.com/orgs/community/discussions/45830

Aissen · on April 2, 2024

It's not a minor note, it's a major reason that the github auto-generated tarballs are useless as-is, since they are not stable.

colejohnson66 · on April 2, 2024

This was not GitHub’s fault, but Git itself combined with cache pruning. Specifically, GitHub updating to Git 2.38 which changed the algorithm. Non-cached tarballs were regenerated on demand, and all hell broke loose: https://github.blog/2023-02-21-update-on-the-future-stabilit...

Aissen · on April 2, 2024

It was not the first instance of this happening; other times I'm not certain it was git's fault.

rwmj · on April 2, 2024

github have (for the moment) backed down and currently the auto-generated tarballs are stable, but they have in the past and may in the future change this.

Lockal · on April 2, 2024

Thank you for highlighting this. I've started a new discussion https://github.com/orgs/community/discussions/116557 to provide strong guarantees for checksum stability for autogenerated tarballs attached to releases.

dhx · on April 2, 2024

For many projects, the release tarballs only contain the files necessary to build the software, and not the following items that may be present in the same repository:

- scripts used by project developers to import translations from another system and commit those translation changes to the repository

- scripts to build release tarballs and sign them

- continuous integration scripts

- configuration and scripts used to setup a developer IDE environment

roywashere · on April 2, 2024

You can use .gitattributes export-ignore to influence what gets into the tarballs and what stays into the repository! It's super powerful but not often used

kzrdude · on April 2, 2024

And export-subst to insert the current tag or git revision into the archive too.

In fact export-subst is powerful enough that there is probably some way to create an exploit triggered by a particular payload inside a commit or tag message? :)

Maybe not triggered, but it could be part of the chain.

bdd8f1df777b · on April 2, 2024

I smell a new backdooring opportunity. Modifying .gitattributes to surreptitiously sneak some binary files into the GitHub release tarballs. Few poeple would take a look at .gitattributes.

juitpykyk · on April 2, 2024

What would be the problem of downloading a few MB more, aren't these source tarballs just used to build the distro binary and then they are deleted?

dhx · on April 2, 2024

As this backdoor has shown, extra unnecessary files in the source files can make it easier to hide malicious code. If you take Gentoo as an example, when a software package is built, Gentoo creates a sandboxed environment first, disallowing the build process from impacting the rest of the operating system.[1] Removing superfluous files from the source tarballs minimises the ability for an attacker to get malicious code inside the sandboxed build environment.

Sandboxes for building software are commonly used throughout Linux distributions, but I am unsure how strict those sandboxes are in general e.g. whether they use seccomp and really tighten what a build script can get up to. At least on Gentoo, there is a subset of packages (such as GNU coreutils) that are always just assumed to be needed to build software and they're always present in the sandbox. Build dependencies aren't as granular as "this build needs to use awk but not sed".

[1] https://wiki.gentoo.org/wiki/Sandbox_(Portage)

kzrdude · on April 2, 2024

I've enabled "trusted publishing" as it is called for python packages (publishing to cheeseshop/PyPI).

However, what they call trusted publishing is just a configuration where PyPI tells github that publishing from a particular workflow name on a particular repository is ok, without further tokens. So PyPI "trusts" github actions and the maintainer is out of the loop.

All good? Well, if you trust Github!

It would be a lot better to me if both the maintainer and github were involved, something like - the maintainer signs off on some artifacts, and the github action verifies they are reproducible exactly by a workflow, and /then/ it's published.

woodruffw · on April 2, 2024

> However, what they call trusted publishing is just a configuration where PyPI tells github that publishing from a particular workflow name on a particular repository is ok, without further tokens. So PyPI "trusts" github actions and the maintainer is out of the loop.

That isn’t quite how trusted publishing works: it’s morally equivalent to trusting GitHub with your manually configured API token, except GitHub mints the token using its own publicly verifiable key material.

In other words: it’s no more trusting of GitHub than manually configuring a secret would be, but is significantly more misuse and compromise resistant (since the minted API token is auto-expiring and minimally scoped to whatever project the repo is linked to).

You’re right that this does not itself involve reproducibility or maintainer signoff. But signoff is something you can configure for yourself on GitHub releases (which PyPI cannot meaningfully enforce), and reproducibility is a distant ecosystem-sized goal at this point. Lashing trusting publishers to either would have meant making the perfect the enemy of the good.

(It’s also something we’re working on, by exposing cryptographic publish attestations tied to the trusted publisher. That is still being designed, but I hope to have more to share publicly soon.)

Source: I implemented PyPI’s trusted publishing.

kzrdude · on April 2, 2024

Sounds exciting with the coming features!

About trust.

I would say, the firmer the structure of configurations is, the more trust I'm giving over to Github vs handling it myself.

Is Github actions a neutral VM executor that I can configure as I want to? Or does it have slot-ins for lots of different services? You are probably right that it is morally equivalent, but there is some material difference - Github now has a database of all their python-publishing repositories.

guappa · on April 2, 2024

> but is significantly more misuse and compromise resistant

I don't think it is at all.

Without it, I have a supersecret token.txt file on my machine.

With it, I have a supersecret .ssh/ directory, which grants access to do a git tag, which will publish without needing to know the pypi secrets at all.

You just moved the secret into another file.

Am I wrong? Can you explain how am I wrong if you think so?

woodruffw · on April 2, 2024

> With it, I have a supersecret .ssh/ directory, which grants access to do a git tag, which will publish without needing to know the pypi secrets at all.

That isn't how trusted publishing works. It uses OpenID Connect, which both requires a privileged workflow (in the case of GitHub) and is itself self-expiring.

There's a lot of information available in PyPI's documentation about how it works[1]. But the basic version is that short-lived tokens always reduce attacker exposure, since they prevent a temporary compromise from becoming a permanent one (i.e., via theft of a long-lived credential). Similarly, auto-scoped credentials always reduce credential misuse, since they prevent users from over-scoping their API tokens to get things to "just work." Both were observed issues on PyPI prior to trusted publishing, and were a huge part of both the original API token and 2FA rollouts.

[1]: https://docs.pypi.org/trusted-publishers/

guappa · on April 2, 2024

Once you configure all the CI and everything, following the documentation. Isn't the documentation giving the advice to make it publish automatically on a git tag?

And isn't it true that to do a git tag you only need to steal the private ssh keys of the account, which are commonly stored in the ~/.ssh directory?

If I am the attacker, and a project has trusted publishing configured, my target shifts from the token.txt file on the machine of a maintainer, to the .ssh directory on the very same machine.

You can do all the complicated authentication schemes between github and pypi, but do I care to attack there when there's a much easier point to attack?

I have set up trusted publishers for one of my projects. At least from the point of view of a user I am familiar with it. A git tag followed by a push did indeed trigger a new build+release workflow in my CI.

woodruffw · on April 2, 2024

These are different layers: trusted publishing doesn't know how you store your git credentials, and PyPI isn't able to be prescriptive there. The recommendation that you publish from a tag is just normal release hygiene stuff; PyPI cannot make you use the other parts of your computer safely.

The scenario you're talking about ("attacker has your SSH private key") is already a "game over" scenario, and is tantamount to the attacker assuming your identity. Trusted publishing is not aimed at that scenario; it's aimed at eliminating other scenarios (such as people using permanent user-scoped tokens everywhere, making it easier for an attacker to move laterally after compromising a single token).

Or in other words: not every attacker is empowered to assume my identity by stealing my SSH key (which lives in a security module for me, not in `~/.ssh`). That's a very strong attacker model; there are other, weaker attacker models that are still worth defending against (such as someone who opportunistically compromises a single CI run, or takes advantage of a one-time accidental API token leak).

Edit: and note: the documentation for trusted publishing encourages you to use a release environment on PyPI, which can be configured to require multi-user signoff. So you can use it to prevent a stolen SSH key from compromising your uploaded project. But again, that is the strongest possible adversary, and is not the most common one that users need protecting against.

guappa · on April 3, 2024

I think you are only considering people who use a CI to do an upload. Do you have any data of how many pypi projects are set up this way in total?

I upload from my machine, where a token.txt is stored on the same disk as the .ssh file. So your game over scenario is identical in both cases for me.

> people using permanent user-scoped tokens everywhere

I'd say that's a major design flaw of pypi, that it is impossible to create per-package scoped tokens directly, unless one is configuring trusted publishers (which translates to github).

> which lives in a security module for me

I am very confident that this is extremely uncommon among authors of projects hosted by pypi. Is this a mandate from your employer? (I don't know who that might be and if it is related at all, just asking.)

woodruffw · on April 3, 2024

No, I don’t have those numbers on hand. If you upload from your local machine, none of this applies to you. It only applies to people using CI to upload, who want to reduce the overall scope of their shared access by not relying on permanent credentials. That’s the whole point of trusted publishing; it has nothing to do with locally initiated uploads.

> I'd say that's a major design flaw of pypi, that it is impossible to create per-package scoped tokens directly, unless one is configuring trusted publishers (which translates to github).

I think you’ve misunderstood. PyPI allows you to create project scoped tokens completely independent of trusted publishing. Trusted publishing is only distinguished in that the scopes are automatic, meaning that the user cannot overscope their token.

The point about security modules is a red herring; it was supposed to emphasize that this is outside of PyPI’s domain of interest. But no, I do it for myself, not because my employer requires it.

guappa · on April 4, 2024

Well it would be interesting to have the numbers, to know how important it is to focus on defending uploads from CI vs uploads from developer machines.

> PyPI allows you to create project scoped tokens

It requires the creation of a non-scoped token first, to do that. And since people are lazy I'd bet many have not created a scoped token. Again, this is supposition from my part that could be verified or proven wrong by you.

Retr0id · on April 2, 2024

Reproducible builds could help here, perhaps.

If GitHub's CI runners and GitLab's CI runners (for example) are producing the same tarballs, that's probably a good sign.

londons_explore · on April 2, 2024

I do trust github with anything where any dishonesty on their part would be discoverable from the end result.

Just a single piece of strong evidence that github the company had injected malware into something (rather than a github user) would end their business.

Aachen · on April 2, 2024

"oops, our bad, we'll improve security"

It won't end their business and it's currently a single point of failure.

Having and checking pgp signatures would mean the developer and Microsoft/GitHub have to both get compromised, which is a risk I'd much more readily take than github never ever getting compromised for security-critical software like ssh or its dependencies

Edit: realised this isn't true unless github starts doing code reviews. The developer can still always push anything. But by having digital signatures, you could require both devs to OK a change or release. Or, at least not everything is compromised if github is compromised: every developer would have to be compromised separately to forge all signatures. Using github as the single point of trust seems like an exceedingly large problem waiting to happen at least once in our lifetimes

londons_explore · on April 2, 2024

Since the code, the build process, and the final signed tarball would all be made available to the public in the case of these 'verified builds', it would only take one person to rebuild and diff the results to uncover any malpractice by github.

Obviously in most cases, any differences will be bits of the build process that are nondeterministic like including timestamps in the final binary.

woodruffw · on April 2, 2024

One of the key observations from PyPI’s removal of PGP support was that virtually nobody actually verified the signatures uploaded to the index (when they were actual signatures, and not just garbage or error files), and that most of their backing keys were expired, weak, or otherwise impossible to publicly establish.

Independent signatures do mean that the author could stymie a compromise of their source host. But on net, users are worse maintainers of cryptographic key materials (and worse users of obtuse ecosystems like PGP) than big companies are. That doesn’t mean we shouldn’t enable user signing; only that we need better, more misuse-resistant tools for it.

Aachen · on April 3, 2024

In response to the last sentence (copying function broke here, idk why but so I can't easily quote): agreed, though it seems odd to throw out the existing solution before having a new one, preferring no solution over a partial one

perihelions · on April 2, 2024

I think we should have learned this lesson by now. The revelation that GitHub deliberately injects malware into things would be unlikely to "end their business": it would contentiously divide their userbase between purists who believe repo hosts should not be injecting malware into people's FOSS projects, and pragmatists who defend the trust-and-safety reasons for injecting malware into FOSS (say, to assist US intelligence preventing terror attacks).

We've had this recurring discussion in many forms, and the typical result is that peoples' capacity for outrage wears out quickly. Who still remembers, for instance, that the GitHub entity sabotaged its users by giving the NSA backdoors to its private infrastructure [0]? How much business is Microsoft missing out on today, in 2024, because of that betrayal of trust? Essentially none, right? They're bigger than ever.

GitHub sneaking NSA malware onto their platform—maybe selectively serving it to certain targets—would be perfectly in character with their past behavior. It would be smart business sense (they're a major DoD vendor). It'd be totally irrational to have a faith-based trust that they wouldn't do this.

(There's a precedent: the FOSS host SourceForge destroyed itself by making the error of judgement of injecting code into people's repos [1]. But that was a decade ago: the Overton Window's shifted a lot. What SourceForge lost their business for doing in 2013—injecting ads—is something Microsoft comfortably gets away with today, on Windows desktops).

[0] https://news.ycombinator.com/item?id=6027779 ("How Microsoft handed the NSA access to encrypted messages" (2013))

[1] https://en.wikipedia.org/wiki/SourceForge#Adware_controversy

lupusreal · on April 2, 2024

It may seem inconceivable that github would infect downloads with malware, but 20 years ago sourceforge doing it would have seemed inconceivable. They did though. You're saying that it happened before so it won't happen again, but that's silly. Anybody with sense back then would have already known that sourceforge's reputation would be ruined by it, just as everybody knows githubs reputation would be ruined. But you can't count on businesses being run on common sense.

perihelions · on April 2, 2024

You misread my comment: I'm saying it absolutely could happen and there's no sound reason to assume it won't.

hyperpape · on April 2, 2024

Whether or not you draw a distinction, it seems plausible that the world as a whole would draw a distinction between sharing encrypted messages with the NSA and deliberately injecting malware into widely distributed software. Corporations can look at the former and think it is unlikely to affect them.

So regardless of how bad its prior behavior is, it makes sense that Microsoft has more of a reason to try to keep software distributed through GitHub as secure as it reasonably can.

tsimionescu · on April 2, 2024

Still, if the NSA wants to target a "rogue state" with malware through GitHub, they are much much more likely than not to comply. Even if it were ever discovered that they did so, and if it were ever widely publicized, it would not affect the majority of GitHub paying customers (which are US or EU companies, which by and large don't care if their geopolitical rivals are served malware).

peteradio · on April 2, 2024

How about they just put a checkbox in the settings (checked by default of course) to install the malware. If you have a problem with it you can navigate to the settings and uncheck the install malware box. Now of course you may need to check from time to time if the box has reverted back to its natural checked state because of some backend update.

guappa · on April 2, 2024

> All good? Well, if you trust Github!

And if you trust that your secret ssh keys don't get hacked away from your machine, giving someone else the rights to push a tag and automatically make a release on pypi.

orf · on April 2, 2024

For Python releases, this is quite a big problem. Python packages consist of many different downloadable files - for example, the Datadog APM library has over 60 different files[1] depending on the platform and Python version.

There's nothing to enforce that each of these files contains the same code - and indeed, they shouldn't: each platform-specific .whl file contains platform-specific binaries.

But for the python and other platform-agnostic code, it can and does differ if the publishing process is done manually.

One example of this is an intel project called "devcloud" - a specific file was included only in the .whl release (and not the source distribution) that contained an AWS access key[2]. The Amazon pay SDK also fell victim to this - the developer accidentally included an integration test that contained a key[3]. Someone else included a markdown file containing ~400 OpenAI keys in a similar manner, only present in a single .whl file out of several.

There are _lots_ of examples of this, so many that I created a project to attempt to quantify this, prevent accidental credential leaks and bring greater visibility to the contents of PyPI by publishing all the contents of PyPI releases to Github and indexing them: https://py-code.org/

There's a hosted clickhouse dataset available here: https://py-code.org/datasets. On my to-do list is to write a query to find differences in files within the same package version, where they should be the same. This is almost always an indication of something malicious or accidental.

"Package integrity" is a really interesting + complex area when you add a platform-specific dimension to it. For Python specifically, I wonder if the "separate whl per platform" approach is a good one. Maybe a layered approach would be better - the "source" is the source of truth, and pre-built binaries _only_ contain pre-built binaries and not the duplicated "source"?

1. https://pypi.org/project/ddtrace/#files

2. https://inspector.pypi.io/project/devcloud/0.0.2/packages/9f...

3. https://inspector.pypi.io/project/amazon-pay/2.5.1/packages/...

andersa · on April 2, 2024

Why is this a thing? Can't packages use specific tags from the git repo? It seems so incredibly stupid to allow this, throwing out all of the "oh but it's open source you can review it" arguments in one go if the source displayed on GitHub is not what ends up used...

segfaultbuserr · on April 2, 2024

Historically, it was done for end-user convenience. Traditionally, each project has two separate sources, one is the actual development repository, strictly for use by developers. The next is the source tarball for end-user installation, pre-generated by developers by preprocessing the source repo - such as autotools script generation, gettext translation file generation, etc. The idea was that the end users were running on many flavor of incompatible Unix systems, so installing the full development tools such as autotools or CVS/SVN could be inconvenient. To avoid those troubles, developers pre-generate tarballs with necessary installation scripts such as ./configure for user-friendliness - so source tarballs are installers in a sense, at the middle way between actual source and pre-complied binary. On the other hand, because these scripts are machine-generated and are extremely sensitive to small chances to Makefiles, they should not pollute the canonical source tree, so they're not included in the source repository. Occasionally, under the strict Cathedral style of software development, the canonical source tree may even be private under exclusive access by the core team, end users only have access to release tarballs.

Nowadays, the repo-tarball split is largely unnecessary, but the practice remains. The xz incident became a wake-up call that the Reproducible Build movement focused on binary reproducibility, but have so far ignored tarball reproducibility. Hopefully, this problem will be addressed by the community in the future.

mid-kid · on April 2, 2024

> Nowadays, the repo-tarball split is largely unnecessary

I think locking the source distribution mechanism to only git would be detrimental in the grand scheme. A source tarball is universal, independent of preferred tooling, easy to hash, sign and verify, and archive. Even git has loopholes in that a tag does not necessarily have to be a part of the master branch, and not everyone uses (or wants to use) github, or a similar online interface for git...

It's also worth considering that source generation tools are often keen to changing their API frequently, while their output is more weaponized against the passage of time. Autotools is significantly better about that these days, but many other tools aren't...

kruador · on April 2, 2024

The Git protocol is also substantially more heavyweight than just downloading a tarball.

'Dumb' HTTP involves requesting `info/refs` which lists all the references (branches and tags) available. The client has to find the tag you want, then request the corresponding commit object. That lists the hash of the tree object, which has to be requested, and recursively the tree object references other trees and finally the file blobs.

Any object request could fail because the object isn't stored 'loose' on disk, but is instead in a pack. So the client first downloads `objects/info/http-alternates` to see if it's in a different location, then if that doesn't list anything, it asks for the list of packs `objects/info/packs`. Then it asks for the index of each pack to determine which pack contains the object, then finally downloads that pack.

Because that's very slow, Git also supports 'Smart' HTTP. That moves all the lookups server-side, with a back-and-forth between client and server about what the client has and what it wants, ultimately building a custom pack containing what you asked for. This obviously can't be cached.

In contrast a tarball is one request for one file on disk which is fully cacheable, both by the server hosting the file and any proxies on the path.

This is why things like `bower` and `DefinitelyTyped` got deprecated. Some of them were even hosting the registry itself as a Git repository on GitHub.

Kwpolska · on April 2, 2024

GitHub/GitLab can generate tarballs from specific git commits, and those are cacheable and don’t change (when GitHub deployed some changes to the format, many places noticed and GitHub had to undo the changes).

mid-kid · on April 3, 2024

That's just a shortcut to the conventional tarball distribution method.

kevincox · on April 2, 2024

I think most packagers that can pull from Git can also pull from Mercurial, SVN or whatever else.

Pulling from Git doesn't really require using GitHub other than connecting to their IP. You don't need an account or any tool other than Git to clone a public repo.

It seems to me that pulling directly from the primary source of truth is a good idea these days where possible.

weinzierl · on April 2, 2024

"The xz incident became a wake-up call that the Reproducible Build movement focused on binary reproducibility, but have so far ignored tarball reproducibility."

The reproducibility must stretch over all the outputs.

A related idea is the Hermetic Build", where the sameness of all the input is ensured, even including the build tools.

yencabulator · on April 2, 2024

In this case, inputs, or more like the whole chain toward "real inputs".

A project can be reproducible when built from the release tarball, without being able to reproduce the release tarball from a git tag.

lamontcg · on April 2, 2024

> Nowadays, the repo-tarball split is largely unnecessary, but the practice remains.

I keep seeing this argument, but I don't understand why CI runners can't generate the configure script rather than offloading that onto the end user requiring them to have autotools/automake/autoconf/m4/etc

If you've ever tried to fight with that on an decade old enterprise system, those tarballs with configure scripts in them are useful. They get generated with new tooling on modern distros, but run on ancient systems. Getting the right versions installed can be an absolute pain in the ass.

tutfbhuf · on April 2, 2024

[flagged]

loloquwowndueo · on April 2, 2024

We don’t understand how AI works. I would not trust an AI to not hallucinate unsafely in this context.

zopa · on April 2, 2024

You wouldn’t want it in the CI pipeline, because any model clever enough to find real issues is also going to find plenty of false positives. That seems like too much friction for most open source projects.

I’m not one of the downvoters, but you’ve linked to a list of forty or fifty different projects, many of which don’t seem relevant to this use-case. It’s not too surprising people have nothing to say besides “ugh, more AI hype.”

VMG · on April 2, 2024

You are wrong but there is no reason to downvote you. The idea might work in the future.

Kwpolska · on April 2, 2024

It might work if AI stops producing random output and bullshit (which proponents call hallucinations to make it sound nice), and produces correct responses deterministically. Which may take forever.

gwd · on April 2, 2024

For Xen, it's historical reasons funneled into "don't break things for your users". In the olden times, Xen had our own fork of QEMU, as well as our own fork of Linux, and some other useful tools like pvgrub. These were developed in a separate repository, but it was imported into the main Xen release tarball, so that you could just download the main tarball, do "./configure && make && make install" and have a reasonably complete Xen system.

These days most Linux kernels can do everything we need, and our release tag of QEMU might only be one or two patches not yet in the upstream branch; and in any case, you can always use the most upstream release. So this isn't necessary anymore (and indeed we only include QEMU in the release tarball, not Linux or pvgrub). And you can still "./configure && make && make install" to get a fairly complete system, it will just clone other repositories on your behalf.

On the other hand, I actually checked just last month, and the tarball for Xen 4.18.0, released back in November, was getting 700 downloads a week. Who are all these people downloading the release tarball, rather than using the distro version of Xen? Do they want and need these extra bits inside? We don't know and were hesitant to make any breaking changes.

I think with the xz fiasco, we now have justification to switch entirely to a `git archive` tarball of a specific tag, whether it's inconvenient for people or not.

brendank310 · on April 2, 2024

It's been probably 2 years since I've messed with it, but I was still using the tarball download for Xen in OpenEmbedded builds.

mkj · on April 2, 2024

Software projects will outlive git. Mine has gone CVS -> Monotone -> Mercurial -> Git over 20 years, it'll probably move again. But you can still download the tarballs from any of the releases made with those various systems, and they still have the same sha1sum (or now sha256sum).

Making reproducible tarballs from VCS isn't hard - with the advice from https://reproducible-builds.org/docs/archives/ (and requiring gnu tar) it's possible to get byte exact output from both Mercurial or git mirrors, on both Linux and MacOS. Put that in the github CI and it's reasonably difficult to subvert (compare against a local build too at release time).

It would be nice if "git archive" had guarantees about archive format, and if the github "download a tar.gz" matched that, but it doesn't seem to be the case at present.

atlasduo · on April 2, 2024

Tags can be changed. To actually pin the source code revision, they should pin to a specific commit hash.

viraptor · on April 2, 2024

Sometimes they get changed for a silly reason (someone just didn't think of the consequences). Sometimes they get changed because they have to - if I remember correctly it was Asterisk that needed to drop some copyrighted music samples and retagged old versions.

gunapologist99 · on April 2, 2024

That shouldn't matter, then; it should just result in a new release.

viraptor · on April 2, 2024

You can't distribute the old versions with copyright violation. The problem doesn't go away just because you released a new version.

nilamo · on April 2, 2024

But then you have two different things (with and without copyrighted data) both with the same version. Wouldn't it be better to just remove the bad release, and publish a new one? They are, after all, not the same things, and shouldn't have the same release.

andersa · on April 2, 2024

Good point, that would make sense. But same difference. Why is that not how it works?

mjochim · on April 2, 2024

The path from source code to distributed binary file is known to be a blind spot.

Removing that blind spot is either Harder Than You'd Think or Easier Than You'd Think, depending on your perspective and expectations. You can find some issues listed here:

https://reproducible-builds.org/docs/

Or the homepage of reproducible-builds.org for a general take on the subject. (I am not associated with that website.)

ncruces · on April 2, 2024

Definitely Harder Than You'd Think and we've known this for a very long time.

https://research.swtch.com/nih

senectus1 · on April 2, 2024

my understanding is for speed reasons most distro's dont build from the source they build from tarballs.

If they built from the source there wouldn't be this issue.

b112 · on April 2, 2024

If by "source" you mean the project's version controlled repo, it is for a variety of reasons.

One primary one is, sustainability. The build tarballs are kept forever. Try that with a remote repo that could vanish in a mere 10 or 20 years.. or tomorrow!

And tar is the most stable archive format in existence, with at least half a century of use.

note: Companies that care, should ask themselves what do you do, if you have a build system which relies upon externals, and a part of that build goes down?

And you have an urgent fix to PROD required?

Hope you can find all the bits, unvarnished, scattered on dev boxes? Cobble them together and hope they build, while your PROD is currently borked?

Or do you hotpatch PROD?

If your build process breaks due to an external repo going MIA, then you're doing it wrong.

martijnvds · on April 2, 2024

The tarballs also contain source. Just not the source control/revision history.

ZiiS · on April 2, 2024

Do you mean if they built from a git hash? The tarball contains "Source code" which is what they are building.

sam_goody · on April 2, 2024

This has really been bugging me about npm.

Anyone can publish a open source repo and add it to npmjs. Users going to the page on npmjs will see that the repo with the code is github.com/myrepo.

But when I do `npm i myrepo` there is no guarantee that what is being pulled is in any way similar to what is in the linked repo. Creating a false feeling that the code could be reviewed.

At the very least, Github should not allow this for code on their platform (ie, if npmjs has it listed as the repo of a project they should scan that the project actually builds to the content or notify npm who should have it flagged). Or npmjs should regularly scan the same - checking that the code you would get by compiling matches the code being offered

Bear in mind that if a bad actor can get in at the level of the NPM user, even if the user is not running with elevated privileges (which it often is since you need to be a superuser to listen on port 80 or to read SSL certificates, and PM2 with handling for SSL is way above the ability of many devs, sigh), they can scan for vulnerabilities, and perhaps open themselves a very big hole.

PaoloBarbolini · on April 2, 2024

Same for Rust. In the short term we're trying to solve it on the user's side with https://crates.io/crates/cargo-goggles, but in the long term the registry should probably do it.

pabs3 · on April 3, 2024

Are crate maintainers going to accept having to do this? I've seen crates where the code in the crate was generated by a Python script not in the crate but only in git.

3np · on April 2, 2024

FYI: `npm publish` reports the current commit hash, if any, and this is publically viewable from the npm registry.

This should help audit/verify if publishers have published what they say and if builds are reproducible.

E.g.

    $ curl -sL https://registry.npmjs.com/colors | jq -r '.versions|map([.version, .gitHead]|@tsv)|.[]'

bdd8f1df777b · on April 2, 2024

Ah, I was wrong assuming that npm is cloning every git repo of the dependencies.

kijin · on April 2, 2024

What makes maintainers of major distros still rely on questionable tarballs to build packages? It's not as if these essential programs don't have a public, authoritative git repository.

Is it because of inertia, because we've been using tarballs since before VCS was a thing? Is it to reduce the burden of package maintainership, by letting upstream do most of the transpiling and autotools plumbing work? Is it because some assets and test data are only included in the tarballs? Why are they not committed, either to the same repo or (if upstream wishes to keep the main repo small) some other repo?

People have been calling for reproducible builds for years now, but reproducible builds don't mean anything if they are not built from authoritative sources with immutable commit IDs.

bregma · on April 2, 2024

The tarball, with its signed hash, is the authoritative source with immutable ID.

Not every project uses git. Not every project uses a public VCS. Git history can be changed with a simple command and is unreliable.

The point of a tarball is that it is the distribution medium of a source release. The developer generates the generated code (where required), runs all tests (eg. `make distcheck`), then when satisfied it's good, signs it and releases it. If you can't trust the developer's signed tarball you can't trust their git repo either, since it's the developer you can't trust.

A downstream, such as a Linux distro, uses the release tarball because it's the single, verifiable, published source of truth. They should be able to do reproduceable builds from the tarball (not being able to do so should be filed upstream as a bug), and for licensing reasons (eg. GPL) need to keep the tarball and make it available to anyone to whom the software is distributed.

kijin · on April 2, 2024

> Git history can be changed with a simple command and is unreliable.

Not if other people have cloned your repo. That's the whole point of having a distributed VCS. Your repo is no longer the single source of truth because other repos remember what truth looked like 3 days, 3 months, 3 years ago. You renounce the right to dictate truth as you wish, and gain trust in exchange.

I don't understand why this idea comes up again and again: "You trust the developer, you might as well trust the code they distribute." That makes no sense. Why do you trust them in the first place? Do you know them? Have you met them in person, had a few beers with them? Or is your trust merely based on their past performance as an open-source developer? If it's the latter, whether they have been willing to work in a way that facilitates independent inspection and verification of their output should be a very important factor in whether you should trust them at all. Hardware vendors who dump opaque binary blobs at the kernel's doorstep tend not to receive much respect, after all.

bregma · on April 2, 2024

> "I trust the developer, so I trust the code they distribute."

I said "If you can't trust the developer's signed tarball you can't trust their git repo". That's absolutely not the same thing. In fact it's sort of the opposite.

Distributed cloned repos are not a single source of truth. If they differ, which one is the actual source of truth? Was the post-release changed history propagated?

A signed tarball means the signing key you use to verify the tarball has been verified by a web of trust. The web of trust requires the physical confirmation of identity of the private key holder. Yes, it's possible to subvert the web of trust by playing the long game, complete with falsified (or even real) national ID documents, active participation in social networks, and other life activities. This, though, is true of all espionage and no amount of technical measures or randos ranting on social media on the internet will prevent it.

jcranmer · on April 2, 2024

> A signed tarball means the signing key you use to verify the tarball has been verified by a web of trust. The web of trust requires the physical confirmation of identity of the private key holder.

Hasn't the web of trust collapsed due to an unfixable hole in keyservers?

PGP keys in practice have almost always amounted to checking a hardcoded list of acceptable keys that is periodically updated, not unlike a list of trusted CAs.

(In any case, the attack we're talking about here--a rogue maintainer creating a backdoor in their project--won't be caught by any amount of advanced signing infrastructure. Somebody was elevated to a position of trust and any reason not to elevate them wasn't discovered until after the damage was done.)

kijin · on April 2, 2024

> Distributed cloned repos are not a single source of truth.

That's the point. No person should be trusted to be a single source of truth. If there are multiple sources of truth and they don't agree, it's a red flag. A highly useful flag, in fact. The situation must be resolved before truth can be established again, and all that commotion in the bazaar will make it more likely that a maintainer gone rogue will be caught. A signed tarball by the same rogue maintainer, on the other hand, will probably pass unnoticed until it's too late.

At the end of the day, it's neither people nor any specific chunk of code that I'd like to trust. It's the system -- a system that incentivizes accountable behavior, and blocks releases if there are any red flags, so that you don't really need to trust any single person.

Ajedi32 · on April 2, 2024

> If you can't trust the developer's signed tarball you can't trust their git repo either

The public git repo has a lot more eyes on it. You don't necessarily have to trust the developer if 1) you trust at least one of the people who reviewed the code, and 2) you have a way to verify that the code you're running is the same as the code that person reviewed

Trusted builds based on the public git repo rather than an unrelated tarball solves part 2 of that problem.

gunapologist99 · on April 2, 2024

The parent referred to "questionable" (i.e., unsigned) tarballs.

Your answer, while excellent, is answering a question that wasn't asked.

AshamedCaptain · on April 2, 2024

No distribution I know of uses "unsigned" tarballs. Most of them hardcode the hash of the expected tarball anyway.

gunapologist99 · on April 6, 2024

Hashes and cryptographic signatures are not the same thing.

shp0ngle · on April 2, 2024

There are still nowadays projects without a proper source control.

For example, xpdfreader, source of xpdf, is distributed just as source tarballs. The maintainer just publishes them every time there is a new version.

https://www.xpdfreader.com/download.html

https://www.xpdfreader.com/old-versions.html

https://packages.debian.org/search?keywords=xpdf

There is nothing in Debian or FOSS in general that mandates having source control.

(I remember that was a point of contention with WebKit/KHTML... in the olden days of 2000-something, Apple forked KHTML to make WebKit, and in order to comply with GPL, they just published source tarballs, which were basically impossible to merge back into KHTML.)

edit:

ahh I see that the xpdf that is in debian is actually different than this one; someone forked it some time ago and it is in in git here (but no github/other forge, just .git)

https://offog.org/git/xpopple.git/

well ok, maybe I was wrong. but still, a .git folder on someone webpage is not that much more reliable than a tarball on someone's webpage

hyperpape · on April 2, 2024

Certainly there's nothing in FOSS that mandates source control, but source control is one of the most important technologies discovered for developing software, and I'd be hard pressed to imagine a justification for relying on software that didn't use it.

bandrami · on April 2, 2024

I'm showing my age here, but making version control public is a relatively recent thing; I can't think of a project before OpenBSD that did it, and it wasn't until SourceForge that the idea of projects just having source repos out there and visible became a popular idea, and that was like '99 or 2000. Most projects would just release tarballs and have a CVS repo that was limited to "approved" developers. And that was well after all the big distro build systems developed.

Kwpolska · on April 2, 2024

It is likely that the `xpdf` project is developed under source control, but they don’t make the repository public. SQLite makes their repository public, but they do not accept contributions.

blueflow · on April 2, 2024

No need to look that far, ncurses is also tarball/patch-based.

shp0ngle · on April 2, 2024

Interesting.

Well, xpdf I knew :) but yeah xpdf can be nuked, ncurses cannot.

cesarb · on April 2, 2024

A popular project with release-only distribution is Lua. I don't know whether they have proper source control, but even if they do, AFAIK it isn't publicly available.

forgotpwd16 · on April 2, 2024

A list of only-archive-published projects will be interesting. 7-Zip is another. Regarding that .git web directory, is there an existent tool/script that does something similar to `git clone` for it?

shp0ngle · on April 2, 2024

yes there is, it's called git clone.

:)

forgotpwd16 · on April 2, 2024

Well, unexpected. Git pretty cool.

pm215 · on April 2, 2024

There's a lot of inertia here, but also the tarball is or was the one point of commonality where you could say "this is the standard thing every upstream provides". Today almost everything has a git repo, but it still isn't 100% universal, and in the past even less so -- some projects had git, some svn, some cvs, some mercurial, some didn't use source control at all(!), but everybody provided their releases as source tarballs. So when you're designing a distro packaging workflow 20 years ago, "treat the source tarball as the authoritative upstream output" made sense, because it was the authoritative upstream release output.

badsectoracula · on April 2, 2024

Why are tarballs any less authoritative than a Git repository? Both come from the same authors. If you can't trust the tarball released by some author why would you trust what the repository contains?

Terr_ · on April 2, 2024

> Why are tarballs any less authoritative than a Git repository? Both come from the same authors.

Traceability. There's a difference in how easily someone can insert nefarious code without it going through processes that might reveal or stop it.

Compare: "Why is a paper sack of pills any different than a factory-sealed anti-tamper bottle? You're buying them from the same grocery store..."

badsectoracula · on April 2, 2024

> Traceability. There's a difference in how easily someone can insert nefarious code without it going through processes that might reveal or stop it.

But both come from the same authors. If you do not trust the tarball because it may have nefarious how would you trust Git to not have nefarious code? The authors have complete control over what goes in Git and what processes might be there.

> Compare...

The grocery store comparison doesn't fit here because grocery stores do not make the things they sell. The equivalent to a grocery store would be the download site (BTW lets not continue with metaphors because IMO instead of making things more clear you are introducing TWO parallel channels to argue about).

grumbel · on April 2, 2024

> But both come from the same authors.

The Git repository comes with a history containing each and every commit, along with checksums and sometimes cryptographic signatures for it all. The tarball is just a bunch of random files.

If somebody wants to rewrite the Git history, that will get instantly noticed by anybody that had an older clone of it. If somebody adds a hack to it, it will be at the top of the history and easy to notice as well.

If somebody adds something to a tarball, there is nothing to tell you that something changed, you'll have to audit the whole thing from top to bottom (or compare it to the Git repository it came from).

Also the author isn't in control here, the tarball can get changed after it was uploaded to the server and there would be nothing in the tarball itself to tell you.

People are using PGP signatures and checksums to secure tarballs some more, but that's always an extra step and doesn't happen automatically. And even that still doesn't give you the history.

badsectoracula · on April 2, 2024

> The tarball is just a bunch of random files.

But that is EXACTLY WHAT GIT IS!, the only difference is that you also get the previous versions too. That doesn't help you at all when the authors themselves are not trustworthy. It MAY help after finding a compromise to figure out when it happened, but it wont help anyone prevent it.

What you write is about someone trying to compromising the Git server itself. This is NOT what i refer to (and even that sort of compromise would be sidestepped with signed tarballs).

What i refer to is more fundamental: the author of the project controls what is put in the Git repository itself, just like the tarballs.

The Git repository is NOT any more trustworthy than the tarballs: if you do not trust the latter then you should not trust the former either and if you do trust the former then there is no reason to not trust the latter.

grumbel · on April 2, 2024

> the author of the project controls what is put in the Git repository itself

The author doesn't control the Git repository. The Git repository is a shared data structure, not just bits on a server. Neither the author nor anybody else can just change the Git repository arbitrarily. They can only change it in ways allowed by Git or the alarm bells will be going off. A tarball has no such restrictions.

badsectoracula · on April 2, 2024

I do not refer to the files or bits or whatever that make the git repository in your `.git` directory. Please take a step back and read again what i wrote, it has nothing to do with what you responded with here.

Terr_ · on April 2, 2024

> But [a bunch of random files] is EXACTLY WHAT GIT IS!

Well, yeah, but only in the same limp way that "a bunch of random bytes" describes both (A) some text file versus (B) an e-mail with valid DKIM headers with a PGP signature.

It's possible that both could arrive containing your CEO's name and an urgent demand to wire all company funds to some random city in Myanmar... but one approach is fundamentally more secure.

> the only difference is that you also get the previous versions too.

No, that's not the only difference nor the most-important.

The important difference is that the files represented by a particular git-revision all exist with cryptographically-hashed relationships on their content, and that structure prevents/deters multiple kinds of secret meddling or impersonation.

badsectoracula · on April 3, 2024

> The important difference is that the files represented by a particular git-revision all exist with cryptographically-hashed relationships on their content, and that structure prevents/deters multiple kinds of secret meddling or impersonation.

...which a developer with proper write access to the repository (i.e. the same developer that'd make the tarballs) wont need to do since they can decide what goes in there in the first place, thus making this "important difference" not at all important when it comes to the case i argue in this entire thread.

xorcist · on April 2, 2024

This backdoor resided in git for a year without anyone noticing.

It wasn't until it was activated via the tarball that someone found out. One could easily argue that signed tarballs are more vetted than git commits.

bdd8f1df777b · on April 2, 2024

The git repository has been scanned by many eyes. People, other than the authors, do read the source code on GitHub or clone/fork the repo and try to build / modify when they want to contribute. By contrast, almost no one reads through the tarball contents. They just assume that it is essential the same as the git repo.

badsectoracula · on April 2, 2024

That doesn't make the Git repository inherently more authoritative or trustworthy, not to mention that if you do not trust the tarballs you shouldn't trust the Git repository either - or the developers who control what goes in both.

The issue here isn't about trusting Git or tarballs, is about trusting developers. If you do not trust the developers then getting the code via Git or tarball wouldn't make anything more trustworthy and if you DO trust the developers then there is no reason to trust the Git more than the tarballs since they are both provided by the same people who you already trust to not insert malicious code in the project.

bdd8f1df777b · on April 3, 2024

That's the conventional thinking, but it has been proven false by this xz incidence. The maintainer of xz did not inject any malicious code into the git repo, but only in the tarball, exactly because the latter is subject to far fewer eyes and he took far less risks polluting only the tarball.

badsectoracula · on April 3, 2024

The project was already compromised, the Git repository isn't any more trustworthy than the tarball.

And if there is any relevant lesson from the xz case isn't to trust Git more than tarballs but - as someone else mentioned already - tarballs should be fully reproducible from Git.

bdd8f1df777b · on April 4, 2024

> The project was already compromised, the Git repository isn't any more trustworthy than the tarball.

You are talking about this from hindsight. For other projects, we do not know yet if anything similar is happening. So for them the Git repo is definitely more trustworthy.

> tarballs should be fully reproducible from Git.

That's exactly the same as trusting Git repo more than the tarball.

senectus1 · on April 2, 2024

as i understand it, the tooling would "see" some of the questionable configs that were in the tarballs.

Because they're packaged into tarballs they cant be checked. or or to the point they use the tarballs because its much faster...

badsectoracula · on April 2, 2024

What tooling?

Git is just a filesystem with file versions, anything you can check in a tarball can also be checked in Git. The only thing Git adds is a change history for the contents in the filesystem (a history that can be forged too - Git even has a bunch of commands for that too).

If you cannot trust the authors' tarballs how can you trust their Git repository?

watt · on April 2, 2024

Every commit in git repository is effectively signed by the commit hash. You can't sneak in any unreviewed changes.

badsectoracula · on April 2, 2024

But you can sneak in reviewed (by you) changes, if you have control over the repository.

Again: both the Git and the tarballs are released by the same developers.

kijin · on April 2, 2024

You never have full control over a repository that can be cloned by others. Sure, you can rewrite the history, but anyone who has been pulling from your repo will notice the conflict.

Whether you can trust the developer is beside the point. The question is how easily we can tell if someone is (still) trustworthy. Source control brings a measure of accountability into the game, making it more difficult for unusual activity to go undetected.

giantrobot · on April 2, 2024

> but anyone who has been pulling from your repo will notice the conflict.

Not will but can notice a conflict. Repos edit history all the time for a variety of reasons. Well written but nefarious changes to history won't immediately be recognized as nefarious.

Besides once there's multiple copies of a repo available with different histories, who is to say which is the canonical repo for a project?

badsectoracula · on April 2, 2024

It is not beside the point, it is the ENTIRE point. If you do not trust the developer then Git wont help you, the developer can put their malicious code in the repository itself.

Git wont help you prevent anything, it is just a bunch of random files (as someone mentioned above) with the only difference being that there is a change history attached. But that wont prevent anyone with proper legitimate access (i.e. not trying to compromise it) to the repository from adding malicious code. All it may help with is figuring out, after the fact (of finding the compromise) when it was added.

As i wrote above, the Git repository is not any more trustworthy than the tarballs: if you do not trust the latter then you should not trust the former either and if you do trust the former then there is no reason to not trust the latter.

kijin · on April 2, 2024

> Git wont help you prevent anything

Of course. Nothing does.

> All it may help with is figuring out, after the fact (of finding the compromise) when it was added.

Exactly.

And that's what I'm trying to say.

Ease of detection discourages bad behavior, just as regular police patrols help reduce crime in a neighborhood. With a public git repo, every commit increases the chance of detection long before a tarball is ready for release.

I never said that I trust the git repo. I trust the eyeballs more, and the pitchforks too.

badsectoracula · on April 2, 2024

Ease of detection is not a thing here since nothing is detected. The only thing you can have with Git is figuring out when a change was made after it has been detected - by different means.

And again, when someone can submit to a repository as developer and also make tarball releases, it makes no sense to differentiate between the two if you are worrying about malicious code.

Your original comment was why "maintainers of major distros still rely on questionable tarballs to build packages" instead of Git. And IMO the answer is really simple: because if they trust a project's developers to not put malicious code in their project, it makes no difference if the code came from Git or an explicitly provided release tarball. If such trust wasn't there, chances are the project wouldn't be part of a distribution in the first place.

Khaine · on April 2, 2024

Yes, and that didn't save us from the xz backdoor did it?

arp242 · on April 2, 2024

Because it's often easier, and because they're not "questionable tarballs".

People seem to have forgotten that all of this has been working well without all that many incidents for 35 years. And these archives are published by the same people you trust to run code on your system. It's not like you can't hide this kind of stuff in the autotools soup committed to git. I doubt this would have been caught it – the "sneak it in the .tar.gz only" was just a bonus defence against detection.

I'm not saying using git isn't better, but we also need to retain some perspective on all of this.

This ibus thing for example seems entirely innocent.

The core of the issue is "who do we trust to write code for us?" Maintainers protect us from randos submitting malicious code, but who maintains the maintainers?

Kluggy · on April 2, 2024

It's absolutely because of inertia. Cutting a tarball is a very public indicator that the software is ready to be used. A git tag or branch doesn't have the clarity of "use this" as a release tarball does.

That said, this is absolutely going to be changing now. We obviously can't keep relying on tarballs anymore. We'll find a new normal that will work for a very long time until some other critical issue arises and the cycle repeats.

kzrdude · on April 2, 2024

We can use tarballs - they are useful as signed artifacts, but only if we verify that they are reproducible.

lupusreal · on April 2, 2024

If you're going to download both the tarball and the got repo and verify the reproducibility of the tarball, then why bother with the tarball at all? You already have the got repo.

b112 · on April 2, 2024

A git repo can vanish overnight. Git is often used to snag source, but tarballs are still crafted after even then:

https://news.ycombinator.com/item?id=39903813

kijin · on April 2, 2024

Git repos only vanish if they have no clones. For the purpose of accountability, the "official" repo is not more privileged than any other clone that retains the same history, and can be verifiably recreated from any clone if it is ever lost or tampered with. (Assuming SHA-1 isn't too broken, that is.)

For archival purposes, nothing prevents people from creating a tarball that contains the .git directory as well, which would preserve not only the current state of the project but its entire history.

b112 · on April 2, 2024

Git repos only vanish if they have no clones.

And? You're speaking probabilities, not certainties. It's not relevant in terms of archiving. You don't guess, you don't hope, you simply 100% ensure that 10 days, 19 years, or a century from now you can build the same thing.

I agree that adding "extra stuff" to a tarball isn't a bad idea, and in fact, many already do!

lupusreal · on April 2, 2024

If you can archive the tarball, you can archive the git repos. If for some reason you can't, you can cut your own tarballs from the git repo and then you don't have to worry about them because you made them yourself.

b112 · on April 2, 2024

Archiving an entire git repo is serious overkill, and untenable realistically. Debian has 30k packages, if not more, and some of them are the linux kernel.

My responses in these threads has been to the "why not build from git" logic championed here.

You want to build from a reliable, repeatable source. And as I mentioned, that can be git clone -> tarball, sure.

jiggawatts · on April 2, 2024

I've noted similar vulnerabilities in every common package management system, including NuGet, Cargo, and NPM. There are giant gaps in security a truck could be driven through.

Collectively, we must start taking the "bill of materials" that goes into our software much more seriously, including the chain of custody from source to binary.

Start with linking each package directly to the Git commit hash it came from.

Then a good start would be enforced reproducible builds. Don't trust the uploader to package their code. Run this step in the package management system from the provided source hash.

Pure functional build systems would be a heck of a lot better, with no side effects of any kind permitted. No arbitrary scripts or arbitrary code during build. No file reads, network access, or API calls of any kind outside of the compiler toolkit itself.

Then, I would like to see packages go through specially instrumented compilers that generate an "ingredient label" for each package. Transitive dependencies. Uses of unsafe constructs in otherwise safe languages. Lists of system calls used. Etc...

You're about to say that clever hackers can work around that by using dynamic calling, or call-by-name, or some other clever trick.

Sure, that's possible, but then the system can report that. If some random image codec or compression library uses any platform API at all, then it's instantly suspect. These things should be as pure as the driven snow: Bytes in, decoded data out.

bandrami · on April 2, 2024

> outside of the compiler toolkit itself

Which toolchain includes, unfortunately, xz

pabs3 · on April 3, 2024

You can avoid that by bootstrapping the compiler and everything else from scratch. See bootstrappable.org and GNU Guix, which starts at a few hundred bytes of machine code and goes up to a full distro.

https://guix.gnu.org/en/blog/2023/the-full-source-bootstrap-...

bandrami · on April 3, 2024

IIRC that first GUIX build is at least theoretically the first Unix-like OS that was not rooted by Ken Thompson.

userbinator · on April 2, 2024

What clickbait. I don't see anything actually "questionable" here, other than someone trying to ride on the xz hype train.

Ajedi32 · on April 2, 2024

Not "questionable" in the sense that it might be malicious; the author acknowledges the changes are benign. But it's still questionable that the tarball made available to distros differs significantly from what's in the tagged release in git and nobody noticed that until now. As the xz incident has demonstrated, there are plenty of good security related reasons why that shouldn't be possible.

gunapologist99 · on April 2, 2024

We could create and hash the tarball from git and compare it to the released tarball, even if it wasn't originally signed. We could even use Merkle trees (perhaps combining with find) to ensure that individual files were unchanged.

At least then we could verify that the tarball was derived from that exact git commit.

zzzeek · on April 2, 2024

so glad pypi actively removed support for package signatures and even sends you an annoying email if your scripts dare to actually upload a signature anyway

shp0ngle · on April 2, 2024

Does Nix/Guix solve this?

I have been skeptical of "rewrite everything into rust", but... maybe we should at least rewrite everything into Nix?

blueflow · on April 2, 2024

No. Nix pulls in tarballs/sources like any other package build system.

1una · on April 2, 2024

In this specific case, nix uses fetchFromGitHub to download the source archive, which are generated by GitHub for the specified revision[1]. Arch seems to just download the tarball from the releases page[2].

[1]: https://github.com/NixOS/nixpkgs/blob/3c2fdd0a4e6396fc310a6e...

[2]: https://gitlab.archlinux.org/archlinux/packaging/packages/ib...

api · on April 2, 2024

Computers have to run software that has to come from somewhere.

I’ve been expecting a supply chain apocalypse for some time now given that the Internet has become a dark forest.

rolandog · on April 2, 2024

Guix does provide a guix challenge command to verify that the binaries correspond to the source code [0].

[0] https://guix.gnu.org/manual/en/html_node/Invoking-guix-chall...

3836293648 · on April 2, 2024

No. It makes sure its inputs are unchanged, but you can still do anything impure within a buildscript called by nix.

It would fix the tarball differing from upstream, but it would still allow the patch to be applied during build

tomthehero · on April 2, 2024

Please don't. Nix UX sucks big time. I don't know about Guix, but I'm pretty darn sure that anything else must be better than Nix cuz Nix is the rock bottom.

forgotpwd16 · on April 2, 2024

Nix UX may suck (doesn't for me although can have some improvements) but Nix as concept/system is still good.

blueflow · on April 2, 2024

Please elaborate. If i have more substantial facts about it then i might be more successful in convincing my CTO that its a bad idea.

stusmall · on April 2, 2024

Nix might even make it worse. xz made it into unstable and it is part of stdenv. This means almost every package needs to be rebuilt which takes forever and limits the speed in which it can be reverted. They still have 5.6.1 in unstable and, to be honest, I'm not sure why. I don't know if they are still waiting for CI to chew through the tens of thousands package rebuild or there is something else.

pxc · on April 2, 2024

Afaict they are in fact waiting for CI to do the big rebuild on staging, in part because the Nixpkgs builds of 5.6.x never pulled down the malicious m4 scripts that inject the backdoor into the output binary (as they never used the release tarball directly from upstream but built from GitHub sources).

See: https://github.com/NixOS/nixpkgs/issues/300055

and: https://github.com/NixOS/nixpkgs/pull/300028

It's also worth noting that Guix is different here, as the grafts mechanism is well-established, so they can get a security patch in for xz without waiting for the mass rebuild, even if it's also in their stdenv or equivalent.

pabs3 · on April 3, 2024

Rust has the same problem as autotools, crates can be plain git archives but often contain things not in git and miss things that are in git.

cchance · on April 2, 2024

What a bullshit story