Wednesday, May 25, 2005

A "dirt simple" download-and-install CPAN clone

It seems that CPAN clones are in the air of late. Yesterday, Ian Bicking posted a comment here that he was working on an automated download facility, and today, there was a Planet Python post about Uraga, a CPAN clone some other folks are working on.

So it got me to thinking, how simple could this actually be? Bob Ippolito and I had talked about making something like that using eggs, where there would be some sort of manifest file that told you where to download things from. But it occurred to me today that there are already "manifest files" out there on the web that work just fine: web pages and directory listings. For example, SourceForge download pages contain a wealth of links to URLs containing "SomePackage-1.0.zip" and the like. Since Python egg filenames also include Python version and platform information, everything you need to find a file is right there in the links - just parse the HTML page and go.

To test the usefulness of this theory, I went to PyPI and randomly selected 43 packages, investigating their download links. Here are the results:
  • 15 packages had no download URL at all
  • 2 had a download URL that went to their homepage, with no direct links to downloadable files
  • 10 had a download URL that pointed directly to a .zip, .tgz, .tar.gz, or .tar.bz2 file with a specific version number in the filename
  • 10 had a download URL that went to an HTML directory listing with links to versioned files
  • 6 had a download URL that went to a "latest version" (i.e. no explicit version number) archive or .py file
So, about half of the packages could have been processed by a spider hunting for specific versions. And about half of those could easily add .egg files to their download listings. Very interesting.

Of course, not every distributor of a package is going to want to mess with making eggs for different platforms, so a really useful tool is probably going to have to be able to download a source archive and build an egg from it.

Now you may be wondering, why build an egg? If you're going to have to build from source anyway, why not just install the package directly? Because eggs -- even unpacked ones -- let you keep multiple versions of a package on your system, and activate them at runtime.

So now I'm thinking, maybe there should also be an install_egg command for the distutils, that basically builds and egg and then installs it in site-packages (or wherever) for you. Then, we could use that with our hypothetical PyPI spider, to make a complete fetch-and-install utility.

Now, once we have that, let's say that somebody wanted to make a bunch of packages available as eggs for their platform. All they'd need to do is run that fetch-and-install such that it installs to a web-accessible directory, whose contents are visible as a directory listing. Now, somebody who adds the URL of that directory to their spider's search URLs would be able to find and download pre-built eggs for whatever they needed, without needing to do any building.

It's starting to sound an awful lot like what everybody's trying to make, doesn't it? So what are the architectural components we need?
  • The "reader": An HTML reader that scans a web page for links to eggs and/or source archives with names that match distutils-standard naming conventions
  • The "finder": A tool that takes a list of candidate start URLs and invokes the reader on them to search for specified package(s), caching the resulting index data
  • The "source catalog": A tool that, given a package name, finds download URLs from PyPI and determines whether they are archives or links that should be passed to the reader
  • The "fetcher": given a desired package, it consults the finder and the source catalog, trying to download a platform-suitable egg, falling back to finding a source archive and building an egg.
  • The "builder": given a source archive URL, download it, extract it, find the setup.py, and build/install an egg
  • Some way to decide what version of a package to build/install, if more than one version is available. (e.g. a way to select only stable versions, or whatever)
Interestingly, it might be possible to just repurpose an existing Python web spider to do a lot of this, just by spidering from PyPI with a reasonable external link depth, to build an index of package+version to download URLs. In fact, you could use that spider to simply create an HTML page with all the download links.

Given the existing capabilities of the egg runtime, and the assumption that an existing HTML-parsing spider (or browser-emulator) could be made to do the fetching and parsing, the biggest parts remaining are the "builder", and managing the whole thing's configuration. It seems to me there are lots of policy issues ranging from the trivial (where to put the eggs) to the critical (what versions to allow? code signing? checksums? what download sites do you trust?)

But the interesting thing about all this, I think, is that in a sense we already do have a CPAN: it's called the web. Now all we need is a smart enough client to use it. :)

In the meantime, I've actually managed to squeeze in a few more hours' work on Python Eggs and their documentation. Directory scanning, dependency management, and even namespace packages are implemented now, although some of these features have received rather minimal testing. The documentation has also undergone a significant overhaul to explain many more of the implemented features, although there's still a lot more to write, just to explain what can be done with the current version of the runtime.

Follow-up: I built eggs for the "mechanize" package and its dependencies, and found it only takes a few lines of code to retrieve and analyze the links of a download directory, such as the PEAK projects download directory, or a Sourceforge file listing. Of course, actually downloading the list and parsing it can be slow, so an "end-user quality" download tool might need to do a fair amount of tweaking to make the process more friendly for impatient people like me.