Michael Stenner and myself have been working on an advanced URL grabbing package for python appropriately named “urlgrabber”. Michael started the project as part of yum and later decided that it should be split out into its own package. I was lucky enough to be in the right place at the right time with Michael proposing some serious redesign and enhancements to the project and me wanting to write as much python as possible.

I have always been extremely interested in network protocol type stuff for some reason. One of the first apps I cobbled together was a little download utility called “PowerDownload”. Yea! This thing was basically a replacement for the stock downloaders you got with the Browsers of the time (Navigator 3.0 and IE 2--various flavors of Mosaic were still fairly popular too). Get this—PowerDownload was written entirely in VB3 (and then ported to VB4). Word to the wise: DO NOT ATTEMPT TO WRITE SOCKET BASED APPLICATIONS IN EARLY VERSIONS OF VISUAL BASIC. Anyway, the reason I wrote the app in the first place was because I wanted pause/resume support on downloads. We were coping with anything from 2400bps to 14.4kbps modems and there are only so many times you wake up to a find a 50MB download would have to be restarted before you either kill yourself or start hacking together a solution. So the nostalgia took over when Michael said, “We need support for byte ranges in HTTP and FTP. Do you know anything about that?”

Back to the point of this entry... URLGrabber is really starting to come together and we hope to announce an early test release in the near future. Michael set up a nice project page on Duke's Linux site that has a significant amount of information, viewcvs, and all the other goodies.

http://linux.duke.edu/projects/urlgrabber/

Here's the list of features ripped from the project page for your convenience.

  • identical behavior for http://, ftp://, and file:// urls
  • http keepalive - faster downloads of many files by using only a single connection
  • byte ranges - fetch only a portion of the file
  • reget - for a urlgrab, resume a partial download
  • progress meters - the ability to report download progress automatically, even when using urlopen!
  • throttling - restrict bandwidth usage
  • batched downloads using threads - download multiple files simultaneously (feature still in progress)
  • retries - automatically retry a download if it fails. The number of retries and failure types are configurable.
  • authenticated server access for http and ftp
  • proxy support - support for authenticated http and ftp proxies
  • mirror groups - treat a list of mirrors as a single source, automatically switching mirrors if there is a failure

Not to shabby. I plan on blogging a bit about some of the cooler things you can do with urlgrabber once we have a clean release out there.

This entry has been tagged yum, coding, projects — follow a tag for an archive of related essays, weblog entries, and bookmarks.

Leave a comment





(syntax: markdown)