Wonderful wget

Note: Please do not hit the server I mention in this article with wget unless you really want to download all those Jack Benny shows. I'm using the Jack Benny server as an example only. Please don't use it for practice.

The other night, I was presented with a challenge. I'd found a wonderful Web site called 'The Jack Benny Radio Archives'. The main draw of the Web site was the large number of shows available in mp3 format for downloading—362 mp3s, to be exact. I wanted those mp3s! However, the prospect of having to right-click on every mp3 hyperlink, choose 'Save Link Target As …', and then click 'OK' to start the download did not appeal to me.

The mp3s were organized in a directory structure like this:





Note how the directories were not sorted in numerical order, as humans would do it, but in alphabetical order—1 does come before 3, after all.

Inside each directory were the mp3s. Some directories had just a few files in them, and some had close to 20. The listing of files in each directory looked like this:

 [BACK]  Parent Directory  19-May-2002  01:03   --
 [SND] 1944-12-24_532.mp3  06-Jul-2002  13:54 6.0M
 [SND] 1944-12-31_533.mp3  06-Jul-2002  14:28 6.5M
 [SND] 1945-01-07_534.mp3  06-Jul-2002  20:05 6.8M
 [SND] 1945-01-14_535.mp3  06-Jul-2002  19:53 6.9M

The [SND] you see was actually an GIF image of musical notes that showed up in front of every file listing.

So the question was, how do I download all of these mp3s, mp3s which have different file names and which exist in different directories? Fortunately, wget was available.

wget is [free software/open source software] available from GNU. It runs on Linux, Unix, Mac OS X, and Windows. After downloading it, reading the manual, and installing it on my Windows 2000 machine, I opened the Command Prompt, as Windows 2000 calls it, and navigated to the directory into which I wanted my mp3s to reside.

cd Music
mkdir Jack_Benny
cd Jack_Benny

In order, I went to my D drive, which is where I keep all my music, changed into the Music directory, made a new directory named 'Jack_Benny', and then changed into that new directory.

At this point, it was time to run wget. Here's the command I used to grab all those mp3s:

wget -r -l2 —no-parent -w -A.mp3 -R.html,.gif http://www.crispy.com/benny/mp3/

Let's decipher this command. [See footnote]

wget is the command I'm running, of course. At the far end is the URL that I want wget to use in its task,http://www.crispy.com/benny/mp3. The important stuff, though, lies in between the command and the URL.

-r stands for 'recursive'. A recursive command is one that follows links and goes down through directories in search of files. By telling wget that it is to act recursively, I can ensure that wget will go through every season's directory, grabbing all the mp3s it finds.

-l2(that's the small letter L, by the way) tells wget how deep I want it to go in retriving files recursively. The 'l' stands for 'level' and the number is the depth. If I had specified '-l1', for 'level 1', then wget would look in the /mp3 directory only. That would result in no mp3s on my computer. Remember, the /mp3 directory contains other subdirectories: season_10, season_11, and so on, and those are directories that contain what I want. By specifying '-l2', I'm telling wget to grab all anything in /mp3—which will result in the all the season_# subdirectories—and then go into each season_# directory in turn and grab anything in it. You need to be very careful with the level you specify. If you aren't careful, you can easily fill your hard drive in very little time!

—no-parent means just what it says: do not recurse into the parent directory. If you look back at the listing of files I demonstrated above, you'll note that the very first link is the Parent Directory. In other words, when in /season_10, the parent is /mp3. The same is true for /season_11, /season_12, and so on. I don't want wget to go *up*, I want it to go *down*. And I certainly don't need to waste time by going up into the same directory—/mp3—every time I'm in a season's directory.

-w introduces a short wait between each file download. This helps prevent overloading the server as you hammer it continuously for files.

-A.mp3 tells wget that I wish it to only download mp3 files and nothing else. 'A' stands for 'accept', and it is followed by the file suffixes that I want, separated by commas. I only want one kind of file type—mp3—so that is all I specify.

-R.html,.gif tells wget what I don't want: html and gifs. This way, I don't get those little musical notes represented by [SND] above. 'R' stands for 'reject'. Notice that I separated my list of suffixes with a comma.

After entering the command, I hit Enter and wget started its work. Results like the following flashed by in the command prompt window:

  --05:33:32--  http://www.crispy.com/benny/mp3/season_8/1937-04-11_253.mp3
  Reusing connection to www.crispy.com:80.
  HTTP request sent, awaiting response… 200 OK
  Length: 7,154,990 [audio/mpeg]
    100%[====================================>] 7,154,990 
    60.37K/s   ETA 00:00   05:35:28   (60.37 KB/s) --
    'www.crispy.com/benny/mp3/season_8/1937-04-11_253.mp3' saved

About 5 1/2 hours later, wget was done. It had downloaded 22 folder containing 362 files, for a grand total of 2.19 gigabytes. Yes, gigabytes. You've got to admit, this is a lot easier that manually downloading all those mp3s!

And now, if you'll excuse me, I've got to burn some mp3s onto CDs, in anticipation of a very long, but now very entertaining, road trip I've got coming up. Enjoy!

Other wget resources

Smart (Script-Aided) Browsing ~ http://www.linuxjournal.com/article.php?sid=5905 ~ "Using the wget script and its mirroring option to download web pages."


By the way, you may be wondering why I didn't just use something like this:

wget -r -l2 http://www.crispy.com/benny/mp3/*.mp3

Unfortunately, that doesn't work, because attempting to retrieve files using HTTP, which is how these mp3s were made available, does not allow for the use of pattern matching. The only way is through the use of a command like the one I parsed above.

WebSanity Top Secret