Pavuk, an Internet document retrieval tool

I've been using wget for a while, and I really like it, but sometimes I need a different tool for grabbing Internet content. After searching, I found Pavuk (which means 'spider' in Slovakian).

Here are some of the features Pavuk offers:

Recursive HTTP, HTTP over SSL, FTP, FTP over SSL, and Gopher document retrieving
Can automaticaly fill forms from HTML documents and make POST or GET requestes based on user input and form content
Synchronizing retrieved local copies of document with remote
Supports HTTP authentification (user, Basic, Digest, NTLM)
Supports HTTP cookies
Optionaly can generate statistical reports from download, usable for Web site link checking
Licensed under the GPL, so it's completely open source

You can get to the Pavuk home page at http://www.pavuk.org (Pavuk runs on Linux and Windows; since the source code is available, you could probably compile it to run on Mac OS X as well).

I wanted to download Web pages from a password-protected Web site. I did not want any of the Web pages from the 'discuss' subdirectory. Since the Web site contained a lot of links to other Web sites, and I didn't want those links to be followed, I had to be careful to restrict the grabbed Web pages to the targeted domain. The site has a 'robots.txt' which tells search engines not to index it; since Pavuk respects robots.txt files be default, I had to turn that off.

Here's the command I gave, which worked perfectly:

pavuk http://[username]:[password]@[domain] --progress --noRobots --adomain [domain] --auth_name [username] --auth_passwd [password] --skip_url_pattern \*discuss\*

Let's look at each of the components of this command.

pavuk is the actual command. Gotta have it!

http://[username]:[password]@[domain] consists of the URL I want to target for download. Instead of [username] you would put in your username, like 'rsmith' or 'stevem', without the brackets. The same applies to your password & domain. If this was not a password-protected Web site, then I could leave out the username & password and instead use something like this: http://:[domain]. If I wanted to copy all of Yahoo onto my computer (not a good idea), then I would use http://www.yahoo.com.

--progress gives you an indication of your progress as you download your files. It's actually quite informative, showing you the file name, its size, the rate of download, the estimated time it will take to download, and the real time it took to download. All in all, very useful, and a good idea.

--noRobots ignores the robots.txt file on a server, which prevents access to certain areas of an Internet site. If you don't include this option, then Pavuk respects the robots.txt file, which isn't always what you want.

--adomain [domain] limits the download to the listed domain(s). If you wanted to only download things on the www.widgets.com domain, for instance, and not follow any links to other Web sites, then you would use '-adomain www.widgets.com'. If you wanted to stick to two domains, then you could use '-adomain www.widgets.com,www.foo.com'.

--auth_name [username] sends your username every single time Pavuk attempts to connect to one of the resources you wish to download. Normally the username & password that you specified in your URL is enough, but some sites require the use of this option.

--auth_passwd [password] is just like the above option, but it resends your password every time instead of your username.

--skip_url_pattern \*discuss\* allows you to specify which paths you want to skip during your download. You can use wildcards like the asterisk (*), the question mark (?), and the square brackets ([a-z]). For instance, let's say I wanted to download everything from http://www.widgets.com except stuff found under http://www.widgets.com/discuss. To get the files I wanted while skipping the files I didn't want, I would use this pattern: \*discuss\*.

Email	scott@granneman.com
Voice	314-780-0489
Address	39 Summit Place St. Louis, MO 63119 United States