Robotech_Master ([info]robotech_master) wrote in [info]linux,
@ 2004-02-21 02:55:00
Previous Entry  Add to memories!  Tell a Friend!  Next Entry
Current mood: frustrated

need help with wget
Help! I'm at my wit's end here.

I need to snag a complete mirror of this directory and everything beneath it (2 layers of recursion should be sufficient):

http://web.archive.org/web/20030413185309/www.gizmology.net/lovecraft/

Ideally, I'd like this all in a single directory, with all images, relative links, and "lovecraft" being the topmost directory in the tree...something suitable for burning to a CDROM and keeping handy.

Unfortunately, I'm having a great deal of difficulty making wget do it. No matter how many levels of recursion I try, no matter what I do, it only mirrors the site's robots.txt and the index file with none of the links.

Can someone please help me out?




(Post a new comment)


[info]warmedjets
2004-02-21 01:04 am UTC (link)
web.archive.org's robots.txt disallows /web/

wget follows the robots.txt rules

(Reply to this) (Thread)

Re:
[info]warmedjets
2004-02-21 01:08 am UTC (link)
P.S. you can add the line

robots = off

in your .wgetrc to ignore the robots.txt

(Reply to this) (Parent)(Thread)

Re:
[info]robotech_master
2004-02-21 01:15 am UTC (link)
Okay, I added that to my .wgetrc (or, rather, created a .wgetrc with "robots = off" in it in my homedir)...but it still didn't work. I just got the directory tree itself with an .index file in it.

BTW, this is the command I used:

wget -nH -r -l 2 -k -p http://web.archive.org/web/20030413185309/www.gizmology.net/lovecraft/

What else am I doing wrong?

(Reply to this) (Parent)(Thread)

Re:
[info]warmedjets
2004-02-21 01:52 am UTC (link)
ok, it's complicated because of the way archive.org handles the links, and the fact that the html files have a <base href=gizmology.net/lovecraft>

what worked for me was:
wget -rH -l 1 --exclude-domains gizmology.net http://web.archive.org/web/20030413185309/www.gizmology.net/lovecraft/

then after it downloads a file, you have to edit that and remove the base href line

wget -rH -l 1 --exclude-domains gizmology.net --base=http://web.archive.org/web/20030413185309/www.gizmology.net/lovecraft/ --input-file=web.archive.org/web/20030413185309/www.gizmology.net/lovecraft/index.html

It's a huge pain in the ass, yes.

If there's only a few pages you need the second link from it's not that huge of a hassle... but if there's a lot it might suck.

Anyone else have an easier way to get around this?

(Reply to this) (Parent)(Thread)

Re:
[info]robotech_master
2004-02-21 09:52 am UTC (link)
The first command worked fine...but when I edited the site and then ran the second, I got a whole lot of "cannot write" errors, such as:

Cannot write to `web.archive.org/web/20040221094924/http:/www.gizmology.net/lovecraft/<html>' (No such file or directory).

and ended up with

FINISHED --11:49:25--
Downloaded: 0 bytes in 0 files


Any further advice?

(Reply to this) (Parent)

(Reply from suspended user)
Re:
[info]robotech_master
2004-02-21 12:11 pm UTC (link)
Oh, excellent! You rule! You really rule! It seems to be chugging right along.

This, by itself, would be more than enough to satisfy me...but there is one minor thingie I'd appreciate if you could help me with. I notice that the index.htm pages in those directories, even though they have relative links, still lead to the archive.org pages when you click on them. I'm guessing there's some script in the page itself that does that. Can you advise me how to make it stop doing that before I burn it to CDROM?

Thanks again, ever so much!

(Reply to this) (Parent)(Thread)

(Reply from suspended user)

[info]dcedilotte
2004-02-21 06:34 am UTC (link)
Check to see if RSYNC couldn't do what you're looking for. I'm not sure but usually rsync is what they use for stuff like that.

(Reply to this)


Create an Account
Forgot your login or password?
Login w/ OpenID
English • Español • Deutsch • Русский…