FlatFilesFromZope 2006-08-24
From LaurasWiki
I've been stymied -- stymied! -- pondering the complexities of organizing my Zope ProjectsArchive, with text and design intact, embedded as they are in various arcane and outdated versions of the constantly evolving Zope ecology. Nightmare!
Finally, it hit me: Just spider the sites and snarf them down using some utility. Doh! I can't believe this didn't occur to me before.
Top of my list, my ZWikis. Checking zwiki.org, sure enough: discussion: UserDiscussion: structure and size of a zwiki (http://zwiki.org/UserDiscussion200404)
> - how do I export the complete zwiki to flat text files? > The best way right now is probably to use something like wget to spider your wiki, visiting either the rendered pages to get html (possibly with the ?bare=1 parameter to turn off the skin) or the PAGE/src urls to get source text. We need a good import and export script to open up more possibilities here.
Indeed.
| Table of contents |
|
2.1 Try 1 |
DeepVacuum
DeepVacuum (http://www.hexcat.com/deepvacuum/index.html) is basically a graphical application interface to wget (http://www.gnu.org/software/wget/) and comes from VersionTracker et al highly recommended.
DV Try 1
The first pass looks promising, but has weirdness.
What it's using: WGET SETTINGS
Arguments: --cache=off --cookies=on --glob=off --tries=3 --proxy=off -e robots=off -x -r -p --follow-ftp -k -np --quota=100m http://myzopesite.com
- Crawled all over the site -- I wanted 1 directory and its subs, but got everything, with lots of missing bits and blank spots: next time, select subdirectories only
- Results are inconsistent: often, but not always, missing the main index page -- including in ZWikis, a course subdirectory using the ZopeZen custom product, maybe elsewhere
- Even with the "-p" flag set, it's not downloading the style sheets (not a biggie, but still)
Maybe it timed out...?
DV Try 2
To modify parameters, click the small arrow on the lower right of the DeepVaccum application panel.
Try 2: On a very tiny test zwiki:
Arguments: --cache=off --cookies=on --glob=off --tries=3 --proxy=off -e robots=off -x -r -p -k -np --quota=100m --http-user=laura --http-passwd=* http://myzopesite.com
OK, here it consistently:
- created folders for zwiki pages
- filled them with backlinks, diff, editform, sometimes more
- NEVER an index.html
- BUT: dupes of the pages named "FrontPage.1" etc (3 of each plus the folder): later realize: w/renaming (.html), that would work!
Don't see a way in DeepVacuum to set:
- "-E" for converting those pages to .html files
- "-nd" to force no subdirectories: note: wouldn't work anyway on a zwiki (all backlinks pages named "balcklinks")
- and also not to clobber, if that's what would help (don't think it would): won't work with mirror?
Using WGET Directly
Try 1
Using this:
$ wget -r -p -k -E --no-parent http://mysite.com/laura/mywiki/
... only getting the index file (1) in the folder -- nothing else. Why? Needed "-e robots=off":
$ wget -e robots=off -r -p -k -E --no-parent http://mysite.com/laura/mywiki/
That's working pretty well! But it's hanging -- and I wanted to see what was happening. What I didn't get:
- style sheets
- backlinks pages (or the edits etc which is fine, but I need the backlinks)
Try 2
Hit return by accident and it restarted, picking up where it left of. Smart. And now it's getting round to the backlinks etc pages. So I think I'll need to:
- Add "-x" to force directories (as does DV)
- I think that means: page names > folders : for holding backlinks + index
- BUT: Many folders created by DeepVacuum using this are missing the index.html file. Why? I don't see the pattern, can't figure it
- So maybe: use "-nd" instead to force it all flat?
- Also: Optional: eliminate editform, map, & subscribe -- and disable those links (or point to single default pages re "this is an archive")
- Maybe try the --no-clobbering flag to eliminate dupclicates (check how this works)
Can't figure out how to use it for a password protected area in Zope, the code from DV didn't do it.
Try 3
Try:
$ wget -e robots=off -r -p -np -nd -k -E http://mysite.com/laura/mywiki/
No:
- "-nd" won't work: backlinks pages are all named "backlinks" and lose the name ref to the parent page
Try:
$ wget -e robots=off -r -p -np -x -k -E http://mysite.com/laura/mywiki/
Strange, from here (netvironments vs tonka), it's getting the stylesheets right away -- either I've coded the call differently or it's to do with the different verion of Zope?
Try 4
More persuing the manual, try:
$ wget -e robots=off -m -w 3 -p -np -x -nH -k -E http://user:password@mysite.com/mywiki
Meaning:
- execute robots=off: ignore robots.txt
- mirror (= recursive, w/timestamp, infinite recursion depth (do I want that?))
- wait 3 seconds to spare server
- get all required pages
- don't go above parent directory
- force create subdirectories where needed (in zwikis)
- for the local copy, don't create a "www.whatever.tld" host folder/directory
- convert links for local viewing
- add HTML extensions to HTML docs, where missing
Bingo! This took a long time and it's very bulky, but it seems to work. In a ZiWiki, instead of index files in folders, it gives me pages with the same name as the folder holding its backlinks, edits, etc pages (which makes a certain WikiName sense).
Cleanup:
- Make a backup copy, just in case
- This approach created duplicates for each wikipage
- ThisPage.html and ThisPage.1.html, plus ThisPage/backlinks.html (etc)
- Relative links in the wiki have been set to the ---Page.1.html versions so
- In TextMate:
- Global search/replace link text, from *.1.html to *.html
- Global search/replace to correct other site links (check and elaborate)
- What else?
- Check links in browser!
- In Terminal:
- Navigate to folder
- rm *.1.html
- Check everything again in browser: what else?
- Add disclaimer/note: "This is a static (flat file) archive" or whatever: Where?
- Upload to server: Works?!
Try 5
Now, this is strange: Flush from my success with mirroring a zwiki wiki, before starting in on the cleanup, I thought I'd snarf the old netvironments blog archives (generated in Blogger, published into Zope), so that I have them in flat file format (yes!). The idea was to set it running and leave it running in the background whiel I turned my attention to the process of cleaning up the laurazWiki archives. But no. Instead I spent another however long messing with wget.
I used this (a couple sections were password protected so I included that):
$ wget -e robots=off -m -w 3 -p -np -x -nH -k -E http://username:password@www.netvironments.org/blog/
Same thing, right? Different results, though: it won't stay inside the /blog directory. I tried limiting it using the exclude flags,. in addition "-np", no parent:
-X laurazWiki,/draftzwiki
Meaning, don't go to the parallel directory, laurazWiki or the subdirectory draftzwiki. There are other parallel directories I didn't want it to copy but "-np" should have addressed that, yes? No. It's crawled all over, copying every directory on netvironments.org. After a few tries to restrict it, to keep it the blog archive boundaries, I gave up I restarted it to mirror the whole site from root, excluding subdomains. 12 hrs later, it's still running. Well, at least now I'll have a flat file copy of the whole thing, in case -- whatever! I needed to get myself a massive external backup drive, anyway, didn't I?
But I don't understand what's going on. I know that Zope doesn't make actual "directories," but this is crawling the URL tree, not the file system. Maybe ZWiki "directories" are somehow served and marked differently in Zope, allowing wget to recognize a boundary? Or, is there something strange, different in the way that I've made thecall? I don't see it.
wget on Plone sites 1
After 19 hrs running, I couldn't stand it any more and quit. I hadn't meant to mirror my Plone sites yet, so hadn't restricted any page types or even thought about that yet. It was creating enormous bloat in the way of, say, this:
enabling_cookies?month/int=1&year/int=2006.html enabling_cookies?month/int=1&year/int=2007.html enabling_cookies?month/int=2&year/int=2006.html enabling_cookies?month/int=2&year/int=2007.html
and so on. I wasn't convinced it would ever stop. Meanwhile, I found several refs on using wget with Plone; most used less parameters than I had and so were, on the one hand, reassuring and, on the other, not much actual help beyond that (though that was something!). But this very helpful apparent cross-post from a list also came up:
http://www.redcor.ch/intranet_zope_plone/tutorial/faq/UsingWgetToCreateOfflinePloneSite
Here is my local wgetrc file:
input = urls-to-get convert_links = on recursive = on reclevel = inf page_requisites = on wait = 0 quota = 1000m no_parent = on dir_prefix = some-directory-name reject = author,copyright,sendto_form,folder_listing,topic* restrict_file_names = windows robots = off
The urls-to-get file that is mentioned in the wgetrc contains the URL pointing to the top directory that you want to mirror:
http://your-domain/some-web-directory/some-sub-directory
I pointed the WGETRC environment variable at it:
setenv WGETRC /some-local-directoryl/wgetrc
Then I did:
wget --no-host-directories
That's it.
OK, then.
wgetrc on a Mac?
So, where is wgetrc on a Mac? Well. First, I looked for wget itself, which was installed by DeepVacuum. I have no idea where it is and Spotlight won't show it to me and neither will QuickSilver. I found many things, but I'm not sure I found wget:
hopelandia:~ latrippi$ whereis wget hopelandia:~ latrippi$ locate wget /System/Library/Perl/5.8.6/newgetopt.pl /usr/share/man/man3/mvwgetch.3x /usr/share/man/man3/mvwgetnstr.3x /usr/share/man/man3/mvwgetn_wstr.3x /usr/share/man/man3/mvwgetstr.3x /usr/share/man/man3/mvwget_wch.3x /usr/share/man/man3/mvwget_wstr.3x /usr/share/man/man3/wgetbkgrnd.3x /usr/share/man/man3/wgetch.3x /usr/share/man/man3/wgetnstr.3x /usr/share/man/man3/wgetn_wstr.3x /usr/share/man/man3/wgetstr.3x /usr/share/man/man3/wget_wch.3x /usr/share/man/man3/wget_wstr.3x /usr/share/vim/vim62/syntax/wget.vim /usr/share/zsh/4.2.3/functions/_wget
Then I looked for wgetrc:
hopelandia:/ latrippi$ locate wgetrc hopelandia:/ latrippi$
Came up with nothing. I did list in the directory it should be in, and there it was:
hopelandia:/ latrippi$ ls usr/local/etc wgetrc
I double checked:
hopelandia:/ latrippi$ cd usr/local/etc hopelandia:/usr/local/etc latrippi$ ls wgetrc
So, what's with "locate"? Do I need to bind wgetrc to an environment variable for locate to find it? Anyway, OK, I don't need to use wgetrc on an input file (where I should put the input file, for example?).
wget on plone sites 2
I can translate the wgetrc commands into command line flags or executables.
input = urls-to-get : the URL to the top directory to be mirrored
convert_links = on : -k
recursive = on : comes with -m
reclevel = inf : comes with -m
page_requisites = on : -p
wait = 0 : alright then, w 0
quota = 1000m : good idea! -Q 1000m
no_parent = on : -np
dir_prefix = some-directory-name : -P the top of the retrieval tree (name of directory or folder to be created)
reject = author,copyright,sendto_form,folder_listing,topic* : -R rejlist (--reject rejlist)
where rejlist is comma spearated list, no spaces
restrict_file_names = windows : --restrict-file-names=mode (where mode is unix or windows
-- don't think I need it)
robots = off
OK, so:
- I've edited the wgetrc to give a download quote and turn robots off, that's all.
- I still don't see why my Try 4 isn't staying inside the /blog the directory!!!
- Fr Plone sites, restricting file types -- very useful.
- But, I wonder if that list's complete? I'm getting lots of other krufty type stuff: enablingcookies, join_form, login_form, mail_password_form, document_view, discussionitem_view, event_view, newsitem_view, forum_listing, search_form -- all incrementing up in multiple copies "month/int=12&year/int=2006" and in wiki: backlinks, subscribeform, wikipage_view, ditto!
On Wget
- GNU Wget Manual (http://www.gnu.org/software/wget/manual/html_node/index.html)
- Wget (http://en.wikipedia.org/wiki/Wget) at Wikipedia
Directory Options
Directory-Options (http://www.gnu.org/software/wget/manual/html_node/Directory-Options.html#Directory-Options)
-nH --no-host-directories Disable generation of host-prefixed directories. By default, invoking Wget with -r http://fly.srk.fer.hr/ will create a structure of directories beginning with fly.srk.fer.hr/. This option disables such behavior.
-x --force-directories The opposite of -nd—create a hierarchy of directories, even if one would not have been created otherwise. E.g. wget -x http://fly.srk.fer.hr/robots.txt will save the downloaded file to fly.srk.fer.hr/robots.txt.
Download Options
Download-Options (http://www.gnu.org/software/wget/manual/html_node/Download-Options.html#Download-Options)
-c --continue Continue getting a partially-downloaded file.
-w seconds --wait=seconds Wait the specified number of seconds between the retrievals. Use of this option is recommended, as it lightens the server load by making the requests less frequent.
HTTP Options
HTTP-Options (http://www.gnu.org/software/wget/manual/html_node/HTTP-Options.html#HTTP-Options)
-E --html-extension If a file of type application/xhtml+xml or text/html is downloaded and the URL does not end with the regexp \.[Hh][Tt][Mm][Ll]?, this option will cause the suffix .html to be appended to the local filename. This is useful, for instance, when you're mirroring a remote site that uses .asp pages, but you want the mirrored pages to be viewable on your stock Apache server. Another good use for this is when you're downloading CGI-generated materials. A URL like http://site.com/article.cgi?25 will be saved as article.cgi?25.html.
Recursive Retrieval Options
Recursive-Retrieval-Options (http://www.gnu.org/software/wget/manual/html_node/Recursive-Retrieval-Options.html#Recursive-Retrieval-Options)
-k --convert-links After the download is complete, convert the links in the document to make them suitable for local viewing. This affects not only the visible hyperlinks, but any part of the document that links to external content, such as embedded images, links to style sheets, hyperlinks to non-html content, etc.
-K --backup-converted When converting a file, back up the original version with a .orig suffix. Affects the behavior of -N (see HTTP Time-Stamping Internals).
-l depth --level=depth Specify recursion maximum depth level depth (see Recursive Download). The default maximum depth is 5.
-m --mirror Turn on options suitable for mirroring. This option turns on recursion and time-stamping, sets infinite recursion depth and keeps ftp directory listings. It is currently equivalent to -r -N -l inf --no-remove-listing.
-p --page-requisites This option causes Wget to download all the files that are necessary to properly display a given html page. This includes such things as inlined images, sounds, and referenced stylesheets.
-r --recursive Turn on recursive retrieving. See Recursive Download, for more details.
Directory Based-Limits
Directory Based-Limits (http://www.gnu.org/software/wget/manual/html_node/Directory_002dBased-Limits.html#Directory_002dBased-Limits)
-np --no-parent no_parent = on The simplest, and often very useful way of limiting directories is disallowing retrieval of the links that refer to the hierarchy above than the beginning directory, i.e. disallowing ascent to the parent directory/directories.
References
- wget: Download entire websites easy (http://linuxreviews.org/quicktips/wget/), Linux Reviews
- Mirroring Websites with wget (http://www.jim.roberts.net/articles/wget.html)
Also here:
- ZopeZen - Getting data out of Zope (http://www.zopezen.org/Members/genghis/news_item.2004-07-18.8739319985): zopezen (Andy!): Same issue, pretty much, solutions not spelled out
- Making wget Work With Plone (http://zopelabs.com/cookbook/1103609775): Python script and DTML method: maybe might help?
- [http://www.zope.org/Members/hdw/Tip/wget/ OK, so I'm not crazy (19 hours now and counting).... More important, though redendant, it should be going into endless recrusions anywhere which I was beginning to suspect!

