Saving the WordPress.com Export File and The Linked Media Files (and wget’s strictness)
So I’ve been wanting a way to automatically backup my wordpress.com export file. I decided to go for a bash and wget mix to do this work. But I soon had a problem wget won’t save cookies that have a path different to the file you are downloading. This is a problem because, well here is what I basically do to get the export file.
Grab wp-login.php. This will issue a cookie that WP looks for as proof that I can indeed store cookies.
Next I post login credentials to wp-login.php. This will issue a bunch of authentication cookies. Specifically,
Set-Cookie: wordpress_test_cookie=WP+Cookie+check; path=/; domain=.wordpress.com Set-Cookie: wordpress=some_string; path=/wp-content/plugins; domain=.wordpress.com; httponly Set-Cookie: wordpress=some_string path=/wp-admin; domain=.wordpress.com; httponly Set-Cookie: wordpress_logged_in=some_string; path=/; domain=.wordpress.com; httponly Set-Cookie: wordpress_sec=some_string; path=/wp-content/plugins; domain=.wordpress.com; secure; httponly Set-Cookie: wordpress_sec=some_string path=/wp-admin; domain=.wordpress.com; secure; httponly
The problem is Wget will refuse to save number 2,3,5 and 6 (only saving wordpress_test_cookie and wordpress_logged_in). It refuses the rest because it requires the cookie path to be the same as the path of the file you are requesting. Using –debug wget says,
cdm: 1 2 3 4 5 6 7 8Attempt to fake the path: /wp-content/plugins, /wp-login.php cdm: 1 2 3 4 5 6 7 8Attempt to fake the path: /wp-admin, /wp-login.php cdm: 1 2 3 4 5 6 7 8Attempt to fake the path: /wp-content/plugins, /wp-login.php cdm: 1 2 3 4 5 6 7 8Attempt to fake the path: /wp-admin, /wp-login.phpSpecifically to get the export file I need the wordpress_sec cookie for the path /wp-admin. I can’t just request /wp-admin and try to get the cookie from there because only wp-login.php will let me post credentials.
Possible solutions are A) write a hacky solution that just grabs the cookie value using grep/sed and manually add this to the cookies file, B) recompile wget to accept some other argument that will accept these cookies, or C) don’t use wget.
I took a look at the source for wget, and it was easy to identify the problem area, I could just simply remove this segment,
/* The cookie sets its own path; verify that it is legal. */ if (!check_path_match (cookie->path, path)) { DEBUGP (("Attempt to fake the path: %s, %s\n", cookie->path, path)); goto out; }But then my download script wouldn’t be as portable and I’ll have to make sure I use and have the patched wget available.
I ended up using curl for some parts, but I probably could have done option A.
Anyhow, the script is here. It should grab the export xml file as well as any media files that it references and were uploaded to that wordpress.com blog.
I also had trouble with wget, but I solved the cookie issue a bit differently, by forcing it to every request.
My article about archiving WPMU can be found here: http://mig.hyper.fi/article/355
Thanks for the script, It saved me a lot of time 🙂