Joel H. Simplex ([info]njyoder) wrote in [info]lj_dev,
@ 2008-04-03 06:06:00
Previous Entry  Add to memories!  Tell a Friend!  Next Entry
Multiple LJ backup software has a syncitems bug.
Many LJ backup software, including the official jbackup.pl client, have a bug when doing syncitems where the first batch of items is all comments (i.e. no posts).  LiveJournal Backup / Search Tool and ljArchive also appear to the susceptible to this bug.

I noticed it because my journal is like this and after the LJ-Sec developer reviewed it, he found that the first batch of syncitems for me was all comments and then later I tried experimenting with various backup software and found this bug manifested in many of them.

It seems that most software operates under the assumption that there's at least one post in the first batch and therefore gets stuck trying to do syncitems or otherwise doesn't do the backup properly.

In jbackup.pl, if you haven't created a GDBM database before, $lastsync won't be set and normally this isn't a problem, because if there's a single post ("L" type) in the first batch of syncitems, it will update $lastsync.

This fix shouldn't cause any problems, unless I'm mistaken.  You only need to modify do_sync's main while loop as follows:

foreach my $item (@{$hash->{syncitems} || []}) {
    # Update $lastsync regardless of item type
    $lastsync = $item->{'time'}
        if $item->{'time'} gt $lastsync;
    next unless $item->{item} =~ /L-(\d+)/;
    $synccount++;
    $sync{$1} = [ $item->{action}, $item->{'time'} ];
    $lastsync = $item->{'time'}
#  if $item->{'time'} gt $lastsync;
#  $bak{"event:realtime:$1"} = $item->{'time'};
}
$bak{'event:lastsync'} = $lastsync;


Side note/question: jbackup.pl, when run via ActiveState's windows perl client, seems to be really slow (running for minutes on a Pentium 4 system) in processing comments AFTER they've been downloaded, but is fast when comments aren't processed at all.  This holds true even if no new comments have been added when it attempts to check for new comments and appears to take about as long as when there are many new comments.

There's not a lot of CPU usage, but there is a lot of hard drive access.  I'm not sure why it would do this, especially if there are new no comments and therefore the database doesn't need to be updated, nor any new comments downloaded.


(Post a new comment)


[info]ghewgill
2008-04-06 09:53 am UTC (link)
I had a look at my ljdump backup program and it appears as though it should handle this case properly. In particular, the lastsync state variable is updated for each item in the list that syncitems returns, instead of just for journal entries.

I wonder whether you could give ljdump a try on your journal and see whether it works for you.

(Reply to this)(Thread)


[info]njyoder
2008-04-07 06:29 pm UTC (link)
I'll check it out. How well developed is its library for general use (in other programs)? I was thinking I'd probably use ljsm because it can backup any journal, but I might be inclined to use another one in Python instead of Perl if it wouldn't be too time consuming for me to add that.

Oddly enough, I couldn't log in with ljsm and it kept returning some relatively useless error message. I found out that the cause was "BROKEN_CLIENT_INTERFACE" being set to 1, so I disabled it (set it to 0). I just found that really funny, even though I know it's actually set to compensate for sessiongenerate not returning cookies. Did LJ's sessiongenerate not work properly at some point?

(Reply to this)(Parent)(Thread)


[info]ghewgill
2008-04-08 08:11 am UTC (link)
ljdump is pretty specifically tailored for its single purpose, that is, it's not designed as a library. But the code isn't terribly long or complex so it could certainly be used in another context.

(Reply to this)(Parent)(Thread)


[info]njyoder
2008-04-14 10:42 am UTC (link)
That's good to hear. I'm sorry to bother you again, but after I tried it, for some reason it seems quite slow compared to other backup utilities.

Are you downloading posts one by one? It seems like that and if so, it's inefficient when you can posts month by month, just in case you didn't already know (I'm not meaning to patronize if you already did). I don't know if this is normal, but it did go over the posts/syncitems more than once even though it's a first run, does it create and update the same post more than once under any circumstances?

It takes over 30 minutes, probably closer to 60, to download the about 1,800 posts and 2,000 comments in my journal. It takes only about 10-20 with other utilities.

It did download everything correctly, it was just slow at doing it.

(Reply to this)(Parent)(Thread)


[info]ghewgill
2008-04-15 09:16 am UTC (link)
I don't remember the details now, but there's a comment in the ljdump source indicating that I had to fetch journal entries the slow way due to the server rejecting repeated calls to getevents with selecttype=syncitems. The LJ API documentation is pretty poor and I couldn't figure out what they were trying to tell me.

Anyway, it sounds like the slower approach I used in ljdump avoids the syncitems problem you mentioned in your original post. At least it's only slow the first time you run it with thousands of entries to download. :)

(Reply to this)(Parent)


Create an Account
Forgot your login?
Login w/ OpenID
English • Español • Deutsch • Русский…