None of Them Knew They Were Robots ([info]phyxeld) wrote in [info]perl,
@ 2003-09-04 06:59:00
Previous Entry  Add to memories!  Tell a Friend!  Next Entry
LJ Scraping Exercise
Inspired by [info]ciphergoth's TrustFlow application, I recently set about to write a little LJ scraper of my own. I don't have any cool trust algorithms to offer, and what I'm doing isn't exactly rocket science... but it does provide a useful functionality that I don't think LiveJournal has on it's own (at least, I haven't found it yet anyway). And I had some fun writing it :)

My script tells you which LJ users you share the most interests with.

#!/usr/bin/perl -w
$L=10; die " usage: ./lj_interesting.pl ljusername
This script will show you the $L LiveJournal users whom you share the most 
interests with, and tell you what those interests are. It doesn't work as
well for really popular interests, as livejournal only shows the first 500 
users with a given interest, but the less common interests are probably 
more interesting anyway. This script is released in the public domain by 
the author, phyxeld. (notcopyright) 2003\n" unless $ARGV[0]; crawl(@ARGV);
sub crawl {
    my $user=shift;
    warn "Searching for users who share interests with $user ...\n";
    my @int=scrape($user,0); my $t=$#int+1;
    for $int (@int) {
        printf STDERR (qq"\x0d\x1b[K %3d%% [%-30s] request %d ".
        q[of %d "%s"], (++$i/$t*100),('*'x($i/$t*30)),$i,$t,$int);
        push @{$d{$_}},$int for (scrape($int,1));
    }
    print "\nUser '$_' has ",$#{$d{$_}}+1," common interests:\n   ",
        join(', ',@{$d{$_}}),"\n" for (grep { !m/$user/ && 0<$L--}
        sort {scalar @{$d{$b}}<=>scalar @{$d{$a}}} keys %d)
}
sub scrape { # scrape can do two things: @interests = scrape(user,0)
    my ($q,$w)=@_; my @r=(); #           @users = scrape(interest,1)
    $re=$w?qr[rinfo\.bml\?user=([\w+]+)']:qr[sts\.bml\?int=([\w+]+)'];
    my $u=$w?'interests.bml?int=':'userinfo.bml?user='; push @r, (m/$re/g) 
    for qx[curl "http://www.livejournal.com/$u$q" 2>/dev/null]; return @r;
}


Naturally, the more interests you have listed, the better your results will be... and the longer the script will take to run (so a visual progress bar is displayed while the script works). The added load on LJ could potentially get heavy if a lot of people ran this; there is one HTTP request sent out for each of your interests, plus one for your userinfo page. If this becomes a problem for LJ (I doubt it will), then I'll take it down. Due to my extreme laziness, I'm fetching the pages with curl, so you'll need that installed if you want to try this out. (Or replace the single occurance of the word curl in the source with wget -O -)

If anyone in [info]perl has any feedback about this script, I'd love to hear it. I'm cross posting this there and in my own journal.


$ lj_interesting.pl jwz       
Searching for users who share interests with jwz ...
 100% [******************************] request 69 of 69 "xemacs"
User 'jw_izz' has 28 common interests:
   brassy, cabaret+voltaire, cop+shoot+cop, cyber+fashion, cypherpunk,
die+warzau, dna+lounge, emacs, emergency+broadcast+network,
frank+miller, hanzel+und+gretyl, harlan+ellison, internet+radio,
john+varley, jwz, killing+the+riaa, low+pop+suicide, monkey+butter,
psytrance, retrocomputing, schadenfreude, screen+savers, shriekback,
surveilance, the+singularity, vernor+vinge, waxtrax, webcasting

User 'blackavar' has 17 common interests:
   cabaret+voltaire, cyber+fashion, cypherpunk, die+warzau, dna+lounge,
emacs, frank+miller, internet+radio, jwz, killing+the+riaa,
retrocomputing, schadenfreude, screen+savers, shriekback, vernor+vinge,
waxtrax, xemacs

User 'confuseme' has 14 common interests:
   autechre, blade+runner, cabaret+voltaire, cyberpunk, cypherpunk,
drum+and+bass, front+242, hacking, killing+the+riaa, lain, psytrance,
shriekback, transmetropolitan, william+gibson

User 'ivorjawa' has 11 common interests:
   culture+jamming, cypherpunk, internet+radio, john+varley, jwz,
killing+the+riaa, schadenfreude, security, shriekback, vernor+vinge,
waxtrax

User 'dnalounge' has 10 common interests:
   cypherpunk, dna+lounge, internet+radio, jwz, killing+the+riaa,
monkey+butter, psytrance, screen+savers, surveilance, webcasting

User 'spot' has 9 common interests:
   buffy+the+vampire+slayer, comics, fight+club, hacking,
neal+stephenson, nine+inch+nails, science+fiction, sushi, william+gibson

User 'machinegirl' has 9 common interests:
   blade+runner, cyberpunk, fight+club, front+242, ghost+in+the+shell,
lain, psytrance, sushi, william+gibson

User 'slithead' has 9 common interests:
   autechre, cyberpunk, front+242, harlan+ellison, neal+stephenson,
schadenfreude, unix, warren+ellis, william+gibson

User 'azurecobalt' has 9 common interests:
   24, cyberpunk, farscape, lain, neal+stephenson, the+matrix,
transmetropolitan, warren+ellis, william+gibson

User 'rasp_utin' has 9 common interests:
   aeon+flux, autechre, blade+runner, cyberpunk, electro,
hanzel+und+gretyl, pop+will+eat+itself, waxtrax, william+gibson


This could obviously be cleaned up a bit :)
It would be relatively easy to make the script take input from a CGI, and htmlize the output to make all the names and interests LJ links... but then I think enough people would use it that LJ might take issue with the extra bandwidth consumption. So it's probably better to (a) keep it as a script that stays in the terminal, or (b) implement this the right way (as part of LiveJournal).



(Post a new comment)


[info]queue
2003-09-04 07:24 am UTC (link)
LJ used to provide an interest-matching thing, but it doesn't appear to be around any more. It weighted less common interests more heavily, though, so that people could still appear high on your list if you only shared a few interests with them.

(Reply to this) (Thread)


[info]timwi
2003-09-04 08:27 am UTC (link)
Yeah. That was taken down because it was too database-intensive. It was also rather inaccurate; people who have had their account for longer had a greater chance of appearing higher up in your list than newer users.

(Reply to this) (Parent)


[info]momentsmusicaux
2003-09-04 07:27 am UTC (link)
LJ had this functionality up until a few weeks ago. I don't know why they're removed it -- maybe it was slowing the server down.

(Reply to this)


[info]whitaker
2003-09-04 08:20 am UTC (link)
Is the crawler "polite" ? Rate-limited in some way. I skimmed the perl and didn't see anything slowing it down?

(Reply to this) (Thread)


[info]phyxeld
2003-09-04 11:59 am UTC (link)
No. It pulls all the URLs it needs with a for loop, with no rate-limiting at all. I'm sort of figuring that livejournal gets such an astronomical amount of traffic already, that a few people running this is just a few more drops in the bucket.

Like I said, if I'm wrong and it's a problem for you guys, I'll be glad to stop / take down the script.

Is there a better way I could do this than I am? I'm a fan of having RSS feeds everywhere; RSS of my friends page is something I've wished I had for a while, but RSS output from interests.bml would sure be cool too (and would make this use slightly less bandwidth).

Btw, my next scraper project will probably be a friends-page RSS feed, which I was originally going to do from the HTML but am now thinking about doing by merging each friend's actual RSS feed. I don't know how I'd go about hosting that, as the script needs the reader's LJ credentials to see friends-only posts (the whole point), so I'd probably just give that away as a script too. Something that could be run from cron that would (a) check my userinfo page, (b) check the rss feeds of everybody on my friends list, and (c) merge them together and write out a static RSS file with all the latest entries. Again this is something that could much better be done server-side of course... :)

(Reply to this) (Parent)


[info]timwi
2003-09-04 08:29 am UTC (link)
You definitely need to work on the readability of your programming style ;-)

(Reply to this)


Create an Account
Forgot your login or password?
Login w/ OpenID
English • Español • Deutsch • Русский…