| None of Them Knew They Were Robots ( @ 2003-09-04 06:59:00 |
LJ Scraping Exercise
Inspired by
ciphergoth's TrustFlow application, I recently set about to write a little LJ scraper of my own. I don't have any cool trust algorithms to offer, and what I'm doing isn't exactly rocket science... but it does provide a useful functionality that I don't think LiveJournal has on it's own (at least, I haven't found it yet anyway). And I had some fun writing it :)
My script tells you which LJ users you share the most interests with.
Naturally, the more interests you have listed, the better your results will be... and the longer the script will take to run (so a visual progress bar is displayed while the script works). The added load on LJ could potentially get heavy if a lot of people ran this; there is one HTTP request sent out for each of your interests, plus one for your userinfo page. If this becomes a problem for LJ (I doubt it will), then I'll take it down. Due to my extreme laziness, I'm fetching the pages with curl, so you'll need that installed if you want to try this out. (Or replace the single occurance of the word curl in the source with wget -O -)
If anyone in
perl has any feedback about this script, I'd love to hear it. I'm cross posting this there and in my own journal.
This could obviously be cleaned up a bit :)
It would be relatively easy to make the script take input from a CGI, and htmlize the output to make all the names and interests LJ links... but then I think enough people would use it that LJ might take issue with the extra bandwidth consumption. So it's probably better to (a) keep it as a script that stays in the terminal, or (b) implement this the right way (as part of LiveJournal).
Inspired by
My script tells you which LJ users you share the most interests with.
#!/usr/bin/perl -w
$L=10; die " usage: ./lj_interesting.pl ljusername
This script will show you the $L LiveJournal users whom you share the most
interests with, and tell you what those interests are. It doesn't work as
well for really popular interests, as livejournal only shows the first 500
users with a given interest, but the less common interests are probably
more interesting anyway. This script is released in the public domain by
the author, phyxeld. (notcopyright) 2003\n" unless $ARGV[0]; crawl(@ARGV);
sub crawl {
my $user=shift;
warn "Searching for users who share interests with $user ...\n";
my @int=scrape($user,0); my $t=$#int+1;
for $int (@int) {
printf STDERR (qq"\x0d\x1b[K %3d%% [%-30s] request %d ".
q[of %d "%s"], (++$i/$t*100),('*'x($i/$t*30)),$i,$t,$int);
push @{$d{$_}},$int for (scrape($int,1));
}
print "\nUser '$_' has ",$#{$d{$_}}+1," common interests:\n ",
join(', ',@{$d{$_}}),"\n" for (grep { !m/$user/ && 0<$L--}
sort {scalar @{$d{$b}}<=>scalar @{$d{$a}}} keys %d)
}
sub scrape { # scrape can do two things: @interests = scrape(user,0)
my ($q,$w)=@_; my @r=(); # @users = scrape(interest,1)
$re=$w?qr[rinfo\.bml\?user=([\w+]+)']:qr[sts\.bml\?int=([\w+]+)'];
my $u=$w?'interests.bml?int=':'userinfo.bml?user='; push @r, (m/$re/g)
for qx[curl "http://www.livejournal.com/$u$q" 2>/dev/null]; return @r;
}Naturally, the more interests you have listed, the better your results will be... and the longer the script will take to run (so a visual progress bar is displayed while the script works). The added load on LJ could potentially get heavy if a lot of people ran this; there is one HTTP request sent out for each of your interests, plus one for your userinfo page. If this becomes a problem for LJ (I doubt it will), then I'll take it down. Due to my extreme laziness, I'm fetching the pages with curl, so you'll need that installed if you want to try this out. (Or replace the single occurance of the word curl in the source with wget -O -)
If anyone in
$ lj_interesting.pl jwz Searching for users who share interests with jwz ... 100% [******************************] request 69 of 69 "xemacs" User 'jw_izz' has 28 common interests: brassy, cabaret+voltaire, cop+shoot+cop, cyber+fashion, cypherpunk, die+warzau, dna+lounge, emacs, emergency+broadcast+network, frank+miller, hanzel+und+gretyl, harlan+ellison, internet+radio, john+varley, jwz, killing+the+riaa, low+pop+suicide, monkey+butter, psytrance, retrocomputing, schadenfreude, screen+savers, shriekback, surveilance, the+singularity, vernor+vinge, waxtrax, webcasting User 'blackavar' has 17 common interests: cabaret+voltaire, cyber+fashion, cypherpunk, die+warzau, dna+lounge, emacs, frank+miller, internet+radio, jwz, killing+the+riaa, retrocomputing, schadenfreude, screen+savers, shriekback, vernor+vinge, waxtrax, xemacs User 'confuseme' has 14 common interests: autechre, blade+runner, cabaret+voltaire, cyberpunk, cypherpunk, drum+and+bass, front+242, hacking, killing+the+riaa, lain, psytrance, shriekback, transmetropolitan, william+gibson User 'ivorjawa' has 11 common interests: culture+jamming, cypherpunk, internet+radio, john+varley, jwz, killing+the+riaa, schadenfreude, security, shriekback, vernor+vinge, waxtrax User 'dnalounge' has 10 common interests: cypherpunk, dna+lounge, internet+radio, jwz, killing+the+riaa, monkey+butter, psytrance, screen+savers, surveilance, webcasting User 'spot' has 9 common interests: buffy+the+vampire+slayer, comics, fight+club, hacking, neal+stephenson, nine+inch+nails, science+fiction, sushi, william+gibson User 'machinegirl' has 9 common interests: blade+runner, cyberpunk, fight+club, front+242, ghost+in+the+shell, lain, psytrance, sushi, william+gibson User 'slithead' has 9 common interests: autechre, cyberpunk, front+242, harlan+ellison, neal+stephenson, schadenfreude, unix, warren+ellis, william+gibson User 'azurecobalt' has 9 common interests: 24, cyberpunk, farscape, lain, neal+stephenson, the+matrix, transmetropolitan, warren+ellis, william+gibson User 'rasp_utin' has 9 common interests: aeon+flux, autechre, blade+runner, cyberpunk, electro, hanzel+und+gretyl, pop+will+eat+itself, waxtrax, william+gibson
This could obviously be cleaned up a bit :)
It would be relatively easy to make the script take input from a CGI, and htmlize the output to make all the names and interests LJ links... but then I think enough people would use it that LJ might take issue with the extra bandwidth consumption. So it's probably better to (a) keep it as a script that stays in the terminal, or (b) implement this the right way (as part of LiveJournal).