Brad Fitzpatrick ([info]bradfitz) wrote in [info]lj_dev,
@ 2005-09-15 21:13:00
Previous Entry  Add to memories!  Tell a Friend!  Next Entry
Google Blog Search -- relax, yo
I got this email from Google:
Hey Brad,

EvanM passed a note our way from LJ Tech Support, regarding Blog Search and its accidental indexing of "noindex" LJ content. Just wondering if you guys could let your users know that this was entirely unintentional, and a fix should go live within the next day or two? (hopefully tomorrow)

Thanks,
E
So y'all can relax.

(I'm also talking to them about RSS/Atom specs for indicating noindex so they don't have to hit up HTML to learn about it.)

And please, people, stop spreading paranoia: they're not using RSS as a "workaround" to not obey robots.txt and noindex... that's just silly on so many levels.

Remember the golden rule on the Internets:
Never attribute to malice what can be adequately explained by stupidity.
... or in this case, an accident.


(Post a new comment)


[info]halfawake
2005-09-16 04:20 am UTC (link)
Neat, thanks for letting us know about this. Might want to have someone mention it in [info]news as well.

(Reply to this)


[info]zach
2005-09-16 04:20 am UTC (link)
Sweet deal.
Not that I complained in the first place. I welcome all robots to my journal. =)

(Reply to this) (Thread)


[info]idigital
2005-09-16 08:22 am UTC (link)
I amz az robotz. Myz Prime Directivez are to Indexz your LivezJournalz.

(Reply to this) (Parent)(Thread)(Expand)

(no subject) - [info]zach, 2005-09-16 08:28 am UTC (Expand)
(no subject) - [info]idigital, 2005-09-16 08:36 am UTC (Expand)
(no subject) - [info]shmuelisms, 2005-09-18 07:44 am UTC (Expand)

[info]burr86
2005-09-16 04:26 am UTC (link)
Oh, wow, thanks! *goes back to the support board to reassure a bunch of people* :)

(Reply to this)


[info]azurelunatic
2005-09-16 04:46 am UTC (link)
Good to hear.

(Reply to this)

neat
[info]jay
2005-09-16 04:53 am UTC (link)
It's nice to know that a press conference and involvement from grass roots organizations was not required.

(Reply to this)


[info]zach
2005-09-16 05:56 am UTC (link)
Is it just me, or have some of the comments on this entry kept disappearing and reappearing?

(Reply to this) (Thread)


[info]azurelunatic
2005-09-16 08:47 am UTC (link)
Is the content inane, off-topic, blatantly offensive and/or obscene?

(Reply to this) (Parent)(Thread)(Expand)

(no subject) - [info]zach, 2005-09-16 08:53 am UTC (Expand)
(no subject) - [info]azurelunatic, 2005-09-16 08:57 am UTC (Expand)

[info]mart
2005-09-16 06:15 am UTC (link)

This seems like a good opportunity to solve this properly with HTTP headers:

X-Robot-Prefs: noindex, nofollow
X-Robot-See: /users/mart/robots.txt

Aside from the obvious benefit that it can then apply to any media type including images, having it “out of band” means that the code to handle it can be centralised to LJ::make_journal rather than duplicating it in S1, S2 and talkread.bml. Still needs to go in a few awkward BML pages, but it's still a win. (Of course, the old robots blocking will no dout have to stay where it is for the benefit of those mythical “other search engines” I've heard about.)

If you just come up with some half-baked solution specific to RSS and Atom we'll be doing this dance again soon enough. For the people who are using stunted webservers and can't set such things, the problem for the Atom/RSS folks then becomes a way to do http-equiv like HTML does, allowing these header fields to be embedded into the document. That doesn't have to be LJ's problem, though.

(Reply to this) (Thread)


[info]nikolasco
2005-09-16 06:33 am UTC (link)
Out of curiousity, why the robots.txt + X-robot approach? For specific user agents or wildcards? I was thinking of handling such things at the server level (e.g. mod_xrobot reads robot.txt).

My other thought of the day is the need for something more specific than whole document inclusion/exclusion, in light of aggregations like atom-stream.xml. I like the idea of an XML attribute. For example:
<feed xmlns='http://www.w3.org/2005/Atom' r:index="no" xmlns:r="http://namespace/robot/">

(Reply to this) (Parent)(Thread)(Expand)

(no subject) - [info]mart, 2005-09-16 07:05 am UTC (Expand)

[info]jamesd
2005-09-16 10:41 pm UTC (link)
HTTP headers are useful but not enough. At least four cases where they fail:

1. The unending stream of posts Brad's working on, where the whole stream would get one header but individual posts should have a way to block the whole document tree below a certain depth.

2. Friends pages, where posts have individual properties.

3. Comments like this, where each comment has individual properties.

4. New posts list, some items which shouldn't be indexed but which are there for humans to see.

So, needs to be post-specific or still more fine-grained to really do the job well.

Given the option, I'd do it within posts by hand as well as any higher level items. Less chance of a new protocol or presentation circumventing whatever protections are already in place that way.

I fully agree with your half-baked solution reservation.

(Reply to this) (Parent)


[info]mart
2005-09-16 07:07 am UTC (link)
Robots are user-agents too

(Reply to this) (Thread)(Expand)


[info]zach
2005-09-16 07:18 am UTC (link)
Aw poor robot...

(Reply to this) (Parent)


[info]davidkevin
2005-09-16 08:47 am UTC (link)

"Your droid! We don't serve their kind!"

-- bartender, Mos Eisley Spaceport Cantina

(Reply to this) (Parent)(Thread)(Expand)

(no subject) - [info]zach, 2005-09-16 08:55 am UTC (Expand)
(no subject) - [info]thgintaetal, 2005-09-16 09:47 am UTC (Expand)
(no subject) - [info]orangemike, 2005-09-16 01:39 pm UTC (Expand)

[info]gizmometer
2005-09-16 12:23 pm UTC (link)
Oh, yay. :)

(Reply to this)


[info]chgowiz
2005-09-16 02:38 pm UTC (link)
I'll believe it when I see it - I think there'll be a few growing pains.

(Reply to this)


[info]isidorenabi
2005-09-16 06:16 pm UTC (link)
seems to be OK now (for me at least) - yesterday when i searched on my username i got hits from my own lj, now i only get hits from other people who've used my username in their posts.

(Reply to this) (Thread)


[info]njyoder
2005-09-16 08:40 pm UTC (link)
Yes, I checked too and all the hits from my journal are gone now too. Google cleared up the problem quickly. What I'm wondering is, how could Google make such an error in the first place. You'd think their google blog indexing code would use the same code that checked the meta tags as their general search engine.

(Reply to this) (Parent)(Thread)(Expand)

(no subject) - [info]metaphorge, 2005-09-16 10:05 pm UTC (Expand)
(no subject) - [info]njyoder, 2005-09-16 10:16 pm UTC (Expand)
(no subject) - [info]metaphorge, 2005-09-16 10:20 pm UTC (Expand)
(no subject) - [info]njyoder, 2005-09-16 10:28 pm UTC (Expand)
(no subject) - [info]metaphorge, 2005-09-16 10:40 pm UTC (Expand)
(no subject) - [info]jamesd, 2005-09-16 10:45 pm UTC (Expand)
(no subject) - [info]metaphorge, 2005-09-16 10:50 pm UTC (Expand)
(no subject) - [info]njyoder, 2005-09-16 10:57 pm UTC (Expand)
(no subject) - [info]metaphorge, 2005-09-16 11:05 pm UTC (Expand)
(no subject) - [info]njyoder, 2005-09-16 11:11 pm UTC (Expand)
(no subject) - [info]njyoder, 2005-09-16 10:53 pm UTC (Expand)
(no subject) - [info]metaphorge, 2005-09-16 11:04 pm UTC (Expand)
(no subject) - [info]njyoder, 2005-09-16 11:10 pm UTC (Expand)
(no subject) - [info]metaphorge, 2005-09-16 11:20 pm UTC (Expand)
(no subject) - [info]njyoder, 2005-09-16 11:31 pm UTC (Expand)
(no subject) - [info]7rin, 2005-09-17 05:42 am UTC (Expand)

[info]metaphorge
2005-09-16 10:04 pm UTC (link)
IMHO LiveJournal should not block search spiders for public entries at all. Such moves laregely defeat the point of the Internet.

If someone doesn't want their entry to be accessible, it should not be public. Period.

(Reply to this) (Thread)(Expand)


[info]njyoder
2005-09-16 10:20 pm UTC (link)
That's totally illogical. From a benefit/drawback standpoint, your proposal has one extra drawback and absolutely no benefits. If people were forced to do that, then they'd just make their entries FO, which would defeat the whole purpose. In that case, no humans other than people they've friended can read it. In the case where they CAN block spiders, then you have the added benefit of allowing any humans to read it.

Your whole "point of the internet" emotive argument is stupid anyway. The internet serves many purposes and one of them is not to have 100% of information accessible via search engines. Do you want your medical records accessible via search engines? No? What's wrong? I thought the point of the internet was to have everything accessible via a search engine.

(Reply to this) (Parent)(Thread)(Expand)

(no subject) - [info]metaphorge, 2005-09-16 10:36 pm UTC (Expand)
(no subject) - [info]njyoder, 2005-09-16 10:52 pm UTC (Expand)
(no subject) - [info]metaphorge, 2005-09-16 10:59 pm UTC (Expand)
(no subject) - [info]njyoder, 2005-09-16 11:06 pm UTC (Expand)
(no subject) - [info]metaphorge, 2005-09-16 11:11 pm UTC (Expand)
(no subject) - [info]njyoder, 2005-09-16 11:16 pm UTC (Expand)
(no subject) - [info]metaphorge, 2005-09-16 11:24 pm UTC (Expand)
(no subject) - [info]jamesd, 2005-09-16 10:47 pm UTC (Expand)
(no subject) - [info]ladysorka, 2005-09-17 08:29 am UTC (Expand)
(no subject) - [info]unseelie23, 2005-09-22 09:05 pm UTC (Expand)

[info]elementa
2005-09-17 04:03 am UTC (link)
as of 10 minutes ago, I have hits on many of my public entries AND one of my friends-protected entries which includes text of the entry.

any idea on when this will change and why it happened?

(Reply to this)


[info]7rin
2005-09-17 05:49 am UTC (link)
Cor... you got a personal email from Google! Wow Brad, you must be like, famous, or something.

(Reply to this)


[info]tuscendi
2005-09-17 11:53 pm UTC (link)
As of right now, two days later, a bunch of my LJ posts (and my LJ journal is Friends Only by default) are fully available on Blog Search.

I know I'm an ignoramus about these things, but I'm still shocked that our journal cannot be protected from the unscrupulosity or incompetence of the people who run the search engines. I had assumed our privacy was fully protected by Live Journal which I trust.

(Reply to this)

It also indexes on the Journal Title
[info]hilltop
2005-09-19 07:55 pm UTC (link)
My journal has the Title of Unquiet Ether.
Do a search on that, and I'm turning up all over the place, although Hilltop isn't.

(Reply to this)


[info]f_l_i_r_t
2005-09-19 10:13 pm UTC (link)
I agree with other people when they say we cannot believe that these journals cannot be protected better. You tell us to stop over reacting but it is over a week and all our sites are still cached and showing on Google Blog search, I am not impressed.

Time to ditch Live Journal? I and a bunch of my friends are all feeling the same, I think this really needs to be adressed in a more serious fashion and not fobbed off as 'hysterical users'.

When friends only posts still show on the search there is an issue with your security code, no?

(Reply to this)


[info]surrealist_post
2005-09-22 09:33 pm UTC (link)
This is unrelated but.. how do I hide the new 'schools' identifier listed in the bio page? I have absolutely no use for it and don't want it there, but I can't seem to find a 'hide' option for it. I assumed that it would be like memories, if you have none, it doesn't show, but it does show, and I'd like it gone. Thanks.

(Reply to this) (Thread)


[info]prissi
2005-09-24 06:41 am UTC (link)
You currently can't hide the 'schools' section of your bio page.

If you want to suggest that this feature be added to LiveJournal, you can offer it up at [info]suggestions. FAQ on suggestions here.

(Reply to this) (Parent)


[info]imfallingup
2005-09-29 02:14 am UTC (link)
any more word on this? i'm still pulling up a friendslocked entry right now...

(Reply to this)


Create an Account
Forgot your login or password?
Login w/ OpenID
English • Español • Deutsch • Русский…