View Full Version : 'bots and phpLD
kickass
08-27-2005, 05:44 PM
I remember Sharr posting about this- Googlebots and other arachnids are going absolutely nuts on phpLD sites and returning MANY more links than actually exist.
Then Jim Westergren offered me a sitemap of my site made by a site he used- well, thanks Jim, but a LARGE portion of my site was unspidered and unspiderable by that particular tool, namely my PR6 WordPress Blog. He also mentioned that he had a lotta duplication in phpLD when he used this tool (I believe he has the locally installed version, though I tried the web version and it did the same thing on the phpLD area.) That sent me on a search for a better tool to accomplish this, one which would grab my wordpress links, which I found:
johannesmueller.com/gs/
The first time I ran this tool, which sends its own arachnids all over your site, it went totally nuts the same way that Sharr described the googlebots do. I'm sure the same problem exists on the google python sitemap generator.
This bit seems to be the problem:
Content visible to registered users only.
This is a bit of the url in repeated links. For example:
http://kickasswebdesign.com/webgeekdir/Java/root+'/ASP_NET/
Now, if you look at my directory structure you'll see that ASP_NET is not a child of Java. But every one of the directories and subs were duplicated in this way in multiple urls and all had that same root+' within the url and at different, and sometimes repeated, levels.
Now, when I'm using this program locally I can filter that bit out and none of these mutant and nonexistent urls will show up. But how do I do this for googlebots?
Anyway, I'm posting this in hopes this will spark a lightbulb moment for one of you coding gurus. Is there a bit of code that needs to be changed? an .htaccess rule that can be written? A miracle bullet for robots.txt? There's gotta be something. Otherwise those of us who develop large directories might end up with beaucoup bandwidth and server performance issues and hosts might make phpLD perscripta non grata.
minstrel
08-27-2005, 07:46 PM
This increased googlebot activity is being reported all over for the last couple of weeks, kickass -- on all different types of sites, too -- forums, dynamic sites, static sites.
I don't think it's a phpld problem. Something's up with Google or Googlebot.
kickass
08-28-2005, 08:18 PM
Increased activity just indicates one of google's updates is coming. What I described above is NOT that. It's the way the bots end up stuck in some kind of loop and show 10,000 links where only a few hundred actually exist, a big problem that may hurt all of us in the way google perceives our site.
minstrel
08-29-2005, 06:41 AM
I understood that but it's not a problem with phpld, or at least not just with phpld.
The first report I saw about Googlebot doing this was last fall in a DigitalPoint thread regarding a phpBB forum -- I believe the member's name was Sammy or something like that.
More recently, there have been several reports of similar activity at DigitalPoint and elsewhere.
Ap0s7le
08-29-2005, 08:10 AM
This isn't a "fix", as I'm not a javascript person and haven't had time to look into it.
What I did tonight was recreate the problem, the "root+" is coming from the HITS javascript at the bottom, since I don't really short my LINKS anyhow I removed that at the bottom, retried and of course the problem was gone.
This isn't the best solution, but considering I care more about what SE's get right now than sorting by hits it was the best solution for me.
Bedtime
*gone*
Yodaya
08-29-2005, 09:37 AM
Any other ideas on how to fix this problem?
kickass
08-29-2005, 09:48 PM
Casey, are you saying removing the whole javascript gets rid of the problem?
Though it's nice to have the hits thing there, I'd rather have SE arachnids perceive my site in a good way, and not a hurtful one.
What do you mean "short" your links? I hope you're not talking about the mod rewrite thing and nicer urls . . . which would put us in a quandary- pretty urls and way too many or crappy urls but only what's really there.
Ap0s7le
08-30-2005, 01:41 AM
I'm an idiot :) short = sort.
Sometimes one word slips out when I mean another. Blasted multiple personalities.
Yeah, if you remove the JS at the bottom it'll not show up.
It's what I did, because I care more what the SE's get than letting someone SORT (haha) by Hits right now.
later
Jim_Westergren
09-04-2005, 10:24 PM
Thanks! this works now.
Pity that I spent more than an hour today before reading this thread to make a very nice sitemap when I just could have disabled the javascript.
Thanks.
Thanks for the redirection from the other thread, kickass.
How and where might I disable this code Ap0s7le?
Thanks.
Jez.
Jim_Westergren
09-05-2005, 11:01 AM
Content visible to registered users only.
You find it in the main.tpl at the bottom. There is a javascript there, I have simple put comments tag around it in case I want it back later. If you do that you have to know that it won't be able to track the clicks on the links and the sorting by hits won't be accurate. As for my directory this doesn't matter and so I have commented out the sort by hits as well.
Thanks Jim, I copied yours. I hope that's ok as I had no idea on how to comment out a tag. I hope it sorts it out.
kickass
09-05-2005, 03:55 PM
For future references, complete instructions on removal of the problem javascript are in the wiki here:
Wiki Googlebot Issues
Jim_Westergren
09-05-2005, 04:10 PM
Content visible to registered users only.
No problem.
If you need help you can PM me and I will help you.
Jim_Westergren
09-05-2005, 04:51 PM
Thanks, didn't see your post.
The alpha sorting still works, it is only the tracking of the hits that are not recorded anymore I think.
Sharr76
09-05-2005, 06:42 PM
Hi all,
we I did exactly as you said and its not changed anything.......still showing thousands....... :shock:
kickass
09-06-2005, 03:24 PM
Geez, a friend just set me up on cron for googlygoodness with an automatically generated sitemap, and it spits out a beautifully clean list. Works fine for me re google spiders. It's gotta be something else with your config, Sharr76.
Content visible to registered users only.
Thanks Jim, that's appreciated.
alain76
09-09-2005, 10:14 AM
Content visible to registered users only.
'root' is a variable pointing to the main directory your phpLD is located.
So for kickass' site ('http://kickasswebdesign.com/webgeekdir/'... sorry, yours was the first I found :)) 'root' would be equal to '/webgeekdir'.
The '+' is just to combine it with the next element. Like the '.' does in PHP.
While I know this doesn't solve the problem, I just thought I'd point it out :)
alain76
09-09-2005, 10:34 AM
I changed the Javascript to:
Content visible to registered users only.
And this seems to work on my site. Can anybody confirm ?
I never checked to see if I had problems before changing this, so I don't know if this'll actually make a difference.
On a related note: since when do SE Spiders follow JavaScripts ?
Ap0s7le
09-09-2005, 05:47 PM
It didn't help one of my test installs.
I haven't had time to look into it, but a pattern seems to emerge.
It'll grab a cat, the sorted cat and than the root+
Like so
/?s=H
/?s=A
/root+
It does this for each category that has links in it.
I've got to go.
later
offmaster
09-19-2005, 04:10 PM
Casey can we expect a solution to this issue ? I'd really like to use the hit link sorting and have google mapping my site corectly.
David
09-19-2005, 04:31 PM
I know Casey is out of town for a few days. We need to remember to ask him when he gets back. And if anyone can send him a donation, he does a lot of work for free already.
offmaster
09-19-2005, 04:47 PM
Content visible to registered users only.
Yes Google simply follows these links, and make 3 copies of the cat
The solution should be a js link or a no follow attribute on these links
offmaster
09-26-2005, 05:26 PM
Finaly what's the problem ? the JS or the sort links ?
StockPot
10-04-2005, 09:11 PM
that's too bad... Pagerank isn't working for my server, so that leaves just alphabetical. Doesn't seem fair to the alphabetically challenged yet quality sites.
Dopeman
01-13-2006, 10:41 PM
is there already a solution to this problem? (get link hits counted and properly mapped by Google)
because I dont understand the solution with the cats!?? (cats and dogs) :D
joost
01-29-2006, 09:15 PM
bump
Is there a solution for this problem available?
I just tried a linkchecker on my new PHPLD and it found about 30000 pages, which is much too much. :shock:
greetz joost
PS thanks for this wonderful piece of software
melaniejk
02-11-2006, 05:24 PM
I deleted the javascript mentioned from the main.tpl file and uploaded it.
Afterwards, I noticed the alphabetical doesn't work at all.
I thought I would just lose the Hits sort, which I did.
Neither worked and I decided to go back to what I had.
So, I repasted in the javascript into the main.tpl and uploaded it.
Nothing fixed. The Alphabetical and the Hit Sort don't work at all.
Can anyone tell me what went wrong or how I can fix this.
Example page:
http://www.genealogygeek.com/Genealogy/tools/
Thank you.
Best wishes,
Melanie
melaniejk
02-14-2006, 01:40 AM
bump
bigdog
02-14-2006, 05:46 AM
Looks like you have a problem with your template.
When you click on Alphabetical, this command is run, http://www.genealogygeek.com/US/MI/?s=A. The page that loads doesn't change the link for Alphabetical.
To see what I mean, run this command:
http://www.genealogygeek.com/US/MI/?s=D
Good luck,
Bill :)
melaniejk
02-14-2006, 06:05 AM
Hi bigdog.
I'm not sure I understand. When I visit other directories and click on the alphabetical or hit sort the only thing that changes in the url is:
/?s=A or n/?s=H
Sorry, I'm a newbie when it comes to php.
Thanks,
Melanie
bigdog
02-15-2006, 06:25 AM
Hi Melanie,
I am still learning how the PHPLD system works. So I am in error on what I told you previously about the template. This is what I came up with through experimentation.
Content visible to registered users only.
When a request is for PAGERANK, the query will pull links that are ordered descending.
When a request is for HITS, the query will pull links that are ordered descending.
When a request is for TITLE, the query will pull links that are ordered ascending.
When a request is for DATE_ADDED, the query will pull links that are ordered descending.
Content visible to registered users only.
It is getting late here, so I am probably not making any sense. Try modifying your template with the changes above.
BTW, to stop from defaulting to 'P' or PAGERANK, I commented out the following code inside of index.php.
Content visible to registered users only.
Good luck,
Bill
bigdog
02-15-2006, 06:28 AM
On my last comment, it is probably not necessary to comment out the code in the index. I don't have PR set up on my site.
melaniejk
02-15-2006, 03:10 PM
Hi Bigdog.
Well, I pasted that section into main.tpl as suggested.
I'm sad to report that it did not work.
It gave the page an extra Sort by Alphabetical link.
But, it did not work either.
:( :(
Melanie
bigdog
02-15-2006, 05:41 PM
Whew, I am still having nightmares from last night.
Melanie,
From what I can tell by going to this url, http://www.genealogygeek.com/Genealogy/tools/ , is that your sort is by hits.
The code that appears to force a sort by hits is as follows (index.php):
Content visible to registered users only.
If there is no PAGERANK. or do not SHOW_PAGERANK. and $sort. is PAGERANK., then sort by hits.
No matter what you pass (?s=A) to index.php, you are going to sort by hits. I think that is why I proposed that you comment out this code. So that you can see what happens with the template code that I posted. The new code just swaps from Date Added to Title ASC.
Can you explain again what you want to do? I keep reading your posting, but I must not be getting it.
Thanks,
Bill :)
melaniejk
02-15-2006, 06:41 PM
Hi Bigdog.
Basically, I would like the default sort to be alphabetical.
When a person visits a page, I want the links to be in alphabetical order.
Could I just replace the H with a letter A ?
bigdog
02-15-2006, 07:37 PM
Here is what I would suggest:
:arrow: Inside of Admin>Edit Settings>Directory
Content visible to registered users only.
:arrow: Inside of Index.php, change the following code:
Content visible to registered users only.
:arrow: Remove URL Sort Code from main.tpl:
Content visible to registered users only.
How does this sound?
Bill :D
melaniejk
02-15-2006, 08:44 PM
Hi.
Well, I already had the default sort set to alphabetical.
So, I went and changed the code you mentioned in index.php
I then checked the directory.
The links are now in alphabetical order. Yes!
So, why do I need to do the next part? (remove url sort code from main.tpl )
Can I just leave it now?
Thank you, Big Dog.
Best Wishes,
Melanie
bigdog
02-15-2006, 09:07 PM
Melanie,
I am glad that steps 1 and 2 worked. You mentioned that you wanted the links to be in Alphabetical order. So that is why I suggested that you take out the "link" code in the template.
I just checked your site using the "hits" link. That one works, but your are unable to use the Alphabetical link after hitting the "hits" link. It might make more sense to modify the H/A tpl code so that you can toggle between alphabetical and hits. Just a suggestion for you to try.
Here is the code if you are interested:
Content visible to registered users only.
Good luck to you,
Bill :D
bigdog
02-15-2006, 09:45 PM
The code for tpl didn't include anchors. If might help your user if the page scrolls to the link that they click.
Here is the code if you are interested:
Content visible to registered users only.
Content visible to registered users only.
clubracer
02-19-2006, 10:37 AM
Back to the indexing topic, the problem was that the Googles spiders where indexing to much pages.
Content visible to registered users only.
what do you think about this workaround:
I'm pointing the sorted by pagerank en hits links on the resultpages.
Just exclude the spiders from indexing those sorted by links:
1. Use your robot.txt to exclude Google and the other search engines from the cats: Pagerank, Hits, ...etc.
2. Use no follow and no index tags in the main.tpl for the sorted by links
Now you can use both features; sorting by and good indexing by spiders
joost
03-03-2006, 10:19 PM
Thanks for your help clubracer, but my problem is another.
I disabled everything except alphabetical order and it still didn't work.
I found out, if a page is called with a trailing question mark e.g.:
http://www.domain.com/cat/subcat/?
then one of the paging links (between previous and next)
is corrupted.
The link for the actual page 1 is only partial, a few characters are missing.
The bots now find this link, it is rewritten false and a loop will make the bots go nuts.
I forgot to stop my linkchecker a few days ago, and generated 1 gig of traffic.lol
If anybody has an idea or needs an example please help
joost
joost
03-03-2006, 10:45 PM
i just found another example
if you visit
h++p://www.allthetopsites.com/Consultants/Databases/
at the bottom of the page the link behind [1] is
h++p://www.allthetopsites.com/Consultants/Databases/
thats ok
but if you visit:
h++p://www.allthetopsites.com/Consultants/Databases/
at the bottom of the page the link behind [1] is
h++p://www.allthetopsites.com/Consultants/Databas
this one is only partial and misleads the bots.
greets joost
joost
03-05-2006, 01:36 PM
Maybe it has to do something with this little hack in function.pager.php?
Content visible to registered users only.
Possibly the last characters are cut with -strlen??
I'm not that programmer, so i don't dare to change this.
Thanks in advance
Joost
vBulletin® v3.8.0, Copyright ©2000-2012, Jelsoft Enterprises Ltd.