Google crawler problem

Google crawler problem

Postby -J- on Mon Jun 02, 2008 10:49 am

Hi,

it looks like I've got a very weird problem with my blog platform :?

It's a blog platform dedicated to expatriates, expat blog. The weird thing is that google doesn't crawl all the pages of the blogs, let's say we take a very old blog: http://hebert.expat-blog.net/

Why does the site: command on google only returns the index page?

I have noticed a problem with the pagination links: they are looking like this http://hebert.expat-blog.net//page/2 , but I don't know how to sort this out

Can anyone help me?

Thanks in advance,

J
-J-
 
Posts: 116
Joined: Sun Apr 03, 2005 4:02 pm

Re: Google crawler problem

Postby jondaley on Mon Jun 02, 2008 11:36 am

I don't know about why google isn't crawling your other pages.

For the page links, I think the trouble is because your page format starts with a slash (which it needs to, since you might have a link like /archives/2008/page/2 and your blog format also has a trailing slash (I am not sure if that is required - you might try taking out the "/", (so probably it is just a "$") and see if that works. It is likely that will fix your page // issue, but it might break the front page, I am not sure.
jondaley
Lifetype Expert
 
Posts: 6169
Joined: Thu May 20, 2004 6:19 pm
Location: Pittsburgh, PA, USA
LifeType Version: 1.2.11 devel branch

Re: Google crawler problem

Postby -J- on Mon Jun 02, 2008 4:10 pm

Thanks for your reply. Do you mean the base_url shouldn't end with a / ? this is actually the case ...

only the pagination urls are wrong and I don't know how to sort this out ... if I delete the / at the beginning of the URLs it doesn't work :cry:

I may have found a solution. I am testing on another LT blog (not on my platform). it's a bit late I will recheck tomorrow... it would be great if our friend google could now crawl the pages!
-J-
 
Posts: 116
Joined: Sun Apr 03, 2005 4:02 pm

Re: Google crawler problem

Postby jondaley on Mon Jun 02, 2008 7:59 pm

I tested it out on my blog, and it worked, though there is an extra redirect on links to the main url.

My base url: http://jon.limedaley.com/plog
My subdomains_url: http://{blog_domain}/plog (I only have one blog, but I have it setup to use the multi-domain stuff for testing)
URLs:
blog_url = /$
page_url = /page/{page}

With these settings, I get the double slash, but apache is smart enough to do the right thing. This is probably how most people have the URLs setup, or at least with similar results.

I changed the blog_url to simply $

which then caused the page URLs to be generated appropriately with just one slash, and the links to the main url, were http://jon.limedaley.com/plog which then apache redirects to http://jon.limedaley.com/plog/
I'd rather have the double slash than an extra redirect on every hit to the home page.
jondaley
Lifetype Expert
 
Posts: 6169
Joined: Thu May 20, 2004 6:19 pm
Location: Pittsburgh, PA, USA
LifeType Version: 1.2.11 devel branch

Re: Google crawler problem

Postby -J- on Tue Jun 03, 2008 1:20 am

The great thing with your config is that google reads all your pages (check here)

Could it be my htaccess that blocks googlebot?
-J-
 
Posts: 116
Joined: Sun Apr 03, 2005 4:02 pm

Re: Google crawler problem

Postby jondaley on Tue Jun 03, 2008 12:31 pm

I wonder if google is getting confused because your robots.txt is invalid?
http://hebert.expat-blog.net/robots.txt
jondaley
Lifetype Expert
 
Posts: 6169
Joined: Thu May 20, 2004 6:19 pm
Location: Pittsburgh, PA, USA
LifeType Version: 1.2.11 devel branch

Re: Google crawler problem

Postby -J- on Tue Jun 03, 2008 4:08 pm

errrr... I don't use robots.txt files :roll:
-J-
 
Posts: 116
Joined: Sun Apr 03, 2005 4:02 pm

Re: Google crawler problem

Postby jondaley on Wed Jun 04, 2008 8:28 am

right, and that might be the problem, since your server is returning something when robots.txt is requested. I don't know what search engines will do when they request a robots.txt and get HTML. And since you aren't being indexed, it seems like a reasonable thing to put a proper robots.txt in place, and see what happens. You also might look into the webmaster tools on google, and see what they can tell you.
jondaley
Lifetype Expert
 
Posts: 6169
Joined: Thu May 20, 2004 6:19 pm
Location: Pittsburgh, PA, USA
LifeType Version: 1.2.11 devel branch

Re: Google crawler problem

Postby Mischa on Sat Dec 22, 2012 9:15 am

jondaley wrote:I tested it out on my blog, and it worked, though there is an extra redirect on links to the main url.

My base url: http://jon.limedaley.com/plog
My subdomains_url: http://{blog_domain}/plog (I only have one blog, but I have it setup to use the multi-domain stuff for testing)
URLs:
blog_url = /$
page_url = /page/{page}

With these settings, I get the double slash, but apache is smart enough to do the right thing. This is probably how most people have the URLs setup, or at least with similar results.

I changed the blog_url to simply $

which then caused the page URLs to be generated appropriately with just one slash, and the links to the main url, were http://jon.limedaley.com/plog which then apache redirects to http://jon.limedaley.com/plog/
I'd rather have the double slash than an extra redirect on every hit to the home page.


"apache is smart enough to do the right thing" - this may be the case many times, but not in mine at my web-host.

"I changed the blog_url to simply $ - this worked very well for me. No more double slashes, and clicking the pagination at the bottom of page works now. Thank you for this tip!
(changing the "/$" to "$" is done in Admin/Global/URLs at blog_link_format)
Using lifetype-1.2.12 pretty standard config
Mischa
 
Posts: 39
Joined: Fri Jun 20, 2008 9:27 am
Location: Åland Islands
LifeType Version: 1.2.12


Return to Other Problems

cron