It’s not easy to pull the wool over the eyes of the search engines
The “Boy Who Cried Wolf” (BWCW) Syndrome
I came across an XML sitemap file for a website that other day that validated correctly and included all the most topical content the site had produced. I think it was produce by the Content Management System (CMS) utilised by the site. This is a great approach to ensure your website is easily parsed by the search engines. [Update: It was definitely produced by the CMS, which was Business Catalyst. I have seen this on a number of other sites and they all are Business Catalyst. Since this post was published I have worked on a number of Business Catalyst sites and have turned off the sitemap module and manually created a more appropriate sitemap.]
What I found interesting was that the <lastmod> tag had an identical time stamp for every page on the site listed in the file. Today (10 Nov 2010) the time stamp is 2010-11-10T00:39:08+00:00. I presume a cron task fires at 12:39AM every day, determines the list of eligible pages and produces the sitemap file. This is terrific except how can every page (more than 50) on the site be “last changed” at the same time on every day that I have inspected the site? That doesn’t make sense.
What are the implications?
The site concerned publishes entries on its blog, at most, weekly. So I’m reasonably certain for most days there are no changes. Search engine robots are the most frequent users of sitemap files. If the site is known to any search engines, periodically robots will visit, read the sitemap file, note the last change date is newer than the last time it verified the site and should begin to parse the changed pages. On this site it would look like every page has changed. If the robots compare before and after images they will find no differences for the majority of pages. Occasionally they will find a new page or two.
This impacts a number of ways:
- the search engines won’t appreciate wasting their effort to parse pages that have not changed
- the site is wasting bandwidth supplying and the search engines are wasting bandwidth retrieving pages to be parsed unnecessarily.
If I was determining the logic for the search engines I would either no longer believe that site when they claim a new last change date or I would ignore that field entirely for all websites.
What can be done about it?
I’m not familiar enough with the CMS used here to know if this “feature” can be configured to better represent reality, but that’s where I would start if I were asked. Other options seem to be:
- do nothing and hope the search engines don’t penalise you (never a good choice)
- stop the CMS generating the sitemap and produce the sitemap either manually or find a new tool to produce it.
- change to a CMS that is less likely to produce fake data. (This is not an easy choice as the implications of this sort of change are large.)
I’m not prepared to name the errant CMS in this post* but if you would like to contact me I will give you the name.
[* Update: I’ve changed my mind. The CMS is Adobe’s Business Catalyst. It is generally reliable and has good features, but in this area and that the time I encountered it I believe it to be deficient.]
If you have any questions on this or any other topic raised on this website or anything to do with Search Engine Optimisation please use the MidBoh Contact Form and we’ll do our best to answer your questions.
The moral of the story
“Nobody believes a liar, even when he is telling the truth.”
If you need a refresh on the “Boy Who Cried Wolf” there are plenty of versions available via Google. Here are two I found:
http://www.storyarts.org/library/aesops/stories/boy.html
Photo Credit: Tambako the Jaguar via Compfight cc