Digging in deeper to identify what the true culprits may be for our index cache bloat.
This month we are going to show the next steps in identifying some of the interesting findings from last month’s SEO Tool Tip post using Google as our SEO tool for high-level observations. This is meant to provide a focused approach to doing a quick SEO review when you just “need to quickly get to know a new website”.
However, in this installment, it is time for us to check out some of the things we observed in the Google index and begin looking more at the website itself to try to identify the opportunities available and the causes of what we saw. We will also compare notes between what Bing sees as well, in order to review any hypotheses we may begin to develop. So, let’s dive right in.
We left off looking for the culprit causing a huge drop in Google listings from the first page display that continued to decrease as we moved through Google’s records for VizionInteractive.com and wondering if this was tied to some of the default WordPress functionality and/or potential plug-ins or custom modifications that might exist on our blog platform.
We’ll start with the largest suspected culprit, the blog categories. We’ll start with the largest potential “category” called “blog” since this sounds like the category that would be the most generic and probably have the most tagged articles since a simple glance at the blog homepage seems to show that every post gets this tag:
When we review this category page and begin looking at the source code, it becomes a bit clearer why these may be getting into the Google cache, but not getting displayed as valid search results simply by looking at some of the tags included in the <head> section:
The particular tag that catches my eye is the robots tag providing the directions “noindex, follow” at the top of the screenshot above. While this provides Google (and other search engines) directions around not indexing these pages, it does tell them to continue to spider links through it. Additionally, we see in the tags that there are a variety of “alternate” URLs that will share the same (or similar) content, specifically 3 RSS feed options that would include content tagged in the “blog” category. This is where curiosity starts to take over and other questions start to arise. For instance:
- Despite this category being flagged as “noindex”, are we providing any contrary signals to this?
- How does another search engine like Bing see these pages and does it respect the “robots” tag directions?
- Does this “blog” category also have approximately 48 pages of post summaries in it?
- Should we be also calling out a “nocache” command for robots on these types of pages?
We’ll go ahead and address these in order. So, for #1, let’s check to see if we are feeding this to the search engines directly by checking the sitemap.xml file(s).
Sure enough, our sitemap file does include these category URLs (all 33 of them) despite them being listed as “noindex” in their page code. A quick check of our Google Search Console shows us that there are definitely some URLs in our sitemap being blocked:
However, the change is definitely more than 33, but we identified some other page-type possibilities as well. So, out of 93 URLs, we have identified 33.
So, now let’s take a peek at what shows up in the Bing index to address curiosity question #2. The first query is to see what category URLs are showing up using yet another “site:” search:
Looks like Bing is respecting the “noindex” command and has decided to show the specific “alternate” URLs listed, in this case, specifically the RSS feed pages for each listed category on the website, although it looks like they don’t have them all here. While we’re on Bing, why don’t we just see what it sees differently on the blog than Google:
Well, Bing is definitely seeing more of the blog than Google. In fact, Bing has 841 additional records. Interestingly though, when Bing drops its cached URLs once you hit page 33 of the listings, the count drops dramatically to just 444 results. Google drops down to 200 once you get to page 20 of their results. 244 pages is still quite the difference when it comes to getting your content to work for you, or in this case, for us, because at a rough count with the blog homepage paginated to page 47 or 48 with 10 post listing per page, that means we should have approximately 470 articles…in fact, a quick check shows us having 472 articles by the date of writing this.
Our 4th curiosity question to address is whether or not every post has received the category assignment of ‘blog’ or not. Going through the pagination for that category, we find that it ends at page 37 with only 4 listed on that final page putting the count of articles at 374 out of the estimated total of 472 articles. Well, at least t wouldn’t be a 100% duplicate to the main blog homepage, but it would represent quite the duplication risk for sure.
Now we would duplicate these steps for some of the other sections identified in the “Tool Tip Tuesday – Your Browser and the Google Index (Part 1)” and, as they say, “see what we see”. Come back next month when we discuss possible solutions to these cache bloat issues we are seeing and what we will recommend to our webmaster to implement.