Content MarketingEcommerce and RetailSearch MarketingSocial Media & Influencer Marketing

How Do Search Engines Find, Crawl, and Index Your Content?

I don’t often recommend that clients build out their own ecommerce or content management systems because of all of the unseen extensibility options that are needed nowadays – primarily focused around search and social optimization. I wrote an article on how to select a CMS and I still show it to the companies that I work with that are tempted just to build their own content management system.

However, there are absolutely situations where a custom platform is a necessity. When that’s the optimal solution, I still push my clients to build out the necessary features to optimize their sites for search and social media, though. There are basically three key features that are a necessity.

  • Robots.txt
  • XML Sitemap
  • Metadata

What’s a Robots.txt File?

Robots.txt file – the robots.txt file is a plain text file that’s in the root directory of the site and tells the search engines what they should include and exclude from search results. In recent years, search engines also requested that you include the path to an XML sitemap within the file. Here’s an example of mine, which allows all bots to crawl my site and also directs them to my XML sitemap:

User-agent: *

What’s an XML Sitemap?

XML Sitemap – Just as HTML is for viewing in a browser, XML is written to be digested programmatically. An XML sitemap is basically a table of every page on your site and when it was last updated. XML sitemaps can also be daisy-chained… that is one XML Sitemap can refer to another one. That’s great if you want to organize and breakdown the elements of your site logically (FAQs, pages, products, etc.) into their own Sitemaps.

Sitemaps are essential so that you can effectively let the search engines know what content you’ve created and when it was last edited. The process a search engine uses when going to your site isn’t effective without implementing a sitemap and snippets.

Without an XML Sitemap, you’re risking your pages from never being discovered. What if you have a new product landing page that’s not linked internally or externally. How does Google discover it? Well, simply put… until a link is found to it, you’re not going to be discovered. Thankfully, search engines enable content management systems and ecommerce platforms to roll out a red carpet for them, though!

  1. Google discovers an external or internal link to your site.
  2. Google indexes the page and ranks it according to its content and what the content and quality of the referring link’s site is.

With an XML Sitemap, you’re not leaving the discovery of your content or the updating of your content to chance! Too many developers try to take shortcuts that hurt them as well. They publish the same rich snippet across the site, providing information that isn’t relevant to the page information. They publish a sitemap with the same dates on every page (or all of them updated when one-page updates), giving queues to the search engines that they’re gaming the system or unreliable. Or they don’t ping the search engines at all… so the search engine doesn’t realize that new information has been published.

What Is Metadata? Microdata? Rich Snippets?

Rich snippets are carefully tagged microdata that is hidden from the viewer but visible in the page for search engines or social media sites to utilize. This is known as metadata. Google conforms to as a standard for including things like images, titles, descriptions… as well as a plethora of other informative snippets like price, quantity, location information, ratings, etc. Schema will significantly enhance your search engine visibility and the likelihood that a user will click through.

Facebook uses the OpenGraph protocol (of course they couldn’t be the same), Twitter even has a snippet to specify your Twitter profile. More and more platforms are using this metadata to preview embedded links and other information when they publish.

Your web pages have an underlying meaning that people understand when they read the web pages. But search engines have a limited understanding of what is being discussed on those pages. By adding additional tags to the HTML of your web pages—tags that say, “Hey search engine, this information describes this specific movie, or place, or person, or video”—you can help search engines and other applications better understand your content and display it in a useful, relevant way. Microdata is a set of tags, introduced with HTML5, that allows you to do this., What is MicroData?

Of course, none of these are required… but I highly recommend them. When you share a link on Facebook, for example, and no image, title, or description comes up… few people will be interested and actually click through. And if your Schema snippets aren’t in each page, of course you can still appear in search results… but competitors may beat you out when they have additional information displayed.

Register Your XML Sitemaps with Search Console

It’s imperative that, if you’ve built your own content or ecommerce platform, that you have a subsystem that pings the search engines, publishes microdata, and then provides a valid XML sitemap for the content or product information to be found!

Once your robots.txt file, XML sitemaps, and rich snippets are customized and optimized throughout your site, don’t forget to register for each search engine’s Search Console (also known as Webmaster tool) where you can monitor the health and visibility of your site on search engines. You can even specify your Sitemap path if none is listed and see how the search engine is consuming it, whether or not there are any issues with it, and even how to correct them.

Roll out the red carpet to search engines and social media and you’ll find your site ranking better, your entries on search engine result pages clicked through more, and your pages shared more on social media. It all adds up!

How Robots.txt, Sitemaps, and MetaData Work Together

Combining all these elements is a lot like rolling out the red carpet for your site. Here’s the crawl process that a bot takes along with how the search engine indexes your content.

  1. Your site has a robots.txt file that also references your XML Sitemap location.
  2. Your CMS or ecommerce system updates the XML Sitemap with any page and publish date or edit date information.
  3. Your CMS or ecommerce system pings the search engines to let them know that your site has been updated. You can ping them directly or use RPC and a service like Ping-o-matic to push to all of the key search engines.
  4. The Search Engine instantly comes back, respects the Robots.txt file, finds new or updated pages via the sitemap, and then indexes the page.
  5. When it indexes your page, it utilizes rich snippet microdata to enhance the search engine results page.
  6. As other relevant sites link to your content, your content ranks better.
  7. As your content is shared on social media, the rich snippet information specified can help properly preview your content and direct them to your social profile.

Douglas Karr

Douglas Karr is the founder of the Martech Zone and a recognized expert on digital transformation. Douglas has helped start several successful MarTech startups, has assisted in the due diligence of over $5 bil in Martech acquisitions and investments, and continues to launch his own platforms and services. He's a co-founder of Highbridge, a digital transformation consulting firm. Douglas is also a published author of a Dummie's guide and a business leadership book.

Related Articles


What do you think?

This site uses Akismet to reduce spam. Learn how your comment data is processed.


Adblock Detected

Martech Zone is able to provide you this content at no cost because we monetize our site through ad revenue, affiliate links, and sponsorships. We would appreciate if you would remove your ad blocker as you view our site.