Email Marketing & AutomationMarketing Tools

How To Crawl A Large Site And Extract Data Using Screaming Frog’s SEO Spider

We’re assisting several clients right now with Marketo migrations. As large companies utilize enterprise solutions like this, it’s like a spider web that weaves itself into processes and platforms over years until companies aren’t even aware of every touchpoint.

With an enterprise marketing automation platform like Marketo, forms are the entry point of data throughout sites and landing pages. Companies often have thousands of pages and hundreds of forms throughout their sites that must be identified for updating.

A great tool for this is Screaming Frog’s SEO Spider… perhaps the most popular platform in the SEO market for crawling, auditing, and extracting data from a site. The feature-rich platform offers hundreds of options for virtually every task you require. The features extend far beyond optimization for search, though, with one incredibly helpful feature for extracting data from your site as it’s being crawled.

Screaming Frog SEO Spider: Crawl And Extract

A key feature of Screaming Frog SEO Spider is that you can perform custom extractions based on Regex, XPath, or CSSPath specifics. This is extremely useful as we wish to crawl the client’s sites and audit and capture the MunchkinID and FormId values from pages.

With the tool, open Configuration > Custom > Extraction to identify elements you wish to extract.

screamingfrog custom extraction

The extraction screen allows for virtually unlimited data collection:

Screaming Frog SEO Spider Extraction Rules

Regex, XPath, and CSSPath Extraction

For the MunchkinID, the identifier is located within the form script that’s on the page:

<script type='text/javascript' id='marketo-fat-js-extra'>
    /* <![CDATA[ */
    var marketoFat = {
        "id": "123-ABC-456",
        "prepopulate": "",
        "ajaxurl": "https:\/\/yoursite.com\/wp-admin\/admin-ajax.php",
        "popout": {
            "enabled": false
        }
    };
    /* ]]> */

We then apply a Regex rule to capture the id from within the script tag that’s inserted in the page:

Regex: ["']id["']: *["'](.*?)["']

For the Form ID, the data is in an input tag within the Marketo form:

<input type="hidden" name="formid" class="mktoField mktoFieldDescriptor" value="1234">

We apply an XPath rule to capture the id from within the form inserted in the page. The XPath query looks for a form with an input with a name of formid, then the extraction saves the value:

XPath: //form/input[@name="formid"]/@value

Extract Inline Style Tags

We’re helping a client clean up a site where they used inline styles on the Elementor plugin to customize virtually every element with a page. To identify where inline styles were used, we scraped the site with several RegEx rules for custom extraction:

  • Span Inline Style:
<span\s+(?:[^>]*?\s+)?style\s*=\s*"([^"]*)"
  • Anchor Tag Inline Style:
<a\s+(?:[^>]*?\s+)?style\s*=\s*"([^"]*)"
  • Div Tag Inline Style:
<div\s+(?:[^>]*?\s+)?style\s*=\s*"([^"]*)"
  • Heading Tag Inline Style:
<h+(?:[^>]*?\s+)?style\s*=\s*"([^"]*)"

Exclusions

At Martech Zone, we serve the site in multiple languages at different subdomains. Crawling these translations isn’t necessary since all the assets and information are based on the core site. Because of this, we enabled the Exclude List Configuration and added the following rule:

.*\.martech.zone

You can also use this to skip crawling unnecessary paths like tags by adding:

martech.zone/tag/.*

We also don’t want to crawl our AMP pages, which end in ?amp=1, so in the Configuration > Exclude section, we’ve also added:

https?://[^\s]+?\?amp=1

The platform even has a nice method to test some URLs against the rules to ensure they work properly before you crawl your site.

ScreamingFrog > Configuration > Exclude

Screaming Frog SEO Spider JavaScript Rendering

Another great option of Screaming Frog is that you aren’t limited to the HTML in the page, you can render any JavaScript that’s going to insert forms within your site. Within Configuration > Spider, you can go to the Rendering tab and enable this.

Screaming Frog SEO Spider JavaScript Rendering

This does take a little longer to crawl the site, of course, but you’ll get forms that are rendered client-side by JavaScript as well as forms that are inserted server-side.

While this is a very specific application, it’s an incredibly useful one as you’re working with large sites. You’ll absolutely want to audit where your forms are embedded throughout the site.

Download Screaming Frog SEO Spider

Disclosure: Martech Zone is using its affiliate links in this article.

Appreciate this content?

Sign up for our weekly newsletter, which delivers our latest posts every Monday morning.

We don’t spam! Read our privacy policy for more info.

Douglas Karr

Douglas Karr is CMO of OpenINSIGHTS and the founder of the Martech Zone. Douglas has helped dozens of successful MarTech startups, has assisted in the due diligence of over $5 bil in Martech acquisitions and investments, and continues to assist companies in implementing and automating their sales and marketing strategies. Douglas is an internationally recognized digital transformation and MarTech expert and speaker. Douglas is also a published author of a Dummie's guide and a business leadership book.
Back to top button
Close

Adblock Detected

Martech Zone is able to provide you this content at no cost because we monetize our site through ad revenue, affiliate links, and sponsorships. We would appreciate if you would remove your ad blocker as you view our site.