When running a website, SEO experts and site owners must be aware of all the web pages that are indexed by search engines. But this information alone is not enough. It’s also critical to know all the pages that aren’t visible.
Getting a list of all the web pages of a single website enables you to get a complete overview of that website, and empowers you to clean it up for improved SEO success.
In this blog post, we’ll look into why you need to be able to find all the web pages of a website, how exactly you can do that, and what to do once you have a list of all your web pages.
Why do I need to find every single page?
Search engines are constantly introducing new algorithms and applying manual penalties to pages and sites. So if you don’t have a thorough understanding of all your website’s pages — you’re tiptoeing through an SEO minefield.
In order to avert a serious setback, you must keep a close eye on all the pages that make up your website. Doing so will not only enable you to discover pages you already knew about, but will also help you find forgotten pages, pages you had no idea existed and would otherwise not be able to view.
There are several possible scenarios when you have to know how to find all the web pages of a site, such as:
- Changing website architecture;
- Finding and removing duplicate/redundant pages;
- Switching the site to a new permalink structure and 301-redirecting the pages to new URLs;
- Checking the validity of hreflang attributes, canonical and noindex tags;
- Setting up internal linking;
- Creating an XML sitemap or robots.txt file, to name a few.
Now, while getting a list of all crawlable web pages isn’t that difficult of a task, obtaining a list of your lost, forgotten and orphaned pages is another story that we will focus on in more depth.
An orphan page is a web page without any internal links directing to it. In other words, such pages do not have a parent page. And without a parent, they don’t receive any authority and are left without any context, resulting in the search engines not being able to evaluate them.
For example, let’s say you were redesigning your website and accidentally removed the only link to a page without deleting the page itself. Consequently, you’ll have a page that is not connected to the website and its SEO performance will be greatly jeopardized.
However, we’re not only looking to find pages without any internal linking. We’re also tracking down other pages, such as duplicates, that may have slipped from your attention in some other way.
Common causes of abandoned pages
Let’s look into some of the most common reasons why orphaned, lost and forgotten pages may occur on your site:
- Campaign-specific dedicated landing pages;
- Test pages created for split tests;
- Pages that were removed from the internal link structure, but were not deleted;
- Pages included in the previous CRM system;
- Pages generated as a result of the incorrect use of a CMS;
- Pages lost during website migration;
- Deleted shop category pages.
On top of that, if you don’t use http or https, www or non-www, as well as trailing slashes consistently on every web page of your site that’s been made public, this may lead to new abandoned pages.
To see if everything’s set up as it should be on your site, enter all the different variations of your home page into the browser:
As long as each option redirects to the same URL, you’re good.
But just to be on the safe side, you ought to try the same tactic on several other pages of the same site. Plus, make sure that your website’s redirects are properly set up in the .htaccess file.
Note: If you’ve spotted issues early on, here are a few useful links that will help you set this up in .htaccess:
To recap, if you designed a web page with the ultimate goal of getting it ranked high organically — triple-check that it’s properly linked to your site so that it receives authority and has a chance of being discovered.
Using tools to find all pages of a website
Now, when it comes down to finding all the web pages that belong to a single website, we’re going to be using three tools:
- SE Ranking’s Website Audit to find all crawlable web pages;
- Google Analytics to discover all pages that have ever been visited;
- Google Search Console to discover pages only visible to Google.
Then, we’ll compare the data sets from these tools to find mismatches, and identify all the pages of your site, including those that aren’t linked to the website, and, therefore, aren’t discoverable via organic search.
Finding crawlable pages via SE Ranking’s Website Audit
Let’s start by collecting all the URLs that both people and search engine crawlers can visit by following your site’s internal links. Analyzing such pages should be your top priority as they get the most amount of attention.
To do this, we’re going to first need to access SE Ranking, add a website or select an existing one, and then go to Website Audit.
Note: The 14-day free trial gives you access to all of SE Ranking’s available tools and features, including Website Audit.
Next, let’s configure the settings to make sure we’re telling the crawler to go through the right pages.
Go to Settings → Source of pages for audit, and enable the system to scan Site pages, Subdomains, XML sitemap to verify that we’re only scanning what’s been clearly specified, and are including the site’s subdomains along with all their pages:
Then, go to Rules for scanning pages, and set the Take into account the robots.txt directives option to ‘Yes’ to tell the system to follow the instructions specified in the robots.txt file. Click ‘Save’ when you’re done:
Now, go to Report and launch the audit with the new settings applied by hitting ‘Restart audit’:
Once the audit is complete, go to Crawled pages to view the full list of all crawlable pages:
But since we only want to see 200-status-code pages, as in those that are working correctly, we need to add a filter like so:
Now it’s time to export the results:
The last thing we need to do here is remove all URLs from the list that have the ‘Yes’ value under the Meta noindex column in Excel. Select the corresponding column and sort it’s data:
Finally, considering the fact that we will have to compare data sets later on, we need to export the results to a place where we can easily perform such tasks. So, copy all the remaining URLs — those with the ‘No’ value under Meta noindex — into a spreadsheet.
(Note that you can use Excel as well, but I prefer Google Sheets.)
Finding all pages with pageviews via Google Analytics
Since crawlers are inherently designed to go through pages that are exclusively reachable via internal links or sitemaps — they aren’t capable of finding orphan pages.
For this reason, you should track down such pages by carefully studying the data in your Google Analytics account. There is only one condition: your website must be linked to your Google Analytics account from the get-go, so that it can collect data behind the scenes.
The logic here is simple: if someone has ever visited any page of your website, Google Analytics will have the data to prove it. And since these visits are made by people, we should ensure such pages serve a distinct SEO or marketing purpose.
Start by going to Behavior → Site Content → All Pages. Now, we are looking for pages that are difficult (close to impossible) to find by navigating through the site. As a result, they won’t have many pageviews. Close to none, as a matter of fact.
Next, click on ‘Pageviews’ to get the arrow pointing up and sort the page URLs from least to most pageviews. Ultimately, the least visited pages will be seen at the top of the list:
If your site’s been up for some time, it’s a good idea to set the time range to a period before you connected your site to Google Analytics, but mind the data sampling issue.
Now scroll down until you start seeing pages that have had way more visits than your orphan pages and that’s where you should stop. I want to note that since we sorted to view the pages from least to most pageviews, all orphan pages should be there.
Once that’s done, export the data into a .csv file.
Singling out orphan pages
The next step is putting the data from SE Ranking and Google Analytics next to each other and comparing it to learn what web pages weren’t crawled.
Since we already have the data from SE Ranking in a spreadsheet, copy the data from Google Analytics’ .csv file and insert it into column C, and here’s why.
The data we collected from Google Analytics isn’t in a URL format, so we need to fix this. To do so, start by inserting the URL of the home page into column B, as shown below:
Then, make use of the concatenate() function to combine the values of column B and C in column D, dragging box D2 down to generate the complete list of URLs:
This is the exciting part: now we need to compare the “SE Ranking” column with the “GA URLs” column to find those lost, forgotten pages.
Obviously, the example above is just an example. In reality, you’ll have way more pages to go through and performing this task manually will take forever.
Luckily, there’s the match function for this that checks to see if every value in the “GA URLs” column is present in the “SE Ranking” column as well. To do this, click on box E2, enter the function, and drag the box all the way down to your last value.
Here’s what you should get:
As you can see, the position in the range is returned in the box should there be matching values. But that’s not what we came for — we’re looking to see if no match (#N/A) has been found, as is the case in box E12.
From the example, it is evident that A12 is empty, therefore E12 returns an error. This means that we’ve found our lucky winner: an orphan page.
Assuming that your list is much longer and not necessarily sorted in any logical order, sort the data in column E as shown in the screenshot below to collect all the errors:
Finally, take the list of all errors, which are actually orphan pages, and insert them into a new spreadsheet. Now you can go through each page and figure out how to handle it.
What to do with orphan pages
Before you do anything else, you must look at each orphan page and understand the big picture — its role in your website and in your marketing efforts. That way, you’ll be able to decide what to do with it.
You have three ways of going about this situation:
- Keep the page by adding internal linking to it and finding the right place for it on your website;
- Leave untouched if it’s a campaign-specific page, but add a noindex tag;
- Delete the page but set up a 301 redirect to it.
To make sure you’ve got all your bases covered, you can run the process anew afterward using updated data.
Finding all other pages via Google Search Console
Now that we know how to find and manage all the pages of your site that have ever had any human visitors, let’s look at the pages that weren’t covered in the previous steps — those that are only accessible to Google.
We’re going to be using the data provided in your Google Search Console account to achieve this.
Start by opening up your account and going to Coverage. Then, make sure to select ‘All known pages’ instead of ‘All submitted pages’ and enable to view only ‘Valid’ pages:
By doing so, you’ll get two lists of pages that have been successfully indexed by the search giant — Indexed, not submitted in sitemap and Submitted and indexed.
Click on a list to expand it and get the full list of the pages that fall under one of these two categories:
Take your time to closely study all the pages listed in there to see if you can find any pages that we’re not collected in the two previous steps. If there are any, make sure to check that they are set up properly in the framework of your site.
Now, let’s select ‘Excluded’ to view only those pages that were intentionally not indexed and won’t appear in Google. Unfortunately, this is where you’ll have to roll up your sleeves and do a lot of manual work:
As you scroll down, you’ll see several lists of excluded pages:
You can view pages with redirects, pages excluded by ‘noindex’ tag, those blocked by rotobs.txt, and so on.
Going through each one of them will give you unfiltered access to every single page of your site. Then, by comparing the orphan page data with the data in these lists, you’ll get a comprehensive overview of all of your site’s pages.
I recommend repeating this process once or twice a year to find new pages that may have gotten away from you.
In order for a search engine bot to fully crawl a website, it needs to follow the internal links one by one. But if a web page is not linked to the site in any way, be it accidentally or on purpose, then neither search engines nor humans will be able to reach the page. And this is not great for the site’s SEO performance.
As a site owner or SEO specialist responsible for a site, seeing all the pages of a particular site can help you discover valuable pages you might have forgotten about.
By regularly making sure you’re aware of all the web pages of your site, including orphan pages, you will be able to stay on top of your SEO and marketing efforts.