In-Depth Guide to Search Engine Indexing
Ever wonder how websites get listed on search engines and how Google, Bing, and others provide us with tons of information in a matter of seconds?
The secret of this lightning-fast performance lies in search indexing. It can be compared to a huge and perfectly ordered catalog archive of all pages. Getting into the index means that the search engine has seen your page, evaluated, and remembered it. And, therefore, it can show this page in search results.
Let’s dig into the process of indexing from scratch in order to understand:
- How the search engines collect and store the information from billions of websites, including yours
- How you can manage this process
- What you need to know about indexing site resources with the help of different technologies
What is search engine indexing?
To participate in the race for the first position in SERP, your website has to go through a selection process:
Step 1. Web spiders (or bots) scan all the website’s known URLs. This is called crawling.
Step 2. The bots collect and store data from the web pages, which is called indexing.
Step 3. And finally, the website and its pages can compete in the game trying to rank for a specific query.
In short, if you want users to find your website on Google, it needs to be indexed: information about the page should be added to the search engine database.
The search engine scans your website to find out what it is about and which type of content is on its pages. If the search engine likes what it sees, it can then store copies of the pages in the search index. For each page, the search engine stores the URL and content information. Here is what Google says:
“When crawlers find a web page, our systems render the content of the page, just as a browser does. We take note of key signals—from keywords to website freshness—and we keep track of it all in the Search index.”
Web crawlers index pages and their content, including text, internal links, images, audio, and video files. If the content is considered to be valuable and competitive, the search engine will add the page to the index, and it’ll be in the “game” to compete for a place in the search results for relevant user search queries.
When users enter a search query on the Internet, the search engine quickly scans its list of saved (=indexed) websites and shows only the relevant pages in the SERP. Think of a librarian looking for books in a catalog based on alphabetical order, subject matter, and exact title.
Keep in mind: pages are only added to the index if they contain quality content and don’t trigger any alarms by doing shady things like keyword stuffing or building a bunch of links from irrefutable sources. At the end of this post, we’ll discuss the most common indexing errors.
Note that Google algorithm updates, such as core updates, can impact indexing. If Google doesn’t consider significant portions of a site valuable enough to display in search results, the search engine may conclude that it is not worthwhile to invest time crawling and indexing the entirety of the site.
What helps crawlers find your site?
If you want a search engine to find out about your website or its new pages, you have to show it to the search engine. The most popular and effective ways include: submitting a sitemap to Google, getting backlinks, engaging social media, and using special tools.
Let’s dive into these ways to speed up the indexing process:
1. Submitting your sitemap to Google
To make sure we are on the same page, let’s first refresh our memories. The XML sitemap is a list of all the pages on your website (an XML file) crawlers need to be aware of. It serves as a navigation guide for bots. The sitemap does help your website get indexed faster with a more efficient crawl rate.
Furthermore, it can be especially helpful if your content is not easily discoverable by a crawler. It is not, however, a guarantee that those URLs will be crawled or indexed.
If you still don’t have a sitemap, take a look at our guide to successful SEO mapping.
Once you have your sitemap ready, go to your Google Search Console and:
Open the Sitemaps report ▶️ Click Add a new sitemap ▶️ Enter your sitemap URL (normally, it is located at yourwebsite.com/sitemap.xml) ▶️ Hit the Submit button.
Soon, you’ll see if Google was able to properly process your sitemap. If everything went well, the status will be Success.
In the same table of your Sitemap report, you’ll see the number of discovered URLs. By clicking the icon next to the number of discovered URLs, you’ll get to the Index Coverage Report. Below, I will tell you point by point how to use this report to check your website indexing.
2. Using Google’s Indexing API
With the Indexing API, you can notify Google of new URLs that need to be crawled.
According to Google, this method serves as an excellent alternative to using a sitemap. By leveraging the Indexing API, Googlebot can promptly crawl your pages without waiting for sitemap updates or pinging Google. However, Google still recommends submitting a sitemap to cover your entire website.
To use the Indexing API, create a project for your client and service account, verify ownership in Search Console, and get an access token. This documentation provides a step-by-step guide on how to do it.
Once set up, you can send requests with the relevant URLs to notify Google of new pages, and then patiently wait until your website’s pages and content are crawled.
Note: The Indexing API is especially useful for websites that frequently host short-lived pages, such as job postings or livestream videos. By enabling individual updates to be pushed, the Indexing API ensures that the content remains fresh and up-to-date in search results.
3. Getting backlinks
Backlinks are a cornerstone of how search engines understand the importance of a page. They give a signal to Google that the resource is useful and that it’s worth getting on top of the SERP.
Recently, John Mueller said, “Backlinks are the best way to get Google to index content.” According to him, submitting a sitemap with URLs to GSC is considered good practice. Particularly for new websites with no existing signals or information available to Google, providing the search engine with URLs via a sitemap is a good way to get that initial foot in the door. Still, it’s important to note that this does not guarantee that Google will pick up the included URLs.
John Mueller advises webmasters to cooperate with different blogs and resources and get links pointing to their websites. That probably would do more than just going to Search Console and saying I want this URL indexed immediately.
Here are a few ways to get quality backlinks:
- Guest posting: Reach out to reputable blogs and websites, such as Forbes, Entrepreneur, Business Insider, and TechCrunch, to publish your high-quality posts with necessary links.
- Creating press releases: Inform the audience about your brand by publishing noteworthy news about your company, product updates, and important events on different websites.
- Writing testimonials: Find companies that are relevant to your industry and submit a testimonial in exchange for a backlink.
- Utilize other popular strategies to get backlinks, as described in this article.
4. Improving social signals
Search engines want to provide users with high-quality content that meets their search intent. To do so, Google takes into account social signals—likes, shares, and views of social media posts. All of them inform search engines that the content is meeting the needs of their users, and is relevant and authoritative. If users actively share your page, like it, and recommend it for reading, search bots will not get past such content. That’s why it’s very important to be active in social media.
Mind that Google says that social signals are not a direct ranking factor. Still, they can indirectly help with SEO. Google’s partnership with Twitter, which added tweets to SERP, is further evidence of the growing significance of social media in search rankings.
Social signals include all activity on Facebook, Twitter, Pinterest, LinkedIn, Instagram, YouTube, etc. Instagram lets you use the Swipe Up feature to link to your landing pages. With Facebook, you can create a post for each important link. On YouTube, you can add a link to the video description. LinkedIn allows you to raise your website and company credibility. Understanding the individual platforms you’re targeting lets you better tailor your approach to maximize your website effectiveness.
There are a few things to remember:
- Post relevant content: Your content should be about your company, industry, and brand, which is what your followers are following you for.
- Create shareable content: Memes, infographics, and diverse research content always receives a lot of likes and reposts.
- Optimize your social profile: Make sure to add a link to your website into the account info section.
As a rule of thumb, the more social buzz you create around your website, the faster you will get your website indexed.
5. Using add URL tools
Another way to signal about a new website page and try to speed up its indexing is using add URL tools. It allows you to request the indexing of URLs. This option is available in GCS and other special services. Let’s take a look at different add URL tools.
Google Search Console
At the beginning of this chapter, I described how to add a sitemap with lots of website links. But if you need to add one or more links for indexing, you can use another GCS option. With the URL Inspection tool, you can request a crawl of individual URLs.
Go to your Google Search Console dashboard, click on the URL inspection section, and enter the desired page address in the line:
If a page has been created recently or is experiencing technical issues, it may not be indexed. When this happens, you will receive a message indicating the issue, and you can request indexing of the URL. Simply press the button to start the indexing process:
All URLs with new or updated content can be requested for indexing this way through GSC.
How to check your website indexing?
You have submitted your website pages for indexing. How do you know that the indexing was successful and that the necessary pages have already been ranked? Let’s look at some methods you can use to check this.
Analyze the Pages report in GSC
Google Search Console allows you to monitor which of your website pages are indexed, which are not, and why. We’ll show you how to check this.
Begin by clicking on the Indexing section and going to the Pages report.
On the Indexed tab, you’ll find information about all pages on the site that have been indexed. Click on the View data about indexed pages button.
You’ll see all submitted in the sitemap and indexed pages under the All submitted pages row.
Scroll down to see the list of all indexed pages. From here, you can even find out when Google last crawled the page.
Next, choose the Unsubmitted pages only option from the drop-down menu. You’ll see indexed pages that were not submitted in the sitemap. You may want to add them to your sitemap because Google considers them to be high-quality pages.
Now, let’s move on to the next stage.
The Not Indexed tab shows pages that could not be indexed due to various reasons, such as indexing errors.
In the Why pages aren’t indexed table, you can find specific details about each issue and try to fix it.
Look through all these pages carefully because you may find URLs that can be fixed. This will ensure that Google indexes them, leading to improved rankings. Use the Google website rank checker to see if your efforts worked and if your rankings improved.
Scroll down to the tab showing the pages that have been indexed, but there are some issues that can be intentional on your part. Click on the warning row in the table to see details about the issue and then try to fix it using this new info. This will help you rank better.
The same type of indexing data can also be obtained for videos. Simply go to the Video pages report within the Indexing section.
Use special tools
Many specialists use the site: operator to determine the exact number of a website’s indexed pages. Unfortunately, this is not a reliable and accurate method due to personalized search results, search engine limitations, and delayed SERP updates.
Use other tools in addition to GSC instead. In the next section, we’ll go over some of the simplest and most effective ones, and you can also check out an extended hand-picked list of the best rank tracker software.
Using SE Ranking, you can run a website SEO audit and find information about indexing.
Go to the Overview and scroll to the Page Indexation block.
Here, you’ll see a graph of indexed and not indexed pages, their percentage ratio, and number. This dashboard also shows issues that won’t let search engines index pages of the website. You can view a detailed report by clicking on the graph.
By clicking on the green line, you’ll see the list of indexed pages and their parameters: status code, blocked by robots.txt, referring pages, x-robots-tag, title, description, etc.
Here, you can filter pages based on the parameters Blocked by noindex and Blocked by X-Robots-Tag. This allows you to see which pages shouldn’t be indexed at all.
The same info can be found in the Crawling section within the Issue Report.
This extensive information will help you find and fix the issues so that you can be sure all important website pages are indexed.
You can also check page indexing with SE Ranking’s Index Status Checker. Just choose the search engine and enter a URL list.
Once you’ve resolved any indexing issues, you can use a rank checker to monitor your website’s performance and track the improvements.
Check out our guide on tracking search engine rankings to learn effective techniques for analyzing your website’s performance on search engines and optimizing your SEO strategy accordingly.
Prepostseo is another tool that helps you check website indexing.
Just paste the website URL or a list of URLs that you want to check, and click on the Check pages button. You’ll get a results table, displaying two values for each URL:
- By clicking the View Full Website Status link, you will be redirected to a Google SERP. You will then find a full list of indexed pages for that specific domain.
- By clicking the View Current Page Status link, you will be redirected to a results page, allowing you to verify whether the exact URL is listed in Google’s index or not.
With this website index checker, you can check 1,000 pages at once.
What are the specifics of websites indexing with different technologies?
We’ve puzzled out how Google indexes websites, how to submit pages for indexing, and how to check whether they appear in SERP. Now, let’s talk about an equally important thing: how web development technology affects the indexing of website content.
The more you know about indexing aspects of websites with different technologies, the higher your chances of having all your pages successfully indexed. So, let’s get down to different technologies and their indexing.
Flash started as a simple piece of animation software, but in the years that followed, it has shaped the web as we know it today. Flash was used to make games and indeed entire websites, but today, Flash is quite dead.
Over the 20 years of its development, the technology has had a lot of shortcomings, including a high CPU load, flash player errors, and indexing issues. Flash is cumbersome, consumes a huge amount of system resources, and has a devastating impact on mobile device battery life.
In 2019, Google stopped indexing flash content, making a statement about the end of an era.
Not surprisingly, search engines recommend not using Flash on websites. But if your site is designed using this technology, create a text version of the site. It will be useful for users who haven’t ever installed Flash, or installed outdated Flash programs, plus mobile device users (such devices do not display flash content).
There are a lot of JS-based technologies. Below, we’ll dive into the most popular ones.
AJAX allows pages to update serially by exchanging a small amount of data with the server. One of the signature features of the websites using AJAX is that content is loaded by one continuous script, without division into pages with URLs. As a result, the website has pages with hashtag # in the URL.
Historically, such pages were not indexed by search engines. Instead of scanning the https://mywebsite.com/#example URL, the crawler went to https://mywebsite.com/ and didn’t scan the URL with #. As a result, crawlers simply couldn’t scan all the website content.
From 2019, websites with AJAX are rendered, crawled, and indexed directly by Google, which means that bots scan and process the #! URLs, mimicking user behavior.
This means that webmasters no longer need to create the HTML version of every page. Still, it’s important to check if your robots.txt allows for the scanning of AJAX scripts. If they are disallowed, ensure that you open them for search indexing.
While scanning, crawlers don’t get enough page content; they don’t understand that the content is being loaded dynamically. As a result, search engines see an empty page yet to be filled.
Moreover, with SPA, you also lose the traditional logic behind the 404 error page and other non-200 server status codes. As content is rendered by the browser, the server returns a 200 HTTP status code to every request, and search engines can’t tell if some pages are not valid for indexing.
If you want to learn how to optimize single-page applications to improve their indexing, take a look at our comprehensive blog post about SPA.
- Crawlers can’t actually see what’s on the page. Search engines find it difficult to index content that requires clicking to load.
- Speed is one of the biggest hurdles. Google crawls pages un-cached, so those cumbersome first loads can be problematic.
- Client-side code adds increased complexity to the finalized DOM, which means more CPU resources will be required from both search engine crawlers and client devices. This is one of the most significant reasons why a complex JS framework would not be preferred.
How to restrict site indexing
There may be certain pages that you don’t want search engines to index. It is not necessary for all pages to rank and appear in search results.
What content is most often restricted?
- Internal and service files: those that should be seen only by the site administrator or webmaster, for example, a folder with user data specified during registration: /wp-login.php; /wp-register.php.
- Pages that are not suitable for display in search results or for the first acquaintance of the user with the resource: thank you pages, registration forms, etc.
- Pages with personal information: contact information that visitors leave during orders and registration, as well as payment card numbers;
- Files of a certain type, such as pdf documents.
- Duplicate content: for example, a page you’re doing an A/B test for.
So, you can block information that has no value to the user and does not affect the site’s ranking, as well as confidential data from being indexed.
You can solve two problems with it:
- Reduce the likelihood of certain pages being crawled, including indexing and appearing in search results.
- Save crawling budget—a limited number of URLs per site that a robot can crawl.
Let’s see how you can restrict website content.
Robots meta tag
Meta robots is a tag where commands for search bots are added. They affect the indexing of the page and the display of its elements in search results. The tag is placed in the <head> of the web document to instruct the robot before it starts crawling the page.
Meta robot is a more reliable way to manage indexing, unlike robots.txt, which works only as a recommendation for the crawler. With the help of a meta robot, you can specify commands (directives) for the robot directly in the page code. It should be added to all pages that should not be indexed.
Read our ultimate guide to find out how to add meta tag robots to your website.
You can also restrict the indexing of website content server-side. To do this, find the .htaccess file in the root directory of your website and add the necessary code to restrict access for specific search engines.
This rule allows you to block unwanted User Agents that may pose a potential threat or simply overload the server with excessive requests.
Set up a website access password
Another method to prevent site indexing is by setting up a website access password through the .htaccess file. Set a password and add the code to the .htaccess file.
The password must be set by the website owner, so you will need to identify yourself by adding a username. This means you will need to include the user in the password file.
This will result in the bot will no longer being able to crawl the website and index it.
Common indexing errors
Sometimes, Google cannot index a page, not only because you have restricted content indexing but also because of technical issues on the website.
Here are the five most common issues preventing search engines from indexing your pages.
Having the same content on different pages of your website can negatively affect optimization efforts because your content isn’t unique. Since Google doesn’t know which URL to list higher in SERP, it might rank both URLs lower and give preference to other pages. Plus, suppose Google decides that your content is deliberately duplicated across domains in an attempt to manipulate search engine rankings. In that case, the website may not only lose position but also can be dropped from Google’s index. So, you’ll have to get rid of duplicate content on your site.
Let’s look at some steps you can take to avoid duplicate content issues.
- Set up redirects. Use 301 redirects to merge identical or highly similar pages.
- Work on the website structure. Make sure that the content does not overlap (common with blogs and forums). For example, a blog post may appear on the main page of a website, and an archive page.
- Minimize similar content. If your website has two or more pages with nearly identical text, this is a problem, and you’ll need to fix it. Either merge all pages into one page or create unique content for each. Note that using poor boilerplate content can lead to soft 404 errors. For instance, if a page contains partial content from other pages on the site, it may be flagged as a soft 404 error and be removed from SERPs. Your best bet is to eliminate these redundant pages because they’re a waste of your valuable crawl budget.
- Use canonical tag. If you want to keep duplicate content on your website, Google recommends using rel=”canonical” link element. What canonical does is point the search engines to the main version of the page.
HTTP status code issues
Another problem that might prevent any website page from being crawled and indexed is an HTTP status issue. Website pages, files, or links are supposed to return the 200 status code. If they return other HTTP status codes, your website can experience indexing and ranking issues. Let’s look at the main types of response codes that can hurt your website indexing:
- 3XX response code indicates that a redirect to another page exists. If you set a redirect from page A to page B, then page A will no longer be indexed. This is why you should review all redirects to ensure they are correctly set up.
- 4XX response code indicates that the requested page cannot be accessed. When clicking on a non-existing URL, users will see that the page is missing. This harms the website indexing, which can lead to ranking drops. Plus, internal links to 4XX pages drain your crawl budget. To get rid of unnecessary 4XX URLs, review the list of such broken links and replace them with accessible pages, or just remove some of the links. Besides, to avoid 4XX errors, you can set up 301 redirects. Keep in mind: 404 pages can exist on your website, but if a user ends up seeing them, you have to make sure that they are well-thought-out. Thus, you can minimize reputational and usability damages.
- 5XX response code means the problem was caused by the server. Pages that return such codes are inaccessible both to website visitors and to search engines. As a result, the crawler cannot crawl and index such a broken page. To fix this issue, try reproducing the server error for these URLs through the browser and check the server’s error logs. What’s more, you can consult your hosting provider or web developer since your server may be misconfigured.
Internal linking issues
Internal links help crawlers scan websites and discover new pages. They even expedite the indexing process. Still, some issues arise when certain pages on a website lack internal links pointing to them. In these cases, search engines are unlikely to find and index these orphaned pages. While you can address this by indicating them in the XML sitemap or getting external links, internal linking should not be ignored.
Make sure that your website’s most important pages have at least a couple of internal links pointing to them.
Keep in mind: all internal links should pass link juice—as in not be tagged with the rel=”nofollow” attribute. After all, using internal links in a smart way can even boost your rankings.
- Check and test your robots.txt. Make sure that all the directives are properly set up by using the robots.txt tester.
- To see if Google detects your website’s mobile pages as compatible for visitors, use the Mobile-Friendly Test.
It’s important to make sure your website loads quickly. Google doesn’t like slow-loading sites. As a result, they are indexed longer. Reasons for that can be different. For example, using outdated servers with limited resources or too overloaded pages for the user’s browser to process.
The best practice is to get your website to load in less than 2 to 3 seconds. Keep in mind that Core Web Vitals metrics, which measure and evaluate the speed, responsiveness, and visual stability of websites, are Google ranking factors.
To learn more about how to improve your site’s speed, read our blog post.
You can monitor all the issues by using special SEO tools—for example, SE Ranking’s Website Audit. To find out errors on the website, go to Issue report, which will provide you with a complete list of errors and recommendations for fixing them.
The report includes insights on issues related to:
- Website Security
- Duplicate Content
- HTTP Status Code
- Title & Description
- Website Speed
- Internal & External Links etc.
By fixing all the issues, you can improve the website indexing and increase its ranking in search results.
Getting your site crawled and indexed is essential, but it can take a while for your web pages to appear in the SERP. By having a thorough understanding of the subtleties of search engine indexing, you can avoid making detrimental mistakes that harm your website’s SEO.
If you set up and optimized your sitemap correctly, take into account technical search engine requirements, and make sure you have high-quality and useful content, Google won’t leave your website unattended.
To recap, we have covered the following aspects of search engine indexing:
- Notifying the search engine of a new website or page by means of creating a sitemap, special URL adding tools, and external links.
- Restricting site indexing with the help of robots, meta tag, and access password.
- Common indexing errors: internal linking issues, duplicate content, slow loading pages, etc.
Keep in mind that a high indexing rate isn’t equal to high Google rankings. But it’s the basis for your further website optimization. So, before doing anything else, check your pages’ indexing status to verify that they can be indexed.