SEO Tip #94: How Does Google Find Pages That Don’t Have Any Links To Them?
Matt Cutts: You almost threw me for a loop with this question because I read all the way through and was ready to give my answer but then the last sentence completely changed my answer so I want to answer it both ways.
Starting with the beginning of your question on how Google can index things even if there aren’t any links pointing to the particular page people can always submit a url or something like that; but a lot of people don’t realize how many links there are just sort of floating around on the web.
It could be that you don’t realize somebody is linking to a page on your site even though it is.
We can follow a page from a very obscure esoteric type of page to a deep page on your own site. Because we only return a sub-sample of all the links we know about when you do a link: search on a particular url we might know about a link that you don’t know about.
That’s how I started to answer your question and then you said, “The pages are generated by the search field of my website.” That completely changes the nature of the question. In April of 2008 Jayant Madhavan and Alon Levy did a blog post where they talked about crawling through HTML forms. They later on got it published as a paper.
The basic idea is in some cases whenever we see a search form Google can try to fill out that form as long as is simple enough. Suppose for example you have the main root page of your website and you can’t get to any other page on your site except for a drop down page. Googlebot can enumerate the values in that drop down; maybe it’s the 50 states in the United States. We can try to submit for example if we set the state to Kentucky or the state to California. That will open up things for us to discover and crawl. That can let us crawl a search form.
Now in general we don’t crawl through a ton of search forms because they can be very complex, sometimes they want credit card numbers and Googlebot is very broke and doesn’t have a credit card number but in some situations where there might be only one or two input elements we do have the ability to try and find out and search through that form to find new content.
Now if that’s something that you’re not interested in, maybe you don’t want those pages crawled, you can always use Google’s robot.txt and do a disallow/search or whatever the area is that you will go to when you submit the search form. We try to be polite.
You can read more about it if you search for Googlebot crawl through HTML forms or something like that. You can read about the forms we will and won’t crawl through but it’s all part of the process where we try as much of the web as possible, crawl it as comprehensively as we can so that we can return it to you in half a second.