How To Scrape Google Cache With A Python Script

I was curious as to how one could scrape Googles Cache to recover a website that was recently taken down. Say for instance, you’re a real estate agent and your website was terminated by your previous hosting company.

Guy Rutenberg wrote a great script in his blog post titled, “Retrieving Google’s Cache for a Whole Website” back in 2008, and has since been revised by curious Python programmers.

The latest revision was done by Thang Pham, available at: Let’s look over the code real quick.

I fired up an Amazon EC2 instance and placed the python script in ~/python – and allowed the script to run for about an hour. Again, I am not sure if Amazon or Google will rage but eventually Google will block the ip and you’ll get a 503 error. Keep an eye on this so you don’t get it raging. You can always run the script later after the ip block is removed and it will resume where you left off.

TL;DR: On line 19, change the search_site to your target site. Then go to line 48 and change ‘\’ to the destination directory, I used ‘/’

[py]#Retrive old website from Google Cache. Optimized with sleep time, and avoid 504 error (Google block Ip send many request).
#Programmer: Kien Nguyen – QTPros
#change search_site and search_term to match your requirement

import urllib, urllib2
import re
import socket
import os, errno, os.path
import time
import random, math
#import MySQLdb
import imp;

#adjust the site here
search_term="site:" + search_site

#mysql = imp.load_source("MySQLConnector", "").MySQLConnector()

def mkdir_p(path):
except OSError as exc: # Python >2.5
if exc.errno == errno.EEXIST:
else: raise

def main():
headers = {‘User-Agent’: ‘Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv: Gecko/20070515 Firefox/′}
url = ""+search_term

regex_cache = re.compile(r'<a href="([^"]*)"[^>]*>Cached</a>’)
regex_next = re.compile(‘<a href="([^"]*)"[^>]*><span[^>]*>[^<]*</span><span[^>]*>Next</span></a>’)
regex_url = re.compile(r’search?q=cache:[dw-]+:([^%]*)’)
# regex_title = re.compile(‘<title>([wW]+)</title>’)
# regex_time = re.compile(‘page as it appeared on ([dws:]+)’)
regex_pagenum = re.compile(‘<a href="([^"]*)"[^>]*><span[^>]*>[^<]*</span>([d]+)’)

#this is the directory we will save files to
path = os.path.dirname(os.path.abspath(__file__)) + ‘\’ + search_site
# path = os.path.dirname(os.path.abspath(__file__))
counter = 0
pagenum = int(math.floor(len([name for name in os.listdir(path)]) / 10) + 1)
max_goto = 0;
more = True
if (pagenum > 1):
while (max_goto < pagenum):
req = urllib2.Request(url, None, headers)
page = urllib2.urlopen(req).read()
goto = regex_pagenum.findall(page)
# print goto
for goto_url, goto_pagenum in goto:
goto_pagenum = int(goto_pagenum)
if (goto_pagenum == pagenum):
url = "" + goto_url.replace(‘&amp;’, ‘&’)
max_goto = pagenum
elif (goto_pagenum < pagenum and max_goto < goto_pagenum):
max_goto = goto_pagenum
url = "" + goto_url.replace(‘&amp;’, ‘&’)
random_interval = random.randrange(5, 20, 1)
print "sleeping for: " + str(random_interval) + " seconds"
print "going to page: " + str(max_goto)
print url

#Send search request to google with pre-defined headers
req = urllib2.Request(url, None, headers)
#open the response page
page = urllib2.urlopen(req).read()
#find all cache in the page
matches = regex_cache.findall(page)
#loop through the matches
for match in matches:
#find the url of the page cached by google
the_url = regex_url.findall(match)
the_url = the_url[0]
the_url = the_url.replace(‘http://’, ”)
the_url = the_url.strip(‘/’)
the_url = the_url.replace(‘/’, ‘-‘)
#if href doesn’t start with http insert http before
if not match.startswith("http"):
match = "http:" + match
if (not the_url.endswith(‘html’)):
the_url = the_url + ".html"
#if filename "$url"[.html] does not exists
if not os.path.exists(search_site + "/" + the_url):
tmp_req = urllib2.Request(match.replace(‘&amp;’, ‘&’), None, headers)
tmp_page = urllib2.urlopen(tmp_req).read()
f = open(search_site + "/" + the_url, ‘w’)
print counter, ": " + the_url
#comment out the code below if you expect to crawl less than 50 pages
random_interval = random.randrange(15, 20, 1)
print "sleeping for: " + str(random_interval) + " seconds"
except urllib2.HTTPError, e:
print ‘Error code: ‘, e.code
#now check if there is more pages
match =
if match == None:
more = False
url = ""‘&amp;’, ‘&’)

if __name__=="__main__":

Thanks Guy Rutenberg and Thang Pham for this great python script! You’re both a life saver!

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    Markdown is turned off in code blocks:
     [This is not a link](

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see