In addition to getting page lengths and status codes using the request method:
Python: Using requests to get web page lengths and status codes
You can also use requests
to return the source code of web pages. For example:
import requests sites = [ 'http://www.python.org', 'http://www.jython.org', 'http://www.pypy.org', 'http://www.drudgereport.com', 'http://www.phys.org', 'http://www.bluegalaxy.info', 'http://www.bluegalaxy.info/codewalk' ] for url in sites: r = requests.get(url) page_source = r.text page_source = page_source.split('\n') print("\nURL:", url) print("--------------------------------------") # print the first five lines of the page source for row in page_source[:5]: print(row) print("--------------------------------------")
This is done by using .text
on the request object which loads all of the page source into a string. The above code yields:
URL: http://www.python.org -------------------------------------- <!doctype html> <!--[if lt IE 7]> <html class="no-js ie6 lt-ie7 lt-ie8 lt-ie9"> <![endif]--> <!--[if IE 7]> <html class="no-js ie7 lt-ie8 lt-ie9"> <![endif]--> <!--[if IE 8]> <html class="no-js ie8 lt-ie9"> <![endif]--> <!--[if gt IE 8]><!--><html class="no-js" lang="en" dir="ltr"> <!--<![endif]--> -------------------------------------- URL: http://www.jython.org -------------------------------------- <?xml version="1.0" encoding="utf-8" ?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> -------------------------------------- URL: http://www.pypy.org -------------------------------------- <!DOCTYPE html> <html> <head> <title>PyPy - Welcome to PyPy</title> <meta http-equiv="content-language" content="en" /> -------------------------------------- URL: http://www.drudgereport.com -------------------------------------- <title>DRUDGE REPORT 2017®</title> <!-- Start Quantcast tag --> <script type="text/javascript" src="http://edge.quantserve.com/quant.js"></script> <script type="text/javascript">_qacct="p-e2qh6t-Out2Ug";quantserve();</script> <noscript> -------------------------------------- URL: http://www.phys.org -------------------------------------- <html><body><h1>400 Bad request</h1> Your browser sent an invalid request. </body></html> -------------------------------------- URL: http://www.bluegalaxy.info -------------------------------------- <!-- //CSS --> <style type="text/css"> body { margin:0; padding:0; } -------------------------------------- URL: http://www.bluegalaxy.info/codewalk -------------------------------------- <!DOCTYPE html> <html lang="en-US"> <head> <meta charset="UTF-8"> <meta name="viewport" content="width=device-width, initial-scale=1"> --------------------------------------
For more information about the requests module see:
http://docs.python-requests.org/en/master/