In addition to getting page lengths and status codes using the request method:
Python: Using requests to get web page lengths and status codes
You can also use requests to return the source code of web pages. For example:
import requests
sites = [
'http://www.python.org',
'http://www.jython.org',
'http://www.pypy.org',
'http://www.drudgereport.com',
'http://www.phys.org',
'http://www.bluegalaxy.info',
'http://www.bluegalaxy.info/codewalk'
]
for url in sites:
r = requests.get(url)
page_source = r.text
page_source = page_source.split('\n')
print("\nURL:", url)
print("--------------------------------------")
# print the first five lines of the page source
for row in page_source[:5]:
print(row)
print("--------------------------------------")
This is done by using .text on the request object which loads all of the page source into a string. The above code yields:
URL: http://www.python.org
--------------------------------------
<!doctype html>
<!--[if lt IE 7]> <html class="no-js ie6 lt-ie7 lt-ie8 lt-ie9"> <![endif]-->
<!--[if IE 7]> <html class="no-js ie7 lt-ie8 lt-ie9"> <![endif]-->
<!--[if IE 8]> <html class="no-js ie8 lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--><html class="no-js" lang="en" dir="ltr"> <!--<![endif]-->
--------------------------------------
URL: http://www.jython.org
--------------------------------------
<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
--------------------------------------
URL: http://www.pypy.org
--------------------------------------
<!DOCTYPE html>
<html>
<head>
<title>PyPy - Welcome to PyPy</title>
<meta http-equiv="content-language" content="en" />
--------------------------------------
URL: http://www.drudgereport.com
--------------------------------------
<title>DRUDGE REPORT 2017®</title>
<!-- Start Quantcast tag -->
<script type="text/javascript" src="http://edge.quantserve.com/quant.js"></script>
<script type="text/javascript">_qacct="p-e2qh6t-Out2Ug";quantserve();</script>
<noscript>
--------------------------------------
URL: http://www.phys.org
--------------------------------------
<html><body><h1>400 Bad request</h1>
Your browser sent an invalid request.
</body></html>
--------------------------------------
URL: http://www.bluegalaxy.info
--------------------------------------
<!-- //CSS -->
<style type="text/css">
body {
margin:0; padding:0;
}
--------------------------------------
URL: http://www.bluegalaxy.info/codewalk
--------------------------------------
<!DOCTYPE html>
<html lang="en-US">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
--------------------------------------
For more information about the requests module see:
http://docs.python-requests.org/en/master/