Python: Using requests to get web page source text

In addition to getting page lengths and status codes using the request method:

Python: Using requests to get web page lengths and status codes

You can also use requests to return the source code of web pages. For example:

import requests

sites = [
    'http://www.python.org',
    'http://www.jython.org',
    'http://www.pypy.org',
    'http://www.drudgereport.com',
    'http://www.phys.org',
    'http://www.bluegalaxy.info',
    'http://www.bluegalaxy.info/codewalk'
]

for url in sites:
    r = requests.get(url)
    page_source = r.text
    page_source = page_source.split('\n')

    print("\nURL:", url) 
    print("--------------------------------------")
    # print the first five lines of the page source
    for row in page_source[:5]:
        print(row)
    print("--------------------------------------")

This is done by using .text on the request object which loads all of the page source into a string. The above code yields:

URL: http://www.python.org
--------------------------------------
<!doctype html>
<!--[if lt IE 7]>   <html class="no-js ie6 lt-ie7 lt-ie8 lt-ie9">   <![endif]-->
<!--[if IE 7]>      <html class="no-js ie7 lt-ie8 lt-ie9">          <![endif]-->
<!--[if IE 8]>      <html class="no-js ie8 lt-ie9">                 <![endif]-->
<!--[if gt IE 8]><!--><html class="no-js" lang="en" dir="ltr">  <!--<![endif]-->
--------------------------------------

URL: http://www.jython.org
--------------------------------------
<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
--------------------------------------

URL: http://www.pypy.org
--------------------------------------
<!DOCTYPE html>
<html>
<head>
        <title>PyPy - Welcome to PyPy</title>
        <meta http-equiv="content-language" content="en" />
--------------------------------------

URL: http://www.drudgereport.com
--------------------------------------
<title>DRUDGE REPORT 2017&#174;</title>
<!-- Start Quantcast tag -->
<script type="text/javascript" src="http://edge.quantserve.com/quant.js"></script>
<script type="text/javascript">_qacct="p-e2qh6t-Out2Ug";quantserve();</script>
<noscript>
--------------------------------------

URL: http://www.phys.org
--------------------------------------
<html><body><h1>400 Bad request</h1>
Your browser sent an invalid request.
</body></html>


--------------------------------------

URL: http://www.bluegalaxy.info
--------------------------------------
<!-- //CSS -->
<style type="text/css">
   body {
       margin:0; padding:0;
   }
--------------------------------------

URL: http://www.bluegalaxy.info/codewalk
--------------------------------------
<!DOCTYPE html>
<html lang="en-US">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
--------------------------------------

For more information about the requests module see:
http://docs.python-requests.org/en/master/

Leave a Reply