{"id":452,"date":"2017-09-21T13:49:51","date_gmt":"2017-09-21T18:49:51","guid":{"rendered":"http:\/\/bluegalaxy.info\/codewalk\/?p=452"},"modified":"2018-07-28T16:33:09","modified_gmt":"2018-07-28T21:33:09","slug":"python-using-requests-to-get-web-page-source-text","status":"publish","type":"post","link":"https:\/\/bluegalaxy.info\/codewalk\/2017\/09\/21\/python-using-requests-to-get-web-page-source-text\/","title":{"rendered":"Python: Using requests to get web page source text"},"content":{"rendered":"<p>In addition to getting page lengths and status codes using the request method:<\/p>\n<blockquote class=\"wp-embedded-content\" data-secret=\"7acq5bjo19\"><p><a href=\"http:\/\/bluegalaxy.info\/codewalk\/2017\/09\/21\/python-using-requests-to-get-web-page-lengths-and-status-codes\/\">Python: Using requests to get web page lengths and status codes<\/a><\/p><\/blockquote>\n<p><iframe loading=\"lazy\" class=\"wp-embedded-content\" sandbox=\"allow-scripts\" security=\"restricted\" style=\"position: absolute; clip: rect(1px, 1px, 1px, 1px);\" src=\"http:\/\/bluegalaxy.info\/codewalk\/2017\/09\/21\/python-using-requests-to-get-web-page-lengths-and-status-codes\/embed\/#?secret=7acq5bjo19\" data-secret=\"7acq5bjo19\" width=\"600\" height=\"338\" title=\"&#8220;Python: Using requests to get web page lengths and status codes&#8221; &#8212; Chris Nielsen Code Walk\" frameborder=\"0\" marginwidth=\"0\" marginheight=\"0\" scrolling=\"no\"><\/iframe><\/p>\n<p>You can also use <code class=\"EnlighterJSRAW\" data-enlighter-language=\"no-highlight\">requests<\/code> to return the source code of web pages. For example:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">import requests\r\n\r\nsites = [\r\n    'http:\/\/www.python.org',\r\n    'http:\/\/www.jython.org',\r\n    'http:\/\/www.pypy.org',\r\n    'http:\/\/www.drudgereport.com',\r\n    'http:\/\/www.phys.org',\r\n    'http:\/\/www.bluegalaxy.info',\r\n    'http:\/\/www.bluegalaxy.info\/codewalk'\r\n]\r\n\r\nfor url in sites:\r\n    r = requests.get(url)\r\n    page_source = r.text\r\n    page_source = page_source.split('\\n')\r\n\r\n    print(\"\\nURL:\", url) \r\n    print(\"--------------------------------------\")\r\n    # print the first five lines of the page source\r\n    for row in page_source[:5]:\r\n        print(row)\r\n    print(\"--------------------------------------\")<\/pre>\n<p>This is done by using <code class=\"EnlighterJSRAW\" data-enlighter-language=\"no-highlight\">.text<\/code> on the request object which loads all of the page source into a string. The above code yields:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"no-highlight\">URL: http:\/\/www.python.org\r\n--------------------------------------\r\n&lt;!doctype html&gt;\r\n&lt;!--[if lt IE 7]&gt;   &lt;html class=\"no-js ie6 lt-ie7 lt-ie8 lt-ie9\"&gt;   &lt;![endif]--&gt;\r\n&lt;!--[if IE 7]&gt;      &lt;html class=\"no-js ie7 lt-ie8 lt-ie9\"&gt;          &lt;![endif]--&gt;\r\n&lt;!--[if IE 8]&gt;      &lt;html class=\"no-js ie8 lt-ie9\"&gt;                 &lt;![endif]--&gt;\r\n&lt;!--[if gt IE 8]&gt;&lt;!--&gt;&lt;html class=\"no-js\" lang=\"en\" dir=\"ltr\"&gt;  &lt;!--&lt;![endif]--&gt;\r\n--------------------------------------\r\n\r\nURL: http:\/\/www.jython.org\r\n--------------------------------------\r\n&lt;?xml version=\"1.0\" encoding=\"utf-8\" ?&gt;\r\n&lt;!DOCTYPE html PUBLIC \"-\/\/W3C\/\/DTD XHTML 1.0 Transitional\/\/EN\" \"http:\/\/www.w3.org\/TR\/xhtml1\/DTD\/xhtml1-transitional.dtd\"&gt;\r\n&lt;html xmlns=\"http:\/\/www.w3.org\/1999\/xhtml\" xml:lang=\"en\" lang=\"en\"&gt;\r\n&lt;head&gt;\r\n&lt;meta http-equiv=\"Content-Type\" content=\"text\/html; charset=utf-8\" \/&gt;\r\n--------------------------------------\r\n\r\nURL: http:\/\/www.pypy.org\r\n--------------------------------------\r\n&lt;!DOCTYPE html&gt;\r\n&lt;html&gt;\r\n&lt;head&gt;\r\n        &lt;title&gt;PyPy - Welcome to PyPy&lt;\/title&gt;\r\n        &lt;meta http-equiv=\"content-language\" content=\"en\" \/&gt;\r\n--------------------------------------\r\n\r\nURL: http:\/\/www.drudgereport.com\r\n--------------------------------------\r\n&lt;title&gt;DRUDGE REPORT 2017&amp;#174;&lt;\/title&gt;\r\n&lt;!-- Start Quantcast tag --&gt;\r\n&lt;script type=\"text\/javascript\" src=\"http:\/\/edge.quantserve.com\/quant.js\"&gt;&lt;\/script&gt;\r\n&lt;script type=\"text\/javascript\"&gt;_qacct=\"p-e2qh6t-Out2Ug\";quantserve();&lt;\/script&gt;\r\n&lt;noscript&gt;\r\n--------------------------------------\r\n\r\nURL: http:\/\/www.phys.org\r\n--------------------------------------\r\n&lt;html&gt;&lt;body&gt;&lt;h1&gt;400 Bad request&lt;\/h1&gt;\r\nYour browser sent an invalid request.\r\n&lt;\/body&gt;&lt;\/html&gt;\r\n\r\n\r\n--------------------------------------\r\n\r\nURL: http:\/\/www.bluegalaxy.info\r\n--------------------------------------\r\n&lt;!-- \/\/CSS --&gt;\r\n&lt;style type=\"text\/css\"&gt;\r\n   body {\r\n       margin:0; padding:0;\r\n   }\r\n--------------------------------------\r\n\r\nURL: http:\/\/www.bluegalaxy.info\/codewalk\r\n--------------------------------------\r\n&lt;!DOCTYPE html&gt;\r\n&lt;html lang=\"en-US\"&gt;\r\n&lt;head&gt;\r\n&lt;meta charset=\"UTF-8\"&gt;\r\n&lt;meta name=\"viewport\" content=\"width=device-width, initial-scale=1\"&gt;\r\n--------------------------------------<\/pre>\n<p>For more information about the requests module see:<br \/>\n<a href=\"http:\/\/docs.python-requests.org\/en\/master\/\">http:\/\/docs.python-requests.org\/en\/master\/<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>In addition to getting page lengths and status codes using the request method: Python: Using requests to get web page lengths and status codes You can also use requests to return the source code of web pages. For example: import requests sites = [ &#8216;http:\/\/www.python.org&#8217;, &#8216;http:\/\/www.jython.org&#8217;, &#8216;http:\/\/www.pypy.org&#8217;, &#8216;http:\/\/www.drudgereport.com&#8217;, &#8216;http:\/\/www.phys.org&#8217;, &#8216;http:\/\/www.bluegalaxy.info&#8217;, &#8216;http:\/\/www.bluegalaxy.info\/codewalk&#8217; ] for url in &hellip; <a href=\"https:\/\/bluegalaxy.info\/codewalk\/2017\/09\/21\/python-using-requests-to-get-web-page-source-text\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Python: Using requests to get web page source text<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[22,33],"tags":[4,34],"class_list":["post-452","post","type-post","status-publish","format-standard","hentry","category-python-language","category-python-web-scraping","tag-python","tag-requests"],"_links":{"self":[{"href":"https:\/\/bluegalaxy.info\/codewalk\/wp-json\/wp\/v2\/posts\/452","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/bluegalaxy.info\/codewalk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/bluegalaxy.info\/codewalk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/bluegalaxy.info\/codewalk\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/bluegalaxy.info\/codewalk\/wp-json\/wp\/v2\/comments?post=452"}],"version-history":[{"count":3,"href":"https:\/\/bluegalaxy.info\/codewalk\/wp-json\/wp\/v2\/posts\/452\/revisions"}],"predecessor-version":[{"id":455,"href":"https:\/\/bluegalaxy.info\/codewalk\/wp-json\/wp\/v2\/posts\/452\/revisions\/455"}],"wp:attachment":[{"href":"https:\/\/bluegalaxy.info\/codewalk\/wp-json\/wp\/v2\/media?parent=452"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/bluegalaxy.info\/codewalk\/wp-json\/wp\/v2\/categories?post=452"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/bluegalaxy.info\/codewalk\/wp-json\/wp\/v2\/tags?post=452"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}