{"id":2243,"date":"2018-07-28T19:51:57","date_gmt":"2018-07-29T00:51:57","guid":{"rendered":"http:\/\/bluegalaxy.info\/codewalk\/?p=2243"},"modified":"2018-07-28T20:00:45","modified_gmt":"2018-07-29T01:00:45","slug":"python-open-read-docx-document","status":"publish","type":"post","link":"https:\/\/bluegalaxy.info\/codewalk\/2018\/07\/28\/python-open-read-docx-document\/","title":{"rendered":"Python: How to open and read a .docx document"},"content":{"rendered":"<p>To open and read a Word (.docx) file in Python, we need to import the docx module. If you don&#8217;t have the docx module, it can be downloaded with pip. Specifically, we will use the opendocx and getdocumenttext functions of the docx module:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">from docx import opendocx, getdocumenttext<\/pre>\n<p>The opendocx() method will be used to open the Word document, and the getdocumenttext() method will be used to extract plain text from the document.<\/p>\n<p>Note: This will only extract plain text. It will not extract images or any other Word features or styles.<\/p>\n<p>Example of use:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">wordDoc = r\"C:\\Users\\Chris\\Desktop\\python-blog\\quantum-tunneling.docx\"\r\n\r\ndocument = opendocx(wordDoc)\r\nextractedText = getdocumenttext(document)\r\n<\/pre>\n<p>Now that we have extracted the text from the document, we can do whatever we want with it. The extracted text is captured in an array so we can access text by index or iterate through all of it.<\/p>\n<p>Note: The extracted text will not necessarily be divided into individual sentences. It can be blocks of sentences, depending on how the paragraphs are set up in the Word document.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">print \"Data type:\", type(extractedText)\r\nprint \"Length of data:\", len(extractedText)\r\n\r\n# Print the first line of the Word document\r\nprint \"First line:\\n\", extractedText[0]<\/pre>\n<p>which yields:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"raw\">Data type: &lt;type 'list'&gt;\r\nLength of data: 4\r\nFirst line:\r\nThe quantum tunnelling effect is, as the name suggests, a quantum phenomenon which occurs when particles move through a barrier that should be impossible to move through according to classical physics. The barrier can be a physically impassable medium, like an insulator or a vacuum, or it can be a region of high potential energy.<\/pre>\n<p>This shows that there are 4 blocks of text extracted from the document. Here is what the Word document looks like:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-2248\" title=\"Text from: https:\/\/www.azoquantum.com\/article.aspx?ArticleId=12\" src=\"http:\/\/bluegalaxy.info\/codewalk\/wp-content\/uploads\/2018\/07\/word-doc.png\" alt=\"\" width=\"600\" height=\"782\" srcset=\"https:\/\/bluegalaxy.info\/codewalk\/wp-content\/uploads\/2018\/07\/word-doc.png 1128w, https:\/\/bluegalaxy.info\/codewalk\/wp-content\/uploads\/2018\/07\/word-doc-230x300.png 230w, https:\/\/bluegalaxy.info\/codewalk\/wp-content\/uploads\/2018\/07\/word-doc-768x1002.png 768w, https:\/\/bluegalaxy.info\/codewalk\/wp-content\/uploads\/2018\/07\/word-doc-785x1024.png 785w, https:\/\/bluegalaxy.info\/codewalk\/wp-content\/uploads\/2018\/07\/word-doc-676x882.png 676w\" sizes=\"auto, (max-width: 600px) 100vw, 600px\" \/><\/p>\n<p>Now that we have extracted all of the text from the Word document, we can do some analysis. For example, let&#8217;s say we want to get a count of unique words and capture all of the unique words in the document. Here are a couple of functions I wrote that can facilitate this:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">from docx import opendocx, getdocumenttext\r\n\r\nwordDoc = r\"C:\\Users\\Chris\\Desktop\\python-blog\\quantum-tunneling.docx\"\r\n\r\ndocument = opendocx(wordDoc)\r\nextractedText = getdocumenttext(document)\r\ndictionary = []\r\n\r\ndef createWordList():\r\n    \"\"\"Count the number of unique words and list every unique word in the document.\"\"\"\r\n    count = 0\r\n    for string in extractedText:\r\n        s = string.split()\r\n        for p in s:\r\n            p = stripPunctuation(p)\r\n            if p not in dictionary:\r\n                dictionary.append(p)\r\n                count +=1\r\n            else:\r\n                pass\r\n    dictionary.sort()\r\n    ##for d in dictionary:\r\n    ##   print d\r\n    print \"The unique word count is\", count\r\n    print dictionary\r\n\r\n\r\ndef stripPunctuation(s):\r\n    if s.endswith('.'):\r\n        s = s[:-1]\r\n    if s.endswith(','):\r\n        s = s[:-1]\r\n    return s\r\n\r\ncreateWordList()<\/pre>\n<p>This produces:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"raw\">The unique word count is 117\r\n[u'-', u'Credit:', u'If', u'Image', u'In', u'On', u'The', u'This', u'Wikipedia', u'a', u'abruptly', u'according', u'across', u'amplitude', u'an', u'as', u'barrier', u'be', u'behave', u'by', u'can', u'classical', u'coefficient', u'corresponds', u'current', u'decays', u'decrease', u'defined', u'density', u'divided', u'drop', u'effect', u'emerging', u'encountering', u'end', u'energy', u'enough', u'exponentially', u'finding', u'finite', u'from', u'further', u'has', u'have', u'high', u'higher', u'however', u'if', u'impassable', u'impossible', u'in', u'incident', u'insufficient', u'insulator', u'into', u'is', u'it', u'its', u'like', u'likelihood', u'look', u'may', u'mechanics', u'medium', u'move', u'name', u'narrow', u'non-zero', u'not', u'occurs', u'of', u'often', u'on', u'or', u'other', u'overcome', u'particle', u'particles', u'phenomenon', u'physically', u'physics', u'potential', u'probability', u'quantum', u'ratio', u'region', u'regions', u'should', u'side', u'simply', u'so', u'some', u'suggests', u'than', u'that', u'the', u'there', u'thin', u'this', u'through', u'to', u'transmission', u'tunnel', u'tunneling', u'tunnelling', u'vacuum', u'value', u'wave', u\"wave's\", u'waves', u'when', u'where', u'which', u'will', u\"won't\", u'world', u'you']<\/pre>\n<p>&nbsp;<\/p>\n<p>For more information about the docx module, see:<br \/>\n<a href=\"https:\/\/pypi.org\/project\/docx\/\">https:\/\/pypi.org\/project\/docx\/<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>To open and read a Word (.docx) file in Python, we need to import the docx module. If you don&#8217;t have the docx module, it can be downloaded with pip. Specifically, we will use the opendocx and getdocumenttext functions of the docx module: from docx import opendocx, getdocumenttext The opendocx() method will be used to &hellip; <a href=\"https:\/\/bluegalaxy.info\/codewalk\/2018\/07\/28\/python-open-read-docx-document\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Python: How to open and read a .docx document<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[22],"tags":[164,4],"class_list":["post-2243","post","type-post","status-publish","format-standard","hentry","category-python-language","tag-docx","tag-python"],"_links":{"self":[{"href":"https:\/\/bluegalaxy.info\/codewalk\/wp-json\/wp\/v2\/posts\/2243","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/bluegalaxy.info\/codewalk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/bluegalaxy.info\/codewalk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/bluegalaxy.info\/codewalk\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/bluegalaxy.info\/codewalk\/wp-json\/wp\/v2\/comments?post=2243"}],"version-history":[{"count":9,"href":"https:\/\/bluegalaxy.info\/codewalk\/wp-json\/wp\/v2\/posts\/2243\/revisions"}],"predecessor-version":[{"id":2254,"href":"https:\/\/bluegalaxy.info\/codewalk\/wp-json\/wp\/v2\/posts\/2243\/revisions\/2254"}],"wp:attachment":[{"href":"https:\/\/bluegalaxy.info\/codewalk\/wp-json\/wp\/v2\/media?parent=2243"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/bluegalaxy.info\/codewalk\/wp-json\/wp\/v2\/categories?post=2243"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/bluegalaxy.info\/codewalk\/wp-json\/wp\/v2\/tags?post=2243"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}