Python: How to open and read a .docx document

To open and read a Word (.docx) file in Python, we need to import the docx module. If you don’t have the docx module, it can be downloaded with pip. Specifically, we will use the opendocx and getdocumenttext functions of the docx module:

from docx import opendocx, getdocumenttext

The opendocx() method will be used to open the Word document, and the getdocumenttext() method will be used to extract plain text from the document.

Note: This will only extract plain text. It will not extract images or any other Word features or styles.

Example of use:

wordDoc = r"C:\Users\Chris\Desktop\python-blog\quantum-tunneling.docx"

document = opendocx(wordDoc)
extractedText = getdocumenttext(document)

Now that we have extracted the text from the document, we can do whatever we want with it. The extracted text is captured in an array so we can access text by index or iterate through all of it.

Note: The extracted text will not necessarily be divided into individual sentences. It can be blocks of sentences, depending on how the paragraphs are set up in the Word document.

print "Data type:", type(extractedText)
print "Length of data:", len(extractedText)

# Print the first line of the Word document
print "First line:\n", extractedText[0]

which yields:

Data type: <type 'list'>
Length of data: 4
First line:
The quantum tunnelling effect is, as the name suggests, a quantum phenomenon which occurs when particles move through a barrier that should be impossible to move through according to classical physics. The barrier can be a physically impassable medium, like an insulator or a vacuum, or it can be a region of high potential energy.

This shows that there are 4 blocks of text extracted from the document. Here is what the Word document looks like:

Now that we have extracted all of the text from the Word document, we can do some analysis. For example, let’s say we want to get a count of unique words and capture all of the unique words in the document. Here are a couple of functions I wrote that can facilitate this:

from docx import opendocx, getdocumenttext

wordDoc = r"C:\Users\Chris\Desktop\python-blog\quantum-tunneling.docx"

document = opendocx(wordDoc)
extractedText = getdocumenttext(document)
dictionary = []

def createWordList():
    """Count the number of unique words and list every unique word in the document."""
    count = 0
    for string in extractedText:
        s = string.split()
        for p in s:
            p = stripPunctuation(p)
            if p not in dictionary:
                count +=1
    ##for d in dictionary:
    ##   print d
    print "The unique word count is", count
    print dictionary

def stripPunctuation(s):
    if s.endswith('.'):
        s = s[:-1]
    if s.endswith(','):
        s = s[:-1]
    return s


This produces:

The unique word count is 117
[u'-', u'Credit:', u'If', u'Image', u'In', u'On', u'The', u'This', u'Wikipedia', u'a', u'abruptly', u'according', u'across', u'amplitude', u'an', u'as', u'barrier', u'be', u'behave', u'by', u'can', u'classical', u'coefficient', u'corresponds', u'current', u'decays', u'decrease', u'defined', u'density', u'divided', u'drop', u'effect', u'emerging', u'encountering', u'end', u'energy', u'enough', u'exponentially', u'finding', u'finite', u'from', u'further', u'has', u'have', u'high', u'higher', u'however', u'if', u'impassable', u'impossible', u'in', u'incident', u'insufficient', u'insulator', u'into', u'is', u'it', u'its', u'like', u'likelihood', u'look', u'may', u'mechanics', u'medium', u'move', u'name', u'narrow', u'non-zero', u'not', u'occurs', u'of', u'often', u'on', u'or', u'other', u'overcome', u'particle', u'particles', u'phenomenon', u'physically', u'physics', u'potential', u'probability', u'quantum', u'ratio', u'region', u'regions', u'should', u'side', u'simply', u'so', u'some', u'suggests', u'than', u'that', u'the', u'there', u'thin', u'this', u'through', u'to', u'transmission', u'tunnel', u'tunneling', u'tunnelling', u'vacuum', u'value', u'wave', u"wave's", u'waves', u'when', u'where', u'which', u'will', u"won't", u'world', u'you']


For more information about the docx module, see: