Let’s say you have an API feed or a form on a web page that passes in a block of text that could have HTML and UTF-8 or other specially encoded characters. Now let’s say we need to clean this feed upon arrival such that there are no HTML tags left in it, and all characters are in plain text ASCII.
In this article I will describe the steps necessary to do this with Python. This example will use Python 3.7.
A good way to get rid of the HTML is to use the BeautifulSoup module. To use this, we must add this include at the top of the script:
from bs4 import BeautifulSoup
This assumes you already have this module installed in your version of Python. If not, you can use PIP to install it.
Now let’s set up a test block of input text that includes HTML and special characters:
html_str = ''' <div> <p>FARGO — Eastbound traffic on 32nd Avenue South will continue to be reduced to one lane between 27th Street South to 32nd Street South.</p> <p>The closure is Phase 1 of a street resurfacing project on 32nd Avenue South. The road is expected to fully reopen at the end of the day Friday, June 14. The second phase of the project will impact the lanes east of 25th Street South and will occur about two to three weeks following the completion of Phase 1.</p> <p>For a complete list of road closures, go to <a href="http://www.fargostreets.com/" target="_blank">www.FargoStreets.com</a> , or ♥O◘♦♥O◘♦ <a href="https://twitter.com/fargostreets" target="_blank">@FargoStreets</a> on Facebook and Twitter.</p> </div> '''
Now lets strip out the HTML so that only the text is remaining. This is done using BeautifulSoup’s get_text()
method:
# This strips out the html, leaving only text behind cleanedText = BeautifulSoup(html_str).get_text() print(cleanedText)
This yields:
FARGO — Eastbound traffic on 32nd Avenue South will continue to be reduced to one lane between 27th Street South to 32nd Street South. The closure is Phase 1 of a street resurfacing project on 32nd Avenue South. The road is expected to fully reopen at the end of the day Friday, June 14. The second phase of the project will impact the lanes east of 25th Street South and will occur about two to three weeks following the completion of Phase 1. For a complete list of road closures, go to www.FargoStreets.com , or ♥O◘♦♥O◘♦ @FargoStreets on Facebook and Twitter.
Notice that all of the HTML tags have been stripped out already! Next we need to convert this text to ASCII. We can use one of Python 3.7’s new functions to check whether the current text is ASCII or not. For example:
# Is this pure ASCII? print(cleanedText.isascii())
Which yields : False
due to the fact that there are still special encoded characters in the text that are outside of the ASCII range.
Since we know that the text is not currently ASCII, we can convert the text to ASCII using Python’s built-in encode()
function. The 'replace'
argument will replace any non-ascii character found with a ?
mark.
Note: Another option for the encode()
function that can be used in instead of 'replace'
is 'ignore'
. This would simply eliminate the special character rather than replace it with a question mark.
# Convert the cleanedText to an ASCII byte object, to remove UTF-8 characters cleanedText = cleanedText.encode('ascii', 'replace')
The result of this step is that the block of text is converted to a bytes object, which is why the string is preceded by a b
at the front and why new line characters are represented as \n
. Also, all special characters are replaced with question marks. For example:
b'\nFARGO ? Eastbound traffic on 32nd Avenue South will continue to be reduced to one lane between 27th Street South to 32nd Street South.\nThe closure is Phase 1 of a street resurfacing project on 32nd Avenue South. The road is expected to fully reopen at the end of the day Friday, June 14. The second phase of the project will impact the lanes east of 25th Street South and will occur about two to three weeks following the completion of Phase 1.\nFor a complete list of road closures, go to\nwww.FargoStreets.com\n, or ?O???O??\n@FargoStreets\non Facebook and Twitter.\n\n'
Since I do want to maintain line breaks and I don’t want to see \n
in the text, I need to convert the byte object back into a string:
# Convert the bytes object back to a string to maintain line breaks cleanedText = cleanedText.decode('ascii') print(cleanedText)
Which yields:
FARGO ? Eastbound traffic on 32nd Avenue South will continue to be reduced to one lane between 27th Street South to 32nd Street South. The closure is Phase 1 of a street resurfacing project on 32nd Avenue South. The road is expected to fully reopen at the end of the day Friday, June 14. The second phase of the project will impact the lanes east of 25th Street South and will occur about two to three weeks following the completion of Phase 1. For a complete list of road closures, go to www.FargoStreets.com , or ?O???O?? @FargoStreets on Facebook and Twitter.
This is the desired output. The original text has been stripped of all HTML tags and special characters, while maintaining the line breaks.
Let’s check do the ascii check again:
print(cleanedText.isascii())
This time it yielded True
.
Finally, I will place these steps into a re-usable function called “cleanText”:
def cleanText(rawInput): cleanedText = BeautifulSoup(rawInput).get_text() cleanedText = cleanedText.encode('ascii', 'replace') cleanedText = cleanedText.decode('ascii') return cleanedText
Now this can be used in other contexts where we need some input data cleaned. For example, in a data object:
"description": sys.argv[5],
can be changed to:
"description": cleanText(sys.argv[5]),
For more information about the BeautifulSoup
module, see:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
For more information about the .isascii()
function (which is new in Python 3.7), see:
https://docs.python.org/3/library/stdtypes.html?highlight=isascii#str.isascii
For more information about the .encode()
and .decode()
functions, see:
https://docs.python.org/3/library/stdtypes.html?highlight=str%20encode#str.encode
https://www.tutorialspoint.com/python/string_decode.htm