I am currently using Beautiful Soup to parse an HTML file and calling get_text()
, but it seems like I'm being left with a lot of xa0 Unicode representing spaces. Is there an efficient way to remove all of them in Python 2.7, and change them into spaces? I guess the more generalized question would be, is there a way to remove Unicode formatting?
I tried using: line = line.replace(u'xa0',' ')
, as suggested by another thread, but that changed the xa0's to u's, so now I have "u"s everywhere instead. ):
EDIT: The problem seems to be resolved by str.replace(u'xa0', ' ').encode('utf-8')
, but just doing .encode('utf-8')
without replace()
seems to cause it to spit out even weirder characters, xc2 for instance. Can anyone explain this?
Question&Answers:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…