regex python with unicode (japanese) character issue

Question

Welcome To Ask or Share your Answers For Others

regex python with unicode (japanese) character issue

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

regex python with unicode (japanese) character issue

I want to remove part of a string (shown in bold) below, this is stored in the string oldString

[DMSM-8433] 加護亜依 Kago Ai – 加護亜依 vs. FRIDAY

im using the following regex within python

p=re.compile(ur"( [W]+) (?=[A-Za-z ]+–)", re.UNICODE)
newString=p.sub("", oldString)

when i output the newString nothing has been removed

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T19:30:49+0000

You can use the following snippet to solve the issue:

#!/usr/bin/python
# -*- coding: utf-8 -*-
import re
str = u'[DMSM-8433] 加護亜依 Kago Ai – 加護亜依 vs. FRIDAY'
regex = u'[u3000-u303fu3040-u309fu30a0-u30ffuff00-uff9fu4e00-u9fafu3400-u4dbf]+ (?=[A-Za-z ]+–)'
p = re.compile(regex, re.U)
match = p.sub("", str)
print match.encode("UTF-8")

See IDEONE demo

Beside # -*- coding: utf-8 -*- declaration, I have added @nhahtdh's character class to detect Japanese symbols.

Note that the match needs to be encoded as UTF-8 string "manually" since Python 2 needs to be "reminded" we are working with Unicode all the time.

Categories

regex python with unicode (japanese) character issue

regex python with unicode (japanese) character issue

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags