Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
262 views
in Technique[技术] by (71.8m points)

javascript - Remove all spaces between Chinese words with regex

I would like to remove all spaces among Chinese text only.

My text: "請 把 這 裡 的 10 多 個 字 合 併. Can you help me?"

Ideal output: "請把這裡的 10 多個字合併. Can you help me?"

var str = '請 把 這 裡 的 10 多 個 字 合 併. Can you help me?';
str = str.replace("/ /", "");

I have studied a similar question for Python but it seems not to work in my situation so I brought my question here for some help.

question from:https://stackoverflow.com/questions/54179179/remove-all-spaces-between-chinese-words-with-regex

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Getting to the Chinese char matching pattern

Using the Unicode Tools, the p{Han} Unicode property class that matches any Chinese char can be translated into

[u2E80-u2E99u2E9B-u2EF3u2F00-u2FD5u3005u3007u3021-u3029u3038-u303Bu3400-u4DB5u4E00-u9FEFuF900-uFA6DuFA70-uFAD9U00020000-U0002A6D6U0002A700-U0002B734U0002B740-U0002B81DU0002B820-U0002CEA1U0002CEB0-U0002EBE0U0002F800-U0002FA1D]

In ES6, to match a single Chinese char, it can be used as

/[u2E80-u2E99u2E9B-u2EF3u2F00-u2FD5u3005u3007u3021-u3029u3038-u303Bu3400-u4DB5u4E00-u9FEFuF900-uFA6DuFA70-uFAD9u{20000}-u{2A6D6}u{2A700}-u{2B734}u{2B740}-u{2B81D}u{2B820}-u{2CEA1}u{2CEB0}-u{2EBE0}u{2F800}-u{2FA1D}]/u

Transpiling it to ES5 using ES2015 Unicode regular expression transpiler, we get

(?:[u2E80-u2E99u2E9B-u2EF3u2F00-u2FD5u3005u3007u3021-u3029u3038-u303Bu3400-u4DB5u4E00-u9FEFuF900-uFA6DuFA70-uFAD9]|[uD840-uD868uD86A-uD86CuD86F-uD872uD874-uD879][uDC00-uDFFF]|uD869[uDC00-uDED6uDF00-uDFFF]|uD86D[uDC00-uDF34uDF40-uDFFF]|uD86E[uDC00-uDC1DuDC20-uDFFF]|uD873[uDC00-uDEA1uDEB0-uDFFF]|uD87A[uDC00-uDFE0]|uD87E[uDC00-uDE1D])

pattern to match any Chinese char using JS RegExp.

So, you may use

s.replace(/([u2E80-u2E99u2E9B-u2EF3u2F00-u2FD5u3005u3007u3021-u3029u3038-u303Bu3400-u4DB5u4E00-u9FEFuF900-uFA6DuFA70-uFAD9]|[uD840-uD868uD86A-uD86CuD86F-uD872uD874-uD879][uDC00-uDFFF]|uD869[uDC00-uDED6uDF00-uDFFF]|uD86D[uDC00-uDF34uDF40-uDFFF]|uD86E[uDC00-uDC1DuDC20-uDFFF]|uD873[uDC00-uDEA1uDEB0-uDFFF]|uD87A[uDC00-uDFE0]|uD87E[uDC00-uDE1D])s+(?=(?:[u2E80-u2E99u2E9B-u2EF3u2F00-u2FD5u3005u3007u3021-u3029u3038-u303Bu3400-u4DB5u4E00-u9FEFuF900-uFA6DuFA70-uFAD9]|[uD840-uD868uD86A-uD86CuD86F-uD872uD874-uD879][uDC00-uDFFF]|uD869[uDC00-uDED6uDF00-uDFFF]|uD86D[uDC00-uDF34uDF40-uDFFF]|uD86E[uDC00-uDC1DuDC20-uDFFF]|uD873[uDC00-uDEA1uDEB0-uDFFF]|uD87A[uDC00-uDFE0]|uD87E[uDC00-uDE1D]))/g, '$1')

See the regex demo.

If your JS environment is ECMAScript 2018 compliant you may use a shorter

s.replace(/(p{Script=Hani})s+(?=p{Script=Hani})/gu, '$1')

Pattern details

  • (CHINESE_CHAR_PATTERN) - Capturing group 1 ($1 in the replacement pattern): any Chinese char
  • s+ - any 1+ whitespaces (any Unicode whitespace)
  • (?=CHINESE_CHAR_PATTERN) - there must be a Chinese char immediately to the right of the current location.

JS demo:

var s = "請 把 這 裡 的 10 多 個 字 合 併. Can you help me?";
var HanChr = "[\u2E80-\u2E99\u2E9B-\u2EF3\u2F00-\u2FD5\u3005\u3007\u3021-\u3029\u3038-\u303B\u3400-\u4DB5\u4E00-\u9FEF\uF900-\uFA6D\uFA70-\uFAD9]|[\uD840-\uD868\uD86A-\uD86C\uD86F-\uD872\uD874-\uD879][\uDC00-\uDFFF]|\uD869[\uDC00-\uDED6\uDF00-\uDFFF]|\uD86D[\uDC00-\uDF34\uDF40-\uDFFF]|\uD86E[\uDC00-\uDC1D\uDC20-\uDFFF]|\uD873[\uDC00-\uDEA1\uDEB0-\uDFFF]|\uD87A[\uDC00-\uDFE0]|\uD87E[\uDC00-\uDE1D]"; 
console.log(s.replace(new RegExp('(' + HanChr + ')\s+(?=(?:' + HanChr + '))', 'g'), '$1'));

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...