Java's regex supports supplementary characters, so you can specify those high ranges with two UTF-16 encoded chars.
Here is the pattern for removing characters that are illegal in XML 1.0:
// XML 1.0
// #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
String xml10pattern = "[^"
+ "u0009
"
+ "u0020-uD7FF"
+ "uE000-uFFFD"
+ "ud800udc00-udbffudfff"
+ "]";
Most people will want the XML 1.0 version.
Here is the pattern for removing characters that are illegal in XML 1.1:
// XML 1.1
// [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
String xml11pattern = "[^"
+ "u0001-uD7FF"
+ "uE000-uFFFD"
+ "ud800udc00-udbffudfff"
+ "]+";
You will need to use String.replaceAll(...)
and not String.replace(...)
.
String illegal = "Hello, World!";
String legal = illegal.replaceAll(pattern, "");
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…