First, note that it is not possible to detect whether text belongs to a specific undesired encoding. You can only check whether a string is valid in a given encoding.
You can make use of the UTF-8 validity check that is available in preg_match
[PHP Manual] since PHP 4.3.5. It will return 0
(with no additional information) if an invalid string is given:
$isUTF8 = preg_match('//u', $string);
Another possibility is mb_check_encoding
[PHP Manual]:
$validUTF8 = mb_check_encoding($string, 'UTF-8');
Another function you can use is mb_detect_encoding
[PHP Manual]:
$validUTF8 = ! (false === mb_detect_encoding($string, 'UTF-8', true));
It's important to set the strict
parameter to true
.
Additionally, iconv
[PHP Manual] allows you to change/drop invalid sequences on the fly. (However, if iconv
encounters such a sequence, it generates a notification; this behavior cannot be changed.)
echo 'TRANSLIT : ', iconv("UTF-8", "ISO-8859-1//TRANSLIT", $string), PHP_EOL;
echo 'IGNORE : ', iconv("UTF-8", "ISO-8859-1//IGNORE", $string), PHP_EOL;
You can use @
and check the length of the return string:
strlen($string) === strlen(@iconv('UTF-8', 'UTF-8//IGNORE', $string));
Check the examples on the iconv
manual page as well.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…