PHP의 preg_match 및 UTF-8

preg_match를 사용하여 UTF8로 인코딩 된 문자열을 검색하려고합니다.

preg_match('/H/u', "\xC2\xA1Hola!", $a_matches, PREG_OFFSET_CAPTURE);
echo $a_matches[0][1];

"H"는 문자열 "¡ Hola!"의 색인 1에 있기 때문에 1을 인쇄해야합니다. 그러나 그것은 2를 인쇄합니다. 그래서 정규 표현식에 "u"수정자를 전달하더라도 주제가 UTF8로 인코딩 된 문자열로 취급되지 않는 것처럼 보입니다.

내 php.ini에서 다음과 같은 설정을 가지고 있고, 다른 UTF8 함수가 작동 중입니다.

mbstring.func_overload = 7
mbstring.language = Neutral
mbstring.internal_encoding = UTF-8
mbstring.http_input = pass
mbstring.http_output = pass
mbstring.encoding_translation = Off

어떤 아이디어?

해결법

==============================
1.이 기능은 "기능"인 것처럼 보입니다. http://bugs.php.net/bug.php?id=37391

이 기능은 "기능"인 것처럼 보입니다. http://bugs.php.net/bug.php?id=37391

'u'스위치는 pcre에만 의미가 있으며, PHP 자체는 인식하지 못합니다.

PHP의 관점에서 볼 때, 문자열은 바이트 시퀀스이고 리턴하는 바이트 오프셋은 논리적으로 보입니다 ( "올바른"이라고 말하지 않습니다).
==============================
2.u 수정자는 패턴과 주제를 모두 UTF-8로 해석하지만 캡처 된 오프셋은 여전히 바이트로 계산됩니다.

u 수정자는 패턴과 주제를 모두 UTF-8로 해석하지만 캡처 된 오프셋은 여전히 바이트로 계산됩니다.

mb_strlen을 사용하여 바이트가 아닌 UTF-8 문자로 길이를 얻을 수 있습니다.
```
$str = "\xC2\xA1Hola!";
preg_match('/H/u', $str, $a_matches, PREG_OFFSET_CAPTURE);
echo mb_strlen(substr($str, 0, $a_matches[0][1]));
```
==============================
3.regex 앞에 this (* UTF8)를 추가하십시오 :

regex 앞에 this (* UTF8)를 추가하십시오 :
```
preg_match('(*UTF8)/H/u', "\xC2\xA1Hola!", $a_matches, PREG_OFFSET_CAPTURE);
```
마술,에 대한 코멘트 덕분에 http://www.php.net/manual/es/function.preg-match.php#95828

==============================

4.실례지만 necroposting에 대해 누군가는 유용 할 것입니다. 아래 코드는 preg_match 및 preg_match_all 함수를 대체 할 수 있으며 UTF8로 인코딩 된 문자열에 대한 올바른 오프셋과 올바른 일치를 반환합니다.

실례지만 necroposting에 대해 누군가는 유용 할 것입니다. 아래 코드는 preg_match 및 preg_match_all 함수를 대체 할 수 있으며 UTF8로 인코딩 된 문자열에 대한 올바른 오프셋과 올바른 일치를 반환합니다.

     mb_internal_encoding('UTF-8');

     /**
     * Returns array of matches in same format as preg_match or preg_match_all
     * @param bool   $matchAll If true, execute preg_match_all, otherwise preg_match
     * @param string $pattern  The pattern to search for, as a string.
     * @param string $subject  The input string.
     * @param int    $offset   The place from which to start the search (in bytes).
     * @return array
     */
    function pregMatchCapture($matchAll, $pattern, $subject, $offset = 0)
    {
        $matchInfo = array();
        $method    = 'preg_match';
        $flag      = PREG_OFFSET_CAPTURE;
        if ($matchAll) {
            $method .= '_all';
        }
        $n = $method($pattern, $subject, $matchInfo, $flag, $offset);
        $result = array();
        if ($n !== 0 && !empty($matchInfo)) {
            if (!$matchAll) {
                $matchInfo = array($matchInfo);
            }
            foreach ($matchInfo as $matches) {
                $positions = array();
                foreach ($matches as $match) {
                    $matchedText   = $match[0];
                    $matchedLength = $match[1];
                    $positions[]   = array(
                        $matchedText,
                        mb_strlen(mb_strcut($subject, 0, $matchedLength))
                    );
                }
                $result[] = $positions;
            }
            if (!$matchAll) {
                $result = $result[0];
            }
        }
        return $result;
    }

    $s1 = 'Попробуем русскую строку для теста';
    $s2 = 'Try english string for test';

    var_dump(pregMatchCapture(true, '/обу/', $s1));
    var_dump(pregMatchCapture(false, '/обу/', $s1));

    var_dump(pregMatchCapture(true, '/lish/', $s2));
    var_dump(pregMatchCapture(false, '/lish/', $s2));

내 예제 출력 :

    array(1) {
      [0]=>
      array(1) {
        [0]=>
        array(2) {
          [0]=>
          string(6) "обу"
          [1]=>
          int(4)
        }
      }
    }
    array(1) {
      [0]=>
      array(2) {
        [0]=>
        string(6) "обу"
        [1]=>
        int(4)
      }
    }
    array(1) {
      [0]=>
      array(1) {
        [0]=>
        array(2) {
          [0]=>
          string(4) "lish"
          [1]=>
          int(7)
        }
      }
    }
    array(1) {
      [0]=>
      array(2) {
        [0]=>
        string(4) "lish"
        [1]=>
        int(7)
      }
    }

==============================

5.preg_match가 반환 한 오프셋을 적절한 utf 오프셋으로 변환하는 작은 클래스를 작성했습니다.

preg_match가 반환 한 오프셋을 적절한 utf 오프셋으로 변환하는 작은 클래스를 작성했습니다.

final class NonUtfToUtfOffset
{
    /** @var int[] */
    private $utfMap = [];

    public function __construct(string $content)
    {
        $contentLength = mb_strlen($content);

        for ($offset = 0; $offset < $contentLength; $offset ++) {
            $char = mb_substr($content, $offset, 1);
            $nonUtfLength = strlen($char);

            for ($charOffset = 0; $charOffset < $nonUtfLength; $charOffset ++) {
                $this->utfMap[] = $offset;
            }
        }
    }

    public function convertOffset(int $nonUtfOffset): int
    {
        return $this->utfMap[$nonUtfOffset];
    }
}

다음과 같이 사용할 수 있습니다.

$content = 'aą bać d';
$offsetConverter = new NonUtfToUtfOffset($content);

preg_match_all('#(bać)#ui', $content, $m, PREG_OFFSET_CAPTURE);

foreach ($m[1] as [$word, $offset]) {
    echo "bad: " . mb_substr($content, $offset, mb_strlen($word))."\n";
    echo "good: " . mb_substr($content, $offsetConverter->convertOffset($offset), mb_strlen($word))."\n";
}

https://3v4l.org/8Y32J

==============================
6.원하는 모든 작업이 H의 멀티 바이트 안전 위치를 찾으면 mb_strpos ()를 시도하십시오.

원하는 모든 작업이 H의 멀티 바이트 안전 위치를 찾으면 mb_strpos ()를 시도하십시오.
```
mb_internal_encoding('UTF-8');
$str = "\xC2\xA1Hola!";
$pos = mb_strpos($str, 'H');
echo $str."\n";
echo $pos."\n";
echo mb_substr($str,$pos,1)."\n";
```
산출:
```
¡Hola!
1
H
```

from https://stackoverflow.com/questions/1725227/preg-match-and-utf-8-in-php by cc-by-sa and MIT license

'PHP' 카테고리의 다른 글

객체가 아닌 객체에서 prepare () 함수를 호출하십시오 PHP Help (0)	2018.09.17
PHP의 오류 로그는 어디에서 XAMPP에 있습니까? (0)	2018.09.17
PHP 문자열에서 모든 html 태그를 제거하십시오. (0)	2018.09.17
PHP 클래스 인스턴스화. 괄호를 사용하거나 사용하지 않으려면 어떻게해야합니까? [닫은] (0)	2018.09.17
PDO PHP에서 쿼리 오류를 보는 방법 (0)	2018.09.17

복붙노트

PHP의 preg_match 및 UTF-8

PHP의 preg_match 및 UTF-8

해결법

1.이 기능은 "기능"인 것처럼 보입니다. http://bugs.php.net/bug.php?id=37391

2.u 수정자는 패턴과 주제를 모두 UTF-8로 해석하지만 캡처 된 오프셋은 여전히 바이트로 계산됩니다.

3.regex 앞에 this (* UTF8)를 추가하십시오 :

4.실례지만 necroposting에 대해 누군가는 유용 할 것입니다. 아래 코드는 preg_match 및 preg_match_all 함수를 대체 할 수 있으며 UTF8로 인코딩 된 문자열에 대한 올바른 오프셋과 올바른 일치를 반환합니다.

5.preg_match가 반환 한 오프셋을 적절한 utf 오프셋으로 변환하는 작은 클래스를 작성했습니다.

6.원하는 모든 작업이 H의 멀티 바이트 안전 위치를 찾으면 mb_strpos ()를 시도하십시오.

'PHP' 카테고리의 다른 글

티스토리툴바

PHP의 preg_match 및 UTF-8

해결법

1.이 기능은 "기능"인 것처럼 보입니다. http://bugs.php.net/bug.php?id=37391

2.u 수정자는 패턴과 주제를 모두 UTF-8로 해석하지만 캡처 된 오프셋은 여전히 ​​바이트로 계산됩니다.

3.regex 앞에 this (* UTF8)를 추가하십시오 :

4.실례지만 necroposting에 대해 누군가는 유용 할 것입니다. 아래 코드는 preg_match 및 preg_match_all 함수를 대체 할 수 있으며 UTF8로 인코딩 된 문자열에 대한 올바른 오프셋과 올바른 일치를 반환합니다.

5.preg_match가 반환 한 오프셋을 적절한 utf 오프셋으로 변환하는 작은 클래스를 작성했습니다.

6.원하는 모든 작업이 H의 멀티 바이트 안전 위치를 찾으면 mb_strpos ()를 시도하십시오.

'PHP' 카테고리의 다른 글

티스토리툴바

2.u 수정자는 패턴과 주제를 모두 UTF-8로 해석하지만 캡처 된 오프셋은 여전히 바이트로 계산됩니다.