Python과 BeautifulSoup 인코딩 문제

나는 BeautifulSoup를 사용하여 파이썬으로 크롤러를 작성하고 있으며이 사이트를 방문 할 때까지 모든 것이 순조롭게 진행되고 있습니다.

http://www.elnorte.ec/

요청 라이브러리를 사용하여 내용을 가져옵니다.

r = requests.get('http://www.elnorte.ec/')
content = r.content

이 시점에서 컨텐트 변수를 인쇄하면 모든 스페인어 특수 문자가 제대로 작동하는 것 같습니다. 그러나 일단 BeautifulSoup에 컨텐트 변수를 공급하려고하면 모든 것이 엉망이됩니다.

soup = BeautifulSoup(content)
print(soup)
...
<a class="blogCalendarToday" href="/component/blog_calendar/?year=2011&amp;month=08&amp;day=27&amp;modid=203" title="1009 artÃculos en este dÃa">
...

그것은 분명히 모든 스페인어 특수 문자 (악센트와 겹침 선반)를 깨뜨리는 것입니다. 나는 ContentEdecode ( 'utf-8'), content.decode ( 'latin-1')을 시도해 보았고, fromEncoding 매개 변수를 BeautifulSoup로 둘러보고, fromEncoding = 'utf-8'로 설정하고 fromEncoding = 'latin-1'이지만 여전히 주사위가 없습니다.

모든 포인터가 많이 감사 할 것입니다.

해결법

==============================
1.시도해 볼 수 있니?

시도해 볼 수 있니?
```
r = urllib.urlopen('http://www.elnorte.ec/')
x = BeautifulSoup.BeautifulSoup(r.read)
r.close()

print x.prettify('latin-1')
```
나는 정확한 결과를 얻는다. 아,이 특별한 경우에는 x .__ str __ (encoding = 'latin1')도 가능합니다.

콘텐츠가 ISO-8859-1 (5)이고 meta http-equiv 콘텐츠 유형이 "UTF-8"이라고 잘못 명시되어 있기 때문입니다.

확인해 주시겠습니까?
==============================
2.귀하의 경우이 페이지에는 BeautifulSoup을 혼동시키는 잘못된 utf-8 데이터가있어 귀하의 페이지가 windows-1252를 사용한다고 생각하게 만들면 다음과 같은 트릭을 수행 할 수 있습니다.

귀하의 경우이 페이지에는 BeautifulSoup을 혼동시키는 잘못된 utf-8 데이터가있어 귀하의 페이지가 windows-1252를 사용한다고 생각하게 만들면 다음과 같은 트릭을 수행 할 수 있습니다.
```
soup = BeautifulSoup.BeautifulSoup(content.decode('utf-8','ignore'))
```
이렇게하면 페이지 소스에서 잘못된 심볼을 버리고 BeautifulSoup가 인코딩을 올바르게 추측합니다.

'무시'를 '바꾸기'로 바꾸고 '?'에 대한 텍스트를 확인할 수 있습니다. 기호를 클릭하면 무엇이 삭제되었는지 확인할 수 있습니다.

사실 100 % 확률로 페이지 인코딩을 추측 할 수있는 크롤러를 작성하는 것은 매우 어렵습니다 (브라우저는 요즘 매우 좋습니다). 예를 들어, 'chardet'과 같은 모듈을 사용할 수 있지만, 예를 들어 인코딩에서 추측 할 수 있습니다 ISO-8859-2로도 올바르지 않습니다.

아마도 사용자가 제공 할 수있는 페이지를 인코딩 할 수 있어야한다면 - 다중 레벨 (utf-8 시도, latin1 시도, 시도 등 ...) 감지 기능을 구축해야합니다 (우리 프로젝트에서했던 것처럼) ) 또는 firefox 또는 chromium의 일부 탐지 코드를 C 모듈로 사용하십시오.

==============================

3.첫 번째 대답은 옳다.이 기능은 때때로 효과적이다.

첫 번째 대답은 옳다.이 기능은 때때로 효과적이다.

    def __if_number_get_string(number):
        converted_str = number
        if isinstance(number, int) or \
            isinstance(number, float):
                converted_str = str(number)
        return converted_str


    def get_unicode(strOrUnicode, encoding='utf-8'):
        strOrUnicode = __if_number_get_string(strOrUnicode)
        if isinstance(strOrUnicode, unicode):
            return strOrUnicode
        return unicode(strOrUnicode, encoding, errors='ignore')

    def get_string(strOrUnicode, encoding='utf-8'):
        strOrUnicode = __if_number_get_string(strOrUnicode)
        if isinstance(strOrUnicode, unicode):
            return strOrUnicode.encode(encoding)
        return strOrUnicode

==============================

4.좀 더 체계적인 바보 증거 접근법을 제안 할 것입니다.

좀 더 체계적인 바보 증거 접근법을 제안 할 것입니다.

# 1. get the raw data 
raw = urllib.urlopen('http://www.elnorte.ec/').read()

# 2. detect the encoding and convert to unicode 
content = toUnicode(raw)    # see my caricature for toUnicode below

# 3. pass unicode to beautiful soup. 
soup = BeautifulSoup(content)


def toUnicode(s):
    if type(s) is unicode:
        return s
    elif type(s) is str:
        d = chardet.detect(s)
        (cs, conf) = (d['encoding'], d['confidence'])
        if conf > 0.80:
            try:
                return s.decode( cs, errors = 'replace' )
            except Exception as ex:
                pass 
    # force and return only ascii subset
    return unicode(''.join( [ i if ord(i) < 128 else ' ' for i in s ]))

당신이 던진 것에 상관없이 당신은 추론 할 수 있습니다, 그것은 항상 유효한 유니 코드를 BS에 보낼 것입니다.

결과적으로 파싱 된 트리가 훨씬 더 잘 작동하고 새로운 데이터가 생길 때마다 더 새롭고 흥미로운 방식으로 실패하지 않습니다.

평가판 및 오류 코드에서 작동하지 않습니다 - 너무 많은 조합이 있습니다 :-)

==============================

5.이것을 시도하면 모든 인코딩에 사용할 수 있습니다.

이것을 시도하면 모든 인코딩에 사용할 수 있습니다.

from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
headers = {"User-Agent": USERAGENT}
resp = requests.get(url, headers=headers)
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, 'lxml', from_encoding=encoding)

from https://stackoverflow.com/questions/7219361/python-and-beautifulsoup-encoding-issues by cc-by-sa and MIT license

'PYTHON' 카테고리의 다른 글

[PYTHON] 파이썬 클래스 인스턴스 변수와 클래스 변수 (0)	2018.10.08
[PYTHON] 'str'객체가 Python에서 항목 할당을 지원하지 않습니다. (0)	2018.10.08
[PYTHON] SQLite "IN"절의 매개 변수 대체 (0)	2018.10.08
[PYTHON] Python PIP Install throws TypeError : - = : 'Retry'및 'int'에 대해 지원되지 않는 피연산자 유형이 있습니다. (0)	2018.10.08
[PYTHON] 한 줄에 한 문자 씩 인쇄하는 법? (0)	2018.10.08

복붙노트

[PYTHON] Python과 BeautifulSoup 인코딩 문제

Python과 BeautifulSoup 인코딩 문제

해결법

1.시도해 볼 수 있니?

2.귀하의 경우이 페이지에는 BeautifulSoup을 혼동시키는 잘못된 utf-8 데이터가있어 귀하의 페이지가 windows-1252를 사용한다고 생각하게 만들면 다음과 같은 트릭을 수행 할 수 있습니다.

3.첫 번째 대답은 옳다.이 기능은 때때로 효과적이다.

4.좀 더 체계적인 바보 증거 접근법을 제안 할 것입니다.

5.이것을 시도하면 모든 인코딩에 사용할 수 있습니다.

'PYTHON' 카테고리의 다른 글

티스토리툴바