Python / Django를 사용하여 HTML 디코딩 / 인코딩을 수행하려면 어떻게해야합니까?

html로 인코딩 된 문자열이 있습니다.

&lt;img class=&quot;size-medium wp-image-113&quot; 
  style=&quot;margin-left: 15px;&quot; title=&quot;su1&quot; 
  src=&quot;http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg&quot; 
  alt=&quot;&quot; width=&quot;300&quot; height=&quot;194&quot; /&gt;

나는 그것을 바꾸고 싶다 :

<img class="size-medium wp-image-113" style="margin-left: 15px;" 
  title="su1" src="http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg" 
  alt="" width="300" height="194" />

텍스트로 표시되는 대신 브라우저에서 이미지로 렌더링되도록 HTML로 등록하려고합니다.

나는 C #에서 이것을하는 방법을 찾았지만 파이썬에서는 그렇지 않다. 누군가 나를 도울 수 있습니까?

감사.

편집 : 누군가 내 문자열 왜 그런 식으로 질문했다. 그것은 웹 페이지를 "검사"하고 특정 내용을 가져 오는 웹 스크 레이 핑 도구를 사용하기 때문입니다. 도구 (BeautifulSoup)는 해당 형식의 문자열을 반환합니다.

해결법

==============================
1.Django 유스 케이스를 감안할 때, 이것에 대한 답이 두 가지 있습니다. 다음은 참조 용 django.utils.html.escape 함수입니다.

Django 유스 케이스를 감안할 때, 이것에 대한 답이 두 가지 있습니다. 다음은 참조 용 django.utils.html.escape 함수입니다.
```
def escape(html):
    """Returns the given HTML with ampersands, quotes and carets encoded."""
    return mark_safe(force_unicode(html).replace('&', '&amp;').replace('<', '&l
t;').replace('>', '&gt;').replace('"', '&quot;').replace("'", '&#39;'))
```
이를 뒤집기 위해 Jake의 답변에 설명 된 Cheetah 함수가 작동하지만 작은 따옴표가 빠져 있습니다. 이 버전에는 업데이트 된 튜플이 포함되며 대칭 문제를 방지하기 위해 교체 순서가 반대로됩니다.
```
def html_decode(s):
    """
    Returns the ASCII decoded version of the given HTML string. This does
    NOT remove normal HTML tags like <p>.
    """
    htmlCodes = (
            ("'", '&#39;'),
            ('"', '&quot;'),
            ('>', '&gt;'),
            ('<', '&lt;'),
            ('&', '&amp;')
        )
    for code in htmlCodes:
        s = s.replace(code[1], code[0])
    return s

unescaped = html_decode(my_string)
```
그러나 이것은 일반적인 해결책은 아닙니다. django.utils.html.escape로 인코딩 된 문자열에만 적합합니다. 일반적으로 표준 라이브러리를 사용하는 것이 좋습니다.
```
# Python 2.x:
import HTMLParser
html_parser = HTMLParser.HTMLParser()
unescaped = html_parser.unescape(my_string)

# Python 3.x:
import html.parser
html_parser = html.parser.HTMLParser()
unescaped = html_parser.unescape(my_string)
```
제안 사항 : 데이터베이스에 이스케이프 처리되지 않은 HTML을 저장하는 것이 더 효과적 일 수 있습니다. 가능한 경우 이스케이프 처리되지 않은 결과를 BeautifulSoup에서 복원하고이 과정을 모두 피하는 것이 좋습니다.

Django를 사용하면 이스케이프는 템플릿 렌더링 중에 만 발생합니다. 그래서 이스케이프를 방지하기 위해 문자열을 이스케이프하지 않도록 템플릿 엔진에 알려줍니다. 그렇게하려면 템플릿에서 다음 옵션 중 하나를 사용하십시오.
```
{{ context_var|safe }}
{% autoescape off %}
    {{ context_var }}
{% endautoescape %}
```
==============================
2.표준 라이브러리 :

표준 라이브러리 :

==============================

3.html 인코딩에는 표준 라이브러리의 cgi.escape가 있습니다.

html 인코딩에는 표준 라이브러리의 cgi.escape가 있습니다.

>> help(cgi.escape)
cgi.escape = escape(s, quote=None)
    Replace special characters "&", "<" and ">" to HTML-safe sequences.
    If the optional flag quote is true, the quotation mark character (")
    is also translated.

html 디코딩의 경우 다음을 사용합니다.

import re
from htmlentitydefs import name2codepoint
# for some reason, python 2.5.2 doesn't have this one (apostrophe)
name2codepoint['#39'] = 39

def unescape(s):
    "unescape HTML code refs; c.f. http://wiki.python.org/moin/EscapingHtml"
    return re.sub('&(%s);' % '|'.join(name2codepoint),
              lambda m: unichr(name2codepoint[m.group(1)]), s)

더 복잡한 것에 대해서는 BeautifulSoup를 사용합니다.

==============================
4.인코딩 된 문자 세트가 상대적으로 제한적이라면 daniel의 솔루션을 사용하십시오. 그렇지 않으면 수많은 HTML 파싱 라이브러리 중 하나를 사용하십시오.

인코딩 된 문자 세트가 상대적으로 제한적이라면 daniel의 솔루션을 사용하십시오. 그렇지 않으면 수많은 HTML 파싱 라이브러리 중 하나를 사용하십시오.

잘못된 XML / HTML을 처리 할 수 있기 때문에 BeautifulSoup을 좋아합니다.

http://www.crummy.com/software/BeautifulSoup/

귀하의 질문에, 그들의 설명서에 예제가있다.
```
from BeautifulSoup import BeautifulStoneSoup
BeautifulStoneSoup("Sacr&eacute; bl&#101;u!", 
                   convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0]
# u'Sacr\xe9 bleu!'
```
==============================
5.이 페이지 하단의 파이썬 위키에서 "unescape"html을위한 적어도 두 가지 옵션이 있습니다.

이 페이지 하단의 파이썬 위키에서 "unescape"html을위한 적어도 두 가지 옵션이 있습니다.
==============================
6.Python 3.4 이상 :

Python 3.4 이상 :
```
import html

html.unescape(your_string)
```
==============================
7.Daniel의 대답은 다음과 같습니다.

Daniel의 대답은 다음과 같습니다.

"이스케이프는 템플릿 렌더링 중에 장고에서만 발생하므로 이스케이프가 필요하지 않습니다. 템플릿 엔진에서 {{context_var | safe}} 또는 {% autoescape off %} {{context_var}} { % endautoescape %} "

==============================

8.http://snippets.dzone.com/posts/show/4569에서 훌륭한 기능을 발견했습니다.

http://snippets.dzone.com/posts/show/4569에서 훌륭한 기능을 발견했습니다.

def decodeHtmlentities(string):
    import re
    entity_re = re.compile("&(#?)(\d{1,5}|\w{1,8});")

    def substitute_entity(match):
        from htmlentitydefs import name2codepoint as n2cp
        ent = match.group(2)
        if match.group(1) == "#":
            return unichr(int(ent))
        else:
            cp = n2cp.get(ent)

            if cp:
                return unichr(cp)
            else:
                return match.group()

    return entity_re.subn(substitute_entity, string)[0]

==============================
9.누군가가 장고 템플릿을 통해 이것을 수행하는 간단한 방법을 찾고 있다면, 당신은 항상 다음과 같은 필터를 사용할 수 있습니다 :

누군가가 장고 템플릿을 통해 이것을 수행하는 간단한 방법을 찾고 있다면, 당신은 항상 다음과 같은 필터를 사용할 수 있습니다 :
```
<html>
{{ node.description|safe }}
</html>
```
일부 데이터는 공급 업체에서 가져오고 내가 게시 한 모든 것은 HTML 태그가 실제로 소스를보고있는 것처럼 렌더링 된 페이지에 작성되었습니다. 위의 코드는 나를 크게 도왔습니다. 희망이 다른 사람들을 돕는다.

건배!!

==============================

10.이것은 정말로 오래된 질문이지만 작동 할 수도 있습니다.

이것은 정말로 오래된 질문이지만 작동 할 수도 있습니다.

장고 1.5.5

In [1]: from django.utils.text import unescape_entities
In [2]: unescape_entities('&lt;img class=&quot;size-medium wp-image-113&quot; style=&quot;margin-left: 15px;&quot; title=&quot;su1&quot; src=&quot;http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg&quot; alt=&quot;&quot; width=&quot;300&quot; height=&quot;194&quot; /&gt;')
Out[2]: u'<img class="size-medium wp-image-113" style="margin-left: 15px;" title="su1" src="http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg" alt="" width="300" height="194" />'

==============================

11.나는 Cheetah 소스 코드에서 이것을 발견했다.

나는 Cheetah 소스 코드에서 이것을 발견했다.

htmlCodes = [
    ['&', '&amp;'],
    ['<', '&lt;'],
    ['>', '&gt;'],
    ['"', '&quot;'],
]
htmlCodesReversed = htmlCodes[:]
htmlCodesReversed.reverse()
def htmlDecode(s, codes=htmlCodesReversed):
    """ Returns the ASCII decoded version of the given HTML string. This does
        NOT remove normal HTML tags like <p>. It is the inverse of htmlEncode()."""
    for code in codes:
        s = s.replace(code[1], code[0])
    return s

그들이 왜 목록을 뒤집을 지 확신하지 못한다. 제 생각에는 그것이 인코딩하는 방식과 관련이 있다고 생각합니다. 따라서 여러분과 함께 되돌릴 필요가 없을 수도 있습니다. 또한 내가 당신이라면 htmlCodes를 목록 목록이 아닌 튜플 목록으로 바꿀 것입니다 ... 이건 내 도서관에 갈거야 :)

나는 당신의 타이틀이 encode를 요구했음을 알아 차렸다. 그래서 Cheetah의 encode 함수가 여기에있다.

def htmlEncode(s, codes=htmlCodes):
    """ Returns the HTML encoded version of the given string. This is useful to
        display a plain ASCII text string on a web page."""
    for code in codes:
        s = s.replace(code[0], code[1])
    return s

==============================
12.django.utils.html.escape도 사용할 수 있습니다.

django.utils.html.escape도 사용할 수 있습니다.
```
from django.utils.html import escape

something_nice = escape(request.POST['something_naughty'])
```

==============================

13.아래는 htmlentitydefs 모듈을 사용하는 파이썬 함수입니다. 완벽하지는 않습니다. 내가 가진 htmlentitydefs 버전은 불완전하고 모든 엔티티가 & NotEqualTilde;와 같은 엔티티에 대해 잘못된 하나의 코드 포인트로 디코딩한다고 가정합니다.

아래는 htmlentitydefs 모듈을 사용하는 파이썬 함수입니다. 완벽하지는 않습니다. 내가 가진 htmlentitydefs 버전은 불완전하고 모든 엔티티가 & NotEqualTilde;와 같은 엔티티에 대해 잘못된 하나의 코드 포인트로 디코딩한다고 가정합니다.

http://www.w3.org/TR/html5/named-character-references.html

이러한주의 사항이 있지만 여기에 코드가 있습니다.

def decodeHtmlText(html):
    """
    Given a string of HTML that would parse to a single text node,
    return the text value of that node.
    """
    # Fast path for common case.
    if html.find("&") < 0: return html
    return re.sub(
        '&(?:#(?:x([0-9A-Fa-f]+)|([0-9]+))|([a-zA-Z0-9]+));',
        _decode_html_entity,
        html)

def _decode_html_entity(match):
    """
    Regex replacer that expects hex digits in group 1, or
    decimal digits in group 2, or a named entity in group 3.
    """
    hex_digits = match.group(1)  # '&#10;' -> unichr(10)
    if hex_digits: return unichr(int(hex_digits, 16))
    decimal_digits = match.group(2)  # '&#x10;' -> unichr(0x10)
    if decimal_digits: return unichr(int(decimal_digits, 10))
    name = match.group(3)  # name is 'lt' when '&lt;' was matched.
    if name:
        decoding = (htmlentitydefs.name2codepoint.get(name)
            # Treat &GT; like &gt;.
            # This is wrong for &Gt; and &Lt; which HTML5 adopted from MathML.
            # If htmlentitydefs included mappings for those entities,
            # then this code will magically work.
            or htmlentitydefs.name2codepoint.get(name.lower()))
        if decoding is not None: return unichr(decoding)
    return match.group(0)  # Treat "&noSuchEntity;" as "&noSuchEntity;"

==============================
14.이것은이 문제에 대한 가장 쉬운 해결책입니다 -

이것은이 문제에 대한 가장 쉬운 해결책입니다 -
```
{% autoescape on %}
   {{ body }}
{% endautoescape %}
```
이 페이지에서.
==============================
15.Django와 Python에서이 질문에 대한 가장 간단한 해결책을 찾으면서, 내장 코드를 사용하여 HTML 코드를 이스케이프 / 이스케이프 해제 할 수 있다는 것을 알았습니다.

Django와 Python에서이 질문에 대한 가장 간단한 해결책을 찾으면서, 내장 코드를 사용하여 HTML 코드를 이스케이프 / 이스케이프 해제 할 수 있다는 것을 알았습니다.

scraped_html 및 clean_html에 html 코드를 저장했습니다.
```
scraped_html = (
    '&lt;img class=&quot;size-medium wp-image-113&quot; '
    'style=&quot;margin-left: 15px;&quot; title=&quot;su1&quot; '
    'src=&quot;http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg&quot; '
    'alt=&quot;&quot; width=&quot;300&quot; height=&quot;194&quot; /&gt;'
)
clean_html = (
    '<img class="size-medium wp-image-113" style="margin-left: 15px;" '
    'title="su1" src="http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg" '
    'alt="" width="300" height="194" />'
)
```
당신은 Django> = 1.0이 필요합니다.

긁힌 html 코드를 이스케이프 해제하려면 django.utils.text.unescape_entities를 사용할 수 있습니다.
```
>>> from django.utils.text import unescape_entities
>>> clean_html == unescape_entities(scraped_html)
True
```
깨끗한 html 코드를 이스케이프하려면 django.utils.html.escape을 사용할 수 있습니다.
```
>>> from django.utils.html import escape
>>> scraped_html == escape(clean_html)
True
```
파이썬> = 3.4가 필요합니다.

긁힌 html 코드를 이스케이프 해제하려면 html.unescape를 사용할 수 있습니다.
```
>>> from html import unescape
>>> clean_html == unescape(scraped_html)
True
```
깨끗한 html 코드를 이스케이프하려면 html.escape를 사용할 수 있습니다.
```
>>> from html import escape
>>> scraped_html == escape(clean_html)
True
```

from https://stackoverflow.com/questions/275174/how-do-i-perform-html-decoding-encoding-using-python-django by cc-by-sa and MIT license

'PYTHON' 카테고리의 다른 글

[PYTHON] PDF 파일에서 텍스트를 추출하는 방법은 무엇입니까? (0)	2018.10.06
[PYTHON] 파이썬에서 굵은 텍스트를 어떻게 인쇄합니까? (0)	2018.10.06
[PYTHON] POST 요청을 보내는 방법? (0)	2018.10.06
[PYTHON] json.dumps 대 flask.jsonify (0)	2018.10.06
[PYTHON] Python : 사전을 사용하여 목록의 항목 계산 [duplicate] (0)	2018.10.06

복붙노트

[PYTHON] Python / Django를 사용하여 HTML 디코딩 / 인코딩을 수행하려면 어떻게해야합니까?

Python / Django를 사용하여 HTML 디코딩 / 인코딩을 수행하려면 어떻게해야합니까?

해결법

1.Django 유스 케이스를 감안할 때, 이것에 대한 답이 두 가지 있습니다. 다음은 참조 용 django.utils.html.escape 함수입니다.

2.표준 라이브러리 :

3.html 인코딩에는 표준 라이브러리의 cgi.escape가 있습니다.

4.인코딩 된 문자 세트가 상대적으로 제한적이라면 daniel의 솔루션을 사용하십시오. 그렇지 않으면 수많은 HTML 파싱 라이브러리 중 하나를 사용하십시오.

5.이 페이지 하단의 파이썬 위키에서 "unescape"html을위한 적어도 두 가지 옵션이 있습니다.

6.Python 3.4 이상 :

7.Daniel의 대답은 다음과 같습니다.

8.http://snippets.dzone.com/posts/show/4569에서 훌륭한 기능을 발견했습니다.

9.누군가가 장고 템플릿을 통해 이것을 수행하는 간단한 방법을 찾고 있다면, 당신은 항상 다음과 같은 필터를 사용할 수 있습니다 :

10.이것은 정말로 오래된 질문이지만 작동 할 수도 있습니다.

11.나는 Cheetah 소스 코드에서 이것을 발견했다.

12.django.utils.html.escape도 사용할 수 있습니다.

14.이것은이 문제에 대한 가장 쉬운 해결책입니다 -

15.Django와 Python에서이 질문에 대한 가장 간단한 해결책을 찾으면서, 내장 코드를 사용하여 HTML 코드를 이스케이프 / 이스케이프 해제 할 수 있다는 것을 알았습니다.

'PYTHON' 카테고리의 다른 글

티스토리툴바