摘要：问题描述：我有一个 HTML 编码的字符串：'''<img class="size-medium wp-image-113"\n style="margin-left: 15px;" title="su1"\n src="ht...

问题描述：

我有一个 HTML 编码的字符串：

'''&lt;img class="size-medium wp-image-113"\n style="margin-left: 15px;" title="su1"\n src="http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg"\n alt="" width="300" height="194" /&gt;'''

我想将其更改为：

<img class="size-medium wp-image-113" style="margin-left: 15px;" 
  title="su1" src="http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg" 
  alt="" width="300" height="194" />

我希望将其注册为 HTML，以便浏览器将其呈现为图像，而不是显示为文本。

字符串这样存储是因为我正在使用一个名为的网络抓取工具BeautifulSoup，它“扫描”网页并从中获取某些内容，然后以该格式返回字符串。

我找到了如何在C#中执行此操作的方法，但在Python中却找不到。有人能帮帮我吗？

有关的

在 Python 中将 XML/HTML 实体转换为 Unicode 字符串

解决方案 1：

使用标准库：

HTML 转义

try:
    from html import escape  # python 3.x
except ImportError:
    from cgi import escape  # python 2.x

print(escape("<"))

HTML 取消转义

try:
    from html import unescape  # python 3.4+
except ImportError:
    try:
        from html.parser import HTMLParser  # python 3.x (<3.4)
    except ImportError:
        from HTMLParser import HTMLParser  # python 2.x
    unescape = HTMLParser().unescape

print(unescape("&gt;"))

解决方案 2：

考虑到 Django 的使用情况，对此有两个答案。以下是其django.utils.html.escape功能，供参考：

def escape(html):
    """Returns the given HTML with ampersands, quotes and carets encoded."""
    return mark_safe(force_unicode(html).replace('&', '&amp;').replace('<', '&l
t;').replace('>', '&gt;').replace('"', '"').replace("'", '&#39;'))

为了扭转这种情况，Jake 的答案中描述的 Cheetah 函数应该可以工作，但缺少单引号。此版本包含一个更新的元组，替换顺序被反转以避免对称问题：

def html_decode(s):
    """
    Returns the ASCII decoded version of the given HTML string. This does
    NOT remove normal HTML tags like <p>.
    """
    htmlCodes = (
            ("'", '&#39;'),
            ('"', '"'),
            ('>', '&gt;'),
            ('<', '&lt;'),
            ('&', '&amp;')
        )
    for code in htmlCodes:
        s = s.replace(code[1], code[0])
    return s

unescaped = html_decode(my_string)

然而，这不是一个通用的解决方案；它只适用于用编码的字符串django.utils.html.escape。更一般地说，坚持使用标准库是一个好主意：

# Python 2.x:
import HTMLParser
html_parser = HTMLParser.HTMLParser()
unescaped = html_parser.unescape(my_string)

# Python 3.x:
import html.parser
html_parser = html.parser.HTMLParser()
unescaped = html_parser.unescape(my_string)

# >= Python 3.5:
from html import unescape
unescaped = unescape(my_string)

建议：将未转义的 HTML 存储在数据库中可能更有意义。如果可能的话，值得研究从 BeautifulSoup 获取未转义的结果，并完全避免此过程。

使用 Django，转义仅在模板渲染期间发生；因此，为了防止转义，您只需告诉模板引擎不要转义字符串即可。为此，请在模板中使用以下选项之一：

{{ context_var|safe }}
{% autoescape off %}
    {{ context_var }}
{% endautoescape %}

解决方案 3：

对于 html 编码，标准库中有cgi.escape ：

>> help(cgi.escape)
cgi.escape = escape(s, quote=None)
    Replace special characters "&", "<" and ">" to HTML-safe sequences.
    If the optional flag quote is true, the quotation mark character (")
    is also translated.

对于 html 解码，我使用以下命令：

import re
from htmlentitydefs import name2codepoint
# for some reason, python 2.5.2 doesn't have this one (apostrophe)
name2codepoint['#39'] = 39

def unescape(s):
    "unescape HTML code refs; c.f. http://wiki.python.org/moin/EscapingHtml"
    return re.sub('&(%s);' % '|'.join(name2codepoint),
              lambda m: unichr(name2codepoint[m.group(1)]), s)

对于更复杂的事情，我使用BeautifulSoup。

解决方案 4：

如果编码字符集相对受限，请使用 daniel 的解决方案。否则，请使用众多 HTML 解析库之一。

我喜欢 BeautifulSoup，因为它可以处理格式错误的 XML/HTML：

http://www.crummy.com/software/BeautifulSoup/

对于你的问题，他们的文档中有一个例子

from BeautifulSoup import BeautifulStoneSoup
BeautifulStoneSoup("Sacr&eacute; bl&#101;u!", 
                   convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0]
# u'Sacrxe9 bleu!'

解决方案 5：

在 Python 3.4+ 中：

import html

html.unescape(your_string)

解决方案 6：

请参阅Python wiki此页面底部，至少有 2 个选项可以“取消转义”html。

解决方案 7：

如果有人正在寻找通过 django 模板执行此操作的简单方法，您可以随时使用如下过滤器：

<html>
{{ node.description|safe }}
</html>

我有一些来自供应商的数据，我发布的所有内容都有 HTML 标签，实际上写在呈现的页面上，就好像您正在查看源代码一样。

解决方案 8：

丹尼尔的评论作为回答：

“转义仅在 Django 模板渲染期间发生。因此，不需要取消转义 - 您只需告诉模板引擎不要转义。要么是 {{ context_var|safe }} 要么是 {% autoescape off %}{{ context_var }}{% endautoescape %}”

解决方案 9：

我发现了一个很好的功能：http://snippets.dzone.com/posts/show/4569

def decodeHtmlentities(string):
    import re
    entity_re = re.compile("&(#?)(d{1,5}|w{1,8});")

    def substitute_entity(match):
        from htmlentitydefs import name2codepoint as n2cp
        ent = match.group(2)
        if match.group(1) == "#":
            return unichr(int(ent))
        else:
            cp = n2cp.get(ent)

            if cp:
                return unichr(cp)
            else:
                return match.group()

    return entity_re.subn(substitute_entity, string)[0]

解决方案 10：

尽管这是一个非常古老的问题，但它可能会有效。

Django 1.5.5

In [1]: from django.utils.text import unescape_entities
In [2]: unescape_entities('&lt;img class="size-medium wp-image-113" style="margin-left: 15px;" title="su1" src="http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg" alt="" width="300" height="194" /&gt;')
Out[2]: u'<img class="size-medium wp-image-113" style="margin-left: 15px;" title="su1" src="http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg" alt="" width="300" height="194" />'

解决方案 11：

我在 Cheetah 源代码中发现了这一点（此处）

htmlCodes = [
    ['&', '&amp;'],
    ['<', '&lt;'],
    ['>', '&gt;'],
    ['"', '"'],
]
htmlCodesReversed = htmlCodes[:]
htmlCodesReversed.reverse()
def htmlDecode(s, codes=htmlCodesReversed):
    """ Returns the ASCII decoded version of the given HTML string. This does
        NOT remove normal HTML tags like <p>. It is the inverse of htmlEncode()."""
    for code in codes:
        s = s.replace(code[1], code[0])
    return s

不确定他们为什么要反转列表，我认为这与他们的编码方式有关，所以对你来说可能不需要反转。另外，如果我是你，我会将 htmlCodes 更改为元组列表，而不是列表列表……不过这会在我的库中出现 :)

我注意到你的标题也要求编码，所以这里是 Cheetah 的编码功能。

def htmlEncode(s, codes=htmlCodes):
    """ Returns the HTML encoded version of the given string. This is useful to
        display a plain ASCII text string on a web page."""
    for code in codes:
        s = s.replace(code[0], code[1])
    return s

解决方案 12：

您也可以使用 django.utils.html.escape

from django.utils.html import escape

something_nice = escape(request.POST['something_naughty'])

解决方案 13：

这是解决这个问题最简单的方法 -

{% autoescape on %}
   {{ body }}
{% endautoescape %}

从此页面。

解决方案 14：

下面是一个使用模块的 Python 函数htmlentitydefs。它并不完美。我拥有的版本htmlentitydefs不完整，它假设所有实体都解码为一个代码点，这对于以下实体来说是错误的&NotEqualTilde;：

http://www.w3.org/TR/html5/named-character-references.html

NotEqualTilde;     U+02242 U+00338    ≂̸

不过，除了这些警告之外，代码如下。

def decodeHtmlText(html):
    """
    Given a string of HTML that would parse to a single text node,
    return the text value of that node.
    """
    # Fast path for common case.
    if html.find("&") < 0: return html
    return re.sub(
        '&(?:#(?:x([0-9A-Fa-f]+)|([0-9]+))|([a-zA-Z0-9]+));',
        _decode_html_entity,
        html)

def _decode_html_entity(match):
    """
    Regex replacer that expects hex digits in group 1, or
    decimal digits in group 2, or a named entity in group 3.
    """
    hex_digits = match.group(1)  # '&#10;' -> unichr(10)
    if hex_digits: return unichr(int(hex_digits, 16))
    decimal_digits = match.group(2)  # '&#x10;' -> unichr(0x10)
    if decimal_digits: return unichr(int(decimal_digits, 10))
    name = match.group(3)  # name is 'lt' when '&lt;' was matched.
    if name:
        decoding = (htmlentitydefs.name2codepoint.get(name)
            # Treat &GT; like &gt;.
            # This is wrong for &Gt; and &Lt; which HTML5 adopted from MathML.
            # If htmlentitydefs included mappings for those entities,
            # then this code will magically work.
            or htmlentitydefs.name2codepoint.get(name.lower()))
        if decoding is not None: return unichr(decoding)
    return match.group(0)  # Treat "&noSuchEntity;" as "&noSuchEntity;"

解决方案 15：

在 Django 和 Python 中搜索该问题的最简单解决方案，我发现您可以使用其内置函数来转义/取消转义 html 代码。

例子

我将您的 html 代码保存scraped_html在clean_html：

scraped_html = (
    '&lt;img class="size-medium wp-image-113" '
    'style="margin-left: 15px;" title="su1" '
    'src="http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg" '
    'alt="" width="300" height="194" /&gt;'
)
clean_html = (
    '<img class="size-medium wp-image-113" style="margin-left: 15px;" '
    'title="su1" src="http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg" '
    'alt="" width="300" height="194" />'
)

Django

你需要 Django >= 1.0

取消转义

要取消转义您抓取的 html 代码，您可以使用django.utils.text.unescape_entities：

将所有命名和数字字符引用转换为相应的 Unicode 字符。

>>> from django.utils.text import unescape_entities
>>> clean_html == unescape_entities(scraped_html)
True

逃脱

要转义干净的 html 代码，可以使用django.utils.html.escape：

返回带有编码以供 HTML 使用的与号、引号和尖括号的给定文本。

>>> from django.utils.html import escape
>>> scraped_html == escape(clean_html)
True

Python

你需要 Python >= 3.4

取消转义

要取消转义您抓取的 html 代码，您可以使用html.unescape：

将字符串 s 中的所有命名和数字字符引用（例如>、>、）转换为相应的 unicode 字符。&x3e;

>>> from html import unescape
>>> clean_html == unescape(scraped_html)
True

逃脱

要转义干净的 html 代码，您可以使用html.escape：

将字符串 s 中的字符&、<和转换>为 HTML 安全序列。

>>> from html import escape
>>> scraped_html == escape(clean_html)
True

如何使用 Python/Django 执行 HTML 解码/编码？

问题描述：

有关的

解决方案 1：

解决方案 2：

解决方案 3：

解决方案 4：

解决方案 5：

解决方案 6：

解决方案 7：

解决方案 8：

解决方案 9：

解决方案 10：

解决方案 11：

解决方案 12：

解决方案 13：

解决方案 14：

解决方案 15：

例子

Django

取消转义

逃脱

Python

取消转义

逃脱

云端的项目管理软件