The content is truncated after '< script >' #80

ZiheLiu · 2021-02-25T11:28:35Z

When I convert The Go Programming Language into pdf, the output pdf file is truncated after section 5.2.

The reason is that it uses html.unescape() to convert escape characters into corresponding unicode characters.
However, the original HTML code of "练习 5.3：编写函数输出所有text结点的内容。注意不要访问<script>和<style>元素，因为这些元素对浏览者是不可见的。" includes <code><script></code>和<code><style></code>.
As a result, if we convert < script> into <script>, the content after <script> will be truncated.

When I remove the call html.unescape() as follows, then the output pdf contains the whole content.

    def parser(self):
        tree = ET.HTML(self.original)
        if tree.xpath('//section[@class="normal markdown-section"]'):
            context = tree.xpath('//section[@class="normal markdown-section"]')[0]
        else:
            context = tree.xpath('//section[@class="normal"]')[0]
        if context.find('footer'):
            context.remove(context.find('footer'))
        context = self.parsehead(context)
-       return html.unescape(ET.tostring(context).decode())
+       return ET.tostring(context).decode()

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The content is truncated after '< script >' #80

The content is truncated after '< script >' #80

ZiheLiu commented Feb 25, 2021

The content is truncated after '&lt; script &gt;' #80

The content is truncated after '&lt; script &gt;' #80

Comments

ZiheLiu commented Feb 25, 2021

The content is truncated after '< script >' #80

The content is truncated after '< script >' #80