Skip to content
This repository has been archived by the owner on Nov 17, 2022. It is now read-only.

The content is truncated after '< script >' #80

Open
ZiheLiu opened this issue Feb 25, 2021 · 0 comments
Open

The content is truncated after '< script >' #80

ZiheLiu opened this issue Feb 25, 2021 · 0 comments

Comments

@ZiheLiu
Copy link

ZiheLiu commented Feb 25, 2021

When I convert The Go Programming Language into pdf, the output pdf file is truncated after section 5.2.

The reason is that it uses html.unescape() to convert escape characters into corresponding unicode characters.
However, the original HTML code of "练习 5.3: 编写函数输出所有text结点的内容。注意不要访问<script>和<style>元素,因为这些元素对浏览者是不可见的。" includes <code>&lt;script&gt;</code>&#x548C;<code>&lt;style&gt;</code>.
As a result, if we convert &lt; script&gt; into <script>, the content after &lt;script&gt; will be truncated.

When I remove the call html.unescape() as follows, then the output pdf contains the whole content.

    def parser(self):
        tree = ET.HTML(self.original)
        if tree.xpath('//section[@class="normal markdown-section"]'):
            context = tree.xpath('//section[@class="normal markdown-section"]')[0]
        else:
            context = tree.xpath('//section[@class="normal"]')[0]
        if context.find('footer'):
            context.remove(context.find('footer'))
        context = self.parsehead(context)
-       return html.unescape(ET.tostring(context).decode())
+       return ET.tostring(context).decode()
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant