Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandoc 1.19 adds useless HTML elements to Org output #21

Open
tshu-w opened this issue Jun 27, 2017 · 9 comments
Open

Pandoc 1.19 adds useless HTML elements to Org output #21

tshu-w opened this issue Jun 27, 2017 · 9 comments

Comments

@tshu-w
Copy link

tshu-w commented Jun 27, 2017

look like this, and I wander why.
screen shot 2017-06-27 at 11 58 18 am

@alphapapa
Copy link
Owner

Please provide more information, e.g.:

  • OS/platform
  • Emacs version
  • Pandoc version
  • Browser version
  • Method of capture (bookmarklet, shell script, etc)

@tshu-w
Copy link
Author

tshu-w commented Jun 27, 2017

Sorry, here is the information:

  • macOS 10.12.5
  • GNU Emacs 25.2.1 (x86_64-apple-darwin16.6.0, Carbon Version 157 AppKit 1504.83)
    of 2017-06-18
  • pandoc 1.19.2.1 (install by homebrew)
  • uses eww
  • method of capture: bookmarklet (in Safari or Chrome)

If I use curl http://kitchingroup.cheme.cmu.edu/blog/2014/07/17/Pandoc-does-org-mode-now/ | pandoc -f html -t org, I got the org file just like the picture above, but everything look fine in eww

@alphapapa
Copy link
Owner

eww is unrelated to this issue. Pandoc does the conversion from HTML to Org.

If I use curl http://kitchingroup.cheme.cmu.edu/blog/2014/07/17/Pandoc-does-org-mode-now/ | pandoc -f html -t org, I got the org file just like the picture above

That would indicate that it's an issue with Pandoc. I don't have that version of Pandoc, and it's not practical for me to build it manually right now.

I'd suggest checking the Pandoc documentation to see if there have been any features added that would add those HTML DIV elements to the output; maybe there's an option to turn it off. If so, we could add that option to this package.

Otherwise, it would seem to be a bug in Pandoc, adding useless HTML elements to the Org output; if so, I guess you should report it to the Pandoc bug tracker.

Please let me know what you find out. Thanks.

@alphapapa alphapapa changed the title The effect is not as good as the picture Pandoc 1.19 adds useless HTML elements to Org output Jun 28, 2017
@xuchunyang
Copy link
Contributor

xuchunyang commented Jun 28, 2017

I can reproduce with Pandoc 1.18. Maybe all Pandoc versions have the same problem? @alphapapa Can you reproduce?

@tarleb
Copy link

tarleb commented Jun 28, 2017

We are discussing the problem in jgm/pandoc#3771. Any input would be appreciated.

@alphapapa
Copy link
Owner

alphapapa commented Jun 29, 2017

@xuchunyang Not all versions, but apparently some recent ones do. Maybe you can check their changelogs and figure out which one started doing it.

@tarleb Thanks, I posted a comment there. Hoping these will be disabled for normal org output.

@tshu-w
Copy link
Author

tshu-w commented Jun 29, 2017

@alphapapa I close since it's an upstream issue.

@tshu-w tshu-w closed this as completed Jun 29, 2017
@alphapapa
Copy link
Owner

@Voleking Thanks. Let's keep an eye on that issue, though; depending on their solution, we might need to add options or additional processing steps to remove extraneous HTML stuff.

@alphapapa alphapapa reopened this Jun 30, 2017
@alphapapa
Copy link
Owner

For future reference, here's a function from my init file that I improved to handle this. I should use it to handle this problem.

(cl-defun ap/org-capture-web-page-with-eww-readable
      (&optional url
                 (filter-fns '(remove-dos-crlf  ; This function should always be first and should always be included
                               ap/org-remove-html-blocks-from-string)))
    (let* ((url (or url (ap/get-first-url-in-kill-ring)))
           (html (ap/url-html url))
           (result (ap/eww-readable html))
           (title (cdr result))
           (title-linked (org-make-link-string url title))
           (content (with-temp-buffer
                      (insert (car result))
                      ;; Convert to Org with Pandoc
                      (unless (= 0 (call-process-region (point-min) (point-max)
                                                        "pandoc" t t nil "--no-wrap"
                                                        "-f" "html" "-t" "org"))
                        (error "Pandoc failed."))
                      ;; Demote page headings in capture buffer to below the
                      ;; top-level Org heading and "Article" 2nd-level heading
                      (save-excursion
                        (goto-char (point-min))
                        (while (re-search-forward (rx bol (1+ "*") (1+ space)) nil t)
                          (beginning-of-line)
                          (insert "**")
                          (end-of-line)))
                      (buffer-string)))
           (timestamp (format-time-string (concat "[" (substring (cdr org-time-stamp-formats) 1 -1) "]"))))
      (when filter-fns
        (dolist (fn filter-fns)
          (setq content (funcall fn content))))
      (concat title-linked " :website:\n\n" timestamp "\n\n** Article\n\n" content)))

(defun remove-dos-crlf (&optional s)
  "Remove all DOS CRLF (^M) in buffer or in string S."
  (interactive)
  (if s
      (replace-regexp-in-string (string ?\C-m) "" s
                                'fixedcase 'literal)
    (save-excursion
      (goto-char (point-min))
      (while (search-forward (string ?\C-m) nil t)
        (replace-match "")))))

(defun ap/org-remove-html-blocks-from-string (s)
    "Remove \"#+BEGIN_HTML...#+END_HTML\" blocks from Org-formatted string S."
    (replace-regexp-in-string (rx (optional "\n") "#+BEGIN_HTML" (minimal-match (1+ anything)) "#+END_HTML" (optional "\n"))
                              "" s 'fixedcase 'literal))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants