-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pandoc 1.19 adds useless HTML elements to Org output #21
Comments
Please provide more information, e.g.:
|
Sorry, here is the information:
If I use |
That would indicate that it's an issue with Pandoc. I don't have that version of Pandoc, and it's not practical for me to build it manually right now. I'd suggest checking the Pandoc documentation to see if there have been any features added that would add those HTML DIV elements to the output; maybe there's an option to turn it off. If so, we could add that option to this package. Otherwise, it would seem to be a bug in Pandoc, adding useless HTML elements to the Org output; if so, I guess you should report it to the Pandoc bug tracker. Please let me know what you find out. Thanks. |
I can reproduce with Pandoc 1.18. Maybe all Pandoc versions have the same problem? @alphapapa Can you reproduce? |
We are discussing the problem in jgm/pandoc#3771. Any input would be appreciated. |
@xuchunyang Not all versions, but apparently some recent ones do. Maybe you can check their changelogs and figure out which one started doing it. @tarleb Thanks, I posted a comment there. Hoping these will be disabled for normal |
@alphapapa I close since it's an upstream issue. |
@Voleking Thanks. Let's keep an eye on that issue, though; depending on their solution, we might need to add options or additional processing steps to remove extraneous HTML stuff. |
For future reference, here's a function from my init file that I improved to handle this. I should use it to handle this problem. (cl-defun ap/org-capture-web-page-with-eww-readable
(&optional url
(filter-fns '(remove-dos-crlf ; This function should always be first and should always be included
ap/org-remove-html-blocks-from-string)))
(let* ((url (or url (ap/get-first-url-in-kill-ring)))
(html (ap/url-html url))
(result (ap/eww-readable html))
(title (cdr result))
(title-linked (org-make-link-string url title))
(content (with-temp-buffer
(insert (car result))
;; Convert to Org with Pandoc
(unless (= 0 (call-process-region (point-min) (point-max)
"pandoc" t t nil "--no-wrap"
"-f" "html" "-t" "org"))
(error "Pandoc failed."))
;; Demote page headings in capture buffer to below the
;; top-level Org heading and "Article" 2nd-level heading
(save-excursion
(goto-char (point-min))
(while (re-search-forward (rx bol (1+ "*") (1+ space)) nil t)
(beginning-of-line)
(insert "**")
(end-of-line)))
(buffer-string)))
(timestamp (format-time-string (concat "[" (substring (cdr org-time-stamp-formats) 1 -1) "]"))))
(when filter-fns
(dolist (fn filter-fns)
(setq content (funcall fn content))))
(concat title-linked " :website:\n\n" timestamp "\n\n** Article\n\n" content)))
(defun remove-dos-crlf (&optional s)
"Remove all DOS CRLF (^M) in buffer or in string S."
(interactive)
(if s
(replace-regexp-in-string (string ?\C-m) "" s
'fixedcase 'literal)
(save-excursion
(goto-char (point-min))
(while (search-forward (string ?\C-m) nil t)
(replace-match "")))))
(defun ap/org-remove-html-blocks-from-string (s)
"Remove \"#+BEGIN_HTML...#+END_HTML\" blocks from Org-formatted string S."
(replace-regexp-in-string (rx (optional "\n") "#+BEGIN_HTML" (minimal-match (1+ anything)) "#+END_HTML" (optional "\n"))
"" s 'fixedcase 'literal)) |
look like this, and I wander why.
The text was updated successfully, but these errors were encountered: