2022-08-31
How to Convert HTML to Clean Markdown With Pandoc
I have collected source material in form of HTML pages that I would like to keep in one place as knowledge base (technically: create Obsidian vault for these pages). But first I needed to convert them all to Markdown. First tool to use that came into my mind was using Pandoc
I started with basic syntax to check what is my baseline:
pandoc -i index.html -o index.md
The text were converted to markdown but had some additional elements that wish not to see in the output markdown document.
- remaining after div elements, iframes
- references in curly brackets
- raw HTML comments as codefences blocks
Here is an example of the clutter in my Markdown document.
::: iframe
:::
::: site-container
::: site-header
::: wrap
::: title-area
[Page Title](../../../index.html)
:::
::: {.widget-area .header-widget-area}
::: {#nav_menu-27 .section .widget .widget_nav_menu}
::: widget-wrap
- [[Articles](../../../blog/index.html)]{#menu-item-2360}
- [[Books ](../../index.html)]{#menu-item-8729}
:::
:::
:::
:::
:::
::: site-inner
::: content-sidebar-wrap
::: {.content role="main"}
::: entry-header
- "Happiness doesn't just flow from success, it actually causes it".
<!-- -->
Using --strict, -s mode adds YAML frontmatter with metadata - which contains few fields that I want to keep.
My final solution
To remove parts that remained in Markdown document you can use grep and sed
pandoc -s -i index.html -t markdown |\
grep -v "^:" |\
grep -v '^```' |\
grep -v '<!-- -->' |\
sed -e ':again' -e N -e '$!b again' -e 's/{[^}]*}//g' \
>! index.md
The sed is used to remove content in curly brackets spanning multiple lines:
# Linux
sed ':again;$!N;$!b again; s/{[^}]*}//g'
# macOS
sed -e ':again' -e N -e '$!b again' -e 's/{[^}]*}//g' file
solution by John1024 from: Linux Stack Exchange
Note
You can further experiment with Markdown variants supported by pandoc.
In addition to pandoc’s extended Markdown, the following Markdown variants are supported:
markdown_phpextra(PHP Markdown Extra)markdown_github(deprecated GitHub-Flavored Markdown)markdown_mmd(MultiMarkdown)markdown_strict(Markdown.pl)commonmark(CommonMark)gfm(Github-Flavored Markdown)commonmark_x(CommonMark with many pandoc extensions)
Beyond pandoc
You can give a try to a dedicated python package for converting HTML to markdown: markdownify · PyPI - it has command line interface and support many options for the conversion.