How to Convert HTML to Clean Markdown With Pandoc

I have collected source material in form of HTML pages that I would like to keep in one place as knowledge base (technically: create Obsidian vault for these pages). But first I needed to convert them all to Markdown. First tool to use that came into my mind was using Pandoc

I started with basic syntax to check what is my baseline:

pandoc -i index.html -o index.md

The text were converted to markdown but had some additional elements that wish not to see in the output markdown document.

remaining after div elements, iframes
references in curly brackets
raw HTML comments as codefences blocks

Here is an example of the clutter in my Markdown document.

::: iframe
:::

::: site-container
::: site-header
::: wrap
::: title-area
[Page Title](../../../index.html)
:::

::: {.widget-area .header-widget-area}
::: {#nav_menu-27 .section .widget .widget_nav_menu}
::: widget-wrap
- [[Articles](../../../blog/index.html)]{#menu-item-2360}
- [[Books ](../../index.html)]{#menu-item-8729}
:::
:::
:::
:::
:::

::: site-inner
::: content-sidebar-wrap
::: {.content role="main"}
::: entry-header

- "Happiness doesn't just flow from success, it actually causes it".


<!-- -->

Using --strict, -s mode adds YAML frontmatter with metadata - which contains few fields that I want to keep.

My final solution

To remove parts that remained in Markdown document you can use grep and sed

pandoc -s -i index.html -t markdown |\
grep -v "^:" |\
grep -v '^```' |\
grep -v '<!-- -->' |\
sed -e ':again' -e N -e '$!b again' -e 's/{[^}]*}//g' \
>! index.md

The sed is used to remove content in curly brackets spanning multiple lines:

# Linux
sed ':again;$!N;$!b again; s/{[^}]*}//g'

# macOS
sed -e ':again' -e N -e '$!b again' -e 's/{[^}]*}//g' file

solution by John1024 from: Linux Stack Exchange

Note

You can further experiment with Markdown variants supported by pandoc.

In addition to pandoc’s extended Markdown, the following Markdown variants are supported:

markdown_phpextra (PHP Markdown Extra)
markdown_github (deprecated GitHub-Flavored Markdown)
markdown_mmd (MultiMarkdown)
markdown_strict (Markdown.pl)
commonmark (CommonMark)
gfm (Github-Flavored Markdown)
commonmark_x (CommonMark with many pandoc extensions)

Beyond pandoc

You can give a try to a dedicated python package for converting HTML to markdown: markdownify · PyPI - it has command line interface and support many options for the conversion.

My final solution

Note

Beyond pandoc

See also

You might also like