Ask HN: How should I convert Microsoft Word documents to Markdown?

I took over a project that was built by an overseas team. They set up a data ingestion process. They have a step in the ingestion where they use Libre Office (in headless mode) to convert Microsoft Word documents to PDFs. Later we convert all PDFs to Markdown. They felt that it was best to convert everything to a PDF, and then convert all of the PDFs to Markdown.

What I notice is that LibreOffice can create very complex PDFs when the Microsoft Word document has:

1. tables

2. multiple columns

3. strikethrough text

I am thinking we should go straight from Microsoft Word to Markdown.

What is the right software for that?

5 points | by lkrubner 1 day ago

7 comments

  • mackatsol 19 hours ago
    Pandoc is awesome for this: `pandoc input.docx -o output.md` There's more you can do, with style sheets and so on, which you will likely have to dig into for the tables and multiple columns to come out the way you want. You can also extract media files from inside a docx file: `pandoc --extract-media=. input.docx -o output.md`
  • qup 20 hours ago
    I'd give an llm a shot before I ruled it out.

    I had it generating .docx the other day and it did pretty well, so I assume it understands the format just fine.

    And they're excellent at markdown.

  • kha1n3vol3 1 day ago
    Start with pandoc before reinventing the wheel.
  • ramoz 1 day ago
    Pandoc might be able to do this, found this:

    https://gist.github.com/plembo/409a8d7b1bae66622dbcd26337bbb...

  • dhruvyads 15 hours ago
    Claude Code can do these types of really well unless you're trying to convert in bulk
  • snailshare 1 day ago
    Pandoc can do this I think
  • verdverm 21 hours ago
    Native support: https://techcommunity.microsoft.com/blog/onedriveblog/introd...

    Microsoft OSS python: https://github.com/microsoft/markitdown

    There seem to be many addons that enable this, and pandoc as others have suggested