from Hacker News

MarkItDown: Python tool for converting files and office documents to Markdown

by Handy-Man on 12/13/24, 6:02 PM with 81 comments

by simonw on 12/13/24, 7:03 PM
If you have uv installed you can run this against a file without first installing anything like this:
```
    uvx markitdown path-to-file.pdf
```
(This will cache the necessary packages the first time you run it, then reuse those cached packages on future invocations.)
I've tried it against HTML and PDFs so far and it seems pretty decent.
by irskep on 12/13/24, 6:39 PM
I worked on an in-house version of this feature for my employer (turning files into LLM-friendly text). After reading the source code, I can say this is a pretty reasonable implementation of this type of thing. But I would avoid using it for images, since the LLM providers let you just pass images directly, and I would also avoid using it for spreadsheets, since LLMs are very bad at interpreting Markdown tables.
There are a lot of random startups and open source projects who try to make this space sound fancy, but I really hope the end state is a simple project like this, easy to understand and easy to deploy.
I do wish it had a knob to turn for "how much processing do you want me to do." For PDF specifically, you either have to get a crappy version of the plain text using heuristics in a way that is very sensitive to how the PDF is exported, or you have to go full OCR, and it's annoying when a project locks you into one or the other. I'm also not sure I'd want to use the speech-to-text features here since they might have very different performance characteristics than the text-to-text stuff.
by btown on 12/13/24, 7:39 PM
For PDFs it's entirely a wrapper around https://pdfminersix.readthedocs.io/en/latest/tutorial/highle... - https://github.com/microsoft/markitdown/blob/main/src/markit...
So if that's your use case, PDFMiner might be better to integrate with directly!
by figomore on 12/13/24, 7:08 PM
Pandoc (https://pandoc.org) can be used to convert a .docx file to markdown and other file formats like djot and typst. I don't think pandoc can convert powerpoint and excel files.
by starkparker on 12/13/24, 8:36 PM
I index a lot of tabletop RPG books in PDF format, which often have complex visual layouts and many tables that parsers typically have difficulty with. If this is just a wrapper around PDFMiner, as noted in another comment, I don't see any value added by this tool.
This handles them... fine. It either doesn't recognize or never attempts to handle tables, which makes it fundamentally a non-starter for my typical usage, but to its credit it seems to have at least some sense of table cells; it organizes columns in a manner that isn't fully readable but isn't as broken as some other solutions, either.
It otherwise handles text that's in variable-width columns or wrapped in complex ways around art work rather well. It inserts extraneous spaces on fully justified text, which is frustrating but not unusual, and sometimes adds extraneous line breaks on mid-sentence column breaks.
The biggest miss, though, is how it completely misses headings! This seems fundamental for any use case, including grooming sources for LLM training. It doesn't identify a single heading in any PDF I've thrown at it so far.
by benatkin on 12/13/24, 6:33 PM
Nary a mention of LLMs in the readme. That was an unexpected but pleasant surprise, when the idea of converting something to markdown for LLMs is floated as if it's new and the greatest thing since sliced bread. https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
It's interesting to read the code. It's mostly glue code, and most of it is in single 1101 line file. But it does indeed say what the README says it does. Here is the special handling for Wikipedia: https://github.com/microsoft/markitdown/blob/main/src/markit...
Edit: good to see the one from yesterday flagged. I tried to assume good intent, but also wondered if it was a place to draw a line in the sand. https://news.ycombinator.com/item?id=42405758
Edit 2: ah, it came down to simple violation of the Show HN rules. I didn't notice, but yeah, that's definitely the case.
by markhneedham on 12/13/24, 7:01 PM
Quite curious how this compares to docling - https://github.com/DS4SD/docling
docling uses an LLM IIRC, so that's already a difference in approach
by hks0 on 12/13/24, 9:24 PM
This is amazing and really useful, love the idea; but let me tell you a story, it's a bit of a tangent but relevant enough:
In an online language class we were sending the assignments to our teacher via slack, the teacher would then mark our mistakes and send it back.
I, as a true hater of all the heavy weight text formats for everyday communications, autonomously fired up the terminal, wrote my assignment in my_name.md and happily sent it without giving it any thought. This is what I hear the next session:
"... and everybody did a great job! Although someone just sent me their assignment in a stupid format. I don't know what it was! I could neither highlight it or make the text bold or anything. Don't do that to me again please".
Before that I never dreamed of meeting someone who preferred a word document _after_ opening a .md file, and I also learned if I had chosen product design as a career, everyone would've suffered immensely (or maybe not, I would've just ended up jobless).
by LittleTimothy on 12/13/24, 7:18 PM
This is... interesting. From my understanding - and people can correct me if I'm wrong, but didn't Microsoft spend an extremely large amount of effort essentially trying to screw people who made things like this in the 2000s? Interoperability and the Open Office movement were prety hard fought. It's kind of crazy to see MSFT do this today. Did I just misunderstand and the underlying formats (docx etc) were actually pretty friendly, or have the formats evolved a lot since then? Or is it more a case of "It doesn't matter if it looks terrible because we're feeding it to the AI beast anyway"
A cynic might say it became suddenly easy when MSFT had a reason to allow you to genereate markdown to feed into it's AI?
by konfekt on 12/14/24, 9:23 AM
Though it promises to convert everything to Markdown, it seems to be a worse version of what the already existing tools such as PDFtotext, docx2txt, pptx2md, ... collected [here] do without even pretending to export to Markdown. Looking at its [source], it indeed seems to be a wrapper to python variants of those. Making the pool smaller can hardly improve the output.
[here] https://github.com/Konfekt/vim-office [source] htps://github.com/microsoft/markitdown/blob/main/src/markitdown/_markitdown.py
by theanonymousone on 12/13/24, 6:53 PM
Why is the repository 95% "HTML" code?
by kepano on 12/13/24, 7:50 PM
Never thought I'd see the day. Yet... not surprising because plain text is the ideal format for analysis, LLM training, etc.
The question businesses will start to ask is why are we putting our data into .docx files in the first place?
by lbrunson on 12/14/24, 1:43 AM
Are there any good libraries for the opposite, going from markdown to pdf or docx? Pandoc gets most of the way there but struggles with certain things like tables.
by ezxs on 12/13/24, 6:32 PM
it would be cool if Word just had that implemented inside the product like Google Docs does.
by constantinum on 12/13/24, 8:13 PM
I will try it with some complex layout PDFs or documents with tables. These documents have real business use cases for automation — insurance, banking, etc.
Anyone here who wants to convert PDF documents or scanned images as it is preserving the layout, do try LLMWhisperer - https://unstract.com/llmwhisperer/
by toastal on 12/14/24, 7:46 AM
So we convert from rich formats with metadata & advanced features to a format without the former & severely lacking at the latter.
by zelphirkalt on 12/14/24, 9:16 PM
If the source document is anything half decent, this would serve to lose information, as markdown is far from flexible and powerful enough to represent all kinds of formatting and layout present in source documents. If all you need is the text information, then that might be just what you want, lossily compressing documents.
by poidos on 12/13/24, 8:12 PM
Very timely, thanks!
Was just yesterday working on chaining together `xlsx` and `tablemark` to accomplish this. I found `uvx markitdown my-excel.XLSX | sed 's/ NaN/ /g' my-markdown.md` to be just what I needed to get my spreadsheet into a reasonably-legible markdown table when rendered by GitLab.
by ccbikai on 12/17/24, 6:04 AM
I made a version that can run entirely within the browser
https://www.html.zone/markitdown/
by roamerz on 12/14/24, 1:57 AM
Since it’s Microsoft maybe it will do a half decent job on Outlook HTML and .docx. I have evaluated most of them out there, paid included and haven’t found one that I thought was good enough to run in production. Definitely will be giving this a try.
by sneak on 12/13/24, 10:25 PM
I wish we had a markdown equivalent for spreadsheets. Markdown tables ain’t it.
by SuperHeavy256 on 12/14/24, 9:35 AM
I don't think it works if you try installing it using pip. Can anyone confirm? I ended up downloading it manually, making a venv, and then running it.
by ulrischa on 12/13/24, 8:07 PM
I wonder how a powerpoint can be converted to markdown
by be_erik on 12/14/24, 2:47 AM
Oh thank god. I can finally retire my docx to pandoc to markdown tool chain. I can’t believe M$ was the big one to go first. Good on ya.
by fritzo on 12/13/24, 6:41 PM
Converters like this are much more useful if they are bi-directional, even if the two directions aren't exactly inverses.
by throwaway81523 on 12/13/24, 7:51 PM
Why not Pandoc?
by einpoklum on 12/14/24, 12:36 AM
This is BS, it doesn't support Office documents, it supports only Microsoft's broken office documents which don't obey their own custom specs. Why doesn't this work on ODF files?
by yawnxyz on 12/13/24, 9:49 PM
anyone get the Bing search DocumentConverter working? It keeps getting me null results
by ekianjo on 12/14/24, 9:50 AM
any idea how it compares to Docling?