by Handy-Man on 12/13/24, 6:02 PM with 81 comments
by simonw on 12/13/24, 7:03 PM
uvx markitdown path-to-file.pdf
(This will cache the necessary packages the first time you run it, then reuse those cached packages on future invocations.)I've tried it against HTML and PDFs so far and it seems pretty decent.
by irskep on 12/13/24, 6:39 PM
There are a lot of random startups and open source projects who try to make this space sound fancy, but I really hope the end state is a simple project like this, easy to understand and easy to deploy.
I do wish it had a knob to turn for "how much processing do you want me to do." For PDF specifically, you either have to get a crappy version of the plain text using heuristics in a way that is very sensitive to how the PDF is exported, or you have to go full OCR, and it's annoying when a project locks you into one or the other. I'm also not sure I'd want to use the speech-to-text features here since they might have very different performance characteristics than the text-to-text stuff.
by btown on 12/13/24, 7:39 PM
So if that's your use case, PDFMiner might be better to integrate with directly!
by figomore on 12/13/24, 7:08 PM
by starkparker on 12/13/24, 8:36 PM
This handles them... fine. It either doesn't recognize or never attempts to handle tables, which makes it fundamentally a non-starter for my typical usage, but to its credit it seems to have at least some sense of table cells; it organizes columns in a manner that isn't fully readable but isn't as broken as some other solutions, either.
It otherwise handles text that's in variable-width columns or wrapped in complex ways around art work rather well. It inserts extraneous spaces on fully justified text, which is frustrating but not unusual, and sometimes adds extraneous line breaks on mid-sentence column breaks.
The biggest miss, though, is how it completely misses headings! This seems fundamental for any use case, including grooming sources for LLM training. It doesn't identify a single heading in any PDF I've thrown at it so far.
by benatkin on 12/13/24, 6:33 PM
It's interesting to read the code. It's mostly glue code, and most of it is in single 1101 line file. But it does indeed say what the README says it does. Here is the special handling for Wikipedia: https://github.com/microsoft/markitdown/blob/main/src/markit...
Edit: good to see the one from yesterday flagged. I tried to assume good intent, but also wondered if it was a place to draw a line in the sand. https://news.ycombinator.com/item?id=42405758
Edit 2: ah, it came down to simple violation of the Show HN rules. I didn't notice, but yeah, that's definitely the case.
by markhneedham on 12/13/24, 7:01 PM
docling uses an LLM IIRC, so that's already a difference in approach
by hks0 on 12/13/24, 9:24 PM
In an online language class we were sending the assignments to our teacher via slack, the teacher would then mark our mistakes and send it back.
I, as a true hater of all the heavy weight text formats for everyday communications, autonomously fired up the terminal, wrote my assignment in my_name.md and happily sent it without giving it any thought. This is what I hear the next session:
"... and everybody did a great job! Although someone just sent me their assignment in a stupid format. I don't know what it was! I could neither highlight it or make the text bold or anything. Don't do that to me again please".
Before that I never dreamed of meeting someone who preferred a word document _after_ opening a .md file, and I also learned if I had chosen product design as a career, everyone would've suffered immensely (or maybe not, I would've just ended up jobless).
by LittleTimothy on 12/13/24, 7:18 PM
A cynic might say it became suddenly easy when MSFT had a reason to allow you to genereate markdown to feed into it's AI?
by konfekt on 12/14/24, 9:23 AM
[here] https://github.com/Konfekt/vim-office [source] htps://github.com/microsoft/markitdown/blob/main/src/markitdown/_markitdown.py
by theanonymousone on 12/13/24, 6:53 PM
by kepano on 12/13/24, 7:50 PM
The question businesses will start to ask is why are we putting our data into .docx files in the first place?
by lbrunson on 12/14/24, 1:43 AM
by ezxs on 12/13/24, 6:32 PM
by constantinum on 12/13/24, 8:13 PM
Anyone here who wants to convert PDF documents or scanned images as it is preserving the layout, do try LLMWhisperer - https://unstract.com/llmwhisperer/
by toastal on 12/14/24, 7:46 AM
by zelphirkalt on 12/14/24, 9:16 PM
by poidos on 12/13/24, 8:12 PM
Was just yesterday working on chaining together `xlsx` and `tablemark` to accomplish this. I found `uvx markitdown my-excel.XLSX | sed 's/ NaN/ /g' my-markdown.md` to be just what I needed to get my spreadsheet into a reasonably-legible markdown table when rendered by GitLab.
by ccbikai on 12/17/24, 6:04 AM
by roamerz on 12/14/24, 1:57 AM
by sneak on 12/13/24, 10:25 PM
by SuperHeavy256 on 12/14/24, 9:35 AM
by ulrischa on 12/13/24, 8:07 PM
by be_erik on 12/14/24, 2:47 AM
by fritzo on 12/13/24, 6:41 PM
by throwaway81523 on 12/13/24, 7:51 PM
by einpoklum on 12/14/24, 12:36 AM
by yawnxyz on 12/13/24, 9:49 PM
by ekianjo on 12/14/24, 9:50 AM