Show HN: I Made an Open Source Platform for Structuring Any Unstructured Data
github.comHey HN,
I'm Adithya, a 20-year-old dev from India. I have been working with GenAI for the past year, and I've found it really painful to deal with the many different forms of data out there and get the best representation of it for my AI applications.
That's why I built OmniParse—an open-source platform designed to handle any unstructured data and transform it into optimized, structured representations.
Key Features: - Completely local processing—no external APIs - Supports ~20 file types - Converts documents, multimedia, and web pages to high-quality structured markdown - Table extraction, image extraction/captioning, audio/video transcription, web page crawling - Fits in a T4 GPU - Easily deployable with Docker and Skypilot - Colab friendly with an interactive UI powered by Gradio
Why OmniParse? I wanted a platform that could take any kind of data—documents, images, videos, audio files, web pages, and more—and make it clean and structured, ready for AI applications.
Check it out on GitHub: https://git.new/omniparse
I'm not sure that I understand what we're parsing to. Like on the website, I see supported types, but that looks like the parsable types, no? What kind of structured representation is outputted? And can we guide what that structure looks like?
Yes, the current implementation of the repository converts any data primarily into strctured markdown text.
The next stage will involve prompt guides or schema-guided structure extraction.
Let's say you are processing a lot of research PDFs and want to convert them into clean markdown that best represents the content. Now, let's say you want to extract the authors, abstracts, captions, and store images.
The extraction engine we are currently working on will help you with that.
“structured Markdown” sounds like an oxymoron.
I haven't run it myself, but the example provided looks kinda broken. It looks WAY better than the PyPDF results, but good enough?
The table name was parsed as part of a column name, and half of the column names were not parsed at all.
Original: https://github.com/adithya-s-k/marker-api/blob/master/data/i...
Parsed: https://github.com/adithya-s-k/marker-api/blob/master/data/i...
Yep, the accuracy it currently offers is 80% to 90%. We are actively working on improving the underlying models and there are some major improvments coming soon
1. How does this differ from LlamaParse which can be used with and without LlamaParse?
2. Is there an option for a more permissive license that isn't GNU for commercial enterprise use?
Thanks!
Llamaparse currently only parses PDF documents, as far as I know. OmniParse aims to process any data type, from documents and images to videos and websites, and provide the best representation for AI applications.
We have a few dependencies that are licensed under GNU, which is why we have that license. However, I am currently training models to be under the MIT license and plan to replace the current GNU-licensed dependency to eliminate this limitation.
LlamaParse supports 80+ file types, just FYI.
https://docs.cloud.llamaindex.ai/llamaparse/features/support...
But this is not open source? It is some cloud stuff.
That's fantastic, the MIT license will allow commercial usage as well, right?
Will you be launching a commercial SaaS offering of it as well?
Any ETA?
Oh, I will do some more research on LlamaParse.
Yep, planning to release it under a commercially permissible license.
We have an active API which we are using for our internal clients, and we are planning to release it soon.
Regarding the ETA of the new model, I don't have a fixed deadline as we are training and testing for a lot of edge cases. Currently, we are doing research and trying to build/train in public on X/Twitter.
What are the limitations of running the server on Windows?
Some of the softwareslike LibreOffice, are used to convert files from one format to another.For Windows,it will require a different approach which hasnt been implemented yet