Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Thanks. I added docx and pptx inputs to the todo list. I'll have to look into how to parse them into HTML.


You're in for a world of pain when you start parsing docx and pptx. On the bright side, if you can figure out a good solution, you'll likely have a solid business model. I would imagine that there would likely be significant demand for converting docx and pptx files into html or markdown, as a service. If you do come up with a nice, well-documented API for all of this, I'd certainly recommend your service. If you come up with an outstanding docx parser, then I'd use your service myself (I am using my own somewhat primitive solution for a current project involving the conversion of docx files).

Here's a few projects to look at, if you haven't already:

1. http://www.docx4java.org/trac/docx4j 2. https://github.com/mikemaccana/python-docx There are some interesting forks and more active forks, but this is the original python-docx


Thanks for the links! I'll dig into them later.

Having never looked at it myself I'm not really sure why it would be so painful. Are the formats just super wacky?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: