I love that PDFs are so difficult to transform into HTML, too
FYI, if that’s relevant to your field, every new article published on arxiv.org now has a HTML render as well.
And on many older publications, transforming “arxiv.org” into “ar5iv.org” leads to an HTML rendering that is a best-effort experiments they ran for a while.
That’s really cool! What I really would like is a tool that converts PDFs to semantic HTML files. I took a peek there and it seems easier for them because they have the original LeX source.
I think for arbitrary PDFs files the information just isn’t there. I’ve looked into it a bit and it’s sort of all over. A tool called pdf2htmlex is pretty good but it makes the HTML look exactly like the PDF.
FYI, if that’s relevant to your field, every new article published on arxiv.org now has a HTML render as well.
And on many older publications, transforming “arxiv.org” into “ar5iv.org” leads to an HTML rendering that is a best-effort experiments they ran for a while.
That’s really cool! What I really would like is a tool that converts PDFs to semantic HTML files. I took a peek there and it seems easier for them because they have the original LeX source.
I think for arbitrary PDFs files the information just isn’t there. I’ve looked into it a bit and it’s sort of all over. A tool called pdf2htmlex is pretty good but it makes the HTML look exactly like the PDF.
Yes, PDFs are much more permissive and may not have any semantic information at all. Hell, some old publications are just scanned images!
PDF -> semantic seems to be a hard problem that basically requires OCR, like these people are doing
Oh nice, thanks for sharing that project. I haven’t heard of it before!