Wednesday, June 17, 2009

No, PDFs really do suck!

A typical blog by Peter MR made (again), The ICE-man: Scholary HTML not PDF, the point of why PDF is to data what a hamburger is to a cow, in reply to a blog by Peter SF, Scholarly HTML.

This lead to a discussion on FriendFeed. A couple of misconceptions:

"But how are we going to cite without paaaaaaaaaaaage nuuuuuuuuuuumbers?"
We don't. Many online-only journals can do without; there is DOI. And if that is not enough, the legal business has means of identifying paragraphs, etc, which should provide us with all the methods we could possibly need in science.

Typesetting of PDFs, in most journals, is superior than HTML, which is why I prefer to read a PDF version if it is available. It is nicer to the eyes.
Ummm... this is supposed to be Science, not a California Glossy. It seems that pretty looks is causing major body count in the States. Otherwise, HTML+CSS can likely beat any pretty looks of PDF, or at least match it.

As I seem to be the only physicist/mathematician who comments on these sort of things, I feel like a broken record, but math support in browsers currently sucks extremely badly and this is a primary reason why we will continue to use PDF for quite some time.
HTML+MathML is well established, and default FireFox browsers have no problem showing mathematical equations. For years, the Blue Obelisk QSAR descriptor ontology has been using such a set up for years. If you use TeX to author your equations, you can convert it to HTML too.

We can mine the data from the PDF text. Theoretically, yes. Practically, it is money down the drain. PDF is particularly nasty here, as it breaks words at the end of a line, and even can make words consist of unlinked series of characters positioned at (x,y). PDF, however, can contains a lot of metadata, but that is merely a hack, and unneeded workaround. Worse, hardly used regarding chemistry. PDF can contain PNG images which can contain CML; the tools are there, but not used, and there are more efficient technologies anyway.

I, for one, agree with Peter on PDF: it really suck as scientific communication medium.


  1. Just wanted to say that I disagree with the premise that researchers are wrong to care about the guarantees that PDF gives about layout and presentation. The choice is not as clear-cut as the post suggests.

    I managed to express most of my sentiments in a discussion on #bioclipse.

  2. I think you are wrong in saying that "HTML+CSS can likely beat any pretty looks of PDF, or at least match it".

    Even if you theoretically could match the looks it would probably not be easily done. I would be very surprised if HTML+CSS managed to produce something even comparable to what LaTeX can produce. There is also the fact that LaTeX produces awesome looking things almost by default something that hardly can be said about HTML+CSS when it comes to printing on paper.

    But please by all means, surprise me!

  3. Stephan Michels6:05 PM, June 17, 2009

    "HTML+MathML is well established"
    You must be joking, right?! Even Firefox is not well established.
    And with HTML you have no gain over PDF, because you can do similar ugly stuff you do in PDF, like CSS hacks, layers, invisible characters.
    I prefer PDF for reading articles. And if I would mine articles, then I would prefer something different than HTML.

  4. As commented on FriendFeed too: "It is interesting to note that all comments *are* about the looks, where I tried to make the point that looks must not matter. The suggestion of HTML+CSS may be bad, but was meant illustratively: there are semantic formats that allow to make readable papers! BTW, no one here seems to read papers of 30 years ago! Have you seen those?! Then think again about your position that HTML+CSS is not more than enough for scientific publication :)".

  5. Stephan, when a technology has been in production phase for several years and working as such for years, I do think a technology is mature and established.

    I understand that 'established' seemed to imply a big market share, and that HTML+MathML likely does not have yet. I should write a tutorial about how to embed MathML in CDK's JavaDoc :) Just to increase the market share a bit.

  6. Interesting. What would a better format look like? What aspects of PDF and of HTML/CSS are advantageous and disadvantageous?

    I personally detest the input mechanisms for math. There was briefly a tool called ChiWriter which was much closer to what I want. It seems to me that the ideal is a domain specific keyboard, specialized input software as the UI.

    For the data standard there are clear advantages to reviving MathML, so the whole thing pretty much has to have an XML flavor.

    The situation is sufficiently unsatisfactory that a well-designed alternative (call it WDA) really could win out. Crucial is a a WDA2LaTeX filter and a BibTex equivalent.

    We could also start thinking about XMLish infographics standards. It seems to me that the graph and the table should be parts of the same data structure.

  7. I think that there is more to 'appearance' of work than just self-promotion. Saying looks 'must not matter' is not just idealistic, but actually wrong.

    Clarity and consistency of presentation is essential for communication. Especially when the message is very complex - as much science can be.

    I would expect mathematicians need to have their notation reproduced exactly. After all, the difference between a union operator and a capital U might be important.