Pages

Saturday, August 07, 2021

From a Journal of Cheminformatics knowledge base to automated downloading of JATS

Yesterday I started a blog series around the Journal Article Tag Suite (JATS) of the Journal of Cheminformatics. JATS is the XML standard used to share articles from the journal. It can be seen as the source code of the article. The source code can be compiled to HTML, PDF, and ReadCube. Each output can be customized such as we have seen multiple HTML flavors for articles in the journal. One the main website alone we have had the old BMC website HTML, new BMC website HTML, and the modern Springer Nature website HTML. The latter is the one which imposes a ReadCube button, doesn't have the DOI at the top, etc.

But the JATS source code is in XML, and much easier to process. And since the articles are published as CC-BY, we can. XML is not really a complete language: is has syntax but not words. The words, i.e. XML elements and attributes), are specified in a XML Schema or in a Document Type Definition (DTD) file (that brings me back to 1997/1998 when I first played with SGML and XML...). The latter for JATS: http://jats.nlm.nih.gov/archiving/1.2/JATS-archivearticle1.dtd. It is this format that needs extending if we want full support for CiTO, but all in steps.

To not overload the Springer API too much, I have made the first 10 JATS XML files for the Journal of Cheminformatics available in this repository: https://github.com/egonw/jats. Here's the content of one of the XML files:

Screenshot of the GitHub page showing the XML in this file: https://github.com/egonw/jats/blob/main/s13321-020-00448-1.xml

For full disclosure, I have no clue of the JATS file is the actual source code used by Springer Nature. Maybe they have a more detailed XML internally and the JATS is autogenerated on demand. That would explain why I got different XML the second time I made the API call.

No comments:

Post a Comment