Pages

Sunday, July 31, 2022

Extracting triples from HTML+RDFa pages

Figure 4 from this editorial which shows how
data embedded in webpages can be extracted
and visualized automatically with the right
tools, as developed by Jankowski
(figure license: CC-BY).
The period 2005-2010 was when chemistry world explored (and solved) data sharing on the internet, particularly in the web. The reason was simple: humans like to read a story around data (perhaps related to how we are used to learn) instead of being presented it in a unsorted box. We use both actively still, but if you think about how much are used, I'm fairly sure the first wins hands down.

An example, the ChEBI and ChEMBL databases from the EMBL-EBI provide both: they have a human-oriented website with webpages for all data. The data is sorted, and for both at least sorted by chemical compound. But they also have boxes: their FTP sites. You can download here all data in a box, and they leave it to you to unbox the data. Of course, many cheminformaticians just love to unbox the data, sort it, and put it in both other boxes and onto other websites.

Diversion #1: The mainstream publishers actually like boxes a lot. Fifteen years ago I was hopeful with all the Open Access vibes around, we would jointly make data, facts, readily available, also by machines. Sadly, last year, after trying to work with one mainstream publisher, I accepted my defeat. They are mostly interested in boxes. Worse, taped boxes that you cannot open. That's what they proudly presented (ReadCube).

So, many people, including me, were interested in actually making the human-oriented displays of the data and facts a machine readable box itself. For example, we wrote a chapter Beautifying Data in the Real World (use that link for the OA version) for the book Beautiful Data.

My personal interest went into HTML+RDFa. I probably blogged in 2008 about it for the first time, because of browser plugins that could extract it. Yes, indeed, a website as a machine readable data source. For example, a long time ago I played with the idea that one day all research dissemination would be interactive figures and tables (you can find this return in my research many times, as clear from this blog). Only very few publishers want to make this reality.

So, why I am blogging about this, these technologies have been long solved, here, are used, and ready to be taken up. And they are. From the big search engines that use schema.org for SEO to ELIXIR that uses it to make their projects interoperable.

Diversion #2: You may wonder where this puts SPARQL endpoints. Aren't they boxes too? If you would ask me, I would say yes, they are. But they are somewhere in between the story-around-data and box-with-data: they are self-documenting and interactive. Many FTP sites are documenting like those from the EMBL-EBI, but they are not interactive. Better, SPARQL can easily be wrapped in stories, as done here for SARS-CoV-2 and as worked out by Finn with Scholia (doi:10.1007/978-3-319-70407-4_36).

And here are some examples where I still use HTML+RDFa (probably some more):

The latter follows the same approach I wrote up in the blog post Coding an OWL ontology in HTML5 and RDFa.

Now, this is where the problem is. The online W3C tool to extract RDF from a HTML+RDFa page has been discontinued. So, unboxing the RDF from the HTML website has become a bit trickier. I'm hoping some new online API will show up. There are several offline tools, and one that I was pointed to last week was the Apache Any23 (thanks to Ammar and Alasdair).

Here's the Groovy code I ended up with (license: MIT):

@Grab(group='org.apache.any23', module='apache-any23-core', version='2.7')
@Grab(group='io.github.egonw.bacting', module='managers-rdf', version='0.0.42')

import org.apache.any23.Any23
import org.apache.any23.source.HTTPDocumentSource
import org.apache.any23.writer.NTriplesWriter

workspaceRoot = "../ws"
rdf = new net.bioclipse.managers.RDFManager(workspaceRoot);

if (args.length != 1) { println "groovy extractRDFa.groovy [url]"; System.exit(0) }

url = args[0]

Any23 runner = new Any23();
runner.setHTTPUserAgent("test-user-agent");
httpClient = runner.getHTTPClient()
source = new HTTPDocumentSource(runner.getHTTPClient(), url)

out = new ByteArrayOutputStream();
handler = new NTriplesWriter(out);
try { runner.extract(source, handler);
} finally { handler.close(); }

n3Stream = new ByteArrayInputStream(out.toByteArray())

kb = rdf.createInMemoryStore()
rdf.addPrefix(kb, "wp", "http://vocabularies.wikipathways.org/wp#")
rdf.addPrefix(kb, "gpml", "http://vocabularies.wikipathways.org/gpml#")
rdf.addPrefix(kb, "biopax", "http://www.biopax.org/release/biopax-level3.owl#")
rdf.importFromStream(kb, n3Stream, "N3")

println rdf.asRDFN3(kb)

The extraction bit starts with the new Any23()line up to the handler.close()line. After that, I use Bacting (doi:10.21105/joss.02558) to make the output Notation3 easier to read. I may be able to do that with Any23 directly, but that's not the important part. I would run it like this:

groovy extractOWL.groovy \
  https://vocabularies.wikipathways.org/wp \
  > wp.owl

The output looks something like this:

wp:Complex  rdf:type     owl:Class ;
        rdfs:comment     "A physically bound combination of two or more biological entities."@en ;
        rdfs:label       "Complex"@en ;
        rdfs:subClassOf  wp:DataNode ;
        skos:inScheme    wp: .

Of course, it's still offline, so the next step is to figure out if I can easily dockerize this and make it part of some cloud, or if I can embed this in a GitHub Action.

No comments:

Post a Comment