I had to dig deep to find posts on QSAR modeling. There are quite a few on QSAR in Bioclipse, but that focuses on the descriptor calculation. In a quick scan, I could only spot two modeling posts:
Given the prominent place QSAR has in my thesis, this is somewhat surprising. Anyway, here is some more QSAR modeling talk.

Gilleain implemented the signature descriptors developed by Faulon et al (see doi:10.1021/ci020345w; I mentioned the paper in 2006), and the CDK patch is currently being reviewed. With some transformations, the atomic signatures for a molecule can be transformed into a fixed-length numerical representation: [70:1, 54:1, 23:1, 22:1, 9:9, 45:2]. This string means that atomic signature 70 occurs once in this molecule and signature 9 occurs nine times. At this moment, I am not yet concerned about the actual signature, but just checking how well these signature can be used in QSPR modeling.

Rajarshi's fingerprint code provides a good template to parse this into a X matrix in R:

For my test case, I have used the boiling point data I used in my thesis paper On the Use of 1H and 13C 1D NMR Spectra as QSPR Descriptors (see doi:10.1021/ci050282s). Some of this data is actually available from ChemSpider, but I do not think I ever uploaded the boiling point data. This constitutes a data set with 277 molecules, and my paper provides some reference model quality statistics; that way, I have something to compare against. Moreover, I can use my previous scripts to do the PLS modeling (there are many tutorials online, but you can always buy an expensive book like the one shown on the right, if you really have to), (10-fold) cross-validation (CV), and 5 repeats of random sampling.

I strongly suggest people interested in statistical modeling to read this interesting post from Noel: whatever test set sampling method you use, you must do some repeats to learn about the sensitivity of your modeling approach to changes in the data set. Depending on the actual sampling approach, you might see different sizes of variance, but until you measure it, you will not know. For my application, these are the numbers:

# source("pls.R")
Read 277 items
   rank=66     LV=42     R2=0.987     Q2=0.921  RMSEP=31.37
   rank=66     LV=42     R2=0.983     Q2=0.924  RMSEP=12.405
   rank=65     LV=42     R2=0.985     Q2=0.949  RMSEP=38.503
   rank=63     LV=42     R2=0.983     Q2=0.948  RMSEP=36.981
   rank=65     LV=42     R2=0.986     Q2=0.923  RMSEP=21.49
   rank=64     LV=42     R2=0.983     Q2=0.91  RMSEP=17.759
   rank=64     LV=42     R2=0.983     Q2=0.921  RMSEP=17.062
   rank=66     LV=42     R2=0.986     Q2=0.94  RMSEP=40.311
   rank=66     LV=42     R2=0.982     Q2=0.927  RMSEP=13
   rank=68     LV=42     R2=0.986     Q2=0.929  RMSEP=16.23

I know 42 is the answer to the universe, but 42 latent variables (LVs)?!? Well, it's just a start. A more accurate number of LVs seems to be around 15, but my script had to make the transition from the old pls.pcr package to the newer pls package. And I have yet to discover how I can get the new package to return me the lowest number of LVs for which the CV statistic is no longer significantly different from the best (see my paper how that works). Actually, I have set the maximum LVs to consider to 1/5th of the number of objects (which is about the accepted ratio in the QSAR community); otherwise, it would have happily increased.

However, the five repeats nicely show the variance in the quality statistics, R2, Q2, and root mean square error of prediction (RMSEP). From the numbers, a model with Q2 = 0.94 is not better than one with Q2 = 0.93 (and I have seen the variance quite some larger). Bottom line: just measure that variability, and put it in the publication, will you??

Anyway, what we all have been waiting for: the prediction results visualized (in black the CV predictions; in red the test set predictions):


Well, there is still much work to do, and you can expect the result to get better. Part of statistical modeling is to find the source of variance, and I have yet to explore a few of them. For example, what are the effects of:
  • creating signature from the hydrogen-depleted graph
  • effect of tautomerism (see this special issue)
  • effect of the height of the signature
And there are so many other things I like to do. But this will do for now.
2

View comments

Hi all, as posted about a year ago, I moved this blog to a different domain and different platform. Noting that I still have many followers on this domain (and not on my new domain, including over 300 on Feedly.com along).

This is my last post on blogger.com. At least, that is the plan. It has been a great 18 years. I like to thank the owners of blogger.com and Google later for providing this service. I am continuing the chem-bla-ics on a new domain: https://chem-bla-ics.linkedchemistry.info/

I, like so many others, struggle with choosing open infrastructure versus the freebie model. Of course, we know these things come and go. Google Reader, FriendFeed, Twitter/X (see doi:10.1038/d41586-023-02554-0).

Some days ago, I started added boiling points to Wikidata, referenced from Basic Laboratory and Industrial Chemicals (wikidata:Q22236188), David R. Lide's 'a CRC quick reference handbook' from 1993 (well, the edition I have). But Wikidata wants pressure (wikidata:P2077) info at which the boiling point (wikidata:P2102) was measured. Rightfully so. But I had not added those yet, because it slows me and can be automated with QuickStatements.

Just a quick note: I just love the level of detail Wikidata allows us to use. One of the marvels is the practices of 'named as', which can be used in statements for subject and objects. The notion and importance here is that things are referred to in different ways, and these properties allows us to link the interpretation with the source.

I am still an avid user of RSS/Atom feeds. I use Feedly daily, partly because of their easy to use app. My blog is part of Planet RDF, a blog planet. Blog planets aggregate blogs from many people around a certain topic. It's like a forum, but open, free, community driven. It's exactly what the web should be.

This blog is almost 18 years old now. I have long wanted to migrate it to a version control system and at the same time have more control over things. Markdown would be awesome. In the past year, I learned a lot about the power of Jekyll and needed to get more experienced with it to use it for more databases, like we now do for WikiPathways.

So, time to migrate this blog :) This is probably a multiyear project, so feel free to continue reading it hear.
4

The role of a university is manifold. Being a place where people can find knowledge and the track record how that knowledge was reached is often seen as part of that. Over the past decades universities outsources this role, for example to publishers. This is seeing a lot of discussion and I am happy to see that the Dutch Universities are taking back control fast now.

I am pleased to learn that the Dutch Universities start looking at rankings of a more scientific way. It is long overdue that we take scientific peer review of the indicators used in those rankings seriously, instead of hiding beyond fud around the decline of quality of research.

So, what defines the quality of a journal? Or better, of any scholarly dissemination channel? After all, some databases do better peer review than some journals.

A bit over a year ago I got introduced to Qeios when I was asked to review an article by Michie, West, and Hasting: "Creating ontological definitions for use in science" (doi:10.32388/YGIF9B.2). I wrote up my thoughts after reading the paper, and the review was posted openly online and got a DOI. Not the first platform to do this (think F1000), but it is always nice to see some publishers taking publishing seriously. Since then, I reviewed two more papers.
Text
Text
This blog deals with chemblaics in the broader sense. Chemblaics (pronounced chem-bla-ics) is the science that uses computers to solve problems in chemistry, biochemistry and related fields. The big difference between chemblaics and areas such as chem(o)?informatics, chemometrics, computational chemistry, etc, is that chemblaics only uses open source software, open data, and open standards, making experimental results reproducible and validatable. And this is a big difference!
About Me
About Me
Popular Posts
Popular Posts
  • Update 2021-02: this post is still the more read post in my blog. Welcome! Some updates: Ammar Ammar in our BiGCaT group has set up a new SP...
  • Earlier this year I gave Mendeley a try, after having been a happy JabRef user, unhappy Connotea user (main problem was that any URI can ...
  • Jonathan is here with me to work on his fingerprint project. He asked about CDK modules, which we use to control dependencies, within the ...
Pageviews past week
Pageviews past week
29127
Blog Archive
Blog Archive
Labels
Labels
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.