Can Data Be Used as a Measure of Quality?

Originally published in LinkedIn Pulse, 13 Dec 2016

Recently, I read an interesting post by Tomasz Tunguz entitled: ‘The Limitations of Data and Benchmarks’. I could not agree more with the idea that data only provides ‘metrics with limits’ and is not a source of ‘ideation’.

This is actually a topic I am very interested in due to my work in measure and assessment of membrane protein model structure quality. My work provides a good test case study in the use of homology models, and illustrates the ‘limitations’ in using data to predict the best’ working model structure. My work also shows how data can be used for distinguishing and ‘filtering’ models that are out of the question — the ‘decoys’.

Yes, via negativa! The approach that Michelangelo used:

“It is easy. You just chip away the stone that doesn’t look like David.”

Since we do not ‘intuitively’ think via negativa — this approach may not be as easy as it sounds. More importantly for me, this actually reminds me of my own ‘limitations’ and how humble things are to begin with the search of ‘perfection’.

On the basis of comparative or homology modelling, empirical limits have been set by constraints derived from alignment with template structure, molecular force fields, statistical and non-bonded parameters, and other related data eg. NEM accessibility, and even perhaps assumptions from some specific known characteristics.

Similarly, methods for structure quality assessment are also subjected to the same treatment.

Even the use of simulation techniques that enable the extrapolation of protein model structures — in which sets of structural trajectories are generated and analysed for stability and behaviour during simulation — as a measure for structure quality — is still based on restricted force fields that still represent limits.

The simulation approach used in my work however, was limited to classical molecular dynamics calculations. Even so, I imagine, even the most advanced quantum mechanics calculations performed nowadays are still subject to many limitations and approximations — as it would not be possible to meet the computational costs required for real dynamics situations — at least not at the moment.

Let me put it this way — from looking at a static structure model to running a movie of it, there seems to be no definite idea of how a model structure should behave.

How can we predict ‘absence’ in the context of ‘presence’?

Even the use of artificial intelligence or machine learning cannot extend above and beyond what it is unable to learn. We can make a program or machine deploy tasks more effectively or make decisions more economically or smartly — but we cannot expect it to perform beyond what we have created it to be — or to answer questions that we ourselves have yet to explore — the ‘known unknowns’, and the ‘unknown unknowns’.

Just like the Deep Blue supercomputer — that has yet to ‘know’ all the possible ambiguous chess moves that are too numerous to predict.

So, what is the significance of all this?

The book ‘The Black Swan’ by Nassim Nicholas Taleb — eloquently refers to the ‘black swan phenomena’. This is the failure in predicting the occurrence of the rare, extreme, and retrospective.

“It illustrates a severe limitation to our learning from observations or experience, and the fragility of our knowledge”.

An example that is relevant to the discussion here is seen in the amino acid residues of membrane proteins that reside in the outlier regions of structural stereochemistry. These residues provide additional flexibility for helical (membrane) proteins to break into disordered loops. Additionally, some of these outlier residues (with an acidic or basic nature) are also capable of carrying charges important for their interaction with specific ions or water molecules. These are all part of the vital processes of proteins’ functional mechanisms.

Similarly, disruptive conditions or the perturbation of the lipid bilayer during simulation caused by membrane proteins embedded in the membrane environment does not necessarily imply that the protein structure in question is poor quality — rather it just serves to illustrate the complexity involved in cellular mechanisms.

As ambiguity pushes away from confinement within a limited set of the possible number of categories — how can we distinguish a ‘correct’ or ‘incorrect’ model structure?

When data is visualised into the form of patterns’ — qualitatively, it can give some ‘ideas’. For example, although a difference in the number of interactions cannot distinguish a better model structure, the frequency of a specific interaction during simulation or its locality or the type of amino acid residues involved or even the percentage of stereochemistry outliers etc. — can be used to validate a model structure.

‘Validation’ is what data can be used for — not prediction or ‘measuring’ how ‘correct’ a model structure is.

In analysing, we look for ways to summarise everything into something — and this could involve quantifying qualitative data from the observation of complex situations, and then summarising these quantitative representations.

I find it extremely difficult (if not impossible) to calculate a ‘global’ scale from these discrete representations as a measure for the overall quality of model structures — especially in a case study involving test models with more variations — those constructed using different templates and alignments. Although a comparison of a set of (the same) generated models within the same conformation or very similar ones containing 1 or 2 mutational sites perhaps — might possibly be done!

Apparently, data does not unambiguously enable the prediction of the ‘best’ model. Thus a statement such as: ‘…can be as accurate as producing a model that is less than 1.0 Ắ deviation from its known structure…’ must be thought of more carefully. It does not signify any level of accuracy other than merely saying the model or the approach is NOT wrong!

Although data provides a certain certainty, we must realise that data does not determine the certainty of ‘existence’ and ‘creativity’ which is powered by an unknown force — randomness that cannot possibly be measured quantitatively. There is no doubt though that ideas surface from known facts and knowledge — and from combinations of the many facets of this knowledge. Knowledge can, therefore, derive something new, complex and unique.

p.s. This is my first LinkedIn article — all likes, shares and comments are most welcome.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store