Unraveling the machinery of life: computational challenges in structural bioinformatics

Course Description

The ‘structural’ branch of bioinformatics is a rapidly developing field of science that studies the machinery of life, i.e. what do the parts look like? How do they fit together? How do they move? Ultimately, what do those movements mean in the greater scheme of things? Assume our end goal to be curing any and all diseases (that’s why this research is funded anyway). In order to fix a mechanism, we first need to know how it normally operates. Next, we have to figure out the root cause of a disorder – which cog is slipping and why. Only then we can design a drug molecule that will repair this (and only this!) cog. In most cases, the instructions (also known as DNA) used to manufacture our molecular parts (also known as proteins) are easily obtainable and studied on their own merits by sequence analysis (a sister branch, often referred to simply as “bioinformatics” for its ubiquity). Getting the structural data is way more difficult. If we could just look at the molecules that would be great of course, unfortunately, they are much smaller than the wavelength of visible light. So we run experiments that provide this information indirectly, akin to recording shadows cast by an object throughout the day, then making up a model that casts identical shadows from the recorded angles. The main sources of the experimental data are X-ray crystallography, nuclear magnetic resonance spectroscopy, cryogenic electron microscopy. The data from each is incomplete in its own unique way, and requires solving a puzzle by (in order of historic importance): 1) informed and lucky guess, such as the Watson & Crick 3D model of DNA; 2) trial and error based on the first principles, such as the CCP4 project built by chemists and mathematicians (yay computers!); 3) trial and error based on the knowledge from the structures we have already solved, such as the RosettaCommons or the more recent AlphaFold systems (yay machine learning!). The main concepts needed to jump into the field (that will hopefully be at least superficially covered in the lectures): 1. Primary structure: protein sequence, sequence alignment, and measures of similarity, sequence profiles. 2. Secondary structure: standard elements (a-helix, b-sheet, loop), super secondary elements (e.g., coiled-coil), functional motifs, 3. Tertiary structure: domain classification, core vs. interface, energy landscape, representing a protein (e.g., point clouds, surface, density distributions), similarity measures, ligand docking. 4. Quaternary structure: protein docking, homo/hetero-oligomers, symmetry groups, contact graphs 5. Open problems and available databases for these levels.

Course tools

  • Python
  • Bio.PDB package

Prerequisites

  • Basic knowledge of ML concepts
Level of complexity of course Advance

Lecturer

Dr. Dmytro Guzenko

Scientific consultant, industry

Dmytro is a computer scientist by trade who once stumbled upon a job at the Biocrystallography lab at Katholieke Universiteit Leuven (Leuven, Belgium) which needed a ‘computer guy, and, after taking their data hostage, received a Ph.D. in Biomedical Sciences supposedly for crystallographic studies of intermediate filament proteins. His most significant hands-on achievement in either biology or medicine remains to learn how to use a pipette. He continued his bold ruse of a career in biology as a postdoc at RCSB Protein Data Bank (San Diego, USA) by using expressions like ‘we need faster structure alignment’ and ‘I am solving 3D Zernike polynomials’ at carefully timed intervals, and going to the beach to ‘think’. Dmytro was nearly exposed at the CASP competition for protein structure prediction methods, but luckily he participated as an ‘expert’ assessor, which is easier to fake. After one imprudent trip to Tijuana Dmytro fled to Ukraine where he is now working as a scientific consultant in the industry. Fields of interests: Structural bioinformatics, computer vision, signal processing Contacts:  [email protected]