Understanding COVID-19 Through High-Performance Computing
COVID-19 has changed daily life as we know it. States are beginning to reopen, despite many case counts continuing to trend upwards, and even the most informed seem to have more questions than answers. Many of the answers we do have, though, are the result of models and simulations run on high-performance computing systems. While we can process and analyze all this data on today’s supercomputers, it will take an exascale machine to process quickly and enable true artificial intelligence (AI).
Modeling complex scenarios, from drug docking to genetic sequencing, requires scaling compute capabilities out instead of up – a method that’s more efficient and cost effective. That method, known as high-performance computing, is the workhorse driving our understanding of COVID-19 today.
High-performance computing is helping universities and government work together to crunch a vast amount of data in a short amount of time – and that data is crucial to both understanding and curbing the current crisis. Let’s take a closer look.
Genomics: While researchers have traced the origins of the novel coronavirus to a seafood market in Wuhan, China, the outbreak in New York specifically appears to have European roots. It also fueled outbreaks across the country, including those in Louisiana, Arizona, and even California. These links have been determined by sequencing the genome of SARS-CoV-2 in order to track mutations, as seen on the website The Next Strain and reported in the New York Times. Thus far, an average of two new mutations appear per month.
Understanding how the virus has mutated is a prerequisite for developing a successful vaccine. However, such research demands tremendous compute power. The average genomics file is hundreds of gigabytes in size, meaning computations require access to a high-performance parallel file system such as Lustre or BeeGFS, etc. Running multiple genomes on each node maximizes throughput.
Molecular dynamics: Thus far, researchers have found 69 promising sites on the proteins around the coronavirus that could be drug targets. The Frontera supercomputer is also working to complete an all-atom model of the virus’s exterior component—encompassing approximately 200 million atoms—which will allow for simulations around effective treatment.
Additionally, some scientists are constructing 3D models of coronavirus proteins in an attempt to identify places on the surface that might be affected by drugs. So far, the spike protein seems to be the main target for antibodies that could provide immunity. Researchers use molecular docking, which is underpinned by high-performance computing, to predict interactions between proteins and other molecules.
To model a protein, a cryo-electron microscope must take hundreds of thousands of molecular images. Without high-performance computing, turning those images into a model and simulating drug interactions would take years. By spreading the problem out across nodes, though, it can be done quickly. The Summit supercomputer, which can complete 200,000 trillion calculations per second, has already screened 8,000 chemical compounds to see how they might attach to the spike protein, identifying 77 that might effectively fight the virus.
Other applications: The potential for high-performance computing and AI to simulate the effects of COVID-19 expand far beyond the genetic or molecular level. Already, neural networks are being trained to identify signs of the virus in chest X-rays, for instance. When large scale AI and high-performance computing are done on the same system, you can feed those massive amounts of data back into the AI algorithm to make it smarter.
The possibilities are nearly endless. We could model the fluid dynamics of a forcefully exhaled group of particles, looking at their size, volume, speed, and spread. We could model how the virus may spread through ventilation systems and air ducts, particularly in assisted living facilities and nursing homes with extremely vulnerable populations. We could simulate the supply chain of a particular product, and its impact when a particular supplier is removed from the equation, or the spread of the virus based on different levels of social distancing.
The bottom line: The current crisis is wildly complex and rapidly evolving. Getting a grasp on the situation requires the ability to not just collect a tremendous amount of data on the novel coronavirus, but to run a variety of models and simulations around it. That can only happen with sophisticated, distributed compute capabilities. Research problems must be broken into grids and spread out across hundreds of nodes that can talk to one another in order to be solved as rapidly as is currently required.
High-performance computing is what’s under the hood of current coronavirus research, from complex maps of its mutations and travel to the identification of possible drug therapies and vaccines. As it powers even faster calculations and feeds data to even more AI, our understanding of the novel coronavirus should continue to evolve—in turn improving our ability to fight it.