In This Section

Arcus Enables COVID-19 Variant Research: Q&A with Scott Haag, PhD

Published on June 15, 2022 in Cornerstone Blog · Last Updated 6 months 4 weeks ago


Subscribe to be notified of changes or updates to this page.

Data-driven research

Computational science experts on our Arcus team are helping researchers in the Planet Lab study the many variants of the SARS COV-2 virus more efficiently.

Many of our data-driven discoveries would not be possible without the savvy skills of our computational science experts at Children’s Hospital of Philadelphia’s Research Institute. In this Q&A, we sit down with one of those experts, Scott Haag, PhD, a computer scientist on the Arcus team. Made up of analysts, programmers, digital archivists, and educators, Arcus provides a suite of collaborative tools and services to CHOP researchers from data management to secure computing environments to help in solving computational challenges and much more. Currently, Dr. Haag is collaborating with Paul Planet, MD, PhD, attending physician in the Division of Infectious Diseases, and Ahmed Moustafa, PhD, CHOP Microbiome Sequencing Core Director in the Division of Gastroenterology, Hepatology, and Nutrition, in a project to trace and track variants of the SARS COV-2 virus around the world. In a previous Cornerstone story, we described the work, which involves taxonomically naming mutants of the virus then tracking those pathogens back to their closest relatives around the world. Now, Dr. Haag provides an alternate perspective on the research, describing how computer scientists play key roles in solving algorithmic challenges.

Hi Scott! Tell us aboute your role and how you came to work on this project?

My background is in computer science, and my area of expertise relates to issues involving computer algorithms, sort of a consultant for computational problems. Drs. Planet and Moustafa have been doing this research for a while and have several published papers, and they came to Arcus with a specific computational problem. We sat down and talked about what they were trying to accomplish, and I worked out some algorithms to support their research goals. I think the unique aspect for me is the ability to work within Arcus with these scientists and bring a computational perspective to the medical research perspective.

Can you describe the research project at hand?

Drs. Planet and Moustafa are investigating the phylogenetic tree of the 10 million genetic variants of SARS-CoV-2 available from GISAID, but phylogenetic techniques available today cannot handle the enormous amount of data. Arcus enabled them to extend their research beyond what was previously possible, by designing and implementing more efficient algorithms to represent the problem and providing them a large ephemeral compute environment in a secure Arcus computational lab (where this research has taken place). This work will lead to methods to better understand how SARS-CoV-2 strains are evolving and more generally to techniques to classify any virus more efficiently.

What are the objectives of your involvement in this project, and why is it important?

The objective of the research is to develop tools that can identify the relationships between SARS-CoV-2 variants. This research is important because it helps researchers to better understand how viruses mutate and replicate, particularly challenging in the case of SARS-CoV-2 because of the huge number of genetic samples that have been collected in a short period of time. The build-up of sequences is unprecedented in modern science, and the number of SARS-CoV-2 genomic sequences far outweighs all of the genomes ever sequenced to date of all other organisms combined. Existing methods used by the CHOP/PENN researchers, and researchers in general, could not be run on such a large number of discrete samples.

 How did you solve the challenge?

The variants of SARS-CoV-2 were loaded into an Arcus computational environment (Arcus Lab). Arcus staff wrote custom code that used a graph data structure to represent the problem, where nodes are SARS-CoV-2 variants and edges are used to connect variants that are differ by 1, 2, or 3 alleles. This code reduced the complexity of the problem from increasing proportional to the square of the number of SARS-CoV-2 variants (e.g., N^2 where N is the number of SARS-CoV-2 variants) to a constant operation (e.g., N) and enabled the researchers to run their analysis in about one hour, compared to previous software that failed to complete after running for over 10 days. In addition, Arcus was able to provision a large (2TB RAM with 80 vCPU) virtual machine for a short period of time. This allowed the researchers to use a large machine for the computationally expensive portions of their research and a smaller machine to view results, reducing the total cost of the computational resources for this work and ultimately the cost to CHOP.

Do you have an analogy to describe how the algorithm works, in layman’s terms?

Here is a way to think about it: There’s a room of a million people, and everybody in that room is trying to find people that are similar to them. That’s basically what Drs. Planet and Moustafa are trying to do with variants: They want to group genetic strains of SARS-CoV-2 into clusters of similar RNA sequences so they can discover phylogenetic relationships. If you were to imagine a room where there’s a million people, every person would be talking to every other person trying to figure out how similar they were. But that would take a long time. Every time you added a new person, they would have to talk to a million people, and in this case, we are adding thousands of new people a day to this room!

If we continue with the room analogy, the strategy I devised is to break it down into several rooms where for example, one of the rooms may be for people who are over six feet tall, and you would go into that room and only meet the people there who were over six feet tall, so you wouldn’t have to meet all 1 million people. The strategy of meeting at a single place versus talking to everybody is the way that this runs faster. And it turns out that these groups are relatively small. Instead of meeting a million people, you’re now meeting a hundred people who are like you. And now we’re able to cluster the virus together into groups of similar viruses. And it’s much more efficient because we don’t have to compare every person to every other person.

Can you describe the ways in which Arcus can help researchers?

Arcus is much more than just a digital lab environment; it is a team of software engineers, digital archivists, educators, and computer scientists working to build a safe, secure digital environment that provides the data and tools to support research at CHOP. Researchers use Arcus to explore available clinical, research, and genomic data, see overlaps among datasets, build new cohorts, run advanced geo-spatial analysis, and determine if there are data or samples available for additional projects. In the Planet Lab’s case, we’re providing a consult — reviewing the analysis — and in this case, determining why the algorithm is not working. An Arcus computer scientist then assesses the issue and refines the algorithm. And now the analysis can be done. If you have a particular problem with the computational side of your research, consult with someone who understands algorithms and their complexity. Minimizing the data storage and the running times could make something happen that wasn’t possible before.

What are some other projects that you or other Arcus members are collaborating on, now or in the future?

We’re working with CHOP researchers on some exciting projects. The team that I am on (Arcus Data Science) focuses on machine learning and natural language processing (NLP). Jeff Miller, who supervises the team, has been working with NLP and clinical notes, extracting phenotypes from free text notes on patients. So what that means generally is that we describe a patient based on their set of phenotypes extracted from their notes. And I’m working on algorithms to describe how similar patients are to each other based on their phenotypical expression. Say you’re a clinician with 100 patients and you want to find whether there are other patients with similar phenotypes so you can compare patients. This can be a challenge because there’s more than 2 million patients in the Arcus data repository. I’m working on creating algorithms and tools to enable researchers to query sets of patients more efficiently against all other patients.

What excites you the most about this project and future collaborations with researchers?

Enabling researchers and their work. I always get a kick out of talking to people about what they’re working on and then thinking about it through my lens to see whether I or other folks within Arcus can be helpful. That challenge is why I like computer science so much — being good at a certain skill that are useful to other folks and being able apply that skill to help them. And it’s fun to be able to help solve a problem. It’s fun to take a problem conceptually — to think about how to solve it and then to write the code to make it work. That’s what I enjoy about the job.