Published on December 5, 2024 6:20 PM GMT
Cross-posted from my NAONotebook.
This is an edited transcript of a talk I just gave at CBD S&T, a chem-bio defence conference. I needed to submit the slides several months in advance, so I tried out a new-to-me approach where the slides are visual support only and I finalized the text of the talk later on. This does mean there are places where it would be great to have additional slides to illustrate some concepts
Additionally, this was the first time I gave a talk where I wrote out everything I wanted to say in advance. I think this made for a much less engaging talk, and for future ones I'm planning to go back to speaking from bullets.
I'm Jeff Kaufman, and I co-lead the Nucleic Acid Observatory atSecureBio, where we work to develop biosurveillance methods capable ofdetecting biological threats released by governments or non-stateactors. Today I'm going to be talking about the computational methods we use in our pathogen-agnostic early warning system.
Before I get started, however, this is all the work of a team:
I'm honored to work with this group, and Will, Lenni, and Simon are here today.
As I go through this talk, please leave questions in the app. If I've managed my timing correctly, always a risk, we should have a good bit of time for them at the end.
So: what are we up against?
A typical pandemic starts with an infected individual.
They start having symptoms and become infectious.
Someone else gets infected.
Then a few more.
Those people start having symptoms and become infectious.
More people get infected.
There is now a cluster of unusual cases.
Perhaps at this point doctorsnotice something unusual is happening and raise the alarm.
But what if there's a long pre-symptomatic period? Imagine somethinglike a faster-spreading HIV, maybe airborne.
You'd still start with one infected individual.
Then another.
Still without the symptoms that would tell us something was wrong.
No one notices.
Then the first person starts showing symptoms
But that's unlikely tobe enough for someone to notice something is wrong: there areindividuals who report unusual symptoms all the time.
Another person starts showing the same unsual symptoms.
But they'refar enough away that no one puts the pieces together yet.
By the time you have symptom clusters large enough to be noticed...
Avery large fraction of the population has been infected with somethingserious.
We call this scenario a "stealth" pandemic.
I've talked about this as presymptomatic spread, but note you'd alsoget the same pattern if the initial symptoms were very generic and itjust looked like a cold going around.
Now, barriers to engineering and synthesis are falling.
There are dozens of companies offering synthesis, many with inadequatescreening protocols. And benchtop synthesizers are becoming more practical.
Transfection, to get synthesized DNA or RNA into an organism, is also getting more accessible as more students learn howto do it and the knowledge spreads.
And frontier AI systems can provide surprisingly useful advice,helping an attacker bypass pitfalls.
What if a stealth pandemic were spreading now?
Could our militarypersonnel and general public already be infected? How would we know?
We probably wouldn't. We are not prepared for this threat. How can webecome prepared?
We need an early warning system forstealth pandemics. We're building one.
We're taking a layered approach, starting withtwo sample types: wastewater
and pooled nasal swabs.
We extract the nucleic acids, physically enrich for viruses, and rundeep untargeted metagenomic sequencing.
Then we analyze that data, looking for evidence of genetic engineering.
When this flags a cluster of sequencing reads indicative of a potential threat,analysts evauate the decision and the data the decision was based on.
Lets go over each of these steps in more detail.
Where and what should we sample?
We've done the most work with municipal wastewater. It has someserious advantages, but also serious disadvantages.
The best thing about it is the very low cost per covered individual.
A whole city, in a single sample.
On the other hand, only a tiny fraction
of the nucleic acids inwastewater come from human-infecting pathogens, and even fewer comefrom non-gastrointestinal viruses.
We looked into this in a paper I link at the end. We were trying toestimate the
fraction of sequencing reads that might come from covidwhen 1% of people had been infected in the last week.
While our distribution is still wide, we estimate about one in tenmillion sequencing reads would come from covid.
With influenza, however...
It's one to two orders of magnitude lower.
You need a lot of sequencing to see something that's only one in 100M reads!
On the other hand, bulk sequencing keeps getting cheaper.
The list price of aNovaSeq 25B flow cell is $16k for about 18B read pairs, or about $900 perbillion. This is low enough to make an untargeted sequencing approach practical.
The largest cost at scale, however, is still sequencing.
You still need a lot of sequencing.
Very deep sequencing.
We sequencing a typical sample to between oneand two billion reads.
For an approach with the opposite tradeoffs we're also exploringpooled nasals swabs.
We do anterior nasal, like a covid test.
With swabs, the fraction of sequencing reads that are useful is far higher.
We looked into this in our swab sampling report, which I'll also linkat the end.
With covid, a swab from a single sick person might have one in a thousand sequencing reads match the virus.
Instead of sampling a single person, though...
We sample pools. This is a lot cheaper, since you only need process asingle combined sample.
We estimate that with a pool of 200 people
, if 1% ofthe population got covid in the last week approximately one in 5,000sequencing reads would come from the virus.
Because relative abundance is higher, cost-per-read isn't a drivingfactor. Instead of sending libraries out to run a big Illuminasequencer...
We can sequence them on a Nanopore, right on the bench.
The big downside of pooled nasal swab sampling, though, is you need toget swabs from a lot of people.
A lot of people.
A lot a lot.
We go out to busy public places and ask people for samples.
We getabout 30 swabs per hour and are running batches of 25-30 swabs. Atthis scale this isn't yet cost-competitive withwastewater sequencing as a way to see non-GI viruses, but as we keepiterating on our sample collection and processing we're optimistic that we'll get there.
So you collect your samples and sequence them to learn what nucleicacids they contain. What do you do with that data?
How do you figureout what pathogens are present?
The traditional way is to match sequencing reads against knownthings.
It's great that sequencing lets us check for allknown viruses that infect humans, or that could evolve or be modifiedto infect humans. But this isn't pathogen agnostic because it doesn'tdetect novel pathogens.
Our primary production approach is a kind of genetic engineering detection.
If someone were trying to cause harm today, they'd very likely start with something known. Many ways of modifying a pathogen create a detectable signature:
A junction between modified and unmodified portions of a genome. Wecan look for reads that cover these junctions.
Here's a real sequencing read we saw on February 27th
The first part is a solid match for HIV
But the second part is not.
This is a common lentiviral vector, likely lab contamination. But it'sgenetically engineered, and it's the kind of thing the system should beflagging. It has detected three of these so far, all derived from HIV.
To validate this approach of looking for surprising junctions westarted in silico
We picked a range of viruses, simulated engineered insertions,simulated reads from these genomes and verified the system flags reads.This came in at 71% of optimal performance.
Then we validated real world performance with spike-ins. We took rawmunicipal sewage...
And added viral particles with engineered genomes.
With threedifferent HIV-derived genomes we flagged 79-89% of optimal performance.
In addition to detecting pathogens via...
direct match to known things and by flagging suspicious junctions, wecan flag based on growth patterns.
Initially we're taking a reference-based approach, detecting known thingsbecoming more common. This can be more sensitive than junction detectionbecause you don't have to see the junction to know something's wrong. Thismeans a larger fraction of sequencing reads are informative.
Because this method is based on known things becoming more common,it's far more sensitive in cases where the base genome is initiallyrare or absent in your sample type. But this combines with otherdetection methods to further constrain the adversary's options.
We've also explored a reference-free approach, looking forexponentially-growing k-mers. This requires more data than we have sofar, but it's the hardest one for an adversary to bypass: it tracksthe fundamental signal of a pathogen spreading through the population.
The final method we're exploring is novelty detection.
If we can understand these samples well enough, we can flag somethingintroduced just for being unexpected in the sample. This is a majorresearch problem, and we're excited to share our data and collaboratewith others here. For example, Willie Neiswanger and Oliver Liu atUSC have trained a metagenomic foundation model on this data, whichthey're hoping to apply to detecting engineered sequences. We're alsocollaborating with Ryan Teo in Nicole Wheeler's lab at the Universityof Birmingham on better forms of metagenomic assembly.
How big would a system like this need to be? It depends heavily onour desired sensitivity.
To detect when 1:100 people are infected
We need a medium amount of wastewater sequencing
Or swab collection.
To detect
when 1:1,000 people have been infected
We'd need 10x the sequencing
Or 10x the swabbing
We've made a simulator for estimating the sensitivity of different system designs.
Given a system configuration, it estimates how likely it is to flag anovel pathogen before some fraction of the population has ever been infected
For something in the range of $10M/y we think you can operate a system capableof detecting a novel pathogen before 1:1000 people have been infected. Butthere is still a lot of uncertainty here, and we'll learn more as we continueour pilot.
Today I've described an approach to pathogen-agnostic detection, ...
Starting with wastewater and nasal swabs, extracting and sequencingthe nucleic acids, and then applying a range of computational methodsto identify engineered pathogens.
This is a system that we need todefend against stealth pathogens, and we're building it. Some partsare ready to go today, such as junction-based genetic engineeringdetection. Other parts we're close to, and are a matter ofengineering and tuning, such as reference-based growth detection. Andothers need serious research, such as understanding complex samples sowell that we can flag things for being new. We're excited to beworking on this, and we're interested in collaborating with others whoare thinking along similar lines!
Thank you.
I've prepared a webpage with links to the resources I discussed in this talk:
We have some time for live questions
And I'm also happy to take questions over email at jeff@securebio.org
Discuss