How to Listen
Machine learning is powering cutting-edge software tools, Floodlight™ and Searchlight™, helping chemists sift through an overflow of data. Chemists look for chemical signals in samples of food, drugs, personal care products, carpeting, upholstery and other items we use every day. Identifying these signals and studying these chemical fingerprints could improve health and safety by creating better products and better public policies. Floodlight and Searchlight are speeding up this process, churning out weeks of work in a fraction of the time and saving chemists from data overload.
Below is a transcript of the episode, modified for clarity.
Lisa Peña (LP): Chemistry and computer science combine in innovative software systems, Floodlight™ and Searchlight™. They are machine learning tools that can analyze food, drugs, materials, even the air, to detect chemicals. What is the software's potential impact on health, safety, the environment, and more? Hear from the creators next on this episode of Technology Today.
We live with technology, science, engineering, and the results of innovative research every day. Now let's understand it better. You're listening to the Technology Today podcast presented by Southwest Research Institute.
Hello, and welcome to Technology Today. I'm Lisa Peña. In machine learning, a computer learns to recognize patterns in data using algorithms. Floodlight and Searchlight software, which we are discussing today, illuminate patterns in complex chemical data, identifying chemicals present in a sample. This type of analysis used to take chemists weeks. Now, with this incredible technology speeding up the process, it takes just a fraction of the time.
Our guests today are Dr. Kristin Favela, an SwRI analytical chemist, and SwRI research computer scientist Michael Hartnett. They combined their expertise to create Floodlight and Searchlight. And this is really a great example of collaboration at SwRI, merging different areas of research and development to create a world-changing solution. Thank you for joining us, Kristin and Michael.
Dr. Kristin Favela (KF): Thank you for having us.
LP: So we had a brief segment with you, that was back in Episode 16, where you described your process creating Floodlight. But I'm excited to have more time with you today to talk about your newest software Searchlight, and really get a chance to delve into this technology. So I've given a brief introduction of Floodlight and Searchlight, but what are these software tools? How do you describe them?
MH: So I would first start with Floodlight, because it really is sort of the first pass at this complex chemical data. The data that is received, or picked up, from the GCGC mass spec instrument is three-dimensional. It has those two chromatographic dimensions, as well as a mass spec dimension, each of which is trying to tease apart individual chemical compounds in a chemical sample.
An unfortunate reality to this kind of instrumentation is that there's noise. There are signals picked up by the instrument software that, in reality, aren't helpful or useful. They're just kind of noise in the background. So Floodlight, we developed to sift through the signal and noise, and it could differentiation between the two. So picking up the useful, real data that can help a chemist make determinations on the composition of a sample and throw away all the interferent information, or disinformation.
LP: So these software systems take a sample and sift through the sample for chemical fingerprints, if you will. And through that, the software is able to identify what chemicals you're looking at? Am I understanding that correctly?
MH: Yes. I think the ultimate goal in these types of analyses is to tease apart the important chemical information of samples. So Floodlight cleans up a sample and then passes that cleaned up information to Searchlight, which can analyze the patterns between chemical samples to pick up on the important similarities and differences amongst these chemical samples. So we can think of that in terms of anomaly detection, you can find samples that are out of the normal range that we would expect in terms of their individual chemical components, and see concentration trends over time or with other parameters.
LP: How are the software systems different? It sounds like you start with one and it moves into the other. But what are the differences?
MH: That's a good question. So in the background, the algorithmic underpinnings of these softwares, Floodlight is powered by a supervised machine learning algorithm. That means that we took a lot of labeled data, in terms of the signal quality, and trained a machine learning algorithm to make the differentiation between high quality and poor quality signals. Whereas in Searchlight, much of the machine learning methods there are unsupervised. So instead of relying on human-labeled data to learn the relationships, unsupervised machine learning algorithms just look at the inherent structure of the data to pull out the relationships and group things into categories. So I would say that's a pretty major differentiation between the two technologies.
LP: All right, a lot of great information there from our computer scientist. And I wanted to bring in our chemist on this. Kristin, thank you for joining us. Why is it necessary and important to analyze chemicals? How can these software systems be used in real world applications?
KF: Well, it's necessary and important to analyze chemicals to understand what we are exposed to in our external world. So traditionally, the field of exposomics, for example, has focused on targeted analysis. So that is having a predefined list of chemicals that you are interested in, and then specifically targeting those chemicals in a sample. However, with recent advances in both instrumentation and computer science, we are capable of analyzing our samples in a more holistic manner.
So this is called nontargeted analysis, where we attempt to characterize a specific sample to the greatest extent that we can, so processing all the signals that the instrument provides. And what we were discovering is a very large amount of previously uncharacterized chemicals, and chemicals that maybe are known, but not known to be in, for example, specific consumer products. So in many cases, there is very little known about the effect of these chemicals on human health. And in order to triage which chemicals are most important to be studied, it's important to know what is the frequency and the concentration of these chemicals in samples of interest?
LP: So when you're saying it's a matter of human health, and you're looking for chemicals in products, I think we're all under the impression that if we're using a product, then it's safe. But are you saying that sometimes it takes a little bit more analysis after it's for sale?
KF: Well, so, a sample consists of many, many chemicals. So it's not just the ingredients that go into the sample. All of those ingredients have impurities and they also have byproducts during manufacturing. And those types of signatures have not previously been evaluated in such a detailed and comprehensive manner. Now, it doesn't mean that they're harmful to human health. They may be perfectly benign. But the point is that these chemicals, they're in samples. Or, excuse me, they're in consumer products, and things like that. But they're not listed as being in those products. So in order to understand if these are safe, you have to first characterize what's present.
LP: So are you talking about lotions and shampoos, or I don't know, food containers, or what type of products are you analyzing?
KF: So pretty much anything you come into contact with on a daily basis could be a candidate for this type of analysis. So anything from personal care products to clothing, there's carpeting and upholstery, all of these materials are made of chemicals. And in many cases they're a very complex mix of chemicals.
So now, with these advances in the technology, we're able to characterize these fully. And this is important, not only for exposure science, so knowing what people are exposed to, but also for other industries, such as forensics. So for example, a particular manufacturer may have a process, something that they produce, and then suddenly it's out of spec. And they want to know why. So these techniques can be really helpful for the full characterization of samples for that reason.
You may also think of pretty much any time you want to know what is in a sample. So if you think of the pharmaceutical industry, they are very interested in knowing, obviously, what impurities might be present in pharmaceuticals. So it's not just exposomics in consumer products. Really, the sky's the limit here. Any time you have a situation where you have an interest in knowing holistically what it is made of, this nontargeted analysis technique can provide a good first screening method, and then set you down a path for further study.
LP: So your software is already in use, plucking things out that hadn't been recognized before. What type of clients do you have that are interested in learning more about their products?
KF: We serve a variety of clients across both government and commercial interests. So this ranges from customers that are interested in understanding consumer products to customers who are interested in just understanding the differences in their particular samples, whatever industry that may be. We also support the fuels industry for understanding the composition of different fuels and oils, and what effect might that have on those products and in how they function.
LP: Have you had any really surprising findings?
KF: I think the biggest surprise is just how much more efficient it makes the process. So we've done a number of studies now where we have analyzed samples, both using our quote unquote "old-fashioned manual curation method," which is quite tedious and takes a lot of time. And then to take that same data set and use the software, and see how quickly the results are. And it's very satisfying to see, in most cases, how the results are almost identical. And I think that's what's been the most surprising, in a very good way, thing to me.
LP: So after you analyze a sample for your clients, what's the next step for them, once they know what they're dealing with?
KF: Well, there could be a number of next steps. So non-targeted analysis is a screening technique. So by being able to rapidly see these patterns in their samples, it's really, really critical. Because by speeding up this process, it frees both us and our clients to have more resources to then act on those findings. So follow-up tasks, maybe, to confirm what we found. So as this is a screening technique, the identifications, while they are pretty highly confident identifications in most cases, they can't be said to be absolutely true until they are compared against an authentic reference material. So being able to screen a large number of chemicals quickly allows us to more rapidly hone in on, what are the most important chemicals to be able to confirm, and then study further.
LP: So we have mentioned several times that Floodlight and Searchlight really speed up this process. And there was really a need to do that. We touched a little bit on that in episode 16, when you guys joined us for a segment then. But if you could remind us, how did you develop Floodlight and Searchlight? How did it all come to be?
MH: So Kristin and her team really brought the need, the whole idea being that this first pass, this screening process is severely bottlenecked by the manual processing effort. That effort that you mentioned takes weeks. That's just not feasible at scale to perform the kind of services that we need to with non-targeted analysis. So that led to the sacrifice of that technique in a lot of cases, where quick turnaround was essential.
So once that need was identified, my team was brought in to provide the automation support. And so we've been working closely together, Kristin and I, for a number of years now, building out these tool sets and collaborating on other data-driven approaches to things, and analysis efforts just to make efficient what we can and free up the chemist, the human expert, to take that information and make informed decisions and interpret the results.
So I would call Floodlight and Searchlight really decision support tools. We're not taking anyone's job with this automation. Really what we're taking away is the more tedious aspects, the mundane things that a machine is perfectly capable of doing, and offsetting that burden so that the things that aren't easily transferable to a machine, like interpretation and drawing conclusions from these data driven results, we leave that to the expert.
LP: So this was really a feat achieved by machine learning. I did want to touch on that, as well. For our listeners, what is machine learning? How do you define it, Michael?
MH: Machine learning is a very hot topic. So it is very important to have a clear definition of what it means, and what it can do. Machine learning is a pretty broad field. A lot of people associate it most strongly with deep learning, which is neural-network-based machine learning. But it really spans a lot more techniques than that.
It could be something as simple as linear regression, or something not so simple, like Bayesian updating. But really, the essence of machine learning is really not explicitly telling a computer how to do something. So we can think of all the sorts of physical relationships that we understand about the world. Maybe take velocity, for example. That's easily defined as distance over time. So we know that there's a mathematical operation there. We can take velocity data. We can take distance and time data. And we know how to combine those things together to get our answers.
But there are a lot of scenarios where those relationships aren't as easily defined. We can't just divide something, or multiply something. So that's really where machine learning comes into play. If we feed a machine learning algorithm lots of data, lots of examples, then the machine learning algorithm can define those relationships for us. It can learn the complexities between the characteristics of the data and tease apart, how to come to the right answer without us explicitly telling it what to do.
LP: And if you could just a recap, and we've already talked about this, but how was machine learning specifically used to develop this software?
MH: So this data, it's one of the most complex sets that I've worked with in my career. I don't know how Kristin does it every day, deals with this data. It's highly complex. And so really, we were drawing upon that human, that process in Kristen's brain, where she makes these kinds of decisions about the chemical data.
And those weren't easily defined. It's not something that simple addition or multiplication could take care of. She was looking at a wide variety of information and digesting it herself. And so to take those relationships that Kristin has learned with her expertise and experience and apply that to a computer program was infeasible, to say the least. So there is no way we could program how to decide whether a chemical signal is good or bad, explicitly. So instead, we used machine learning to take that information in, make that determination. So that's for Floodlight.
And similarly for Searchlight, grouping these chemical samples that can contain hundreds or thousands of chemical compounds, making a determination on what is related to each other, what is unrelated, that was a similarly complex problem to deal with. And not one that we could easily define, mathematically. So again, we pulled from machine learning to automatically define those relationships to learn what is important about the data that can help us tease apart, draw some conclusions at the end of the day.
LP: I think one figure that I read was that your software is speeding up the process 120 times. So 120 times faster, now. Is that about where you put this, as far as speed?
MH: So that was a special test case. I can't guarantee that that would always be the case across every sample set. But yes, that was the benchmark that we came up with on a specific sample set.
KF: I might add that one of the unique things about Floodlight and Searchlight is, there are a number of user controls. So there may be situations when we are trying to churn through a lot of samples very quickly. And we would be happy with a baseline accuracy of 90% to 95%, which is pretty great. And then, we can really speed up the process. There may be other situations where the samples are such that we require a very, very high level of accuracy, 99% or above. And the software programs are designed to work with the chemist, so not replace the chemists, but work with the chemist in order to be able to tune those accuracy levels. So even having a 99% accuracy level still does significantly speed up the process. But that's one of the reasons why the speedup factor, it will be dependent on the analysis goals.
LP: So the speed varies. But one thing is for sure. It is definitely faster than how you used to conduct the process. So I did want to talk to you about that, Kristin. How was the work of analyzing and identifying chemicals conducted before introducing machine learning into the process? And what were the limitations with that process?
KF: So previously, we looked at each and every signal and made a decision about its quality. And that's what we did. It was very tedious, very manual. And we looked at millions of signals over time, over several years. And that was really the impetus towards developing a program that would apply these computer science tools to help assist with that very manual, tedious process, freeing us up to do the more interesting and important work of, what does this all mean.
LP: When you say signals, can you talk to us about what you're meaning when you say a chemical signal?
KF: Oh, sure. So the instruments that we use, they are essentially a transducer. So they will interact with the sample to produce some kind of electronic signal. And that's what we call data. So the first step to using that data is to identify features. So a feature is essentially maybe a signal that you can see, a peak rising and falling at a given place and time in the analysis. And so that would be a feature. But those features are not helpful without identifications. So then the next step is to take all those features and try to identify them. And so the instrument software programs do all do this for the chemist. They identify where the features are. And then they try to assign identifications to those features. But the problem is that the computer is not perfect at it.
So there are lots of situations where you can get distortions and signals, and things of that nature. And then, that's where assessing all the information that you have available to you becomes important to decide, is this a high quality signal that I can stand behind as being a real signal, a real chemical in the sample? Or is this a low quality signal that we don't have confidence in, so we really shouldn't report?
LP: So having the tools that you have now, Floodlight and Searchlight, do you ever look back and wonder, how did I ever do it that way? [LAUGHS]
KF: Absolutely, absolutely. It's really been a timesaver. And it's been really fun, because I've had more time to do the interesting parts of sample analysis.
LP: With this new capability to deeply analyze chemical components, are you discovering new information, maybe things you did not see or understand fully before?
KF: Absolutely. So whereas previously, when we had to process everything manually, we knew how complex the data was, because we could see that very plainly. But now, with these computer science tools, what we're discovering is new patterns in relationships between samples which were previously difficult for a single human to discern, just looking at the samples manually.
LP: And this is across a broad range of fields, as we discussed. Everything from products, environmentally. What are some of the areas that you're applying the software to?
KF: So I can talk about some general classes of compounds that many researchers, including us, in this field are interested in. And that ranges anywhere from phthalates, which are present and in a lot of plastics and whose impact on human health is not completely understood. So there's many, many different kinds of phthalates. And when one kind of aid is phased out, it's generally replaced by another. So we're seeing all kinds of those kinds of chemicals, as well. Other chemicals that people in the field are interested, include polyaromatic hydrocarbons and there's dyes and flame retardants, and polyfluorinated chemicals, or PFAS. The list just goes on and on and on.
LP: So with all these areas, with you being able to pull samples from all these areas and analyze with Floodlight and Searchlight, what is the benefit to humankind?
KF: I think it's just knowledge. We can't fix a problem if we don't know what the problem is. So just the knowledge of knowing what is present, what are we being exposed to. That's really in some ways half the battle. So we hope we're providing a deeper level of knowledge to the scientific community to tackle these problems.
LP: And do you see this leading to better initiatives in health and safety?
KF: Absolutely. I think the biggest impact is going to be the sheer amount of samples we're able to analyze. And as I think Michael could probably better explain, definitely the more data, the better.
MH: I wholeheartedly agree. The more information that we have, the more informed our decisions can be as a society. So I think the benefit of increasing the amount of information is just having better information, high quality, fast information to drive policy, drive decision making.
LP: So we've covered a lot of ground today with Floodlight and Searchlight. what is the takeaway today for our listeners? What do you consider the most compelling aspect of this technology?
MH: I would say that it's the amount of information that is brought to the surface. Because of this technology, it will enable future studies. It will enable research that ideally will inform people and keep us all healthy and safe, as well as make things more efficient in the world.
LP: Floodlight and Searchlight, a really fast way to identify chemicals, software that's beneficial in so many applications, truly a breakthrough and example of strong collaboration. So thank you so much for joining us today, Kristen and Michael.
MH: Thank you, Lisa.
KF: Yes, thank you.
And that wraps up this episode of Technology Today. You can hear all of our episodes and see photos and complete transcripts at podcast.swri.org. Remember to share our podcast and subscribe on your favorite podcast platform.
Want to see what else we're up to? Connect with Southwest Research Institute on Facebook, Instagram, Twitter, LinkedIn, and YouTube. And now is a great time to become an SwRI problem solver. Visit our career page at swri.jobs.
Ian McKinney and Bryan Ortiz are the podcast audio engineers and editors. I am producer and host, Lisa Peña.
Thanks for listening.
Recent advances in mass spectrometry coupled with large data collection in cheminformatics are benefiting many fields, including non-targeted analysis (NTA) and exposomics by providing the potential to perform deep analysis and gain unprecedented insights into chemical properties. Floodlight™ provides high-throughput screening (HTS) capabilities for high-quality signal interpretation. Floodlight leverages machine learning algorithms for rapid pattern matching, allowing scientists to make faster data-informed decisions with non-targeted analysis and other complex datasets.