Deep-Learning-Specialization-Coursera

This repo contains the updated version of all the assignments/labs (done by me) of deep learning specialization on coursera by andrew ng. it includes building various deep learning models from scratch and implementing them for object detection, facial recognition, autonomous driving, neural machine translation, trigger word detection, etc., deep learning specialization coursera [updated version 2021].

GitHub Repo

Announcement

[!IMPORTANT] Check our latest paper (accepted in ICDAR’23) on Urdu OCR

UTRNet

This repo contains all of the solved assignments of Coursera’s most famous Deep Learning Specialization of 5 courses offered by deeplearning.ai

Instructor: Prof. Andrew Ng

This Specialization was updated in April 2021 to include developments in deep learning and programming frameworks. One of the most major changes was shifting from Tensorflow 1 to Tensorflow 2. Also, new materials were added. However, Most of the old online repositories still don’t have old codes. This repo contains updated versions of the assignments. Happy Learning :)

Programming Assignments

Course 1: Neural Networks and Deep Learning

  • W2A1 - Logistic Regression with a Neural Network mindset
  • W2A2 - Python Basics with Numpy
  • W3A1 - Planar data classification with one hidden layer
  • W3A1 - Building your Deep Neural Network: Step by Step¶
  • W3A2 - Deep Neural Network for Image Classification: Application

Course 2: Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization

  • W1A1 - Initialization
  • W1A2 - Regularization
  • W1A3 - Gradient Checking
  • W2A1 - Optimization Methods
  • W3A1 - Introduction to TensorFlow

Course 3: Structuring Machine Learning Projects

  • There were no programming assignments in this course. It was completely thoeretical.
  • Here is a link to the course

Course 4: Convolutional Neural Networks

  • W1A1 - Convolutional Model: step by step
  • W1A2 - Convolutional Model: application
  • W2A1 - Residual Networks
  • W2A2 - Transfer Learning with MobileNet
  • W3A1 - Autonomous Driving - Car Detection
  • W3A2 - Image Segmentation - U-net
  • W4A1 - Face Recognition
  • W4A2 - Neural Style transfer

Course 5: Sequence Models

  • W1A1 - Building a Recurrent Neural Network - Step by Step
  • W1A2 - Character level language model - Dinosaurus land
  • W1A3 - Improvise A Jazz Solo with an LSTM Network
  • W2A1 - Operations on word vectors
  • W2A2 - Emojify
  • W3A1 - Neural Machine Translation With Attention
  • W3A2 - Trigger Word Detection
  • W4A1 - Transformer Network
  • W4A2 - Named Entity Recognition - Transformer Application
  • W4A3 - Extractive Question Answering - Transformer Application

I’ve uploaded these solutions here, only for being used as a help by those who get stuck somewhere. It may help them to save some time. I strongly recommend everyone to not directly copy any part of the code (from here or anywhere else) while doing the assignments of this specialization. The assignments are fairly easy and one learns a great deal of things upon doing these. Thanks to the deeplearning.ai team for giving this treasure to us.

Connect with me

Name: Abdur Rahman

Institution: Indian Institute of Technology Delhi

Find me on:

LinkedIn

DeepLearning.ai

This is my assignment on andrew ng's special course "deep learning specialization" this course consists of five courses: neural networks and deep learning improving deep neural networks: hyperparameter tuning, regularization and optimization structuring machine learning projects convolutional neural networks sequence models.

This is my assignment on Andrew Ng’s special course “Deep Learning Specialization” This course consists of five courses:

Course Contents

Neural Networks and Deep Learning

Week1 Introduction to deep learning

Week2 Neural Networks Basics

Week3 Shallow Neural networks

Week4 Deep Neural Networks

Improving Deep Neural Networks

Week1 Practical aspects of Deep Learning(Initialization-Regularization-Gradient Checking)

Week2 Optimization algorithms

Week3 Hyperparameter tuning, Batch Normalization and Programming Frameworks

Convolutional Neural Network

Week1 Foundations of Convolutional Neural Networks

Week2 Deep convolutional models: case studies

Week3 Object detection

Week4 Special applications: Face recognition & Neural style transfer

Sequence Models

Week1 Recurrent Neural Networks

Week2 Natural Language Processing & Word Embeddings

Week3 Sequence models & Attention mechanism

Assignment 5: Text Classification with RNNs (Part 1)

Deadline: November 22nd, 9am

In this assignment and the next, we are switching to a different modality of data: Text. Namely, we will see how to assign a single label to input sequences of arbitrary length. This has many applications, such as detecting hate speech on social media or detecting spam emails. Here, we will look at sentiment analysis, which is supposed to tell what kind of emotion is associated with a piece of text.

In part 1, we are mainly concerned with implementing RNNs at the low level so that we understand how they work in detail. The models themselves will be rather rudimentary. We will also see the kinds of problems that arise when working with sequence data, specifically text. Next week, we will build better models and deal with some of these issues.

The notebook associated with the practical exercise can be found here .

We will be using the IMDB movie review dataset. This dataset comes with Keras and consists of 50,000 movie reviews with binary labels (positive or negative), divided into training and testing sets of 25,000 sequences each.

A first look

The data can be loaded the same way as MNIST or CIFAR – tf.keras.datasets.imdb.load_data() . If you print the sequences, however, you will see that they are numbers, not text. Recall that deep learning is essentially a pile of linear algebra . As such, neural networks cannot take text as input, which is why it needs to be converted to numbers. This has already been done for us – each word has been replaced by a number, and thus a movie review is a sequence of numbers (punctuation has been removed).

If you want to restore the text, tf.keras.datasets.imdb.get_word_index() has the mapping – see the notebook for how you can use this, as well as some additional steps you need to actually get correct outputs.

Representing words

Our sequences are numbers, so they can be put into a neural network. But does this make sense? Recall the kind of transformations a layer implements: A linear map followed by a (optional) non-linearity. But that would mean, for example, that the word represented by index 10 would be “10 times as much” as the word represented by index 1. And if we simply swapped the mapping (which we can do, as it is completely arbitrary), the roles would be reversed! Clearly, this does not make sense.

A simple fix is to use one-hot vectors: Replace a word index by a vector with as many entries as there are words in the vocabulary, where all entries are 0 except the one corresponding to the respective word, which is 1 – see the notebook.

Thus, each word gets its own “feature dimension” and can be transformed separately. With this transformation, our data points are now sequences of one-hot vectors, with shape (sequence_length, vocabulary_size) .

Variable sequence lengths

Of course, not all movie reviews have the same length. This actually represents a huge problem for us: We would like to process inputs in batches, but tensors generally have to be “rectangular”, i.e. we cannot have different sequence lengths in the same batch! The standard way to deal with this is padding : Appending additional elements to shorter sequences such that all sequences have the same length.

In the notebook, this is done in a rather crude way: All sequences are padded to the length of the longest sequence in the dataset.

Food for thought #1: Why is this wasteful? Can you think of a smarter padding scheme that is more efficient? Consider the fact that RNNs can work on arbitrary sequence lengths, and that training minibatches are pretty much independent of each other.

Dealing with extremes

Once we define the model, we will run into two issues with our data:

  • Truncate sequences by cutting off all words beyond a limit. Both load_data and pad_sequences have arguments to do this. We recommend the latter as you can choose between “pre” or “post” truncation.
  • Remove all sequences that are longer than a limit from the dataset. Radical!
  • The one-hot vectors are huge, slowing down the program and eating memory.
  • It’s difficult for the network to learn useful features for the rare words.

load_data has an argument to keep only the n most common words and replace less frequent ones by a special “unknown word” token (index 2 by default). As a start, try keeping only the 20,000 most common words or so.

Food for thought #2: Between truncating long sequences and removing them, which option do you think is better? Why?

Food for thought #3: Can you think of a way to avoid the one-hot vectors completely? Even if you cannot implement it, a conceptual idea is fine.

With these issues taken care of, we should be ready to build an RNN!

Building The Model

A Tensorflow RNN “layer” can be confusing due to its black box character: All computations over a full sequence of inputs are done internally. To make sure you understand how an RNN “works”, you are asked to implement one from the ground up, defining variables yourself and using basic operations such as tf.matmul to define the computations at each time step and over a full input sequence. There are some related tutorials available on the TF website, but all of these use Keras.

For this assignment, you are asked not to use the RNNCell classes nor any related Keras functionality. Instead, you should study the basic RNN equations and “just” translate these into code. You can still use Keras optimizers, losses etc. You can also use Dense layers instead of low-level ops, but make sure you know what you are doing. You might want to proceed as follows:

  • On a high level, nothing about the training loop changes! The RNN gets an input and computes an output. The loss is computed based on the difference between outputs and targets, and gradients are computed and applied to the RNN weights, with the loss being backpropagated trough time .
  • Loop over the input, at each time step taking the respective slice. Your per-step input should be batch x features just like with an MLP!
  • At each time step, compute the new state based on the previous state as well as the current input.
  • Compute the per-step output based on the new state.
  • What about comparing outputs to targets? Our targets are simple binary labels. On the other hand, we have one output per time step . The usual approach is to discard all outputs except the one for the very last step. Thus, this is a “many-to-one” RNN (compare figure 10.5 in the book).
  • You could have an output layer with 2 units, and use sparse categorical cross-entropy as before (i.e. softmax activation). Here, whichever output is higher “wins”
  • You can have a single output unit and use binary cross-entropy (i.e. sigmoid activation). Here, the output is usually thresholded at 0.5.

Food for thought #4: How can it be that we can choose how many outputs we have, i.e. how can both be correct? Are there differences between both choices as well as (dis)advantages relative to each other?

Open Problems

Initial state.

To compute the state at the first time step, you would need a “previous state”, but there is none. To fix this, you can define an “initial state” for the network. A common solution is to simply use a tensor filled with zeros. You could also add a trainable variable and learn an initial state instead!

Food for thought #5: All sequences start with the same special “beginning of sequence” token (coded by index 1). Given this fact, is there a point in learning an initial state? Why (not)?

Computations on padded time steps

Recall that we padded all sequences to be the same length. Unfortunately, the RNN is not aware that we did this. This can be an issue, as we are basically computing new states (thus computing outputs as well as influencing future states) based on “garbage” inputs.

Food for thought #6: pad_sequences allows for pre or post padding. Try both to see the difference. Which option do you think is better? Recall that we use the final time step output from our model.

Food for thought #7: Can you think of a way to prevent the RNN from computing new states on padded time steps? One idea might be to “pass through” the previous state in case the current time step is padding. Note that, within a batch, some sequences might be padded for a given time step while others are not.

Slow learning

Be aware that it might take several thousand steps for the loss to start moving at all, so don’t stop training too early if nothing is happening. Experiment with weight initializations and learning rates. For fast learning, the goal is usually to set them as large as possible without the model “exploding”.

A major issue with our “last output summarizes the sequence” approach is that the information from the end has to backpropagate all the way to the early time steps, which leads to extreme vanishing gradient issues. You could try to use the RNN output more effectively. Here are some ideas:

  • Instead of only using the final output, average (or sum?) the logits (pre-sigmoid) of all time steps and use this as the output instead.
  • Instead of the logits, average the states at all time steps and compute the output based on this average state. Is this different from the above option?
  • Compute logits and sigmoids for each output, and average the per-step probabilities.

Food for thought #8: What could be the advantage of using methods like the above? What are disadvantages? Can you think of other methods to incorporate the full output sequence instead of just the final step?

What to hand in

  • A low-level RNN implementation for sentiment classification. If you can get it to move away from 50% accuracy on the training set, that’s a success. Be wary of overfitting, however, as this doesn’t mean that the model is generalizing! If the test (or validation) loss isn’t moving, try using a smaller network. Also note that you may sometimes get a higher test accuracy, while the test loss is also increasing (how can this be?)!
  • Consider the various questions posed throughout the assignment and try to answer them! You can use text cells to leave short answers in your notebook.

Instantly share code, notes, and snippets.

@Momoumar

Momoumar / JupyterNotebookDownloader.sh

  • Download ZIP
  • Star ( 20 ) 20 You must be signed in to star a gist
  • Fork ( 3 ) 3 You must be signed in to fork a gist
  • Embed Embed this gist in your website.
  • Share Copy sharable link for this gist.
  • Clone via HTTPS Clone using the web URL.
  • Learn more about clone URLs
  • Save Momoumar/26c73af0adcd2df05555f4a3b2d6ebc6 to your computer and use it in GitHub Desktop.
#To download all your programming assignments including all files and notebooks, follow these steps:
#1 - Go to the root tree folder for instance: https://hub.coursera-notebooks.org/user/${user_token}/tree/
#2- Open the terminal by clicking the + button on the right-hand corner
#3 - Enter the following command in the terminal:
tar cvfz allassignments.tar.gz *
#4 - The previous command will create a zip named allassignments containing all your programmings assignment
#5 - Select allassignments.tar.gz and download
#6 - Enjoy, don't forget to delete it afterward ;-)

@subTropic

subTropic commented Sep 3, 2018

If you add the 'h' option tar will resolve the symlinks to images, (otherwise downloaded image files may be blank where they are symlinks) e.g:

tar cvfzh allassignments.tar.gz *

Sorry, something went wrong.

@OsamuB

OsamuB commented Feb 13, 2019

How select the file and download?

@ronykroy

ronykroy commented Jul 9, 2019

This is neat.. Should been higher up the relevant google search :)

@siebenbrunner

siebenbrunner commented May 17, 2020

Great! If the download fails, try splitting it up into smaller chunks or exclude the files that are too large.

@Johan-Liebert1

Johan-Liebert1 commented Jun 13, 2020 • edited Loading

Can you tell me how to do that. It won't download my file because "the server took too long to respond".

I want to download everything, so how do I split up the final zip file?

@kaushikanshul

kaushikanshul commented Sep 2, 2020

Thanks for this solution. It is supremely useful. There is an npy file which is not getting downloaded even after tar-gz, most likely because of its size. Is there another better compression that will help?

@PierreGabioud

PierreGabioud commented Oct 18, 2020

Split into multiple 50MB files split -b 50m allassignments.tar.gz allassignments.tar.gz.

@sergii1989

sergii1989 commented Nov 23, 2021 • edited Loading

3 major steps:

Compress the directory with assignments and split the archive into batches:

  • tar cvfzh allassignments.tar.gz * | split -b 100M allassignments.tar.gz "allassignments.tar.gz.part"

Download all parts using Jupyter Notebook

Join all batches back to one file (on your local machine):

  • cat allassignments.tar.gz.part* > allassignments.tar.gz

deep learning ai assignment github

Deep-Learning-Specialization

Coursera deep learning specialization, sequence models.

This course will teach you how to build models for natural language, audio, and other sequence data. Thanks to deep learning, sequence algorithms are working far better than just two years ago, and this is enabling numerous exciting applications in speech recognition, music synthesis, chatbots, machine translation, natural language understanding, and many others.

  • Understand how to build and train Recurrent Neural Networks (RNNs), and commonly-used variants such as GRUs and LSTMs.
  • Be able to apply sequence models to natural language problems, including text synthesis.
  • Be able to apply sequence models to audio applications, including speech recognition and music synthesis.

Week 1: Sequence Models

Learn about recurrent neural networks. This type of model has been proven to perform extremely well on temporal data. It has several variants including LSTMs, GRUs and Bidirectional RNNs, which you are going to learn about in this section.

Assignment of Week 1

  • Quiz 1: Recurrent Neural Networks
  • Programming Assignment: Building a recurrent neural network - step by step
  • Programming Assignment: Dinosaur Island - Character-Level Language Modeling
  • Programming Assignment: Jazz improvisation with LSTM

Week 2: Natural Language Processing & Word Embeddings

Natural language processing with deep learning is an important combination. Using word vector representations and embedding layers you can train recurrent neural networks with outstanding performances in a wide variety of industries. Examples of applications are sentiment analysis, named entity recognition and machine translation.

Assignment of Week 2

  • Quiz 2: Natural Language Processing & Word Embeddings
  • Programming Assignment: Operations on word vectors - Debiasing
  • Programming Assignment: Emojify

Week 3: Sequence models & Attention mechanism

Sequence models can be augmented using an attention mechanism. This algorithm will help your model understand where it should focus its attention given a sequence of inputs. This week, you will also learn about speech recognition and how to deal with audio data.

Assignment of Week 3

  • Quiz 3: Sequence models & Attention mechanism
  • Programming Assignment: Neural Machine Translation with Attention
  • Programming Assignment: Trigger word detection

Course Certificate

Certificate

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 20 August 2024

An end-to-end deep learning method for mass spectrometry data analysis to reveal disease-specific metabolic profiles

  • Yongjie Deng   ORCID: orcid.org/0000-0002-4221-7027 1 ,
  • Yao Yao   ORCID: orcid.org/0000-0002-2225-0903 2 , 3 ,
  • Yanni Wang 1 ,
  • Tiantian Yu   ORCID: orcid.org/0009-0005-1848-0748 1 , 2 , 3 ,
  • Wenhao Cai   ORCID: orcid.org/0009-0002-1218-3935 1 ,
  • Dingli Zhou 1 ,
  • Feng Yin 2 ,
  • Wanli Liu   ORCID: orcid.org/0000-0002-7467-760X 2 ,
  • Yuying Liu 2 ,
  • Chuanbo Xie 2 ,
  • Jian Guan 4 ,
  • Yumin Hu   ORCID: orcid.org/0000-0003-3089-9945 2 , 3 ,
  • Peng Huang   ORCID: orcid.org/0000-0002-2152-9152 2 , 3 &
  • Weizhong Li   ORCID: orcid.org/0000-0002-9003-7733 1 , 5 , 6  

Nature Communications volume  15 , Article number:  7136 ( 2024 ) Cite this article

609 Accesses

2 Altmetric

Metrics details

  • Cancer metabolism
  • Machine learning
  • Metabolomics

Untargeted metabolomic analysis using mass spectrometry provides comprehensive metabolic profiling, but its medical application faces challenges of complex data processing, high inter-batch variability, and unidentified metabolites. Here, we present DeepMSProfiler, an explainable deep-learning-based method, enabling end-to-end analysis on raw metabolic signals with output of high accuracy and reliability. Using cross-hospital 859 human serum samples from lung adenocarcinoma, benign lung nodules, and healthy individuals, DeepMSProfiler successfully differentiates the metabolomic profiles of different groups (AUC 0.99) and detects early-stage lung adenocarcinoma (accuracy 0.961). Model flow and ablation experiments demonstrate that DeepMSProfiler overcomes inter-hospital variability and effects of unknown metabolites signals. Our ensemble strategy removes background-category phenomena in multi-classification deep-learning models, and the novel interpretability enables direct access to disease-related metabolite-protein networks. Further applying to lipid metabolomic data unveils correlations of important metabolites and proteins. Overall, DeepMSProfiler offers a straightforward and reliable method for disease diagnosis and mechanism discovery, enhancing its broad applicability.

Similar content being viewed by others

deep learning ai assignment github

Machine learning of serum metabolic patterns encodes early-stage lung adenocarcinoma

deep learning ai assignment github

Metabolomic machine learning predictor for diagnosis and prognosis of gastric cancer

deep learning ai assignment github

Targeted metabolomic profiling as a tool for diagnostics of patients with non-small-cell lung cancer

Introduction.

Metabolomics offers a comprehensive view of small molecule concentrations within a biological system and plays a pivotal role in the discovery of disease biomarkers for diagnostic purpose 1 . Liquid chromatography mass spectrometry (LC-MS) is a widely practiced experimental technique in global metabolomic studies 2 , 3 . High sensitivity, stability, reproducibility, and detection throughput are the unique advantages of untargeted LC-MS 4 . Despite its capacity to measure thousands of ion peaks, the conventional metabolomic study by LC-MS remains a challenging task due to laborious data processing such as peak picking and alignment, metabolite annotation by comparing to authenticated databases, and data normalisation to control unwanted variability in a large-scale study 5 , 6 . The broader application of metabolomics in precision medicine may be impeded by obstacles such as complex data processing, high inter-batch variability, and burdensome metabolite identification 7 .

Untargeted metabolomics has been conducted on various human biological fluids, including serum and plasma, for the discovery of biomarkers in cancers such as hepatocellular carcinoma 8 , pancreatic 9 , prostate 10 , and lung cancers 11 , 12 , 13 . However, such biomarker discovery studies utilising metabolomics face significant challenges regarding reproducibility, likely due to signal drifts in cross-batch or cross-platform analysis 14 and the limited integration of data from different laboratory samples 15 . Furthermore, unknown metabolites are excluded when comparing detected features to authenticated databases 16 , which may hinder our ability in discovering new biomarkers associated with diseases. Several previous studies have combined machine learning with LC-MS for in vitro disease diagnosis and improved the efficiency of LC-MS data analysis. For example, Huang et al. conducted machine learning to extract serum metabolic patterns from laser desorption/ionisation mass spectrometry to detect early-stage lung adenocarcinoma 11 ; Chen et al. adopted machine learning models to conduct targeted metabolomic data analysis to identify non-invasive biomarker for gastric cancer diagnosis 17 ; Shen et al. developed a deep-learning-based Pseudo-Mass Spectrometry Imaging method and applied it in the prediction of gestational age of pregnant women, as well as the diagnosis of endometrial cancer and colon cancer 18 . However, these studies still face challenges such as batch effects and unknown metabolites in metabolomics 7 . Consequently, a new analytical approach is urgently needed to overcome the experimental bottlenecks and reveal disease-associated profiles comprising both identified and unknown components derived from LC-MS peaks.

Deep learning has been widely applied in various omics data analyses, holding promise for addressing the complexities of metabolomic data 19 . The encoding and modelling capabilities of deep learning offer a potential solution to overcome the aforementioned bottlenecks in handling intricate and high-dimensional data, mitigating bias in machine learning algorithms 20 , 21 . However, deep learning necessitates high-quality data and a sufficient quantity of samples, otherwise leading to issues like the curse of dimensionality and the overfitting of predictive models 22 . Moreover, integrating large dataset collected from multiple hospitals may introduce significant variations. Furthermore, as deep learning methods are usually perceived as “black-box” processes, the importance of model interpretability for prediction in the context of biomedical research is increasingly recognised 23 , 24 , 25 . Therefore, a deep learning model with both interpretability for biological soundness and capability to mitigate batch effects is highly desirable to enhance the reliability of large-scale metabolomic analyses for diagnostic purposes.

In this study, we develop an ensemble end-to-end deep learning method named as deep learning-based mass spectrum profiler (DeepMSProfiler) for untargeted metabolomic data analysis. We firstly apply this method to differentiate healthy individuals and patients with benign lung nodules or lung adenocarcinoma using 859 serum samples from three distinct hospitals, followed by its extended analysis on lipid metabolomic data derived from 928 cell lines to reveal metabolites and proteins associated with multiple cancer types. Without the process of peak extraction and identification as well as potential errors by conventional machine learning approaches, our method directly converts raw LC-MS data into outputs such as predicted classification, heatmaps illustrating key metabolite signals specific to each class, and metabolic networks that influence the predicted classes. Importantly, DeepMSProfiler effectively removes undesirable batch effects and variations across different hospitals and infers the unannotated metabolites associated with specific classifications. Furthermore, it leverages an ensemble-model strategy that optimises feature attribution from multiple individual models. DeepMSProfiler achieved an area under the receiver operating characteristic curve (AUC) score of 0.99 in an independent testing dataset, along with an accuracy of 96.1% in detecting early-stage lung adenocarcinoma. The results are explainable through locating relevant biological components as contribution factors to prediction. Our method provides a straightforward and reliable approach for metabolomic applications in disease diagnosis and mechanism discovery.

The overview of the ensemble end-to-end deep-learning model

The DeepMSProfiler method includes three main components: the serum-based mass spectrometry, the ensemble end-to-end model, and the disease-related LC-MS profiles (Fig.  1a ). In the first component, the raw LC-MS-based metabolomic data was generated using 859 human serum samples (Fig.  1a left) collected from 210 healthy individuals, 323 benign lung nodules, and 326 lung adenocarcinomas. The space of the LC-MS raw data contains three dimensions: retention time (RT), mass-to-charge ratio (m/z), and intensity. Using the RT and m/z dimensions, the data can be mapped from three-dimensional space into the frequency and time domains, respectively (Fig.  1b left to middle). Ion current maps and primary mass spectra can then be generated and used for metabolite identification (Fig.  1b middle to right). Conventional step-by-step methods of metabolomic analysis 5 , 22 (Supplementary Fig.  1 top) may lead to a large number of lost metabolic signals. To address these issues, DeepMSProfiler directly takes untargeted LC-MS raw data as model input, and builds an end-to-end deep learning model to profile disease-related metabolic signals (Supplementary Fig.  1 bottom).

figure 1

a The overview of DeepMSProfiler. Serum samples of different populations (top left) were collected and sent to the instrument (bottom left) for liquid chromatography-mass spectrometry (LC-MS) analysis. The raw LC-MS data, containing information on retention time (RT), mass-to-charge ratio (m/z), and intensity, is used as input to the ensemble model (middle). Multiple single convolutional neural networks form the ensemble model (centre) to predict the true label of the input data and generate three outputs (right), including the predicted sample classes, the contribution heatmaps of classification-specific metabolic signals, and the classification-specific metabolic networks. b The data structure of raw data. The mass spectra of different colours (centre) represent the corresponding m/z and ion intensity of ion signal groups recorded at different RT frames. All sample points are distributed in a three-dimensional space (left) which can be mapped along three axes to obtain chromatograms, mass spectra, and two-dimensional matrix data. Chromatograms and mass spectra are used for conventional qualitative and quantitative analysis (right), while the two-dimensional matrix serves as input data for convolutional neural networks. c The structure of a single end-to-end model. The input data undergoes the pre-pooling processing to reduce dimensionality and become three-channel data. As the model passes through each convolutional layer (conv) in the feature extractor module, the weights associated with the original signals change continuously. The sizes of different frames in the enlarged layers (top) represent different receptive fields, with DenseNet allowing the model to generate more flexible receptive field sizes. After the last fully connected layer (FC), the classifications are resulted.

The main model adopts an ensemble strategy and consists of multiple sub-models (Fig.  1a middle). The ensemble strategy is considered to be able to provide better generalisation 26 according to the complexity of the hypothesised space 27 , the local optimal search 27 and the parameter diversity 28 , 29 by the random training process, as well as the bias-variance theory 30 , 31 , 32 . The structure of each sub-model consists of three parts: a pre-pooling module, a feature extraction module, and a classification module (Fig.  1c ). The pre-pooling module transforms three-dimensional data into two-dimensional space through a max-pool layer, effectively reducing dimensionality and redundancy while preserving global signals (Fig.  1c left). The feature extraction module is based on a convolutional neural network to perform classification tasks by extracting category-related features (Fig.  1c middle). DenseNet121 is chosen as the backbone of the feature extraction module due to its highest accuracy and the least number of parameters (Supplementary Data  1 ). Additionally, the design of densely connected convolutional networks in DenseNet121 33 makes the model more flexible in terms of receptive field size, allowing adaptation to different RT intervals of metabolic signal peaks. After the non-linear transformation by the feature extraction layer, different weights could be assigned to the input original signal peaks for subsequent classification. We evaluated the effect of the number of sub-models on improving performance and consequently chose to use 18 sub-models for DeepMSProfiler (Supplementary Fig.  2 ). The classification module implements a simple dense neural network to compute the probabilities of different classes (Fig.  1c right).

In the third component, each input sample results in three outputs from the model, including the predicted classification of the sample, the heatmap for the locations of the key metabolic signals, and the metabolic network that influences the predicted category (Fig.  1a right). In the predicted classification, the category with the highest probability is assigned as the predicted label of the model. In the heatmap presentation, the key metabolic signals associated with different classifications are inferred by the perturbation-based method, and the m/z and RT of the key metabolic signals can be located. Finally, the model infers the underlying metabolite-protein networks and metabolic pathways associated with these key metabolic signals directly from m/z (see Methods).

Model performance metrics

To test DeepMSProfiler’s ability in classifying disease states and discovering disease-related profiles, we performed the global metabolomic analysis of serum samples from healthy individuals and patients with benign lung nodules or lung cancer. We collected serum samples from three different hospitals to construct and validate the model. Benign nodule cases were followed for up to 4 years and lung adenocarcinoma samples were pathologically examined (see Methods ). We built the model using 859 untargeted LC-MS samples, of which 686 as the discovery dataset and 173 as the independent testing dataset (Fig.  2a ). The samples were generated from 10 batches, as shown in Supplementary Table  1 . Statistics analysis shows a significant correlation between lesion size and disease type, but the correlations between other clinical factors are not significant (Supplementary Table  2 ). To avoid confounding effects by clinical features, we performed further distribution statistics analysis, which shows no significant difference in the distribution of lesion diameter and patient age in both the discovery dataset and the independent testing dataset (Supplementary Fig.  3 ). The samples in the discovery dataset were randomly divided into two subsets: a training dataset (80%) for parameter optimisation and a validation dataset (20%) to cross-validate the performance of different models. Remarkably, the accuracies of our DeepMSProfiler in the training dataset, validation dataset, and independent testing dataset are 1.0, 0.92, and 0.91, respectively (Supplementary Table  3 ).

figure 2

a The sample allocation chart. The outer ring indicates the types of diseases and the inner ring indicates the sex distribution. Healthy: healthy individuals; Benign: benign lung nodules; Malignant: lung adenocarcinoma. b Predicted receiver operating characteristic (ROC) curves of different methods. Random: performance baseline in a random state. Comparison of performance metrics of different methods ( n  = 50): accuracy ( c ), precision ( d ), recall ( e ), and F1 score ( f ). The blue areas show the different conventional analysis processes using machine learning methods, and the red areas display different end-to-end analysis processes using deep learning methods. The boxplot shows the minimum, first quartile, median, third quartile and maximum values, with outliers as outliers. g Model accuracy rates for different age groups. The sample sizes for different groups are 52, 69, 40, and 12, respectively. h Model accuracy rates for different lesion diameter groups. The sample sizes for different groups are 27, 37, 18, 13, and 34, respectively. The boxplot shows the minimum, first quartile, median, third quartile and maximum values. i Prediction accuracy and parameter scale of different model architectures. j The confusion matrix of the DeepMSProfiler model. The numbers inside the boxes are the number of matched samples between the true label and the predicted label. The ratio in parentheses is the number of matched samples divided by the number of all samples of the true label.

In the independent testing dataset, DeepMSProfiler significantly outperforms traditional methods and single deep learning models (Fig.  2b–h ). Compared with Support Vector Machine (SVM), Random Forest (RF), Deep Learning Neural Network (DNN), Adaptive Boosting (AdaBoost), and Extreme Gradient Boosting (XGBoost) based on traditional methods, and Densely Connected Convolutional Networks (DenseNet121) using raw data, DeepMSProfiler presents the highest areas under the curve (AUC) of 0.99 (Fig.  2b ). Notably, DeepMSProfiler exhibits higher specificity than other models while maintaining high sensitivity, indicating its ability to accurately identify true negatives (Supplementary Table  4 ).

Regarding overall performance against other models, our model achieves the best performance in multiple evaluation metrics: accuracy of 95% (95% CI, 94%–97%) (Fig.  2c ), precision of 96% (95% CI, 94%–97%) (Fig.  2d ), recall of 95% (95% CI, 94%–96%) (Fig.  2e ), and F1 of 98% (95% CI, 97%–98%) (Fig.  2f ). Compared to XGBoost, our model performs better in different groups of lesion sizes and ages (Fig.  2g, h ), except for samples from patients over 70 years old. DeepMSProfiler is also superior to commonly used single deep learning models, such as DenseNet121 (Fig.  2i ). When using the ensemble strategy, we did not set different weights for each sub-model, so each sub-model is equally involved in the final prediction. The confusion matrices of prediction performance for each sub-model (Supplementary Fig.  4 ) show their contributions to the overall results. The robust performance, coupled with the efficiency in terms of computational resources, makes DeepMSProfiler a promising choice for the classification tasks (Fig.  2i ).

Furthermore, DeepMSProfiler demonstrates consistent performance across different categories. All of the AUCs for lung adenocarcinoma, benign lung nodules, and healthy individuals achieves 0.99 (Supplementary Fig.  5 ), and their respective classification accuracies are 85.7%, 90.8%, and 97.0% (Fig.  2j ). Most importantly, our model has good performance for detecting stage-I of lung adenocarcinoma with an accuracy of 96.1%, indicating its potential as an effective method for early lung cancer screening.

Insensitivity to batch effects

Batch effect is one of the most common error sources for the analysis of metabolomic data. To evaluate the impact of batch effects on the non-targeted LC-MS data, we first generated 3 biological replicates as reference samples for each of the 10 batches. These reference replicates were taken from a mixture of 100 healthy human serum samples, and each of them contains equal amounts of isotopes including 13 C-lactate, 13 C 3 -pyruvate, 13 C-methionine, and 13 C 6 -isoleucine (see Methods ). The differences in the data structure of reference replicates from different batches are visualised in 3D and 2D illustrations (Fig.  3a and Supplementary Fig.  6 ), which indicate the changes of shapes and area as well as the RT shifts among different batches. Comparison of individual isotopic peaks in samples from different batches also shows that the batch effects are mainly in the form of RT shifts, and the differences in peak shapes and areas for the same metabolites (Fig.  3b ).

figure 3

a Batch effects in 3D point array and 2D mapped heatmap of reference samples. RT: retention time; m/z: mass-to-charge ratio. b Isotope peaks of the same concentration in different samples. Different colours represent the batches to which the samples belong. c The visualisation of dimensionality reduction of normalisation by the Reference Material method. Below: different colours represent different classes; Above: different colours indicate different batches. Healthy: healthy individuals; Benign: benign lung nodules; Malignant: lung adenocarcinoma. d The visualisation of dimensionality reduction for the output data of the hidden layers in DeepMSProfiler. Conv1 to Conv5 are the outputs of the first to the fifth pooling layer in the feature extraction module. Block4 and Block5 are the outputs of the fourth and fifth conv layers in the fifth feature extraction module. Upper: different colours indicate different sample batches; Lower: different colours represent different population classes. e Correlation of the output data of the hidden layer with the batch and class information in DeepMSProfiler. The horizontal axis represents the layer names. Conv1 to Conv5 are the outputs of the first to the fifth pooling layer in the feature extraction module. Block10 and Block16 are the outputs of the tenth and sixteenth conv layers in the fifth feature extraction module. The blue line represents the batch-related correlations, and the orange line illustrates the classification-related correlations. f The accuracy rates of traditional methods (blue), corrected methods based on reference samples (purple), and DeepMSProfiler (red) in independent testing dataset ( n  = 50). The boxplot shows the minimum, first quartile, median, third quartile and maximum values, with outliers as outliers.

We then investigated the batch effect corrections and compared the performance between DeepMSProfiler and conventional correction methods (see Methods). As shown in Fig.  3c , after correction by the Reference Material (Ref-M) method 34 , we still observed 3 clusters in the principal component analysis (PCA) profiles, which represent the sample dots from three different hospitals. While Ref-M effectively addresses the batch effect within samples from the same hospital, the residual variation across hospitals remains (Fig.  3c ). Samples of batch 1–6 and 9–10 were obtained from the Sun Yat-Sen University Cancer Centre, and samples of batch 8 came from the First Affiliated Hospital of Sun Yat-Sen University. Samples of batch 7 were a mixture from three different hospitals. Among them, lung cancer samples came from the Affiliated Cancer Hospital of Zhengzhou University, lung nodule samples came from the First Affiliated Hospital of Sun Yat-Sen University, and healthy samples came from the Sun Yat-Sen University Cancer Centre. 100 healthy human serum samples used as reference during conventional procedures were all from the Sun Yat-Sen University Cancer Centre, which might be the main reason why it is difficult to correct batch effect for batch 7–9. To illustrate the DeepMSProfiler’s end-to-end process in the automatic removal of batch effects, we extracted the output of the hidden layer and visualised the flow of data during the forward propagation of the network (see Methods). From the input layer to the output layer, the similarity between different batches becomes progressively higher, while the similarity between different types becomes progressively lower (Fig.  3d and Supplementary Fig.  7 ). DenseNet121 is a deep neural network with 431 layers, of which 120 are convolutional layers (Supplementary Data  1 ). The fourth and fifth layers refer to the output of the fourth dense connected module and the output of the fifth dense connected module. There are 112 layers between them. Figure  3d and Supplementary Fig.  7 illustrate the intermediate change process from the output of the fourth closely connected module to the output of the fifth closely connected module. When the batch effects are removed, the classification becomes clearer. We further quantified this process of change using different metrics that measure the correlation between the PCA clusters and the given labels i.e., K-nearest neighbour batch effect test score 35 , local inverse Simpson’s index 36 , adjusted rand index (ARI), normalised mutual information (NMI), and average silhouette coefficient (ASC) (see Methods). The closer to the output layer, the less relevant the data is to batch labels and the more relevant to class labels (Fig.  3e ). This explains how the batch effect removal is achieved via progress through hidden layers (Fig.  3d ). This capability might be gained via the supervised learning. Our findings suggest that in the forward propagation process, the DeepMSProfiler model excludes batch-related information from the input data layer by layer, while retaining class-related information.

Further, we compared the performance of our deep learning method against machine learning methods with and without batch effect correction. The correction of batch effects could improve accuracies when using machine learning classifiers such as SVM, RF, AdaBoost, and XGBoost. However, DeepMSProfiler, without any additional manipulation, surpassed the machine learning methods with or without batch effect correction in terms of prediction accuracy (Fig.  3f ).

The impact of unknown mass spectrometric signals

To investigate the impact of unknown metabolite signals on classification prediction, we first performed the conventional analysis with peak extraction and metabolite annotation using existing databases such as Human Metabolome Database (HMDB) and Kyoto Encyclopedia of Genes and Genomes (KEGG) (see Methods). We found that 83.5% of all detected features remain as unknown metabolites (Fig.  4a ). The absence of these unknown metabolites undermines the prediction accuracy (Fig.  4b ), indicating this large number of unknown metabolites may impose a significant impact on classification performance. One of the advantages of our approach over traditional methods is the ability to retain complete metabolomic features including the unknown metabolites.

figure 4

a Statistics of annotated metabolite peaks. Blue colour represents all peaks, orange, purple and while yellow colours indicate metabolites annotated in HMDB, KEGG, and all databases, respectively. The overlap between orange and purple includes 414 metabolites annotated in both HMDB and KEGG. HMDB: Human Metabolome Database; KEGG: Kyoto Encyclopedia of Genes and Genomes. b The feature selection plot illustrates the effect of different contribution score thresholds removing unknown metabolites versus non-removing. The horizontal axis represents the change in threshold, while the vertical axis shows the accuracy of the model using the remaining features. The shadings of solid lines (mean) represent error bars (standard deviation). c Collection standard of published lung cancer serum metabolic biomarker. SCLC: Small Cell Lung Cancer; LUSC: Lung Squamous Cell Carcinoma; LUAD: Lung Adenocarcinoma; NAR: Nuclear Magnetic Resonance; MS: Mass Spectrometry. d The number counts of known biomarkers published in the current literature. e Molecular weight distribution plot of known biomarkers. f Accuracy comparison between the ablation experiment and DeepMSProfile ( n  = 50). The boxplot shows the minimum, first quartile, median, third quartile and maximum values. In the ablation experiment, we investigated the effect of varying the publication count (PC) of known biomarkers in the literature. Specifically, we eliminated metabolic signals that were not reported in the original data based on the m/z of known biomarkers. We retained only the metabolic signals with publication counts greater than 1, greater than 3, and greater than 8 for modelling development. All ablated data was analysed using the same architecture as the original unprocessed data in the same DeepMSProfiler architecture. The vertical axis shows the accuracy of models built on the dataset of different publication counts and our DeepMSProfiler.

We then tested the limitation of lung cancer prediction using biomarkers identified by annotated metabolites. We collected 826 biomarkers for lung cancer based on serum mass-spectrometry analysis from 49 publications (Fig.  4c ). We deduced the molecular weight of the biomarkers from the HMDB database and the information in these publications (see Methods). Only 42.7% of the biomarkers discovered based on traditional methods appear in more than two articles, and their reproducibility is suboptimal (Fig.  4d ). In addition, the molecular weights of these metabolites are mainly distributed in the range of 200–400 Da (Fig.  4e ). We found that prediction performance of DeepMSProfiler using complete raw data is highly accurate compared with the one using only the corresponding m/z signals of the reported biomarkers (Fig.  4f ). This indicates that there are still unknown metabolomic signals in the serum samples related to lung cancer that has not been unveiled in the current research. In contrast, DeepMSProfiler derives the classifications directly from the complete signals in the raw data.

Explainability to uncover the black box

After the model construction based on deep learning, we sought to explain the classification basis of the black box model and identify the key signals for specific classifications. We adopted a perturbation-based method, Randomised Input Sampling for Explanation (RISE) 37 , to count feature contributions. We slightly modified RISE to improve its operational speed and efficiency, and developed a method to evaluate the importance of RISE scores in different classifications (see Methods).

Interestingly, we found a “background category” phenomenon in some of the single models (Fig.  5a ). For each class in a single tri-classification model of DenseNet121, the classification performance gradually deteriorates as the metabolic signals with higher contribution scores are removed. However, there is one category that is always unaffected by all the features involved in the classification decision, while maintaining a very high number of true positives and false positives. These findings imply that the tri-classification model only predicts the probabilities of two of the categories, and calculates the probability of the third category from the results of the other two. In other words, the classification-related metabolites associated with the “foreground category” negatively contribute to the “background category”. Furthermore, the categories used as “background category” are not consistent across the different models. We were intrigued to test whether this phenomenon occurs exclusively in metabolomic data, so we conducted a seven-classification task using the Photo-Art-Cartoon-Sketch (PACS) image dataset 38 . We observed a similar phenomenon in the resultant feature scoring of different models (Supplementary Fig.  8 ). This suggests that “background category” may generally exist in multi-classification task by single models, although its underlying mechanism is currently unclear and may require future investigation.

figure 5

a Prediction performance and feature scoring by different single models. b Prediction performance and feature scoring by DeepMSProfiler. c Heatmap matrices of classification contribution in healthy individuals (Healthy), benign lung nodules (Benign), and lung adenocarcinoma (Malignant). The horizontal and vertical axes of the matrix are the prediction label and the true label, respectively. The heatmaps of upper left, the middle one, and the bottom right represent the true healthy individuals, the true benign nodules, and the true lung adenocarcinoma, respectively. The horizontal and vertical axes of each heatmap are RT and m/z, respectively. The classification contributions of metabolites corresponding to true healthy individuals ( d ), benign nodules ( e ), and lung adenocarcinoma ( f ). The horizontal axis represents the retention time and the vertical axis represents the m/z of the corresponding metabolites. The colours represent the contribution score of the metabolites. The redder the colour, the greater the contribution to the classification. Metabolites-proteins network for healthy individuals ( g ), benign nodules ( h ), and lung adenocarcinoma ( i ). Pathway enrichment analysis using the signalling networks for healthy individuals ( j ), benign nodules ( k ), and lung adenocarcinoma ( l ) (FDR < 0.05). m/z: mass charge ratio, RT: retention time.

In contrast, the phenomenon of “background category” no longer exists when the feature contributions are calculated by our ensemble model. As shown in Fig.  5b , when we progressively eliminated metabolic signals in each category according to their contribution, their performance of the ensemble model decreased accordingly. Each individual model captures different features that contribute to their corresponding classifications, while the ensemble model could combine these features to improve the accuracy of disease prediction and reduce the possibility of overfitting.

Metabolomic profiles in lung adenocarcinoma, benign nodules, and healthy individuals

To analyse the global metabolic differences between lung adenocarcinoma, benign nodules, and healthy individuals, we extracted the heatmaps of feature contributions counted by RISE from DeepMSProfiler (Fig.  5c ). As shown in Fig.  5d–f , the horizontal and vertical labels in the heatmaps represent m/z and retention time respectively. By mapping the label information to the heatmaps, we are able to locate the metabolites corresponding to different m/z and retention times to obtain their feature contribution scores. In true-positive healthy and benign nodule samples, the metabolic signals with the most significant contribution are uniformly located between 200 and 400 m/z and in 1–3 min (Fig.  5d, e ). In comparison, the metabolic signals located between 200 and 600 m/z and in 1–4 min contribute most in lung adenocarcinoma samples, but signals in other regions also have relatively high scores (Fig.  5f ).

As higher contribution scores in the heatmaps represent more important correlations, we screened signals with scores above 0.70 and attempted to identify the corresponding metabolic profiles in each classification (Supplementary Data  2 ). As observed in Fig.  5b , by retaining metabolic signals with a contribution score above 0.7, the overall accuracy is around 0.8, which manages to maintain an efficient classification impact. Considering the RT shift among different batches, we matched metabolic peaks only by m/z. We then fed these m/z signals, together with metabolites identified by tandem mass spectrometry (MS2), into the analysis tool PIUMet based on protein–protein and protein-metabolite interactions 39 to build disease-associated feature networks (Fig.  5g–i ). As the network shown in Fig.  5i , 82 proteins and 121 metabolites are matched in the lung adenocarcinoma samples, including 9 already identified by MS2 and 111 hidden metabolites found by the correlation between key metabolic peaks. As such, the current analysis based on protein-protein and protein-metabolite interactions allows the discovery of unknown metabolic signals associated with diseased states, although the resolution of the current model might be relatively low in distinguishing all individual peaks contributing to the disease classification. In order to explore the biological explainability, among the features extracted by PIUMet, we also selected 11 metabolites (Supplementary Table  5 ) with available authentic standards in our laboratory to justify their presence in the lung cancer serum samples. Indeed, these metabolites could be identified in the lung cancer serum as described in our previous study 40 . We further analysed the metabolic networks to explore the biological relevance associated with each classification. The heatmaps (Fig.  5d, e ) and pathway analysis (Fig.  5j, k ) consistently show that healthy individuals and benign nodules share similar metabolic profiles. In contrast, the cancer group presents a distinct profile with specific pathways and increasing counts of metabolites or proteins in the shared pathways with healthy individuals or benign nodules (Fig.  5f, l ). The detailed metabolites and protein candidates were further shown in Supplementary Data  3 . Taken together, our network and pathway analyses demonstrated the interpretability of DeepMSProfiler based on deep learning.

Application of model in colon cancer

Considering the transferability of DeepMSProfiler, we obtained a public colon cancer LC-MS dataset which contained 236 samples from MetaboLights (ID is MTBLS1129). There are 197 colon cancer samples and 39 healthy human samples in the dataset. We randomly divided this dataset into discovery dataset and independent testing dataset at 4:1 ratio. The discovery dataset contained 157 colon cancer samples and 31 healthy control samples. The independent testing dataset contained 40 colon cancer patient samples and 8 healthy control samples.

Due to the differences in cancer types and mass spectrometry analysis procedures between the colon cancer dataset and the lung adenocarcinoma dataset, we re-trained the DeepMSProfiler model. The colorectal cancer data was randomly divided into a discovery dataset and an independent testing dataset, and the discovery dataset was further randomly divided into a training dataset and a validation dataset with multiple times. In the independent testing dataset of the colon cancer dataset, our model achieved an accuracy of 97.9% (95% CI, 97.7%–98.1%), a precision of 98.7% (95% CI, 98.6%–98.8%), a recall of 93.4% (95% CI, 92.9%–94.1%), and an F1 of 95.8% (95% CI, 95.4%–96.2%) (Supplementary Fig.  9 ). These results suggest an excellent transferability of DeepMSProfiler.

Discovery of metabolic-protein networks in pan-cancer

In a continued effort to investigate the capabilities of DeepMSProfiler in analysing metabolomics data across multiple cancer types, raw lipid metabolomic data of 928 cell lines spanning 23 cancer types were collected from the Cancer Cell Line Encyclopaedia (CCLE) database 2 and then subjected to processing by DeepMSProfiler. Notably, in addition to the raw metabolomic data, these cell lines also contain valuable data of annotated metabolites, methylation, copy number variations, and mutations 2 . DeepMSProfiler constructed a model encompassing the 23 distinct categories, followed by a feature extraction from the 23-category model to identify the respective crucial metabolic signals of each category. Due to the limited number of samples for many cancer types, particularly for biliary tract, pleura, prostate, and salivary gland cancers, each with less than 10 samples, we did not set a separate independent testing dataset for the performance validation. 20 sub-models have been trained, and in each sub-model training, 80% of all samples were randomly allocated for training to ensure that every sample could contribute to the training process, especially for cancer types with very few samples. The final ensemble model used for explainable analysis achieved 99.3% accuracy, 97.2% sensitivity, and 100% specificity. Next, the priority-collecting Steiner forest optimisation algorithm 39 was employed to unveil the correlation between pivotal metabolic signals and proteins using databases of HMDB 41 , Recon2 42 and iRefIndex 43 (see Methods).

As results, we successfully generated disease-specific metabolite-protein networks (Fig.  6a–c ) along with a contribution score heatmap (Fig.  6e ), where contribution scores exceeding 0.70 were considered indicative of disease-specific metabolites. Metabolites identified within the metabolite-protein network were directly inferred from the mass-to-charge ratio (m/z) of metabolic signals from the raw data using feature spectra extracted by the DeepMSProfiler model. Notably, we identified 14 metabolites and 3 proteins that exhibited co-occurrence within the 23 cancer-related metabolite-protein networks (Fig.  6d ). Finally, we correlated the metabolic data and the methylation information and subsequently verified the associations between the PLA and UGT gene families and the disease-specific metabolites of high contribution (Fig.  6f ). Previous studies 44 , 45 , 46 , 47 , 48 have reported the important roles of PLA and UGT gene families in a variety of diseases, such as PLA2G7 and PLA2G6 in breast and prostate cancers and neurodegenerative diseases, as well as UGT3A2 in head and neck cancers. These evidences support our findings by DeepMSProfiler. In summary, our extended analysis spanning pan-cancer scenarios highlights the capability of DeepMSProfiler in the discovery of potential disease-associated metabolites and proteins.

figure 6

Metabolite-protein networks for ( a ) lung cancer, ( b ) gastric cancer, and ( c ) leukaemia. Yellow squares: metabolites. Red circles: proteins. Blue labels: metabolites and proteins shared in 23 cancer metabolite-protein networks. d Metabolites and proteins shared in the metabolite-protein networks of 23 cancer types. e Heatmap of the classification contribution of different lipid metabolites across 23 cancer types. f Correlation of important pan-cancer-related metabolites with methylation of the PLA and UGT gene families.

Discussions

Metabolomics faces challenges in precision medicine due to complex analytical process, metabolic diversity, and database limitations 5 , 6 . DeepMSProfiler starts with raw untargeted metabolomic data and retains essential information, enabling more effective global analysis. It offers an alternative approach by directly processing raw data of metabolomic samples, bypassing time-consuming experiments such as quality control or reference sample preparation and subsequent normalisation analysis.

In metabolomic study, systematic variations in the measured metabolite profiles may occur during sample collection, processing, analysis, or even in different batches of reagents or instruments. Batch effects can significantly impact the interpretation of the results, leading to inconsistencies in replicating findings across different studies 15 . While batch effects can manifest as variations in retention time (RT) offset, peak area, and peak shape, conventional quantitative methods often prioritise peak area integration while overlooking peak shape 49 , 50 . Significantly, our results demonstrate that DeepMSProfiler is able to automatically eliminate cross-hospital variations during the end-to-end forward propagation process (Fig.  3d ), effectively revealing classification profiles.

Moreover, DeepMSProfiler can address the challenges of unidentified metabolites. LC-MS metabolomics can reveal tens of thousands of metabolite peaks in a biological sample. A substantial number of these peaks remains unidentified or unannotated in existing databases. In this study, we demonstrated that among all detected peaks, only 16.5% are identified by HMDB and KEGG. However, the presence of a significant proportion of unknown metabolites has a considerable influence on the accuracy of classification (Fig.  4b ). Indeed, annotating metabolomic peaks has remained a major study focus in the field 16 . A common approach involves comparing the exact mass of detected peaks with authenticated standards, along with either the retention time or the fragmentation spectra obtained through tandem mass spectrometry (MS2). Despite significant development of molecular structural databases and MS2 spectral databases, their current capabilities and coverage remain limited 51 . In addition, network analysis, which examines complex peak relationships and clusters, has also been developed to facilitate the comprehensive identification of metabolites 52 . In this study, we employed the deep learning method to capture original signals in LC-MS metabolomic analysis without compromising data integrity. We further implemented a direct transition from m/z to pathway annotations by taking advantage of the network-based analysis tool PIUMet 39 , effectively identifying 82 proteins and 121 metabolites in the cancer group, compared with 9 metabolites annotated by MS2.

Furthermore, our method is able to cover the metabolites identified by conventional annotation and simultaneously uncovers the undetected disease-specific features. In the traditional metabolomic analysis, biomarkers specific to the disease of interest are usually sought by comparison of metabolite levels between control and case samples. Therefore, peak alignment and metabolite annotation are crucial to the end results. Here, by employing the end-to-end strategy, we unveiled the complete biological outputs that contribute to the distinct metabolomic profiles of each group. For example, tryptophan metabolism was identified among the characteristics of lung adenocarcinoma profile (Fig.  5l ). The result was consistent with our previous discovery by the conventional annotation method that metabolites in the tryptophan pathway were decreased in the early-stage lung adenocarcinoma compared with benign nodules and healthy controls 40 . Serine and glycine are also important for nucleotide synthesis by mediating one-carbon metabolism, which is relevant to therapeutic strategy targeting non-small cell lung cancer 53 , 54 , 55 , 56 . Intriguingly, we also observed the contribution of bile secretion in the lung adenocarcinoma profile (Fig.  5l ), which aligns with another report of aberrant bile acid metabolism in invasive lung adenocarcinoma 57 . However, it should be noted that the resolution of our model may be limited to distinguish all individual peaks contributing to the disease classification.

We additionally demonstrated that among deep learning models, ensemble models are more stable and class-balanced than single models. Although we have not fully comprehended the reason for the occurrence of “background category”, the ensemble strategy has effectively mitigated this phenomenon (Fig.  5a, b ). Our investigation on the PACS image dataset suggests that “background category” may generally exist in multi-classification tasks using single models. Understanding its underlying mechanism requires further investigation with a broader range of dataset.

The high-resolution heatmaps generated by DeepMSProfiler display the feature contributions to the predicted classes and the precise location of specific metabolomic signals (Fig.  5c ), providing explainable analysis to assure the researchers of the biological soundness of the prediction. With the capability of batch effect removal, comprehensive metabolomic profiling, and ensemble strategy, DeepMSProfiler demonstrates consistent and robust performance across different categories. It achieves AUCs over 0.99 for the predictions of lung adenocarcinoma, benign nodules, and healthy samples, and an accuracy of 96.1% for early-stage (stage-I) lung adenocarcinoma. Moreover, its extended analysis in pan-cancer illustrates it ability to uncover potential disease-associated metabolites and proteins beyond lung cancer. In conclusion, our DeepMSProfiler offers a straightforward and reliable method suitable for applications in disease diagnosis and mechanism discovery, potentially advancing the use of metabolomics in precision medicine. Its effective end-to-end strategy applied to raw metabolomic data can benefit a broader population in non-invasive clinical practices for disease screening and diagnosis.

Clinical sample collection

This study was approved by the Ethics Committees of the Sun Yat-Sen University Cancer Centre, the First Affiliated Hospital of Sun Yat-Sen University and the Affiliated Cancer Hospital of Zhengzhou University. A total of 210 healthy individuals, 323 patients with benign nodules and 326 patients with lung adenocarcinoma were enroled. Cases of lung adenocarcinoma were collected prior to lung resection surgery and had pathological confirmation. Serum from benign nodules was collected from individuals undergoing annual physical examinations. Participants with benign nodules were defined as those with stable 3–5 years follow-up Computed Tomography (CT) scans at the time of analysis. The sample collection period was from January 2018 to October 2022. The sex of the participants was determined by self-report. Informed consent was obtained from all participants. The study design and conduct complied with all relevant regulations regarding the use of human study participants and was conducted in accordance to the criteria set by the Declaration of Helsinki. Research with humans has been conducted according to the principles of the Declaration of Helsinki.

In addition, we collected serum samples from 100 healthy blood donors, including 50 males and 50 females, aged between 40 and 55 years, from the Department of Cancer Prevention and Medical Examination, Sun Yat-Sen University Cancer Centre. All these samples were mixed in equal amounts and the resultant mixture was aliquoted and stored. These mixtures were used as reference samples for quality control and data normalisation in the conventional metabolomic analysis as previously described 34 .

Serum metabolite extraction

Fasting blood samples were collected in serum separation tubes without the addition of anticoagulants, allowed to clot for 1 h at room temperature, and then centrifuged at 2851 ×  g for 10 min at 4 °C to collect the serum supernatant. The serum was aliquoted and then frozen at −80 °C until metabolite extraction.

Reference serum and study samples were thawed and a combined extraction method (methyl tert-butyl ether/methanol/water) was used to extract metabolites. Briefly, 50 μL of serum was mixed with 225 μL of ice-cold methanol and 750 μL of ice-cold methyl-tertbutyl ether (MTBE). The mixture was vortexed and incubated for 1 h on ice. Then 188 μL MS grade water containing internal standards ( 13 C-lactate, 13 C 3 - pyruvate, 13 C-methionine and 13 C 6 -isoleucine, all from Cambridge Isotope Laboratories) was added and vortexed. The mixture was centrifuged at 15,000 ×  g for 10 min at 4 °C, and then the bottom phase was transferred to two tubes (each containing 125 μL) for LC-MS analysis in positive and negative modes. Finally, the samples were dried in a high-speed vacuum concentrator.

Untargeted liquid chromatography-mass spectrometry

The dried metabolites were resuspended in 120 μL of 80% acetonitrile, vortexed for 5 min and centrifuged at 15,000 ×  g for 10 min at 4 °C. The supernatant was transferred to a glass amber vial with a micro insert for metabolomic analysis. Untargeted metabolomic analysis was performed on an ultra-performance liquid chromatography-high resolution mass spectrometry (UPLC-HRMS) platform. The metabolites were separated using the Dionex Ultimate 3000 UPLC system with an ACQUITY BEH Amide column (2.1 × 100 mm, 1.7 μm, Waters). In positive mode, the mobile phase comprised 95% (A) and 50% acetonitrile (B), containing 10 mmol/L ammonium acetate and 0.1% formic acid. In negative mode, the mobile phase was composed of 95% and 50% acetonitrile for phases A and B, respectively, both containing 10 mmol/L ammonium acetate and adjusted to pH 9. Gradient elution was performed as follows: 0–0.5 min, 2% B; 0.5–12 min, 2–50% B; 12–14 min, 50–98% B; 14–16 min, 98% B; 16–16.1 min, 98–2% B; 16.1–20 min, 2% B. The column temperature was maintained at 40 °C, and the autosampler was set at 10 °C. The flow rate was 0.3 mL/min, and the injection volume was 3 μL. A Q-Exactive orbitrap mass spectrometer (Thermo Fisher Scientific) with an electrospray ionisation (ESI) source was operated in full scan mode coupled with ddMS2 monitoring mode for mass data acquisition. The following mass spectrometer settings were used: spray voltage +3.8 kV/−3.2 kV; capillary temperature 320 °C; sheath gas 40 arb; auxiliary gas 10 arb; probe heater temperature 350 °C; scan range 70–1050 m/z; resolution 70000. Xcalibur 4.1 (Thermo Fisher Scientific) was used for data acquisition.

In this study, all serum samples were analysed by LC-MS in 10 batches. To assess data quality, a mixed quality control (QC) sample was generated by pooling 10 μL of supernatant from each sample in the batch. Six QC samples were injected at the beginning of the analytical sequence to assess the stability of the UPLC-MS system, with additional QC samples injected periodically throughout the batch. Serum pooled from 100 healthy donors was used as reference material in each batch to monitor extraction and batch effects. All untargeted metabolomic analysis was performed at the Sun Yat-Sen University Metabolomics Centre.

Public dataset collection

The raw dataset for pan-cancer lipid metabolomics data of CCLE was downloaded from the Metabolomics Workbench database with accession ST001142

( https://www.metabolomicsworkbench.org/data/DRCCMetadata.php?Mode=Study&StudyID=ST001142 ). There are 946 samples in total, including 23 cancer types. The quantitative lipid metabolite matrix and the DNA methylation matrix were downloaded from the appendix of the article 2 .

The LC-MS dataset of colon cancer was downloaded from the MetaboLights database ( https://www.ebi.ac.uk/metabolights/editor/MTBLS1129/descriptors ) with 236 samples in total, including 197 colon cancer cases and 39 healthy controls. Due to the differences of disease samples, classification purposes, instruments, and parameters of LC-MS between the public dataset and the private lung adenocarcinoma dataset, the DeepMSProfiler model needs to be re-trained on the public dataset.

Data format conversion

The raw format files of LC-MS data were converted to mzML format using the MSCovert software. The data used to train the end-to-end model were sampled directly from the mzML format without any further processing. This raw data could be used directly as input to the model. In the mzML file, ion intensity and mass-to-charge ratio of each ion point for each time interval were recorded. Ions points were mapped into a 3D space by their RT and m/z. A 2D matrix was sampled from this 3D points array data using a maximally pooled convolution kernel. RT: 0.5 min and m/z: 50 as the sampling starting point and RT: 0.016 min and m/z: 1 as the sampling interval. The sampling ranges of retention time and mass/charge ratio were set. Using the sampling interval as a sliding window, the maximum ion intensity in the interval was sampled to obtain a two-dimensional matrix of 1024 × 1024 ion intensities.

Extraction and annotation of metabolic peaks

We used Compound Discovery v3.1 and TraceFinder v4.0 (Thermo Fisher Scientific) for peak alignment and extraction. These steps resulted in a matrix containing retention time, mass-to-charge ratio and peak area information for each metabolite. To eliminate potential batch-to-batch variation, we used the Ref-M method to correct peak areas. This involved dividing the peak area of each feature in the study sample by the peak area of the reference compound from the same batch, yielding relative abundances. We then used several data annotation tools such as MetID, MetDNA and NetID in MZmine to annotate the metabolite features and combined the annotation results 52 , 58 , 59 , 60 . These analysis tools include mass spectral information from databases such as the HMDB, MassBank and MassBank of North America (MoNA) 41 , 61 . In addition, we performed data annotation using MS2 data based on Compound Discovery v3.1 and finally selected bio-compounds with MS2 matches to certified standards or compounds with inferred annotations based on complete matches in mzCloud (score > 85) or ChemSpider as the precise annotation results 62 .

Raw data visualisation

The mzML files were read using the Pyteomics package in Python. Records were traversed for all times in the sampling interval 63 . For each time index data in mzML files, it recorded the preset scan configuration, the scan window, the ion injection time, the intensity array, and the m/z array. The intensity array and m/z array were selected to form an array of data points, and retention time, mass-to-charge ratio, and intensity are the row names. The intensity values were log2 processed. Then, the 3D point cloud data was visualised using the Matplotlib Toolkits package in Python 64 . The 2D matrixes were obtained by down-sampling the 3D point cloud and pooling the 3D data using median and maximum convolution kernels. Convolution spans were RT: 0.1 min and m/z: 0.001. Heatmaps and contours were plotted using Matplotlib. Retained time-intensity curves were also plotted using Matplotlib with an m/z span of 0.003.

Dataset division and assignment

The dataset of each batch was randomly divided into a training dataset and an independent testing dataset in a ratio of 4:1. The data from the first to the seventh batch contained 90 samples each, including 30 healthy individuals, 30 lung nodules, and 30 lung adenocarcinoma samples. The data for the eighth and ninth batches did not contain healthy samples. The data for the tenth batch only contained nodule samples. To avoid the effect of classification imbalance, we constrained the same sample type and sex ratio in the training and independent testing dataset. Because the samples came from patients of different ages and sexes, the lesion sizes of lung nodules and lung adenocarcinoma patients also varied. In order to avoid these attributes affecting the authenticity of the model, sex, age, and lesion size were also used as constraints for dataset division. The difference in the distribution of sample attributes between the training dataset and the independent testing dataset was verified by an unpaired t-test.

Deep learning model construction in detail

In this step, we aimed to construct a model to predict the class labels for each metabolomic sample. For this, we first set X and Y as the input and label spaces, respectively. A single end-to-end model consisted of three parts, a dimension converter based on pool processing, a feature detector based on the convolutional neural networks, and a multi-layer perceptron (MLP) classifier. The input data directly from the raw data was extremely large and contained a lot of redundant information, so a pooling process was required to reduce the resolution for downstream convolution operations. The input data of the model was reduced by the maximum pooling layer to obtain D( X ). Next, enter the feature extractor dominated by convolutional layers to obtain F(D( X )). The convolutional neural network had local displacement invariance and was well adapted to the RT offset problem in metabolomic data. Due to the relatively large size of the data, more than 5 layers of convolutional operations were required to reduce the dimensionality of the data to the extent that the computing power could be loaded. Different architectures were used respectively to compare the performance in the tuning set. The architectures used in different models included VGGNet (VGG16, VGG19), Inception Model (InceptionV3), ResNet (ResNet50), DenseNet (DenseNet121), and EfficientNet (EffcientNetB0-B5) 33 , 65 , 66 , 67 . In addition, two optimisation models based on Densenet121 were created to simplify the DenseNet network. The direct connection route replaced the last dense layer of Densenet121 with a convolutional layer. The optimisation route replaced the last dense layer of DenseNet with a convolutional layer that retained a one-hop connection. The pre-training parameters in pre-trained models were derived from ImageNet. Each architecture was tested on the TensorFlow + Keras platform and PyTorch platform, respectively. To reduce overfitting, we used only one linear layer for our MLP layer. In the TensorFlow + Keras model, there was a SoftMax activation layer before the output layer. The output of the model was C(F(D( X ))).

The positive and negative spectral data used different convolutional layers for feature extraction. Their features were combined before inputting the fully connected layer. Their pre-training parameters were shared. For a model trained on both positive and negative spectral data, a cross entropy loss was used.

Model training

20% of the discovery dataset was divided into tuning sets, which were used to guide model architecture selection, hyperparameter selection, and model optimisation, and the rest 80% was used for model training. Sample category balancing was used as a constraint for dataset segmentation. The model architecture was evaluated by both validation performance and operational performance. We counted the number of model parameters and evaluated the complexity of the model. The average of the 10 running times of the models was used as the runtime. Hyperas was used to preliminarily select the optimal model hyperparameters and initialisation weights 68 . The optimal initialisation method was he_normal. But we opted for pretraining with the ImageNet dataset due to its comparable performance and faster execution. After reducing the size of the parameter search, we used the grid search method for hyperparameter tuning.

Ensemble strategy

DeepMSProfiler consists of several trained end-to-end sub-models as an ensemble model, where the average of the classification prediction probabilities of the samples from all sub-models was used as the final prediction probability for classification. The ensemble model calculated a score vector of length 3 in each of the three classifications, and the category with the maximum score was selected as the predicted classification result.

Each end-to-end sub-model was trained on the discovery dataset. The architecture of each sub-model is the same, but some hyperparameters are different. Two different learning rates of 1e-3 and 1e-4 were used. The optimiser used is ‘adam’ with parameter settings of beta_1 as 0.9, beta_2 as 0.999, epsilon as 0.001, decay as 0.0, and amsgrad as False. The batch size was set as 8 and the training was run for 200 epochs. A model typically took about 2 h to complete training on a GP100GL (16GB) GPU server. Each sub-model participated fairly in the final prediction result without setting a specific weight. The independent testing dataset was not used in model training and hyperparameter selection.

Machine learning models for comparison

To compare our DeepMSProfiler to other existing tools, we selected several common traditional machine learning methods to build tri-classification models based on the peak area data obtained from the previous steps. These methods included Extreme Gradient Boosting (XGBoost), RF, Adaptive Boosting (Adaboost), SVM, and DNN. The training dataset and independent testing dataset were divided in the same way as the deep learning algorithm, and the numbers of estimators for Adaboost and XGBoost algorithms were the same as those of DeepMSProfiler. XGBoost was implemented by the XGBClassifier function in the xgboost library. Other machine learning methods were implemented using the SciKitLearn library. SVM was adopted using the svm function, and the kernel of SVM is ‘linear’. RF was implemented through the RandomForestClassifier function. Adaboost was adopted through the AdaBoostClassifier function. DNN was implemented using the MLPClassifier function. The optimal hyperparameter was obtained by the grid search method.

Performance metrics

We evaluated the performance of the model on the independent testing dataset. The evaluation metrics included accuracy, precision, sensitivity and F1 score. Micro was chosen as the computational method for the multiclassification model. Confidence intervals were estimated using 1000 bootstrap iterations. During the bootstrapping procedure, our model was estimated by an ensemble strategy combining 20 models trained on the discovery dataset. In addition, we calculated a confusion matrix and an AUC curve to demonstrate the performance of the model in the three classifications of lung adenocarcinoma, benign nodules and healthy individuals. When the sensitivity was 0.7 or 0.9, the specificity was calculated using the sensitivity-specificity curve. The sensitivity-specificity curve was interpolated using the NEAST method.

Visualisation of “black-box”

In the end-to-end neural network prediction, the data flowed in a chain of \({{{\boldsymbol{X}}}}\to D({{{\boldsymbol{X}}}})\to F(D({{{\boldsymbol{X}}}}))\to C(F(D({{{\boldsymbol{X}}}})))\) from the input layer through the hidden layer to the output layer. In the feature extraction layer, which is dominated by convolutional layers, information was passed in the same chain manner. After inputting X, we obtained the corresponding output L in different hidden layers to open the black box process. In order to observe the space of middle features, PCA was used to reduce T dimensionality to principal components. The PCA result was visualised by the Matplotlib package in Python.

To evaluate the correlation of hidden layer output with batch label and type label, respectively, we calculated NMI, ARI, and ASC using the following formulas. L was the layer output and C was the cluster labels used for the cluster evaluation.

In the above equations, the mutual information (MI) computed by the layer outputs L and the label cluster C . \({P}_{i,j}\) represents the joint distribution probability between i and j, and \({P}_{i}\) refers to the distribution probability of i. \({P}_{j}\) refers to the distribution probability of j. \(H\left({{{\boldsymbol{L}}}}\right)\) and \(H\left({{{\bf{C}}}}\right)\) represent the entropy values of L and C , respectively. The clusters of the output layer are clustered by the K-nearest neighbour algorithm.

In the above equation, TP represents the number of point pairs belonging to the same cluster in both real and experimental cases, and FN represents the number of point pairs belonging to the same cluster in the real case but not in the same cluster in the experimental case. FP represents the number of point pairs not belonging to the same cluster in the real case but in the same cluster in the experimental case, and TN represents the number of point pairs not belonging to the same cluster in both real and experimental cases. The range of ARI is [−1, 1], and the larger the value, the more consistent with the real result, that is, the best effect of clustering.

The output layer was first dimensionally reduced by PCA, and the cluster was specified by the real label. In the above equation, \({a}_{i}\) represents the average of the dissimilarity of the vector i to other points in the same cluster, and \({b}_{i}\) represents the minimum of the dissimilarity of the vector i to points in other clusters.

Feature contributions

In previous feature contribution studies, different branches used different methods to compute feature contributions to final classifications. These methods can help to better understand features and their impacts on model predictions. Gradient-based methods, such as GradCAM, calculate the gradients of the last convolutional layer by backpropagation of the category with the highest confidence 69 . Due to its convenience, this method is widely used in computer vision tasks. But it has a significant problem, that is, the resolution of the feature contribution heatmap is extremely low and cannot reach the requirements for distinguishing most signal peaks. The size of the feature contribution heatmap corresponds to the last convolutional layer of the model. The weight of the feature contribution is the average of the gradients of all features. On the other hand, perturbation-based methods, such as RISE and Local Interpretable Model-Agnostic Explanations, measure the importance of features by obscuring some pixels in raw data 37 , 70 . The predictive ability of the black box is then observed to show how much this affects the prediction. Perturbation-based methods can lead to higher resolution and more accurate contribution estimates, but their runtimes are longer. To improve the computing speed in this study, we made some improvements based on RISE, using boost sampling for the mask process.

Using RISE, we can determine the characteristic contributions of RT and m/z for each sample according to its true category. The feature contribution heatmap uses RT as the horizontal axis and m/z as the vertical axis to show the feature contribution of different positions of each sample. The average feature contribution of all samples correctly predicted to be of their true category is taken as the feature contribution of the category. At the same time, by performing peak extraction in the previous steps, we determined the RT value range and the m/z median value for each signal peak. The characteristic contribution associated with the RT and median m/z coordinates is then identified as the distinctive contribution of the signal peak.

Network analysis and pathway enrichment

The extracted metabolic peaks with a contribution score greater than 0.70 to the lung cancer classification were filtered. Mass-to-charge ratio and some substance identification information of these metabolites and their classification contribution scores were used as input data. For some of the metabolic signal peaks, we have accurately identified their molecular formulae and substance names by secondary mass spectrometry as substance identification information. Due to the limitation of existing databases, many unknown signals cannot be identified through secondary mass spectrometry. Therefore, PIUMet was also adopted to search for hidden metabolites and related proteins.

PIUMet built disease-related metabolite-protein networks based on the prize-collecting Steiner Forest algorithm. First, PIUMet integrated iRefIndex (v13), HMDB (v3) and Recon2 databases to obtain the relationship between m/z, metabolites and proteins, and generated an overall metabolite-protein network. The edges were weighted by the confidence level of the correlation reported in these databases. iRefIndex provides details on the physical interactions between proteins, which are detected through different experimental methods. The protein-metabolite relationships in Recon2 are based on their involvement in the same reactions. HMDB includes proteins that are involved in metabolic pathways or enzymatic reactions, as well as metabolites that play a role in protein pathways, based on known biochemical and molecular biological data. The disease-related metabolite peaks obtained by DeepMSProfiler were matched to the metabolite nodes of the overall network by their m/z, and directly to the terminal metabolite nodes of the overall network after annotation. The corresponding feature contributions obtained by DeepMSProfiler served as prizes for these metabolite nodes. The network structure was then optimised using the prize-collecting Steiner Forest algorithm to minimise the total network cost and connect all terminal nodes, thereby removing low-confidence relationships and obtaining disease-related metabolite sub-networks.

Metabolite identification is an important issue in metabolomics research and there are different levels of confidence in identification. Referring to the highest level considered 71 , we analysed authentic chemical standards and validated 11 of the metabolites discovered by PIUMet with only m/z (Supplementary Table  5 ). Then, disease-related metabolites and proteins were used to analyse their pathways 39 . These hidden metabolites and proteins from PIUMet were then processed for KEGG pathway enrichment analysis using MetaboAnalyst (v6.0). We used joint pathway analysis in MetaboAnalyst and chose hypergeometric test for enrichment analysis and degree centrality for topology measure. The integrated version of KEGG pathways (year 2019) was adopted by MetaboAnalyst. Pathways were filtered out using 1e-5 as a p value cut-off 72 . The corresponding SYMBOL IDs of the proteins were converted to KEGG IDs by the ClusterProfiler package in R 73 .

Ablation experiment

We searched the PubMed database for a total of 5088 articles using the terms “serum”, “lung cancer” and “metabolism” from 2010 to 2022. By reading the titles and abstracts of them, we excluded publications that used non-serum samples such as urine and tissue for research, as well as publications that used non-mass spectrometry methods such as chromatography, nuclear magnetic resonance, and infrared spectroscopy. We then further screened the selected literature to exclude studies that did not result in the discovery of metabolic biomarkers. Finally, 49 publications were remained and 811 serum metabolic biomarkers for lung cancer were reported. Some of the literature provides information on the retention time and mass-to-charge ratio of biomarkers. However, in other literature, only the name of the identified biomarker is given. Therefore, we searched the molecular weights of these metabolites in the HMDB database based on the literature information to match the corresponding m/z. The use of metabolite molecular weights to match the m/z took full account of the effect of adducts. Based on the number of publications of biomarkers in the literature, we determined the range of retained signals to be the m/z corresponding to biomarkers that exceeded the threshold number of publications. We filtered the signals in the raw data to exclude signals that did not fall into the 3 ppm intervals around these m/z. The filtered raw data were used as input to the model.

Statistical analysis

All statistical analysis calculations were performed using the stat package in Python. The distribution of data was plotted using the Seaborn package in Python. The correlation between patient information and labels was calculated using Pearson’s, Spearman’s and Kendall’s correlation coefficients. Pearson’s correlation coefficient was preferred to find linear correlations. Spearman’s and Kendall’s rank correlation coefficients were used to discover non-linear correlations. P -values below 0.05 were considered significant.

Figure preparation

The main figures in this paper were assembled in Adobe Illustrator. The photo of mass spectrometry instruments was taken from actual objects. The data structure diagrams were obtained by fitting simulated functions based on python. Some cartoon components were drawn through FigDraw ( www.figdraw.com ) with a license code (TAIRAff614) for free use.

AI-assisted technologies in the writing process

At the end of the preparation of this work, the authors used ChatGPT to proofread the manuscript. After using this tool, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.

Reporting summary

Further information on research design is available in the  Nature Portfolio Reporting Summary linked to this article.

Data availability

All of the raw LC-MS data generated in this study have been deposited in a Code Ocean capsule under accession code 2328223 with a citable DOI number https://doi.org/10.24433/CO.2328223.v1 . Source data for users to reproduce our research results can be downloaded from the Source Data file. The source data for the network and pathway results in Fig.  5 can be found in Supplementary Data  2 and 3 in the Supplementary Information.  Source data are provided with this paper.

Code availability

The source code and the pretrained model for DeepMSProfiler are available in the GitHub repository ( https://github.com/yjdeng9/DeepMSProfiler ) for academic use 74 . The code on GitHub serves as an easy-to-use tool for running DeepMSProfiler.

Schmidt, D. R. et al. Metabolomics in cancer research and emerging applications in clinical oncology. CA Cancer J. Clin. 71 , 333–358 (2021).

Article   PubMed   PubMed Central   Google Scholar  

Li, H. et al. The landscape of cancer cell line metabolism. Nat. Med. 25 , 850–860 (2019).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Buergel, T. et al. Metabolomic profiles predict individual multidisease outcomes. Nat. Med. 28 , 2309–2320 (2022).

Yang, J., Huang, L. & Qian, K. Nanomaterials‐assisted metabolic analysis toward in vitro diagnostics. Exploration 2 , 20210222 (2022).

Marx, V. Boost that metabolomic confidence. Nat. Methods 17 , 33–36 (2020).

Article   CAS   PubMed   Google Scholar  

Singh, A. Tools for metabolomics. Nat. Methods 17 , 24 (2020).

Chen, X., Shu, W., Zhao, L. & Wan, J. Advanced mass spectrometric and spectroscopic methods coupled with machine learning for in vitro diagnosis. View 4 , 20220038 (2023).

Article   CAS   Google Scholar  

Liu, J. et al. Integrative metabolomic characterisation identifies altered portal vein serum metabolome contributing to human hepatocellular carcinoma. Gut 71 , 1203–1213 (2022).

Caba, O. et al. 1542P untargeted metabolomics to identify novel biomarkers of pancreatic cancer. Ann. Oncol. 31 , S946 (2020).

Article   Google Scholar  

Wang, Y., Jacobs, E. J., Carter, B. D., Gapstur, S. M. & Stevens, V. L. Plasma Metabolomic Profiles And Risk Of Advanced And Fatal Prostate Cancer. Eur. Urol. Oncol. 4 , 56–65 (2021).

Article   PubMed   Google Scholar  

Huang, L. et al. Machine learning of serum metabolic patterns encodes early-stage lung adenocarcinoma. Nat. Commun. 11 , 3556 (2020).

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Lee, K. B., Ang, L., Yau, W. P. & Seow, W. J. Association between metabolites and the risk of lung cancer: a systematic literature review and meta-analysis of observational studies. Metabolites 10 , 362 (2020).

Kannampuzha, S. et al. A Systematic Role Of Metabolomics, Metabolic Pathways, And Chemical Metabolism In Lung Cancer. Vaccines 11 , 381 (2023).

Kim, T. et al. A hierarchical approach to removal of unwanted variation for large-scale metabolomics data. Nat. Commun. 12 , 4992 (2021).

Abram, K. J. & McCloskey, D. A comprehensive evaluation of metabolomics data preprocessing methods for deep learning. Metabolites 12 , 202 (2022).

Singh, A. Annotating unknown metabolites. Nat. Methods 20 , 33 (2023).

Chen, Y. et al. Metabolomic machine learning predictor for diagnosis and prognosis of gastric cancer. Nat. Commun. 15 , 1657 (2024).

Shen, X. et al. Deep learning-based pseudo-mass spectrometry imaging analysis for precision medicine. Brief. Bioinform 23 , bbac331 (2022).

Sen, P. et al. Deep learning meets metabolomics: a methodological perspective. Brief. Bioinform 22 , 1531–1542 (2021).

Pomyen, Y. et al. Deep metabolome: applications of deep learning in metabolomics. Comput. Struct. Biotechnol. J. 18 , 2818–2825 (2020).

Alakwaa, F. M., Chaudhary, K. & Garmire, L. X. Deep learning accurately predicts estrogen receptor status in breast cancer metabolomics data. J. Proteome Res. 17 , 337–347 (2018).

Editorial. Why the metabolism field risks missing out on the AI revolution. Nat. Metab. 1 , 929–930 (2019).

Roscher, R., Bohn, B., Duarte, M. F. & Garcke, J. Explainable machine learning for scientific insights and discoveries. IEEE Access 8 , 42200–42216 (2020).

Janizek, J. D. et al. Uncovering expression signatures of synergistic drug responses via ensembles of explainable machine-learning models. Nat. Biomed. Eng. 7 , 811–829 (2023).

Binder, A. et al. Morphological and molecular breast cancer profiling through explainable machine learning. Nat. Mach. Intell. 3 , 355–366 (2021).

Ganaie, M. A., Hu, M., Malik, A. K., Tanveer, M. & Suganthan, P. N. Ensemble deep learning: a review. Eng. Appl Artif. Intell. 115 , 105151 (2022).

Dietterich, T. G. Ensemble methods in machine learning. In Multiple Classifier Systems (Springer Berlin Heidelberg, 2000).

Brown, G., Wyatt, J., Harris, R. & Yao, X. Diversity creation methods: a survey and categorisation. Inf. Fusion 6 , 5–20 (2005).

Tang, E. K., Suganthan, P. N. & Yao, X. An analysis of diversity measures. Mach. Learn 65 , 247–271 (2006).

Domingos, P. A Unifeid Bias-Variance Decomposition and Its Applications. In Proc. Seventeenth International Conference on Machine Learning 231–238 (Morgan Kaufmann Publishers Inc, 2000).

Kohavi, R. & Wolpert, D. H. Bias plus variance decomposition for zero-one loss functions. In Proc. 13th International Conference on Machine Learning (ICML96). 275–283 (1996).

Pisetta, V., Jouve, P. E. & Zighed, D. A. Learning with ensembles of randomized trees: new insights. Lect. Notes Computer Sci. 6323 , 67–82 (2010).

Huang, G., Liu, Z., Maaten, Lvander & Weinberger, K. Q. Densely connected convolutional networks. Proc. IEEE Conf. Computer Vis. pattern Recognit. 154 , 4700–4708 (2017). vol.

Google Scholar  

Yao, Y. et al. Normalization approach by a reference material to improve LC-MS-based metabolomic data comparability of multibatch samples. Anal. Chem. 95 , 1309–1317 (2023).

CAS   PubMed   Google Scholar  

Büttner, M., Miao, Z., Wolf, F. A., Teichmann, S. A. & Theis, F. J. A test metric for assessing single-cell RNA-seq batch correction. Nat. Methods 16 , 43–49 (2019).

Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with harmony. Nat. Methods 16 , 1289–1296 (2019).

Petsiuk, V., Das, A. & Saenko, K. RISE: Randomized input sampling for explanation of black-box models. Preprint at https://arxiv.org/abs/1806.07421 (2018).

Li, D., Yang, Y., Song, Y. Z. & Hospedales, T. M. Deeper, broader and artier domain generalization. Proc. IEEE Int. Conf. Computer Vis. 2017 , 5542–5550 (2017).

Pirhaji, L. et al. Revealing disease-associated pathways by network integration of untargeted metabolomics. Nat. Methods 13 , 770–776 (2016).

Yao, Y. et al. Metabolomic differentiation of benign vs malignant pulmonary nodules with high specificity via high-resolution mass spectrometry analysis of patient sera. Nat. Commun. 14 , 2339 (2023).

Wishart, D. S. et al. HMDB 5.0: the human metabolome database for 2022. Nucleic Acids Res. 50 , D622–D631 (2022).

Swainston, N. et al. Recon 2.2: from reconstruction to model of human metabolism. Metabolomics 12 , 109 (2016).

Razick, S., Magklaras, G. & Donaldson, I. M. iRefIndex: a consolidated protein interaction database with provenance. BMC Bioinform. 9 , 405 (2008).

Liao, Y. et al. PLA2G7/PAF-AH as potential negative regulator of the Wnt signaling pathway mediates protective effects in BRCA1 mutant breast cancer. Int. J. Mol. Sci. 24 , 882 (2023).

Deng, H. & Li, W. Monoacylglycerol lipase inhibitors: modulators for lipid metabolism in cancer malignancy, neurological and metabolic disorders. Acta Pharm. Sin. B 10 , 582–602 (2020).

Deng, X., Yuan, L., Jankovic, J. & Deng, H. The role of the PLA2G6 gene in neurodegenerative diseases. Ageing Res. Rev. 89 , 101957 (2023).

Liu, D., Yu, Q., Ning, Q., Liu, Z. & Song, J. The relationship between UGT1A1 gene & various diseases and prevention strategies. Drug Metab. Rev. 54 , 1–21 (2022).

Zhang, X. et al. MicroRNA-related genetic variations as predictors for risk of second primary tumor and/or recurrence in patients with early-stage head and neck cancer. Carcinogenesis 31 , 2118–2123 (2010).

Tsugawa, H. et al. A lipidome atlas in MS-DIAL 4. Nat. Biotechnol. 38 , 1159–1163 (2020).

van der Gugten, J. G. Tandem mass spectrometry in the clinical laboratory: a tutorial overview. Clin. Mass Spectrom. 15 , 14–25 (2020).

Wishart, D. S. Emerging applications of metabolomics in drug discovery and precision medicine. Nat. Rev. Drug Discov. 15 , 473–484 (2016).

Chen, L. et al. Metabolite discovery through global annotation of untargeted metabolomics data. Nat. Methods 18 , 1377–1385 (2021).

Article   ADS   PubMed   PubMed Central   Google Scholar  

Stine, Z. E., Schug, Z. T., Salvino, J. M. & Dang, C. V. Targeting cancer metabolism in the era of precision oncology. Nat. Rev. Drug Discov. 21 , 141–162 (2022).

DeNicola, G. M. et al. NRF2 regulates serine biosynthesis in non–small cell lung cancer. Nat. Genet. 47 , 1475–1481 (2015).

Sánchez-Castillo, A. et al. Targeting serine/glycine metabolism improves radiotherapy response in non-small cell lung cancer. Br. J. Cancer 130 , 568–584 (2024).

Fan, T. W. M. et al. De novo synthesis of serine and glycine fuels purine nucleotide biosynthesis in human lung cancer tissues. J. Biol. Chem. 294 , 13464–13477 (2019).

Nie, M. et al. Evolutionary metabolic landscape from preneoplasia to invasive lung adenocarcinoma. Nat. Commun. 12 , 6479 (2021).

Shen, X. et al. Metabolic reaction network-based recursive metabolite annotation for untargeted metabolomics. Nat. Commun. 10 , 1516 (2019).

Shen, X. et al. MetID: an R package for automatable compound annotation for LC-MS-based data. Bioinformatics 38 , 568–569 (2022).

Schmid, R. et al. Integrative analysis of multimodal mass spectrometry data in MZmine 3. Nat. Biotechnol. 41 , 447–449 (2023).

Horai, H. et al. MassBank: a public repository for sharing mass spectral data for life sciences. J. Mass Spectrom. 45 , 703–714 (2010).

Article   ADS   CAS   PubMed   Google Scholar  

Pence, H. E. & Williams, A. Chemspider: an online chemical information resource. J. Chem. Educ. 87 , 1123–1124 (2010).

Levitsky, L. I., Klein, J. A., Ivanov, M. V. & Gorshkov, M. V. Pyteomics 4.0: five years of development of a python proteomics framework. J. Proteome Res. 18 , 709–714 (2019).

Hunter, J. D. Matplotlib: A 2D graphics environment. Comput. Sci. Eng. 9 , 90–95 (2007).

Tan, M. & Le, Q. V. EfficientNet: rethinking model scaling for convolutional neural networks. In Proc. Machine Learning Research 2019. 105–6114 (2019).

He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 770–778 (2016).

Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. Preprint at https://arxiv.org/abs/1409.1556 (2014).

Pumperla, M. Keras + Hyperopt: A very simple wrapper for convenient hyperparameter optimization. GitHub https://github.com/maxpumperla/hyperas (2016).

Selvaraju, R. R. et al. Grad-CAM: visual explanations from deep networks via gradient-based localization. In 2017 IEEE International Conference on Computer Vision (ICCV) 618–626. https://doi.org/10.1109/ICCV.2017.74 (2017).

Ribeiro, M. T., Singh, S. & Guestrin, C. ‘why should i trust you?’ explaining the predictions of any classifier. In Proc. 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 1135–1144 (2016).

Salek, R. M., Steinbeck, C., Viant, M. R., Goodacre, R. & Dunn, W. B. The role of reporting standards for metabolite annotation and identification in metabolomic studies. Gigascience 2 , 2047–217X (2013).

Pang, Z. et al. MetaboAnalyst 5.0: narrowing the gap between raw spectra and functional insights. Nucleic Acids Res. 49 , W388–W396 (2021).

Wu, T. et al. clusterProfiler 4.0: a universal enrichment tool for interpreting omics data. Innovation 2 , 100141 (2021).

CAS   PubMed   PubMed Central   Google Scholar  

Yongjie, D. et al. An end-to-end deep learning method for mass spectrometry data analysis to reveal disease-specific metabolic profiles. Github https://doi.org/10.5281/zenodo.12740369 (2024).

Download references

Acknowledgements

This work was supported by the grants of National Key R&D Program of China (2020YFA0803302 to Peng Huang, 2021YFF1200903, 2016YFC0901604 & 2018YFC0910401 to Weizhong Li), Major Project of Guangzhou National Laboratory (GZNL2024A01003 to Weizhong Li), and Guangdong Basic and Applied Basic Research Foundation (2022B1515120077 to Weizhong Li). We thanks to Prof Zhi Xie (Zhongshan Ophthalmic Center at Sun Yat-sen University) and Prof Kai Ye (Xi’An Jiaotong University) for their helpful suggestions on the paper.

Author information

Authors and affiliations.

Zhongshan School of Medicine, Sun Yat-sen University, Guangzhou, China

Yongjie Deng, Yanni Wang, Tiantian Yu, Wenhao Cai, Dingli Zhou & Weizhong Li

State Key Laboratory of Oncology in South China, Guangdong Provincial Clinical Research Center for Cancer, Sun Yat-sen University Cancer Center, Guangzhou, China

Yao Yao, Tiantian Yu, Feng Yin, Wanli Liu, Yuying Liu, Chuanbo Xie, Yumin Hu & Peng Huang

Metabolic Innovation Platform, Zhongshan School of Medicine, Sun Yat-sen University, Guangzhou, China

Yao Yao, Tiantian Yu, Yumin Hu & Peng Huang

Department of Radiology, The First Affiliated Hospital of Sun Yat-sen University, Guangzhou, China

Sun Yat-Sen University School of Medicine, Sun Yat-Sen University, Shenzhen, China

Weizhong Li

Key Laboratory of Tropical Disease Control of Ministry of Education, Sun Yat-sen University, Guangzhou, China

You can also search for this author in PubMed   Google Scholar

Contributions

Y. Deng, W. Li and Y. Hu designed the method. Y. Deng and Y. Yao implemented the method. Y. Deng, Y. Yao and Y. Wang conducted data analysis. T. Yu conducted the metabolomic experiments. Y. Deng and W. Cai visualised the results. Y. Deng and D. Zhou collected the public data. F. Yin, W. Liu, Y. Liu, C. Xie and J. Guan collected clinical samples and patient information. Y. Deng, W. Li and Y. Hu wrote the manuscript. P. Huang, W. Li and Y. Hu contributed to conceptualisation, supervision, management, manuscript reviewing and editing.

Corresponding authors

Correspondence to Yumin Hu , Peng Huang or Weizhong Li .

Ethics declarations

Competing interests.

All authors declare the following competing interests. All authors have filed patents for both the technology and the use of the technology to analyse metabolomic data.

Peer review

Peer review information.

Nature Communications thanks Timothy Ebbels, Kun Qian and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information, reporting summary, supplementary data 1, supplementary data 2, supplementary data 3, source data, transparent peer review file, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ .

Reprints and permissions

About this article

Cite this article.

Deng, Y., Yao, Y., Wang, Y. et al. An end-to-end deep learning method for mass spectrometry data analysis to reveal disease-specific metabolic profiles. Nat Commun 15 , 7136 (2024). https://doi.org/10.1038/s41467-024-51433-3

Download citation

Received : 16 February 2024

Accepted : 07 August 2024

Published : 20 August 2024

DOI : https://doi.org/10.1038/s41467-024-51433-3

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: Cancer newsletter — what matters in cancer research, free to your inbox weekly.

deep learning ai assignment github

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn .

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

What AI Developer Skills Do You Need in 2024?

Featued image for: What AI Developer Skills Do You Need in 2024?

According to the latest Stack Overflow survey, AI developers are among the top earners in the software industry. Demand for them will continue to rise, as 76% of the respondents are using or planning to use AI tools in their development process.

2024 Stack Overflow Developer Survey indicates average annual salary for AI specialists is $160,000

Source: Stack Overflow 2024 Developer Survey

The past two years have further expanded the possibilities of AI and how it can be used. AI has changed how businesses operate, approach marketing, sell their products, manage customer relationships and conduct research and development. The 2024 McKinsey State of AI Survey reveals that 72% of organizations have already adopted AI in one or more areas of their operations, which further stresses the importance of AI in modern software development.

Newer roles, such as AI engineers, machine learning (ML) engineers and MLOps , are becoming popular. They require traditional developers to expand their skill sets.

In this article, I’ll explore the essential skills you’ll need to become an AI developer. I’ll provide insights from Tim Faulkes , chief developer advocate of Aerospike . I’ll also explore the key trends shaping AI and how they affect your role as a developer, examine the intersection between AI and software development, and look into specialized skills for different developer roles.

Key Trends Shaping AI Development

Before exploring the skills you’ll need to become an AI developer, it’s important to consider some advancements in the AI ecosystem that are expanding the scope of these applications. The most significant trends (which I’ll break down below) include:

  • Real-time data processing
  • Vector databases

Generative AI

  • Convergence of AI with traditional development processes

Key trends shaping AI development: Real-time data processing; vector databases; generative AI; and convergence of AI with traditional development processes

Real-Time Data Processing

In industries such as healthcare, finance, autonomous vehicles and the Internet of Things (IoT) where decisions are made in microseconds, there has been a push to develop advanced AI systems. This need has led these industries to leverage technologies like Apache Kafka and Apache Flink to enable continuous streams of data, which feed pipelines of AI models, allowing instant predictions and updates.

Additionally, latency is a general problem in software systems, and AI systems are not immune to it. The need to bring computation and data storage closer to the data source also drives how AI systems are being developed.

As Aerospike’s Faulkes told SiliconANGLE , “The real-time data is here. Anyone who’s looked at any real-time decisioning, yes, you’re using AI to make the decisions typically. But the more data you can feed the pipelines, the better results you’re going to get.”

He’s seen this in his eight years working with developers at the multi-model database vendor. Real-time data has “come to the forefront recently because of recent innovations,” he said. “As people’s need for data has risen, the data scientists are saying, ‘If I can get 10 times as much data, I can give you a better fraud result,’ for example. So yes, the revolution is here.”

The takeaway : This means that as an AI developer, you must be skilled in technologies enabling real-time data processing, possess domain knowledge relevant to your projects and consider latency when building and deploying AI solutions.

Vector Databases

The way data is stored in a database significantly affects retrieval performance and efficiency. AI systems that deal with large volumes of data and integrate with machine learning models need vector databases, particularly for retrieval-augmented generation (RAG) . These databases index and store data as high-dimensional vectors that AI systems can use to power search , recommendations and other retrieval use cases.

“Vectors are these mystical concepts, but they’re really easy,” Faulkes explained. “It’s just a set of numbers — let’s think of an image, I’ve got a picture, I run a lossy algorithm like JPEG on it. It loses information and turns it into a smaller representation of the same thing.

“That’s exactly what a vector does. It takes whatever input it has, be it an audio file or a question you give it, and it turns it into a series of numbers, and that’s all it is.”

But that series of numbers can be enormous. “We’ve got vectors that can be up to thousands of dimensions, and then you need a vector database. All they do is look at these hundreds or thousands of dimensions of vectors and say, ‘You’ve given me one. Find me the ones that are closest to it.’”

The takeaway : Vector databases are becoming more powerful and will continue to influence how AI applications are built. This means that as an AI developer, you must:

  • Understand the purpose of vector databases.
  • Choose the right one.
  • Understand how vector databases accept and retrieve data.
  • Know how to integrate vector databases with applications.

Generative AI (GenAI) systems can create content like texts, images, audio and videos. Over the past two years, GenAI has been widely adopted in product design, entertainment and synthetic data creation. While models such as GPT-4 and DALL·E 3 have met content creators’ needs to an extent, they will be pushed even further due to demand from creators. Their needs will keep changing and, in turn, the creators will improve existing models and create new and more advanced ones.

Under the GenAI category, you will also find prompt engineering and large language models (LLMs) . LLMs are the underlying models that process and generate content based on their training data, and prompt engineering is the practice of crafting inputs to guide LLMs to produce desired outputs.

Faulkes finds it fascinating that you don’t program LLMs, at least in the traditional sense. Instead, you give them instructions in your native language about what you want them to do. Faulkes provided an example: “You say, ‘Hey, you’re an expert on Aerospike, and I want you to answer my questions in the best manner using this information.’ It’s not writing code, it’s writing English.”

There’s also a complete ecosystem around LLMs, he said. “You’ve got the LLMs, you’ve got the prompt engineering, you’ve got your vector databases so that you can get the right information out of all your inputs and give it the answer.”

The takeaway : This means that as an AI developer, you must:

  • Understand generative AI capabilities.
  • Know how to select appropriate tools and models.
  • Follow all ethical and other responsibilities.
  • Know how to integrate generative AI with applications.

AI’s Convergence With Traditional Development Processes

AI is changing the software development process, from AI-powered development tools like GitHub Copilot and Tabnine  to MLOps, which integrates machine learning models into the software development life cycle (SDLC).

“Everyone’s used to the ChatGPT sort of things where they say, ‘Yes, I can give it some information and it’ll give me some answers, and it might be right or might be wrong. But how do I embed that into an application?’” Faulkes noted. “Plugging [AI] into your traditional applications and having the two liaised together in the right manner so you get the best application results … that’s really what I find exciting.”

The takeaway : This means that as an AI developer, you must learn the skills that will help you embed AI capabilities into conventional software and build applications that can learn from data, improve over time and provide smarter functionalities.

Skills Required To Become an AI Developer Today

AI developers are responsible for designing, developing and maintaining AI systems. These developers have become increasingly popular and are in wide demand because of their role in using AI to transform multiple industries. Below are the main skills you need to become an AI developer.

Programming Skills

Proficiency in programming languages like Python (most common for AI/ML), R (for statistics computing), and Java , C++ or Julia (for performance-critical applications) is a fundamental skill of an AI developer. You also need a good understanding of AI and machine learning libraries and frameworks like TensorFlow, PyTorch, scikit-learn and Pandas.

Mathematics and Statistics

As an AI engineer, you should be familiar with linear algebra, calculus, probability and statistics, and other computer science optimization theories. These are essential skills you need to design, build and maintain AI algorithms and techniques.

Data Handling and Analysis

Data is the bedrock of AI, and AI developers must understand how to properly collect, clean, normalize and transform data. Knowledge of SQL and NoSQL is also important since you’ll deal with structured and unstructured data.

Additionally, you should be familiar with big data tools and frameworks such as Apache Hadoop and Spark for storing, managing and processing big data.

Machine Learning and Deep Learning

As an AI developer, you must be skilled in training and evaluating model performance using metrics like accuracy, precision, F1 score and other machine learning techniques. Additionally, you should be familiar with ML algorithms like linear regression, logistics regressions and neural networks and deep learning (DL) algorithms like convolutional neural networks (CNNs) and generative adversarial networks (GANs).

Cloud Computing and Deployments

AI developers should be familiar with cloud-based ML and AI services offered by providers like Google Cloud Platform, AWS  and Microsoft Azure. They provide prebuilt models, APIs, resources, vector databases and notebooks for prototyping and building AI applications.

Additionally, knowledge of tools like Docker and Kubernetes comes in handy when packaging and deploying models on these platforms.

Ethics and Bias in AI

AI developers need to understand the ethical implications of AI. You’ll be responsible for making AI systems fair, accountable and transparent in the data and models used for training the system.

Soft Skills and Continuous Learning

Beyond technical knowledge, you must be able to communicate effectively with nontechnical and technical stakeholders, collaborate with technology teams and keep up with the latest trends in AI and ML development.

List of AI developer skills required

While the skills above are tailored to AI specialists and AI-related companies, the shift in the AI ecosystem and its application across multiple industries has also affected how traditional software developers and companies build and leverage AI in their operations. Let’s look at that in detail.

How AI Is Shaping Software Development

For most developers, the first glimpse of the possibilities AI offers is through using automated code completion and generation tools like GitHub Copilot and Tabnine in their code editor or integrated development environments (IDEs). These AI tools are used to enforce standards, improve productivity and increase the pace of software development.

Beyond automating code generation, AI has also improved how developers:

  • Use static analysis to fix security vulnerabilities, code quality issues and bugs. Examples include DeepCode and Snyk .
  • Automate testing procedures and predict parts of the application that are most prone to malfunction.
  • Tackle repetitive project management tasks like task assignments, progress tracking and reporting.
  • Use automated code reviews to enforce coding standards and best practices.

The possibilities AI offers to developers are extensive, and most developers have used AI directly or indirectly in their operations.

Beyond developers leveraging AI tools to increase their productivity, companies are integrating AI into one or more areas of their businesses. This means developers can’t just be consumers of AI technology; they must also know how to integrate AI into business requirements.

How Accessible Is AI to Software Developers?

When companies want to build AI products or extend some areas of their business with it, their go-to people include AI experts, including AI engineers, ML engineers, data scientists, research scientists and deep learning engineers. This approach is usually expensive and difficult to scale.

However, AI has become more accessible to software developers, with companies such as OpenAI , Cohere and AWS offering:

  • Pretrained models and APIs
  • Transferred learning
  • Low-code/no-code AI platforms
  • Developer tools
  • Improved documentation and learning resources

Despite the accessibility these companies offer, both beginner and experienced developers need to understand how AI works and how it’s being used to power these tools.

“There are so many moving pieces, and they’re so novel,” Faulkes said. “How do they hang together? It’s almost frustrating for experienced developers. You hear all these terms, [but] how do you put them together? How do you get a generative AI application that’s going to work at scale when you don’t understand what large language models and vectors and things like that are? We’re not used to doing fuzzy things; we’re used to getting an answer out of a computer.”

The possibilities offered by AI companies will continue to shape the roles and responsibilities of developers who want to build the next generation of modern software.

How AI Is Transforming Software Developers’ Roles and Responsibilities

Pretrained models and APIs, transferred learning, low-code/no-code AI platforms and other AI-enabling solutions are blurring the lines between the traditional roles of the developer and the AI specialist. Developers across the board must have at least a basic understanding of AI concepts and how to leverage them in their domain.

Let’s look at how AI is shaping the roles and responsibilities of various developers.

Frontend Developers

As a frontend developer, you can leverage AI when building features such as chatbots, personalized recommendations, voice commands and conversational interfaces. These AI-powered elements enhance interactivity and offer a more personalized user experience.

AI frontend skills

Additionally, you are expected to be familiar with AI-driven tools that can generate layouts, optimize user interfaces and ensure web accessibility standards are met automatically.

AI skills required by frontend developers include:

  • Proficiency in frameworks and libraries like TensorFlow.js and Brain.js for embedding AI features on the frontend.
  • Familiarity with AI-based APIs and how to integrate them.
  • Knowledge of AI-powered layout generation and prototyping tools.

Backend Developers

As a backend developer, you are expected to design, develop and maintain infrastructure that reduces the latency of AI systems that require real-time processing and analytics. You’ll also ensure that the APIs that the system is consuming and exposing are secure and free of malicious parties.

AI backend skills

Additionally, you are expected to be familiar with AI-powered tools to predict and prevent potential system failures.

AI skills required by backend developers include:

  • Familiarity with MLOps tools and best practices for managing AI models.
  • Knowledge of ML frameworks and experience integrating ML models into backend systems.
  • Expertise in big data and real-time data processing technologies like Apache Kafka and Apache Flink.
  • Knowledge of extract, transform and load (ETL) processes for integrating data from multiple sources.

Full Stack Developers

As a full stack developer, you are expected to build applications that leverage AI capabilities both on the client side and server side and ensure seamless integration between the components. You’ll use AI-powered tools to test and debug across the full stack.

AI full stack skills

AI skills required by full stack developers:

  • Proficiency in integrating AI-powered APIs and services across full stack development.
  • Knowledge of AI frameworks across both the client and server side of the application.
  • Familiarity with DevOps and MLOps for managing and deploying AI applications.
  • Knowledge of data handling and ETL processes.

Advancements in AI engineering and the opportunities it brings will continue to grow. It will shape both the roles and responsibilities of AI specialists such as AI developers, and also cut across traditional developer roles. Industries such as healthcare, finance and autonomous vehicles leveraging AI solutions will also influence the skills required of developers.

To learn more about the important role databases play in the development and deployment of AI applications, check out “ An insider’s guide to AI databases .”

deep learning ai assignment github

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • PMC11335749

An end-to-end deep learning method for mass spectrometry data analysis to reveal disease-specific metabolic profiles

Yongjie deng.

1 Zhongshan School of Medicine, Sun Yat-sen University, Guangzhou, China

2 State Key Laboratory of Oncology in South China, Guangdong Provincial Clinical Research Center for Cancer, Sun Yat-sen University Cancer Center, Guangzhou, China

3 Metabolic Innovation Platform, Zhongshan School of Medicine, Sun Yat-sen University, Guangzhou, China

Tiantian Yu

Dingli zhou, chuanbo xie.

4 Department of Radiology, The First Affiliated Hospital of Sun Yat-sen University, Guangzhou, China

Weizhong Li

5 Sun Yat-Sen University School of Medicine, Sun Yat-Sen University, Shenzhen, China

6 Key Laboratory of Tropical Disease Control of Ministry of Education, Sun Yat-sen University, Guangzhou, China

Associated Data

All of the raw LC-MS data generated in this study have been deposited in a Code Ocean capsule under accession code 2328223 with a citable DOI number 10.24433/CO.2328223.v1. Source data for users to reproduce our research results can be downloaded from the Source Data file. The source data for the network and pathway results in Fig.  5 can be found in Supplementary Data  2 and 3 in the Supplementary Information.  Source data are provided with this paper.

The source code and the pretrained model for DeepMSProfiler are available in the GitHub repository ( https://github.com/yjdeng9/DeepMSProfiler ) for academic use 74 . The code on GitHub serves as an easy-to-use tool for running DeepMSProfiler.

Untargeted metabolomic analysis using mass spectrometry provides comprehensive metabolic profiling, but its medical application faces challenges of complex data processing, high inter-batch variability, and unidentified metabolites. Here, we present DeepMSProfiler, an explainable deep-learning-based method, enabling end-to-end analysis on raw metabolic signals with output of high accuracy and reliability. Using cross-hospital 859 human serum samples from lung adenocarcinoma, benign lung nodules, and healthy individuals, DeepMSProfiler successfully differentiates the metabolomic profiles of different groups (AUC 0.99) and detects early-stage lung adenocarcinoma (accuracy 0.961). Model flow and ablation experiments demonstrate that DeepMSProfiler overcomes inter-hospital variability and effects of unknown metabolites signals. Our ensemble strategy removes background-category phenomena in multi-classification deep-learning models, and the novel interpretability enables direct access to disease-related metabolite-protein networks. Further applying to lipid metabolomic data unveils correlations of important metabolites and proteins. Overall, DeepMSProfiler offers a straightforward and reliable method for disease diagnosis and mechanism discovery, enhancing its broad applicability.

Untargeted metabolomic analysis provides comprehensive metabolic profiling but faces challenges in medical application. Here, the authors present an explainable deep learning method for end-to-end analysis on raw metabolic signals to differentiate metabolomic profiles of cancers with high accuracy.

Introduction

Metabolomics offers a comprehensive view of small molecule concentrations within a biological system and plays a pivotal role in the discovery of disease biomarkers for diagnostic purpose 1 . Liquid chromatography mass spectrometry (LC-MS) is a widely practiced experimental technique in global metabolomic studies 2 , 3 . High sensitivity, stability, reproducibility, and detection throughput are the unique advantages of untargeted LC-MS 4 . Despite its capacity to measure thousands of ion peaks, the conventional metabolomic study by LC-MS remains a challenging task due to laborious data processing such as peak picking and alignment, metabolite annotation by comparing to authenticated databases, and data normalisation to control unwanted variability in a large-scale study 5 , 6 . The broader application of metabolomics in precision medicine may be impeded by obstacles such as complex data processing, high inter-batch variability, and burdensome metabolite identification 7 .

Untargeted metabolomics has been conducted on various human biological fluids, including serum and plasma, for the discovery of biomarkers in cancers such as hepatocellular carcinoma 8 , pancreatic 9 , prostate 10 , and lung cancers 11 – 13 . However, such biomarker discovery studies utilising metabolomics face significant challenges regarding reproducibility, likely due to signal drifts in cross-batch or cross-platform analysis 14 and the limited integration of data from different laboratory samples 15 . Furthermore, unknown metabolites are excluded when comparing detected features to authenticated databases 16 , which may hinder our ability in discovering new biomarkers associated with diseases. Several previous studies have combined machine learning with LC-MS for in vitro disease diagnosis and improved the efficiency of LC-MS data analysis. For example, Huang et al. conducted machine learning to extract serum metabolic patterns from laser desorption/ionisation mass spectrometry to detect early-stage lung adenocarcinoma 11 ; Chen et al. adopted machine learning models to conduct targeted metabolomic data analysis to identify non-invasive biomarker for gastric cancer diagnosis 17 ; Shen et al. developed a deep-learning-based Pseudo-Mass Spectrometry Imaging method and applied it in the prediction of gestational age of pregnant women, as well as the diagnosis of endometrial cancer and colon cancer 18 . However, these studies still face challenges such as batch effects and unknown metabolites in metabolomics 7 . Consequently, a new analytical approach is urgently needed to overcome the experimental bottlenecks and reveal disease-associated profiles comprising both identified and unknown components derived from LC-MS peaks.

Deep learning has been widely applied in various omics data analyses, holding promise for addressing the complexities of metabolomic data 19 . The encoding and modelling capabilities of deep learning offer a potential solution to overcome the aforementioned bottlenecks in handling intricate and high-dimensional data, mitigating bias in machine learning algorithms 20 , 21 . However, deep learning necessitates high-quality data and a sufficient quantity of samples, otherwise leading to issues like the curse of dimensionality and the overfitting of predictive models 22 . Moreover, integrating large dataset collected from multiple hospitals may introduce significant variations. Furthermore, as deep learning methods are usually perceived as “black-box” processes, the importance of model interpretability for prediction in the context of biomedical research is increasingly recognised 23 – 25 . Therefore, a deep learning model with both interpretability for biological soundness and capability to mitigate batch effects is highly desirable to enhance the reliability of large-scale metabolomic analyses for diagnostic purposes.

In this study, we develop an ensemble end-to-end deep learning method named as deep learning-based mass spectrum profiler (DeepMSProfiler) for untargeted metabolomic data analysis. We firstly apply this method to differentiate healthy individuals and patients with benign lung nodules or lung adenocarcinoma using 859 serum samples from three distinct hospitals, followed by its extended analysis on lipid metabolomic data derived from 928 cell lines to reveal metabolites and proteins associated with multiple cancer types. Without the process of peak extraction and identification as well as potential errors by conventional machine learning approaches, our method directly converts raw LC-MS data into outputs such as predicted classification, heatmaps illustrating key metabolite signals specific to each class, and metabolic networks that influence the predicted classes. Importantly, DeepMSProfiler effectively removes undesirable batch effects and variations across different hospitals and infers the unannotated metabolites associated with specific classifications. Furthermore, it leverages an ensemble-model strategy that optimises feature attribution from multiple individual models. DeepMSProfiler achieved an area under the receiver operating characteristic curve (AUC) score of 0.99 in an independent testing dataset, along with an accuracy of 96.1% in detecting early-stage lung adenocarcinoma. The results are explainable through locating relevant biological components as contribution factors to prediction. Our method provides a straightforward and reliable approach for metabolomic applications in disease diagnosis and mechanism discovery.

The overview of the ensemble end-to-end deep-learning model

The DeepMSProfiler method includes three main components: the serum-based mass spectrometry, the ensemble end-to-end model, and the disease-related LC-MS profiles (Fig.  1a ). In the first component, the raw LC-MS-based metabolomic data was generated using 859 human serum samples (Fig.  1a left) collected from 210 healthy individuals, 323 benign lung nodules, and 326 lung adenocarcinomas. The space of the LC-MS raw data contains three dimensions: retention time (RT), mass-to-charge ratio (m/z), and intensity. Using the RT and m/z dimensions, the data can be mapped from three-dimensional space into the frequency and time domains, respectively (Fig.  1b left to middle). Ion current maps and primary mass spectra can then be generated and used for metabolite identification (Fig.  1b middle to right). Conventional step-by-step methods of metabolomic analysis 5 , 22 (Supplementary Fig.  1 top) may lead to a large number of lost metabolic signals. To address these issues, DeepMSProfiler directly takes untargeted LC-MS raw data as model input, and builds an end-to-end deep learning model to profile disease-related metabolic signals (Supplementary Fig.  1 bottom).

An external file that holds a picture, illustration, etc.
Object name is 41467_2024_51433_Fig1_HTML.jpg

a The overview of DeepMSProfiler. Serum samples of different populations (top left) were collected and sent to the instrument (bottom left) for liquid chromatography-mass spectrometry (LC-MS) analysis. The raw LC-MS data, containing information on retention time (RT), mass-to-charge ratio (m/z), and intensity, is used as input to the ensemble model (middle). Multiple single convolutional neural networks form the ensemble model (centre) to predict the true label of the input data and generate three outputs (right), including the predicted sample classes, the contribution heatmaps of classification-specific metabolic signals, and the classification-specific metabolic networks. b The data structure of raw data. The mass spectra of different colours (centre) represent the corresponding m/z and ion intensity of ion signal groups recorded at different RT frames. All sample points are distributed in a three-dimensional space (left) which can be mapped along three axes to obtain chromatograms, mass spectra, and two-dimensional matrix data. Chromatograms and mass spectra are used for conventional qualitative and quantitative analysis (right), while the two-dimensional matrix serves as input data for convolutional neural networks. c The structure of a single end-to-end model. The input data undergoes the pre-pooling processing to reduce dimensionality and become three-channel data. As the model passes through each convolutional layer (conv) in the feature extractor module, the weights associated with the original signals change continuously. The sizes of different frames in the enlarged layers (top) represent different receptive fields, with DenseNet allowing the model to generate more flexible receptive field sizes. After the last fully connected layer (FC), the classifications are resulted.

The main model adopts an ensemble strategy and consists of multiple sub-models (Fig.  1a middle). The ensemble strategy is considered to be able to provide better generalisation 26 according to the complexity of the hypothesised space 27 , the local optimal search 27 and the parameter diversity 28 , 29 by the random training process, as well as the bias-variance theory 30 – 32 . The structure of each sub-model consists of three parts: a pre-pooling module, a feature extraction module, and a classification module (Fig.  1c ). The pre-pooling module transforms three-dimensional data into two-dimensional space through a max-pool layer, effectively reducing dimensionality and redundancy while preserving global signals (Fig.  1c left). The feature extraction module is based on a convolutional neural network to perform classification tasks by extracting category-related features (Fig.  1c middle). DenseNet121 is chosen as the backbone of the feature extraction module due to its highest accuracy and the least number of parameters (Supplementary Data  1 ). Additionally, the design of densely connected convolutional networks in DenseNet121 33 makes the model more flexible in terms of receptive field size, allowing adaptation to different RT intervals of metabolic signal peaks. After the non-linear transformation by the feature extraction layer, different weights could be assigned to the input original signal peaks for subsequent classification. We evaluated the effect of the number of sub-models on improving performance and consequently chose to use 18 sub-models for DeepMSProfiler (Supplementary Fig.  2 ). The classification module implements a simple dense neural network to compute the probabilities of different classes (Fig.  1c right).

In the third component, each input sample results in three outputs from the model, including the predicted classification of the sample, the heatmap for the locations of the key metabolic signals, and the metabolic network that influences the predicted category (Fig.  1a right). In the predicted classification, the category with the highest probability is assigned as the predicted label of the model. In the heatmap presentation, the key metabolic signals associated with different classifications are inferred by the perturbation-based method, and the m/z and RT of the key metabolic signals can be located. Finally, the model infers the underlying metabolite-protein networks and metabolic pathways associated with these key metabolic signals directly from m/z (see Methods).

Model performance metrics

To test DeepMSProfiler’s ability in classifying disease states and discovering disease-related profiles, we performed the global metabolomic analysis of serum samples from healthy individuals and patients with benign lung nodules or lung cancer. We collected serum samples from three different hospitals to construct and validate the model. Benign nodule cases were followed for up to 4 years and lung adenocarcinoma samples were pathologically examined (see Methods ). We built the model using 859 untargeted LC-MS samples, of which 686 as the discovery dataset and 173 as the independent testing dataset (Fig.  2a ). The samples were generated from 10 batches, as shown in Supplementary Table  1 . Statistics analysis shows a significant correlation between lesion size and disease type, but the correlations between other clinical factors are not significant (Supplementary Table  2 ). To avoid confounding effects by clinical features, we performed further distribution statistics analysis, which shows no significant difference in the distribution of lesion diameter and patient age in both the discovery dataset and the independent testing dataset (Supplementary Fig.  3 ). The samples in the discovery dataset were randomly divided into two subsets: a training dataset (80%) for parameter optimisation and a validation dataset (20%) to cross-validate the performance of different models. Remarkably, the accuracies of our DeepMSProfiler in the training dataset, validation dataset, and independent testing dataset are 1.0, 0.92, and 0.91, respectively (Supplementary Table  3 ).

An external file that holds a picture, illustration, etc.
Object name is 41467_2024_51433_Fig2_HTML.jpg

a The sample allocation chart. The outer ring indicates the types of diseases and the inner ring indicates the sex distribution. Healthy: healthy individuals; Benign: benign lung nodules; Malignant: lung adenocarcinoma. b Predicted receiver operating characteristic (ROC) curves of different methods. Random: performance baseline in a random state. Comparison of performance metrics of different methods ( n  = 50): accuracy ( c ), precision ( d ), recall ( e ), and F1 score ( f ). The blue areas show the different conventional analysis processes using machine learning methods, and the red areas display different end-to-end analysis processes using deep learning methods. The boxplot shows the minimum, first quartile, median, third quartile and maximum values, with outliers as outliers. g Model accuracy rates for different age groups. The sample sizes for different groups are 52, 69, 40, and 12, respectively. h Model accuracy rates for different lesion diameter groups. The sample sizes for different groups are 27, 37, 18, 13, and 34, respectively. The boxplot shows the minimum, first quartile, median, third quartile and maximum values. i Prediction accuracy and parameter scale of different model architectures. j The confusion matrix of the DeepMSProfiler model. The numbers inside the boxes are the number of matched samples between the true label and the predicted label. The ratio in parentheses is the number of matched samples divided by the number of all samples of the true label.

In the independent testing dataset, DeepMSProfiler significantly outperforms traditional methods and single deep learning models (Fig.  2b–h ). Compared with Support Vector Machine (SVM), Random Forest (RF), Deep Learning Neural Network (DNN), Adaptive Boosting (AdaBoost), and Extreme Gradient Boosting (XGBoost) based on traditional methods, and Densely Connected Convolutional Networks (DenseNet121) using raw data, DeepMSProfiler presents the highest areas under the curve (AUC) of 0.99 (Fig.  2b ). Notably, DeepMSProfiler exhibits higher specificity than other models while maintaining high sensitivity, indicating its ability to accurately identify true negatives (Supplementary Table  4 ).

Regarding overall performance against other models, our model achieves the best performance in multiple evaluation metrics: accuracy of 95% (95% CI, 94%–97%) (Fig.  2c ), precision of 96% (95% CI, 94%–97%) (Fig.  2d ), recall of 95% (95% CI, 94%–96%) (Fig.  2e ), and F1 of 98% (95% CI, 97%–98%) (Fig.  2f ). Compared to XGBoost, our model performs better in different groups of lesion sizes and ages (Fig.  2g, h ), except for samples from patients over 70 years old. DeepMSProfiler is also superior to commonly used single deep learning models, such as DenseNet121 (Fig.  2i ). When using the ensemble strategy, we did not set different weights for each sub-model, so each sub-model is equally involved in the final prediction. The confusion matrices of prediction performance for each sub-model (Supplementary Fig.  4 ) show their contributions to the overall results. The robust performance, coupled with the efficiency in terms of computational resources, makes DeepMSProfiler a promising choice for the classification tasks (Fig.  2i ).

Furthermore, DeepMSProfiler demonstrates consistent performance across different categories. All of the AUCs for lung adenocarcinoma, benign lung nodules, and healthy individuals achieves 0.99 (Supplementary Fig.  5 ), and their respective classification accuracies are 85.7%, 90.8%, and 97.0% (Fig.  2j ). Most importantly, our model has good performance for detecting stage-I of lung adenocarcinoma with an accuracy of 96.1%, indicating its potential as an effective method for early lung cancer screening.

Insensitivity to batch effects

Batch effect is one of the most common error sources for the analysis of metabolomic data. To evaluate the impact of batch effects on the non-targeted LC-MS data, we first generated 3 biological replicates as reference samples for each of the 10 batches. These reference replicates were taken from a mixture of 100 healthy human serum samples, and each of them contains equal amounts of isotopes including 13 C-lactate, 13 C 3 -pyruvate, 13 C-methionine, and 13 C 6 -isoleucine (see Methods ). The differences in the data structure of reference replicates from different batches are visualised in 3D and 2D illustrations (Fig.  3a and Supplementary Fig.  6 ), which indicate the changes of shapes and area as well as the RT shifts among different batches. Comparison of individual isotopic peaks in samples from different batches also shows that the batch effects are mainly in the form of RT shifts, and the differences in peak shapes and areas for the same metabolites (Fig.  3b ).

An external file that holds a picture, illustration, etc.
Object name is 41467_2024_51433_Fig3_HTML.jpg

a Batch effects in 3D point array and 2D mapped heatmap of reference samples. RT: retention time; m/z: mass-to-charge ratio. b Isotope peaks of the same concentration in different samples. Different colours represent the batches to which the samples belong. c The visualisation of dimensionality reduction of normalisation by the Reference Material method. Below: different colours represent different classes; Above: different colours indicate different batches. Healthy: healthy individuals; Benign: benign lung nodules; Malignant: lung adenocarcinoma. d The visualisation of dimensionality reduction for the output data of the hidden layers in DeepMSProfiler. Conv1 to Conv5 are the outputs of the first to the fifth pooling layer in the feature extraction module. Block4 and Block5 are the outputs of the fourth and fifth conv layers in the fifth feature extraction module. Upper: different colours indicate different sample batches; Lower: different colours represent different population classes. e Correlation of the output data of the hidden layer with the batch and class information in DeepMSProfiler. The horizontal axis represents the layer names. Conv1 to Conv5 are the outputs of the first to the fifth pooling layer in the feature extraction module. Block10 and Block16 are the outputs of the tenth and sixteenth conv layers in the fifth feature extraction module. The blue line represents the batch-related correlations, and the orange line illustrates the classification-related correlations. f The accuracy rates of traditional methods (blue), corrected methods based on reference samples (purple), and DeepMSProfiler (red) in independent testing dataset ( n  = 50). The boxplot shows the minimum, first quartile, median, third quartile and maximum values, with outliers as outliers.

We then investigated the batch effect corrections and compared the performance between DeepMSProfiler and conventional correction methods (see Methods). As shown in Fig.  3c , after correction by the Reference Material (Ref-M) method 34 , we still observed 3 clusters in the principal component analysis (PCA) profiles, which represent the sample dots from three different hospitals. While Ref-M effectively addresses the batch effect within samples from the same hospital, the residual variation across hospitals remains (Fig.  3c ). Samples of batch 1–6 and 9–10 were obtained from the Sun Yat-Sen University Cancer Centre, and samples of batch 8 came from the First Affiliated Hospital of Sun Yat-Sen University. Samples of batch 7 were a mixture from three different hospitals. Among them, lung cancer samples came from the Affiliated Cancer Hospital of Zhengzhou University, lung nodule samples came from the First Affiliated Hospital of Sun Yat-Sen University, and healthy samples came from the Sun Yat-Sen University Cancer Centre. 100 healthy human serum samples used as reference during conventional procedures were all from the Sun Yat-Sen University Cancer Centre, which might be the main reason why it is difficult to correct batch effect for batch 7–9. To illustrate the DeepMSProfiler’s end-to-end process in the automatic removal of batch effects, we extracted the output of the hidden layer and visualised the flow of data during the forward propagation of the network (see Methods). From the input layer to the output layer, the similarity between different batches becomes progressively higher, while the similarity between different types becomes progressively lower (Fig.  3d and Supplementary Fig.  7 ). DenseNet121 is a deep neural network with 431 layers, of which 120 are convolutional layers (Supplementary Data  1 ). The fourth and fifth layers refer to the output of the fourth dense connected module and the output of the fifth dense connected module. There are 112 layers between them. Figure  3d and Supplementary Fig.  7 illustrate the intermediate change process from the output of the fourth closely connected module to the output of the fifth closely connected module. When the batch effects are removed, the classification becomes clearer. We further quantified this process of change using different metrics that measure the correlation between the PCA clusters and the given labels i.e., K-nearest neighbour batch effect test score 35 , local inverse Simpson’s index 36 , adjusted rand index (ARI), normalised mutual information (NMI), and average silhouette coefficient (ASC) (see Methods). The closer to the output layer, the less relevant the data is to batch labels and the more relevant to class labels (Fig.  3e ). This explains how the batch effect removal is achieved via progress through hidden layers (Fig.  3d ). This capability might be gained via the supervised learning. Our findings suggest that in the forward propagation process, the DeepMSProfiler model excludes batch-related information from the input data layer by layer, while retaining class-related information.

Further, we compared the performance of our deep learning method against machine learning methods with and without batch effect correction. The correction of batch effects could improve accuracies when using machine learning classifiers such as SVM, RF, AdaBoost, and XGBoost. However, DeepMSProfiler, without any additional manipulation, surpassed the machine learning methods with or without batch effect correction in terms of prediction accuracy (Fig.  3f ).

The impact of unknown mass spectrometric signals

To investigate the impact of unknown metabolite signals on classification prediction, we first performed the conventional analysis with peak extraction and metabolite annotation using existing databases such as Human Metabolome Database (HMDB) and Kyoto Encyclopedia of Genes and Genomes (KEGG) (see Methods). We found that 83.5% of all detected features remain as unknown metabolites (Fig.  4a ). The absence of these unknown metabolites undermines the prediction accuracy (Fig.  4b ), indicating this large number of unknown metabolites may impose a significant impact on classification performance. One of the advantages of our approach over traditional methods is the ability to retain complete metabolomic features including the unknown metabolites.

An external file that holds a picture, illustration, etc.
Object name is 41467_2024_51433_Fig4_HTML.jpg

a Statistics of annotated metabolite peaks. Blue colour represents all peaks, orange, purple and while yellow colours indicate metabolites annotated in HMDB, KEGG, and all databases, respectively. The overlap between orange and purple includes 414 metabolites annotated in both HMDB and KEGG. HMDB: Human Metabolome Database; KEGG: Kyoto Encyclopedia of Genes and Genomes. b The feature selection plot illustrates the effect of different contribution score thresholds removing unknown metabolites versus non-removing. The horizontal axis represents the change in threshold, while the vertical axis shows the accuracy of the model using the remaining features. The shadings of solid lines (mean) represent error bars (standard deviation). c Collection standard of published lung cancer serum metabolic biomarker. SCLC: Small Cell Lung Cancer; LUSC: Lung Squamous Cell Carcinoma; LUAD: Lung Adenocarcinoma; NAR: Nuclear Magnetic Resonance; MS: Mass Spectrometry. d The number counts of known biomarkers published in the current literature. e Molecular weight distribution plot of known biomarkers. f Accuracy comparison between the ablation experiment and DeepMSProfile ( n  = 50). The boxplot shows the minimum, first quartile, median, third quartile and maximum values. In the ablation experiment, we investigated the effect of varying the publication count (PC) of known biomarkers in the literature. Specifically, we eliminated metabolic signals that were not reported in the original data based on the m/z of known biomarkers. We retained only the metabolic signals with publication counts greater than 1, greater than 3, and greater than 8 for modelling development. All ablated data was analysed using the same architecture as the original unprocessed data in the same DeepMSProfiler architecture. The vertical axis shows the accuracy of models built on the dataset of different publication counts and our DeepMSProfiler.

We then tested the limitation of lung cancer prediction using biomarkers identified by annotated metabolites. We collected 826 biomarkers for lung cancer based on serum mass-spectrometry analysis from 49 publications (Fig.  4c ). We deduced the molecular weight of the biomarkers from the HMDB database and the information in these publications (see Methods). Only 42.7% of the biomarkers discovered based on traditional methods appear in more than two articles, and their reproducibility is suboptimal (Fig.  4d ). In addition, the molecular weights of these metabolites are mainly distributed in the range of 200–400 Da (Fig.  4e ). We found that prediction performance of DeepMSProfiler using complete raw data is highly accurate compared with the one using only the corresponding m/z signals of the reported biomarkers (Fig.  4f ). This indicates that there are still unknown metabolomic signals in the serum samples related to lung cancer that has not been unveiled in the current research. In contrast, DeepMSProfiler derives the classifications directly from the complete signals in the raw data.

Explainability to uncover the black box

After the model construction based on deep learning, we sought to explain the classification basis of the black box model and identify the key signals for specific classifications. We adopted a perturbation-based method, Randomised Input Sampling for Explanation (RISE) 37 , to count feature contributions. We slightly modified RISE to improve its operational speed and efficiency, and developed a method to evaluate the importance of RISE scores in different classifications (see Methods).

Interestingly, we found a “background category” phenomenon in some of the single models (Fig.  5a ). For each class in a single tri-classification model of DenseNet121, the classification performance gradually deteriorates as the metabolic signals with higher contribution scores are removed. However, there is one category that is always unaffected by all the features involved in the classification decision, while maintaining a very high number of true positives and false positives. These findings imply that the tri-classification model only predicts the probabilities of two of the categories, and calculates the probability of the third category from the results of the other two. In other words, the classification-related metabolites associated with the “foreground category” negatively contribute to the “background category”. Furthermore, the categories used as “background category” are not consistent across the different models. We were intrigued to test whether this phenomenon occurs exclusively in metabolomic data, so we conducted a seven-classification task using the Photo-Art-Cartoon-Sketch (PACS) image dataset 38 . We observed a similar phenomenon in the resultant feature scoring of different models (Supplementary Fig.  8 ). This suggests that “background category” may generally exist in multi-classification task by single models, although its underlying mechanism is currently unclear and may require future investigation.

An external file that holds a picture, illustration, etc.
Object name is 41467_2024_51433_Fig5_HTML.jpg

a Prediction performance and feature scoring by different single models. b Prediction performance and feature scoring by DeepMSProfiler. c Heatmap matrices of classification contribution in healthy individuals (Healthy), benign lung nodules (Benign), and lung adenocarcinoma (Malignant). The horizontal and vertical axes of the matrix are the prediction label and the true label, respectively. The heatmaps of upper left, the middle one, and the bottom right represent the true healthy individuals, the true benign nodules, and the true lung adenocarcinoma, respectively. The horizontal and vertical axes of each heatmap are RT and m/z, respectively. The classification contributions of metabolites corresponding to true healthy individuals ( d ), benign nodules ( e ), and lung adenocarcinoma ( f ). The horizontal axis represents the retention time and the vertical axis represents the m/z of the corresponding metabolites. The colours represent the contribution score of the metabolites. The redder the colour, the greater the contribution to the classification. Metabolites-proteins network for healthy individuals ( g ), benign nodules ( h ), and lung adenocarcinoma ( i ). Pathway enrichment analysis using the signalling networks for healthy individuals ( j ), benign nodules ( k ), and lung adenocarcinoma ( l ) (FDR < 0.05). m/z: mass charge ratio, RT: retention time.

In contrast, the phenomenon of “background category” no longer exists when the feature contributions are calculated by our ensemble model. As shown in Fig.  5b , when we progressively eliminated metabolic signals in each category according to their contribution, their performance of the ensemble model decreased accordingly. Each individual model captures different features that contribute to their corresponding classifications, while the ensemble model could combine these features to improve the accuracy of disease prediction and reduce the possibility of overfitting.

Metabolomic profiles in lung adenocarcinoma, benign nodules, and healthy individuals

To analyse the global metabolic differences between lung adenocarcinoma, benign nodules, and healthy individuals, we extracted the heatmaps of feature contributions counted by RISE from DeepMSProfiler (Fig.  5c ). As shown in Fig.  5d–f , the horizontal and vertical labels in the heatmaps represent m/z and retention time respectively. By mapping the label information to the heatmaps, we are able to locate the metabolites corresponding to different m/z and retention times to obtain their feature contribution scores. In true-positive healthy and benign nodule samples, the metabolic signals with the most significant contribution are uniformly located between 200 and 400 m/z and in 1–3 min (Fig.  5d, e ). In comparison, the metabolic signals located between 200 and 600 m/z and in 1–4 min contribute most in lung adenocarcinoma samples, but signals in other regions also have relatively high scores (Fig.  5f ).

As higher contribution scores in the heatmaps represent more important correlations, we screened signals with scores above 0.70 and attempted to identify the corresponding metabolic profiles in each classification (Supplementary Data  2 ). As observed in Fig.  5b , by retaining metabolic signals with a contribution score above 0.7, the overall accuracy is around 0.8, which manages to maintain an efficient classification impact. Considering the RT shift among different batches, we matched metabolic peaks only by m/z. We then fed these m/z signals, together with metabolites identified by tandem mass spectrometry (MS2), into the analysis tool PIUMet based on protein–protein and protein-metabolite interactions 39 to build disease-associated feature networks (Fig.  5g–i ). As the network shown in Fig.  5i , 82 proteins and 121 metabolites are matched in the lung adenocarcinoma samples, including 9 already identified by MS2 and 111 hidden metabolites found by the correlation between key metabolic peaks. As such, the current analysis based on protein-protein and protein-metabolite interactions allows the discovery of unknown metabolic signals associated with diseased states, although the resolution of the current model might be relatively low in distinguishing all individual peaks contributing to the disease classification. In order to explore the biological explainability, among the features extracted by PIUMet, we also selected 11 metabolites (Supplementary Table  5 ) with available authentic standards in our laboratory to justify their presence in the lung cancer serum samples. Indeed, these metabolites could be identified in the lung cancer serum as described in our previous study 40 . We further analysed the metabolic networks to explore the biological relevance associated with each classification. The heatmaps (Fig.  5d, e ) and pathway analysis (Fig.  5j, k ) consistently show that healthy individuals and benign nodules share similar metabolic profiles. In contrast, the cancer group presents a distinct profile with specific pathways and increasing counts of metabolites or proteins in the shared pathways with healthy individuals or benign nodules (Fig.  5f, l ). The detailed metabolites and protein candidates were further shown in Supplementary Data  3 . Taken together, our network and pathway analyses demonstrated the interpretability of DeepMSProfiler based on deep learning.

Application of model in colon cancer

Considering the transferability of DeepMSProfiler, we obtained a public colon cancer LC-MS dataset which contained 236 samples from MetaboLights (ID is MTBLS1129). There are 197 colon cancer samples and 39 healthy human samples in the dataset. We randomly divided this dataset into discovery dataset and independent testing dataset at 4:1 ratio. The discovery dataset contained 157 colon cancer samples and 31 healthy control samples. The independent testing dataset contained 40 colon cancer patient samples and 8 healthy control samples.

Due to the differences in cancer types and mass spectrometry analysis procedures between the colon cancer dataset and the lung adenocarcinoma dataset, we re-trained the DeepMSProfiler model. The colorectal cancer data was randomly divided into a discovery dataset and an independent testing dataset, and the discovery dataset was further randomly divided into a training dataset and a validation dataset with multiple times. In the independent testing dataset of the colon cancer dataset, our model achieved an accuracy of 97.9% (95% CI, 97.7%–98.1%), a precision of 98.7% (95% CI, 98.6%–98.8%), a recall of 93.4% (95% CI, 92.9%–94.1%), and an F1 of 95.8% (95% CI, 95.4%–96.2%) (Supplementary Fig.  9 ). These results suggest an excellent transferability of DeepMSProfiler.

Discovery of metabolic-protein networks in pan-cancer

In a continued effort to investigate the capabilities of DeepMSProfiler in analysing metabolomics data across multiple cancer types, raw lipid metabolomic data of 928 cell lines spanning 23 cancer types were collected from the Cancer Cell Line Encyclopaedia (CCLE) database 2 and then subjected to processing by DeepMSProfiler. Notably, in addition to the raw metabolomic data, these cell lines also contain valuable data of annotated metabolites, methylation, copy number variations, and mutations 2 . DeepMSProfiler constructed a model encompassing the 23 distinct categories, followed by a feature extraction from the 23-category model to identify the respective crucial metabolic signals of each category. Due to the limited number of samples for many cancer types, particularly for biliary tract, pleura, prostate, and salivary gland cancers, each with less than 10 samples, we did not set a separate independent testing dataset for the performance validation. 20 sub-models have been trained, and in each sub-model training, 80% of all samples were randomly allocated for training to ensure that every sample could contribute to the training process, especially for cancer types with very few samples. The final ensemble model used for explainable analysis achieved 99.3% accuracy, 97.2% sensitivity, and 100% specificity. Next, the priority-collecting Steiner forest optimisation algorithm 39 was employed to unveil the correlation between pivotal metabolic signals and proteins using databases of HMDB 41 , Recon2 42 and iRefIndex 43 (see Methods).

As results, we successfully generated disease-specific metabolite-protein networks (Fig.  6a–c ) along with a contribution score heatmap (Fig.  6e ), where contribution scores exceeding 0.70 were considered indicative of disease-specific metabolites. Metabolites identified within the metabolite-protein network were directly inferred from the mass-to-charge ratio (m/z) of metabolic signals from the raw data using feature spectra extracted by the DeepMSProfiler model. Notably, we identified 14 metabolites and 3 proteins that exhibited co-occurrence within the 23 cancer-related metabolite-protein networks (Fig.  6d ). Finally, we correlated the metabolic data and the methylation information and subsequently verified the associations between the PLA and UGT gene families and the disease-specific metabolites of high contribution (Fig.  6f ). Previous studies 44 – 48 have reported the important roles of PLA and UGT gene families in a variety of diseases, such as PLA2G7 and PLA2G6 in breast and prostate cancers and neurodegenerative diseases, as well as UGT3A2 in head and neck cancers. These evidences support our findings by DeepMSProfiler. In summary, our extended analysis spanning pan-cancer scenarios highlights the capability of DeepMSProfiler in the discovery of potential disease-associated metabolites and proteins.

An external file that holds a picture, illustration, etc.
Object name is 41467_2024_51433_Fig6_HTML.jpg

Metabolite-protein networks for ( a ) lung cancer, ( b ) gastric cancer, and ( c ) leukaemia. Yellow squares: metabolites. Red circles: proteins. Blue labels: metabolites and proteins shared in 23 cancer metabolite-protein networks. d Metabolites and proteins shared in the metabolite-protein networks of 23 cancer types. e Heatmap of the classification contribution of different lipid metabolites across 23 cancer types. f Correlation of important pan-cancer-related metabolites with methylation of the PLA and UGT gene families.

Discussions

Metabolomics faces challenges in precision medicine due to complex analytical process, metabolic diversity, and database limitations 5 , 6 . DeepMSProfiler starts with raw untargeted metabolomic data and retains essential information, enabling more effective global analysis. It offers an alternative approach by directly processing raw data of metabolomic samples, bypassing time-consuming experiments such as quality control or reference sample preparation and subsequent normalisation analysis.

In metabolomic study, systematic variations in the measured metabolite profiles may occur during sample collection, processing, analysis, or even in different batches of reagents or instruments. Batch effects can significantly impact the interpretation of the results, leading to inconsistencies in replicating findings across different studies 15 . While batch effects can manifest as variations in retention time (RT) offset, peak area, and peak shape, conventional quantitative methods often prioritise peak area integration while overlooking peak shape 49 , 50 . Significantly, our results demonstrate that DeepMSProfiler is able to automatically eliminate cross-hospital variations during the end-to-end forward propagation process (Fig.  3d ), effectively revealing classification profiles.

Moreover, DeepMSProfiler can address the challenges of unidentified metabolites. LC-MS metabolomics can reveal tens of thousands of metabolite peaks in a biological sample. A substantial number of these peaks remains unidentified or unannotated in existing databases. In this study, we demonstrated that among all detected peaks, only 16.5% are identified by HMDB and KEGG. However, the presence of a significant proportion of unknown metabolites has a considerable influence on the accuracy of classification (Fig.  4b ). Indeed, annotating metabolomic peaks has remained a major study focus in the field 16 . A common approach involves comparing the exact mass of detected peaks with authenticated standards, along with either the retention time or the fragmentation spectra obtained through tandem mass spectrometry (MS2). Despite significant development of molecular structural databases and MS2 spectral databases, their current capabilities and coverage remain limited 51 . In addition, network analysis, which examines complex peak relationships and clusters, has also been developed to facilitate the comprehensive identification of metabolites 52 . In this study, we employed the deep learning method to capture original signals in LC-MS metabolomic analysis without compromising data integrity. We further implemented a direct transition from m/z to pathway annotations by taking advantage of the network-based analysis tool PIUMet 39 , effectively identifying 82 proteins and 121 metabolites in the cancer group, compared with 9 metabolites annotated by MS2.

Furthermore, our method is able to cover the metabolites identified by conventional annotation and simultaneously uncovers the undetected disease-specific features. In the traditional metabolomic analysis, biomarkers specific to the disease of interest are usually sought by comparison of metabolite levels between control and case samples. Therefore, peak alignment and metabolite annotation are crucial to the end results. Here, by employing the end-to-end strategy, we unveiled the complete biological outputs that contribute to the distinct metabolomic profiles of each group. For example, tryptophan metabolism was identified among the characteristics of lung adenocarcinoma profile (Fig.  5l ). The result was consistent with our previous discovery by the conventional annotation method that metabolites in the tryptophan pathway were decreased in the early-stage lung adenocarcinoma compared with benign nodules and healthy controls 40 . Serine and glycine are also important for nucleotide synthesis by mediating one-carbon metabolism, which is relevant to therapeutic strategy targeting non-small cell lung cancer 53 – 56 . Intriguingly, we also observed the contribution of bile secretion in the lung adenocarcinoma profile (Fig.  5l ), which aligns with another report of aberrant bile acid metabolism in invasive lung adenocarcinoma 57 . However, it should be noted that the resolution of our model may be limited to distinguish all individual peaks contributing to the disease classification.

We additionally demonstrated that among deep learning models, ensemble models are more stable and class-balanced than single models. Although we have not fully comprehended the reason for the occurrence of “background category”, the ensemble strategy has effectively mitigated this phenomenon (Fig.  5a, b ). Our investigation on the PACS image dataset suggests that “background category” may generally exist in multi-classification tasks using single models. Understanding its underlying mechanism requires further investigation with a broader range of dataset.

The high-resolution heatmaps generated by DeepMSProfiler display the feature contributions to the predicted classes and the precise location of specific metabolomic signals (Fig.  5c ), providing explainable analysis to assure the researchers of the biological soundness of the prediction. With the capability of batch effect removal, comprehensive metabolomic profiling, and ensemble strategy, DeepMSProfiler demonstrates consistent and robust performance across different categories. It achieves AUCs over 0.99 for the predictions of lung adenocarcinoma, benign nodules, and healthy samples, and an accuracy of 96.1% for early-stage (stage-I) lung adenocarcinoma. Moreover, its extended analysis in pan-cancer illustrates it ability to uncover potential disease-associated metabolites and proteins beyond lung cancer. In conclusion, our DeepMSProfiler offers a straightforward and reliable method suitable for applications in disease diagnosis and mechanism discovery, potentially advancing the use of metabolomics in precision medicine. Its effective end-to-end strategy applied to raw metabolomic data can benefit a broader population in non-invasive clinical practices for disease screening and diagnosis.

Clinical sample collection

This study was approved by the Ethics Committees of the Sun Yat-Sen University Cancer Centre, the First Affiliated Hospital of Sun Yat-Sen University and the Affiliated Cancer Hospital of Zhengzhou University. A total of 210 healthy individuals, 323 patients with benign nodules and 326 patients with lung adenocarcinoma were enroled. Cases of lung adenocarcinoma were collected prior to lung resection surgery and had pathological confirmation. Serum from benign nodules was collected from individuals undergoing annual physical examinations. Participants with benign nodules were defined as those with stable 3–5 years follow-up Computed Tomography (CT) scans at the time of analysis. The sample collection period was from January 2018 to October 2022. The sex of the participants was determined by self-report. Informed consent was obtained from all participants. The study design and conduct complied with all relevant regulations regarding the use of human study participants and was conducted in accordance to the criteria set by the Declaration of Helsinki. Research with humans has been conducted according to the principles of the Declaration of Helsinki.

In addition, we collected serum samples from 100 healthy blood donors, including 50 males and 50 females, aged between 40 and 55 years, from the Department of Cancer Prevention and Medical Examination, Sun Yat-Sen University Cancer Centre. All these samples were mixed in equal amounts and the resultant mixture was aliquoted and stored. These mixtures were used as reference samples for quality control and data normalisation in the conventional metabolomic analysis as previously described 34 .

Serum metabolite extraction

Fasting blood samples were collected in serum separation tubes without the addition of anticoagulants, allowed to clot for 1 h at room temperature, and then centrifuged at 2851 ×  g for 10 min at 4 °C to collect the serum supernatant. The serum was aliquoted and then frozen at −80 °C until metabolite extraction.

Reference serum and study samples were thawed and a combined extraction method (methyl tert-butyl ether/methanol/water) was used to extract metabolites. Briefly, 50 μL of serum was mixed with 225 μL of ice-cold methanol and 750 μL of ice-cold methyl-tertbutyl ether (MTBE). The mixture was vortexed and incubated for 1 h on ice. Then 188 μL MS grade water containing internal standards ( 13 C-lactate, 13 C 3 - pyruvate, 13 C-methionine and 13 C 6 -isoleucine, all from Cambridge Isotope Laboratories) was added and vortexed. The mixture was centrifuged at 15,000 ×  g for 10 min at 4 °C, and then the bottom phase was transferred to two tubes (each containing 125 μL) for LC-MS analysis in positive and negative modes. Finally, the samples were dried in a high-speed vacuum concentrator.

Untargeted liquid chromatography-mass spectrometry

The dried metabolites were resuspended in 120 μL of 80% acetonitrile, vortexed for 5 min and centrifuged at 15,000 ×  g for 10 min at 4 °C. The supernatant was transferred to a glass amber vial with a micro insert for metabolomic analysis. Untargeted metabolomic analysis was performed on an ultra-performance liquid chromatography-high resolution mass spectrometry (UPLC-HRMS) platform. The metabolites were separated using the Dionex Ultimate 3000 UPLC system with an ACQUITY BEH Amide column (2.1 × 100 mm, 1.7 μm, Waters). In positive mode, the mobile phase comprised 95% (A) and 50% acetonitrile (B), containing 10 mmol/L ammonium acetate and 0.1% formic acid. In negative mode, the mobile phase was composed of 95% and 50% acetonitrile for phases A and B, respectively, both containing 10 mmol/L ammonium acetate and adjusted to pH 9. Gradient elution was performed as follows: 0–0.5 min, 2% B; 0.5–12 min, 2–50% B; 12–14 min, 50–98% B; 14–16 min, 98% B; 16–16.1 min, 98–2% B; 16.1–20 min, 2% B. The column temperature was maintained at 40 °C, and the autosampler was set at 10 °C. The flow rate was 0.3 mL/min, and the injection volume was 3 μL. A Q-Exactive orbitrap mass spectrometer (Thermo Fisher Scientific) with an electrospray ionisation (ESI) source was operated in full scan mode coupled with ddMS2 monitoring mode for mass data acquisition. The following mass spectrometer settings were used: spray voltage +3.8 kV/−3.2 kV; capillary temperature 320 °C; sheath gas 40 arb; auxiliary gas 10 arb; probe heater temperature 350 °C; scan range 70–1050 m/z; resolution 70000. Xcalibur 4.1 (Thermo Fisher Scientific) was used for data acquisition.

In this study, all serum samples were analysed by LC-MS in 10 batches. To assess data quality, a mixed quality control (QC) sample was generated by pooling 10 μL of supernatant from each sample in the batch. Six QC samples were injected at the beginning of the analytical sequence to assess the stability of the UPLC-MS system, with additional QC samples injected periodically throughout the batch. Serum pooled from 100 healthy donors was used as reference material in each batch to monitor extraction and batch effects. All untargeted metabolomic analysis was performed at the Sun Yat-Sen University Metabolomics Centre.

Public dataset collection

The raw dataset for pan-cancer lipid metabolomics data of CCLE was downloaded from the Metabolomics Workbench database with accession ST001142

( https://www.metabolomicsworkbench.org/data/DRCCMetadata.php?Mode=Study&StudyID=ST001142 ). There are 946 samples in total, including 23 cancer types. The quantitative lipid metabolite matrix and the DNA methylation matrix were downloaded from the appendix of the article 2 .

The LC-MS dataset of colon cancer was downloaded from the MetaboLights database ( https://www.ebi.ac.uk/metabolights/editor/MTBLS1129/descriptors ) with 236 samples in total, including 197 colon cancer cases and 39 healthy controls. Due to the differences of disease samples, classification purposes, instruments, and parameters of LC-MS between the public dataset and the private lung adenocarcinoma dataset, the DeepMSProfiler model needs to be re-trained on the public dataset.

Data format conversion

The raw format files of LC-MS data were converted to mzML format using the MSCovert software. The data used to train the end-to-end model were sampled directly from the mzML format without any further processing. This raw data could be used directly as input to the model. In the mzML file, ion intensity and mass-to-charge ratio of each ion point for each time interval were recorded. Ions points were mapped into a 3D space by their RT and m/z. A 2D matrix was sampled from this 3D points array data using a maximally pooled convolution kernel. RT: 0.5 min and m/z: 50 as the sampling starting point and RT: 0.016 min and m/z: 1 as the sampling interval. The sampling ranges of retention time and mass/charge ratio were set. Using the sampling interval as a sliding window, the maximum ion intensity in the interval was sampled to obtain a two-dimensional matrix of 1024 × 1024 ion intensities.

Extraction and annotation of metabolic peaks

We used Compound Discovery v3.1 and TraceFinder v4.0 (Thermo Fisher Scientific) for peak alignment and extraction. These steps resulted in a matrix containing retention time, mass-to-charge ratio and peak area information for each metabolite. To eliminate potential batch-to-batch variation, we used the Ref-M method to correct peak areas. This involved dividing the peak area of each feature in the study sample by the peak area of the reference compound from the same batch, yielding relative abundances. We then used several data annotation tools such as MetID, MetDNA and NetID in MZmine to annotate the metabolite features and combined the annotation results 52 , 58 – 60 . These analysis tools include mass spectral information from databases such as the HMDB, MassBank and MassBank of North America (MoNA) 41 , 61 . In addition, we performed data annotation using MS2 data based on Compound Discovery v3.1 and finally selected bio-compounds with MS2 matches to certified standards or compounds with inferred annotations based on complete matches in mzCloud (score > 85) or ChemSpider as the precise annotation results 62 .

Raw data visualisation

The mzML files were read using the Pyteomics package in Python. Records were traversed for all times in the sampling interval 63 . For each time index data in mzML files, it recorded the preset scan configuration, the scan window, the ion injection time, the intensity array, and the m/z array. The intensity array and m/z array were selected to form an array of data points, and retention time, mass-to-charge ratio, and intensity are the row names. The intensity values were log2 processed. Then, the 3D point cloud data was visualised using the Matplotlib Toolkits package in Python 64 . The 2D matrixes were obtained by down-sampling the 3D point cloud and pooling the 3D data using median and maximum convolution kernels. Convolution spans were RT: 0.1 min and m/z: 0.001. Heatmaps and contours were plotted using Matplotlib. Retained time-intensity curves were also plotted using Matplotlib with an m/z span of 0.003.

Dataset division and assignment

The dataset of each batch was randomly divided into a training dataset and an independent testing dataset in a ratio of 4:1. The data from the first to the seventh batch contained 90 samples each, including 30 healthy individuals, 30 lung nodules, and 30 lung adenocarcinoma samples. The data for the eighth and ninth batches did not contain healthy samples. The data for the tenth batch only contained nodule samples. To avoid the effect of classification imbalance, we constrained the same sample type and sex ratio in the training and independent testing dataset. Because the samples came from patients of different ages and sexes, the lesion sizes of lung nodules and lung adenocarcinoma patients also varied. In order to avoid these attributes affecting the authenticity of the model, sex, age, and lesion size were also used as constraints for dataset division. The difference in the distribution of sample attributes between the training dataset and the independent testing dataset was verified by an unpaired t-test.

Deep learning model construction in detail

In this step, we aimed to construct a model to predict the class labels for each metabolomic sample. For this, we first set X and Y as the input and label spaces, respectively. A single end-to-end model consisted of three parts, a dimension converter based on pool processing, a feature detector based on the convolutional neural networks, and a multi-layer perceptron (MLP) classifier. The input data directly from the raw data was extremely large and contained a lot of redundant information, so a pooling process was required to reduce the resolution for downstream convolution operations. The input data of the model was reduced by the maximum pooling layer to obtain D( X ). Next, enter the feature extractor dominated by convolutional layers to obtain F(D( X )). The convolutional neural network had local displacement invariance and was well adapted to the RT offset problem in metabolomic data. Due to the relatively large size of the data, more than 5 layers of convolutional operations were required to reduce the dimensionality of the data to the extent that the computing power could be loaded. Different architectures were used respectively to compare the performance in the tuning set. The architectures used in different models included VGGNet (VGG16, VGG19), Inception Model (InceptionV3), ResNet (ResNet50), DenseNet (DenseNet121), and EfficientNet (EffcientNetB0-B5) 33 , 65 – 67 . In addition, two optimisation models based on Densenet121 were created to simplify the DenseNet network. The direct connection route replaced the last dense layer of Densenet121 with a convolutional layer. The optimisation route replaced the last dense layer of DenseNet with a convolutional layer that retained a one-hop connection. The pre-training parameters in pre-trained models were derived from ImageNet. Each architecture was tested on the TensorFlow + Keras platform and PyTorch platform, respectively. To reduce overfitting, we used only one linear layer for our MLP layer. In the TensorFlow + Keras model, there was a SoftMax activation layer before the output layer. The output of the model was C(F(D( X ))).

The positive and negative spectral data used different convolutional layers for feature extraction. Their features were combined before inputting the fully connected layer. Their pre-training parameters were shared. For a model trained on both positive and negative spectral data, a cross entropy loss was used.

Model training

20% of the discovery dataset was divided into tuning sets, which were used to guide model architecture selection, hyperparameter selection, and model optimisation, and the rest 80% was used for model training. Sample category balancing was used as a constraint for dataset segmentation. The model architecture was evaluated by both validation performance and operational performance. We counted the number of model parameters and evaluated the complexity of the model. The average of the 10 running times of the models was used as the runtime. Hyperas was used to preliminarily select the optimal model hyperparameters and initialisation weights 68 . The optimal initialisation method was he_normal. But we opted for pretraining with the ImageNet dataset due to its comparable performance and faster execution. After reducing the size of the parameter search, we used the grid search method for hyperparameter tuning.

Ensemble strategy

DeepMSProfiler consists of several trained end-to-end sub-models as an ensemble model, where the average of the classification prediction probabilities of the samples from all sub-models was used as the final prediction probability for classification. The ensemble model calculated a score vector of length 3 in each of the three classifications, and the category with the maximum score was selected as the predicted classification result.

Each end-to-end sub-model was trained on the discovery dataset. The architecture of each sub-model is the same, but some hyperparameters are different. Two different learning rates of 1e-3 and 1e-4 were used. The optimiser used is ‘adam’ with parameter settings of beta_1 as 0.9, beta_2 as 0.999, epsilon as 0.001, decay as 0.0, and amsgrad as False. The batch size was set as 8 and the training was run for 200 epochs. A model typically took about 2 h to complete training on a GP100GL (16GB) GPU server. Each sub-model participated fairly in the final prediction result without setting a specific weight. The independent testing dataset was not used in model training and hyperparameter selection.

Machine learning models for comparison

To compare our DeepMSProfiler to other existing tools, we selected several common traditional machine learning methods to build tri-classification models based on the peak area data obtained from the previous steps. These methods included Extreme Gradient Boosting (XGBoost), RF, Adaptive Boosting (Adaboost), SVM, and DNN. The training dataset and independent testing dataset were divided in the same way as the deep learning algorithm, and the numbers of estimators for Adaboost and XGBoost algorithms were the same as those of DeepMSProfiler. XGBoost was implemented by the XGBClassifier function in the xgboost library. Other machine learning methods were implemented using the SciKitLearn library. SVM was adopted using the svm function, and the kernel of SVM is ‘linear’. RF was implemented through the RandomForestClassifier function. Adaboost was adopted through the AdaBoostClassifier function. DNN was implemented using the MLPClassifier function. The optimal hyperparameter was obtained by the grid search method.

Performance metrics

We evaluated the performance of the model on the independent testing dataset. The evaluation metrics included accuracy, precision, sensitivity and F1 score. Micro was chosen as the computational method for the multiclassification model. Confidence intervals were estimated using 1000 bootstrap iterations. During the bootstrapping procedure, our model was estimated by an ensemble strategy combining 20 models trained on the discovery dataset. In addition, we calculated a confusion matrix and an AUC curve to demonstrate the performance of the model in the three classifications of lung adenocarcinoma, benign nodules and healthy individuals. When the sensitivity was 0.7 or 0.9, the specificity was calculated using the sensitivity-specificity curve. The sensitivity-specificity curve was interpolated using the NEAST method.

Visualisation of “black-box”

In the end-to-end neural network prediction, the data flowed in a chain of X → D ( X ) → F ( D ( X ) ) → C ( F ( D ( X ) ) ) from the input layer through the hidden layer to the output layer. In the feature extraction layer, which is dominated by convolutional layers, information was passed in the same chain manner. After inputting X, we obtained the corresponding output L in different hidden layers to open the black box process. In order to observe the space of middle features, PCA was used to reduce T dimensionality to principal components. The PCA result was visualised by the Matplotlib package in Python.

To evaluate the correlation of hidden layer output with batch label and type label, respectively, we calculated NMI, ARI, and ASC using the following formulas. L was the layer output and C was the cluster labels used for the cluster evaluation.

In the above equations, the mutual information (MI) computed by the layer outputs L and the label cluster C . P i , j represents the joint distribution probability between i and j, and P i refers to the distribution probability of i. P j refers to the distribution probability of j. H L and H C represent the entropy values of L and C , respectively. The clusters of the output layer are clustered by the K-nearest neighbour algorithm.

In the above equation, TP represents the number of point pairs belonging to the same cluster in both real and experimental cases, and FN represents the number of point pairs belonging to the same cluster in the real case but not in the same cluster in the experimental case. FP represents the number of point pairs not belonging to the same cluster in the real case but in the same cluster in the experimental case, and TN represents the number of point pairs not belonging to the same cluster in both real and experimental cases. The range of ARI is [−1, 1], and the larger the value, the more consistent with the real result, that is, the best effect of clustering.

The output layer was first dimensionally reduced by PCA, and the cluster was specified by the real label. In the above equation, a i represents the average of the dissimilarity of the vector i to other points in the same cluster, and b i represents the minimum of the dissimilarity of the vector i to points in other clusters.

Feature contributions

In previous feature contribution studies, different branches used different methods to compute feature contributions to final classifications. These methods can help to better understand features and their impacts on model predictions. Gradient-based methods, such as GradCAM, calculate the gradients of the last convolutional layer by backpropagation of the category with the highest confidence 69 . Due to its convenience, this method is widely used in computer vision tasks. But it has a significant problem, that is, the resolution of the feature contribution heatmap is extremely low and cannot reach the requirements for distinguishing most signal peaks. The size of the feature contribution heatmap corresponds to the last convolutional layer of the model. The weight of the feature contribution is the average of the gradients of all features. On the other hand, perturbation-based methods, such as RISE and Local Interpretable Model-Agnostic Explanations, measure the importance of features by obscuring some pixels in raw data 37 , 70 . The predictive ability of the black box is then observed to show how much this affects the prediction. Perturbation-based methods can lead to higher resolution and more accurate contribution estimates, but their runtimes are longer. To improve the computing speed in this study, we made some improvements based on RISE, using boost sampling for the mask process.

Using RISE, we can determine the characteristic contributions of RT and m/z for each sample according to its true category. The feature contribution heatmap uses RT as the horizontal axis and m/z as the vertical axis to show the feature contribution of different positions of each sample. The average feature contribution of all samples correctly predicted to be of their true category is taken as the feature contribution of the category. At the same time, by performing peak extraction in the previous steps, we determined the RT value range and the m/z median value for each signal peak. The characteristic contribution associated with the RT and median m/z coordinates is then identified as the distinctive contribution of the signal peak.

Network analysis and pathway enrichment

The extracted metabolic peaks with a contribution score greater than 0.70 to the lung cancer classification were filtered. Mass-to-charge ratio and some substance identification information of these metabolites and their classification contribution scores were used as input data. For some of the metabolic signal peaks, we have accurately identified their molecular formulae and substance names by secondary mass spectrometry as substance identification information. Due to the limitation of existing databases, many unknown signals cannot be identified through secondary mass spectrometry. Therefore, PIUMet was also adopted to search for hidden metabolites and related proteins.

PIUMet built disease-related metabolite-protein networks based on the prize-collecting Steiner Forest algorithm. First, PIUMet integrated iRefIndex (v13), HMDB (v3) and Recon2 databases to obtain the relationship between m/z, metabolites and proteins, and generated an overall metabolite-protein network. The edges were weighted by the confidence level of the correlation reported in these databases. iRefIndex provides details on the physical interactions between proteins, which are detected through different experimental methods. The protein-metabolite relationships in Recon2 are based on their involvement in the same reactions. HMDB includes proteins that are involved in metabolic pathways or enzymatic reactions, as well as metabolites that play a role in protein pathways, based on known biochemical and molecular biological data. The disease-related metabolite peaks obtained by DeepMSProfiler were matched to the metabolite nodes of the overall network by their m/z, and directly to the terminal metabolite nodes of the overall network after annotation. The corresponding feature contributions obtained by DeepMSProfiler served as prizes for these metabolite nodes. The network structure was then optimised using the prize-collecting Steiner Forest algorithm to minimise the total network cost and connect all terminal nodes, thereby removing low-confidence relationships and obtaining disease-related metabolite sub-networks.

Metabolite identification is an important issue in metabolomics research and there are different levels of confidence in identification. Referring to the highest level considered 71 , we analysed authentic chemical standards and validated 11 of the metabolites discovered by PIUMet with only m/z (Supplementary Table  5 ). Then, disease-related metabolites and proteins were used to analyse their pathways 39 . These hidden metabolites and proteins from PIUMet were then processed for KEGG pathway enrichment analysis using MetaboAnalyst (v6.0). We used joint pathway analysis in MetaboAnalyst and chose hypergeometric test for enrichment analysis and degree centrality for topology measure. The integrated version of KEGG pathways (year 2019) was adopted by MetaboAnalyst. Pathways were filtered out using 1e-5 as a p value cut-off 72 . The corresponding SYMBOL IDs of the proteins were converted to KEGG IDs by the ClusterProfiler package in R 73 .

Ablation experiment

We searched the PubMed database for a total of 5088 articles using the terms “serum”, “lung cancer” and “metabolism” from 2010 to 2022. By reading the titles and abstracts of them, we excluded publications that used non-serum samples such as urine and tissue for research, as well as publications that used non-mass spectrometry methods such as chromatography, nuclear magnetic resonance, and infrared spectroscopy. We then further screened the selected literature to exclude studies that did not result in the discovery of metabolic biomarkers. Finally, 49 publications were remained and 811 serum metabolic biomarkers for lung cancer were reported. Some of the literature provides information on the retention time and mass-to-charge ratio of biomarkers. However, in other literature, only the name of the identified biomarker is given. Therefore, we searched the molecular weights of these metabolites in the HMDB database based on the literature information to match the corresponding m/z. The use of metabolite molecular weights to match the m/z took full account of the effect of adducts. Based on the number of publications of biomarkers in the literature, we determined the range of retained signals to be the m/z corresponding to biomarkers that exceeded the threshold number of publications. We filtered the signals in the raw data to exclude signals that did not fall into the 3 ppm intervals around these m/z. The filtered raw data were used as input to the model.

Statistical analysis

All statistical analysis calculations were performed using the stat package in Python. The distribution of data was plotted using the Seaborn package in Python. The correlation between patient information and labels was calculated using Pearson’s, Spearman’s and Kendall’s correlation coefficients. Pearson’s correlation coefficient was preferred to find linear correlations. Spearman’s and Kendall’s rank correlation coefficients were used to discover non-linear correlations. P -values below 0.05 were considered significant.

Figure preparation

The main figures in this paper were assembled in Adobe Illustrator. The photo of mass spectrometry instruments was taken from actual objects. The data structure diagrams were obtained by fitting simulated functions based on python. Some cartoon components were drawn through FigDraw ( www.figdraw.com ) with a license code (TAIRAff614) for free use.

AI-assisted technologies in the writing process

At the end of the preparation of this work, the authors used ChatGPT to proofread the manuscript. After using this tool, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.

Reporting summary

Further information on research design is available in the  Nature Portfolio Reporting Summary linked to this article.

Supplementary information

Source data, acknowledgements.

This work was supported by the grants of National Key R&D Program of China (2020YFA0803302 to Peng Huang, 2021YFF1200903, 2016YFC0901604 & 2018YFC0910401 to Weizhong Li), Major Project of Guangzhou National Laboratory (GZNL2024A01003 to Weizhong Li), and Guangdong Basic and Applied Basic Research Foundation (2022B1515120077 to Weizhong Li). We thanks to Prof Zhi Xie (Zhongshan Ophthalmic Center at Sun Yat-sen University) and Prof Kai Ye (Xi’An Jiaotong University) for their helpful suggestions on the paper.

Author contributions

Y. Deng, W. Li and Y. Hu designed the method. Y. Deng and Y. Yao implemented the method. Y. Deng, Y. Yao and Y. Wang conducted data analysis. T. Yu conducted the metabolomic experiments. Y. Deng and W. Cai visualised the results. Y. Deng and D. Zhou collected the public data. F. Yin, W. Liu, Y. Liu, C. Xie and J. Guan collected clinical samples and patient information. Y. Deng, W. Li and Y. Hu wrote the manuscript. P. Huang, W. Li and Y. Hu contributed to conceptualisation, supervision, management, manuscript reviewing and editing.

Peer review

Peer review information.

Nature Communications thanks Timothy Ebbels, Kun Qian and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Data availability

Code availability, competing interests.

All authors declare the following competing interests. All authors have filed patents for both the technology and the use of the technology to analyse metabolomic data.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Yumin Hu, Email: nc.gro.ccusys@myuh .

Peng Huang, Email: nc.gro.ccusys@gnepgnauh .

Weizhong Li, Email: nc.ude.usys.liam@gnohziewil .

The online version contains supplementary material available at 10.1038/s41467-024-51433-3.

Optimized Wireless Sensing and Deep Learning for Enhanced Human-Vehicle Recognition

New citation alert added.

This alert has been successfully added and will be sent to:

You will be notified whenever a record that you have chosen has been cited.

To manage your alert preferences, click on the button below.

New Citation Alert!

Please log in to your account

Information & Contributors

Bibliometrics & citations, view options, recommendations, compact, isolation enhanced, band-notched swb–mimo antenna suited for wireless personal communications.

Herein, a Conductor Backed Co-Planar Waveguide fed, compact, slotted Multiple–Input–Multiple–Output or MIMO antenna having Super Wideband (SWB) response and tunable band-notching feature is presented. In addition, an improved method for cut-off ...

Bandwidth Enhanced L-Shaped Patch Antenna with Parasitic Element for 5.8-GHz Wireless Local Area Network Applications

Bandwidth enhancement of a compact microstrip antenna is achieved by using a L-shaped patch and a parasitic patch. The L-shaped patch and the parasitic patch are half-wavelength resonators, and their resonant frequencies close to each other and merge to ...

Dual Band, Reduced Size, Enhanced Gain Frequency Selective Surface Based Monopole Antenna for Wireless Communications

In this work, a dual band, reduced size, enhanced gain monopole antenna has been designed by using Frequency Selective Surface (FSS). The volume of the proposed monopole antenna is 0.14 λ 0 × 0.12 λ 0 × 0.02 λ 0 , where λ 0 is the free space wavelength at 3.5 ...

Information

Published in, publication history.

  • Research-article

Contributors

Other metrics, bibliometrics, article metrics.

  • 0 Total Citations
  • 0 Total Downloads
  • Downloads (Last 12 months) 0
  • Downloads (Last 6 weeks) 0

View options

Login options.

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Share this publication link.

Copying failed.

Share on social media

Affiliations, export citations.

  • Please download or close your previous search result export first before starting a new bulk export. Preview is not available. By clicking download, a status dialog will open to start the export process. The process may take a few minutes but once it finishes a file will be downloadable from your browser. You may continue to browse the DL while the export process is in progress. Download
  • Download citation
  • Copy citation

We are preparing your search results for download ...

We will inform you here when the file is ready.

Your file of search results citations is now ready.

Your search export query has expired. Please try again.

IMAGES

  1. Deep_Learning_Project_Work/Module_1_AI and_Deep_Learning_Intro

    deep learning ai assignment github

  2. GitHub

    deep learning ai assignment github

  3. Overview of Deep Learning Algorithms

    deep learning ai assignment github

  4. GitHub

    deep learning ai assignment github

  5. deeplearning-assignment/assignment3.ipynb at master · Wasim37

    deep learning ai assignment github

  6. deep-learning-visualization · GitHub Topics · GitHub

    deep learning ai assignment github

COMMENTS

  1. amanchadha/coursera-deep-learning-specialization

    Notes, programming assignments and quizzes from all courses within the Coursera Deep Learning specialization offered by deeplearning.ai: (i) Neural Networks and Deep Learning; (ii) Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization; (iii) Structuring Machine Learning Projects; (iv) Convolutional Neural Networks; (v) Sequence Models - amanchadha/coursera-deep ...

  2. greyhatguy007/deep-learning-specialization

    Contains Solutions to Deep Learning Specailization - Coursera Topics python machine-learning deep-learning neural-network tensorflow coursera neural-networks convolutional-neural-networks coursera-specialization assignment-solutions

  3. GitHub

    This repository contains all the solutions of the programming assignments along with few output images. It also has some of the important papers which are referred during the course. NOTE : Use the solutions only for reference purpose :) This specialisation has five courses. Courses: Course 1: Neural Networks and Deep Learning. Learning Objectives:

  4. Deep Learning Specialization Coursera [UPDATED Version 2021]

    Announcement [!IMPORTANT] Check our latest paper (accepted in ICDAR'23) on Urdu OCR — This repo contains all of the solved assignments of Coursera's most famous Deep Learning Specialization of 5 courses offered by deeplearning.ai. Instructor: Prof. Andrew Ng What's New. This Specialization was updated in April 2021 to include developments in deep learning and programming frameworks.

  5. DeepLearning.ai

    View on GitHub DeepLearning.ai. This is my assignment on Andrew Ng's special course "Deep Learning Specialization" This course consists of five courses: Course Contents. Neural Networks and Deep Learning. Week1 Introduction to deep learning. Week2 Neural Networks Basics. Week3 Shallow Neural networks. Week4 Deep Neural Networks

  6. Deep-Learning-Specialization

    View on GitHub Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization. This course will teach you the "magic" of getting deep learning to work well. Rather than the deep learning process being a black box, you will understand what drives performance, and be able to more systematically get good results.

  7. deeplearning.ai

    This Specialization will teach you best practices for using TensorFlow, a popular open-source framework for machine learning. In Course 3 of the deeplearning.ai TensorFlow Specialization, you will build natural language processing systems using TensorFlow. You will learn to process text, including tokenizing and representing sentences as ...

  8. Neural Networks and Deep Learning

    In this course, you will learn the foundations of deep learning. When you finish this class, you will: Understand the major technology trends driving Deep Learning. Be able to build, train and apply fully connected deep neural networks. Know how to implement efficient (vectorized) neural networks. Understand the key parameters in a neural ...

  9. Assignment 5

    The differences come in how the RNN computes its output. The basic recurrency can be seen in equation 10.5 of the deep learning book, with more details in equations 10.8-10.11. The important idea is that, at each time step, the RNN essentially works like an MLP with a single hidden layer, but two inputs (last state and current input).

  10. deeplearning.ai

    The Machine Learning course and Deep Learning Specialization from Andrew Ng teach the most important and foundational principles of Machine Learning and Deep Learning. This new deeplearning.ai TensorFlow Specialization teaches you how to use TensorFlow to implement those principles so that you can start building and applying scalable models to ...

  11. abdur75648/Deep-Learning-Specialization-Coursera

    This repo contains the updated version of all the assignments/labs (done by me) of Deep Learning Specialization on Coursera by Andrew Ng. It includes building various deep learning models from scratch and implementing them for object detection, facial recognition, autonomous driving, neural machine translation, trigger word detection, etc. - abdur75648/Deep-Learning-Specialization-Coursera

  12. Convolutional Neural Networks

    Analyze the dimensionality reduction of a volume in a very deep network; Understand and Implement a Residual network; Build a deep neural network using Keras; Implement a skip-connection in your network; Clone a repository from github and use transfer learning; Assignment of Week 2. Quiz 2: Deep convolutional models; Programming Assignment ...

  13. Applied AI with DeepLearning, Anomaly Detection assignment · GitHub

    Applied AI with DeepLearning, Anomaly Detection assignment - Anomaly Detection.ipynb

  14. Download all programming assignments Notebook. #Coursera # ...

    Download all programming assignments Notebook. #Coursera #DeepLearning.ai #Jupyter #Python - JupyterNotebookDownloader.sh

  15. How to make AI training faster

    Many deep learning frameworks (e.g., PyTorch, TensorFlow) use NVIDIA's NCCL for communication across multiple GPUs. Each GPU trains on its data subset and synchronizes model weights using NCCL's AllReduce at the end of each step.

  16. Programming assignments and lecture notes of the Deep Learning ...

    Learn about the key technology trends driving the rise of deep learning; build, train, and apply fully connected deep neural networks; implement efficient (vectorized) neural networks; identify key parameters in a neural network's architecture; and apply deep learning to applications. Week 2 - A1: Logistic Regression with a Neural Network mindset

  17. academic_support_services

    0 likes, 0 comments - academic_support_services on August 22, 2024: " Python 烙 Machine Learning 易 Deep Learning Artificial Intelligence Thesis Writing Assignments Final Year Projects 喙 Medical Research Social Sciences Management Sciences Economics Statistics #Python #MachineLearning #DeepLearning #ArtificialIntelligence #ThesisWriting #Assignments #FinalYearProjects #MedicalResearch # ...

  18. Sequence Models

    Sequence Models. This course will teach you how to build models for natural language, audio, and other sequence data. Thanks to deep learning, sequence algorithms are working far better than just two years ago, and this is enabling numerous exciting applications in speech recognition, music synthesis, chatbots, machine translation, natural ...

  19. An end-to-end deep learning method for mass spectrometry data ...

    The overview of the ensemble end-to-end deep-learning model. The DeepMSProfiler method includes three main components: the serum-based mass spectrometry, the ensemble end-to-end model, and the ...

  20. What AI Developer Skills Do You Need in 2024?

    Machine Learning and Deep Learning. As an AI developer, you must be skilled in training and evaluating model performance using metrics like accuracy, precision, F1 score and other machine learning techniques. ... the first glimpse of the possibilities AI offers is through using automated code completion and generation tools like GitHub Copilot ...

  21. GitHub

    Welcome to the DeepLearning.AI Deep Learning Specialization by Andrew Ng repository! This repository contains all my coursework, assignments, and projects completed during the Deep Learning Specialization offered by DeepLearning.AI on Coursera, taught by Andrew Ng.

  22. A Primer on Deep Learning for Causal Inference

    Current questions include how AI became dominated by deep learning, why problematic race science persists in psychology, and how music genres develop. He also has eccentric interests in deep learning, causal inference, networks, and Bayesian modeling. His work has been published at Sociological Methodology, NeurIPS, and WWW, among other venues.

  23. deep-learning-specialization · GitHub Topics · GitHub

    Graded assignments of all the courses that are being offered in Coursera Deep Learning Specialization by DeepLearning.AI. (i) Neural Networks and Deep Learning; (ii) Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization; (iii) Structuring Machine Learning Projects; (iv) Convolutional Neural Network (v) Squence Model

  24. An end-to-end deep learning method for mass spectrometry data analysis

    Explainable deep learning method avoids limitations of batch effects in conventional methods. a Batch effects in 3D point array and 2D mapped heatmap of reference samples. RT: retention time; m/z: mass-to-charge ratio. b Isotope peaks of the same concentration in different samples. Different colours represent the batches to which the samples ...

  25. Optimized Wireless Sensing and Deep Learning for Enhanced Human-Vehicle

    In the realm of traffic parameter measurement, wireless sensing-based human-vehicle recognition methods have been pivotal due to their low cost and non-invasive nature. Traditionally, these methods have relied on the 2.4 GHz frequency band, often ...

  26. pranavvk18/deep_learning_assignments

    Repository containing deep learning assignments with implementations and explanations of key concepts. - pranavvk18/deep_learning_assignments

  27. GitHub

    This is my assignment on Andrew Ng's special course "Deep Learning Specialization" This course consists of five courses: Course Contents. Neural Networks and Deep Learning. Week1 Introduction to deep learning. Week2 Neural Networks Basics. Week3 Shallow Neural networks. Week4 Deep Neural Networks. Improving Deep Neural Networks

  28. Deep_Learning_2024/DL_2024_Assignment 1.ipynb at main

    Write better code with AI Code review. Manage code changes

  29. GitHub

    Saved searches Use saved searches to filter your results more quickly

  30. DeepLearningAI · GitHub

    GitHub is where DeepLearningAI builds software. GitHub is where DeepLearningAI builds software. ... Find and fix vulnerabilities Codespaces. Instant dev environments GitHub Copilot. Write better code with AI Code review. Manage code changes Issues. Plan and track work ... Learning Pathways White papers, Ebooks, Webinars Customer Stories ...