Deep generative modeling in sleep diagnostics

Published: January 22, 2026

Hans van Gorp, “Deep generative modeling in sleep diagnostics”. [link] [pdf]

Summary

According to clinical standards, the diagnosis of many sleep disorders requires an over-night sleep measurement known as a polysomnography (PSG). During a PSG, multiple physiological signals—such as brain activity (EEG), eye movements (EOG), and muscle tone (EMG)—are recorded to assess sleep structure and detect abnormalities. One of the most widely used methods to analyze the recorded data is sleep staging. Following the guidelines of the American Academy of Sleep Medicine (AASM) manual, gold-standard sleep staging is performed by trained human experts who divide the night into 30-second segments (known as epochs) and classify each as belonging to one of several distinct sleep stages by visually inspecting the recorded signals.

Even though it is the gold standard, the current way in which sleep is measured and analyzed suffers from several key drawbacks. First, measuring the full PSG requires the application of many sensors, which is costly, complex, highly obtrusive, and requires specialized equipment and trained personnel. Secondly, manual annotation of the PSG is both time-consuming and expensive. Thirdly, while the AASM scoring guidelines aim to standardize the way in which scoring is performed, there is still a considerable amount of inter-rater disagreement. On average, a human scorer has an agreement of 83% when compared to the consensus of a human panel. This agreement depends on the sleep stage itself and drops to as low as 63% and 67% for N1 and N3 sleep, respectively.

As a way to address these limitations, the application of deep learning systems has extensively been studied in the literature. These systems learn how to discriminate between sleep stages by observing human-annotated data and are able to score sleep in a fraction of the time at low cost. Additionally, these systems can be trained to perform sleep staging using a reduced set of PSG sensors or even surrogate measurement modalities, such as smartwatches and skin patches. Unlike human scorers, who are bound by the AASM guidelines that require information from the full PSG, deep learning models can attempt to infer the same staging decisions from limited or alternative inputs.

However, not everything is as rosy as it seems for automatic sleep staging algorithms. Some challenges and open questions that these automatic systems still face are: (1) Since these models are trained on human-annotated data, they inherit the uncertainty and variability introduced by inter-rater disagreement. How should such systems interpret and manage this inherent ambiguity? (2) While the results of automatic scoring algorithms on PSG data reaches the upper limit of performance, the accuracy on surrogate modalities remains limited and it is an open question how different surrogate modalities compare against each other. (3) The diversity of available sensor modalities raises the question of scalability: should a separate model be trained for each (combination of) sensor(s), or can a unified model be developed that generalizes across modalities? (4) Finally, many automatic sleep staging algorithms, particularly those developed for surrogate modalities, are only validated on healthy subjects, leading to an over-optimistic view of their performance when compared to that on subjects with sleep disorders.

Challenges (1) and (3) are particularly difficult to address using current discriminative approaches. However, by adopting a Bayesian perspective, where sleep staging is treated as a sampling problem from the posterior distribution given measured data, we can begin to address these limitations more effectively. In our context, the posterior distribution represents the range of possible sleep stage assignments, weighted by how likely they are given the measured signals and our prior knowledge of how sleep stages evolve during the night. This thesis explores how deep generative models, which naturally support posterior sampling, can improve sleep diagnostics.

Chapter 1 provides the necessary background information for this thesis. It introduces clinical sleep monitoring, the principles of sleep staging, and discusses deep generative modeling as an alternative to discriminative modeling.

Chapter 2 further introduces (conditional) generative modeling by providing a detailed mathematical breakdown of the subject and by comparing its statistical assumptions to those of classical discriminative approaches. In addition, this chapter discusses both normalizing flows and score-based diffusion models as explicit implementations of conditional generative sleep staging models, as these two approaches are employed throughout this thesis.

Chapter 3 establishes a theoretical framework for discussing uncertainty and inter-rater disagreement in the context of sleep staging, addressing both the variability in human scoring and the limitations of automatic algorithms trained on human-labeled data. To that end, we will introduce two variants of uncertainty to sleep staging: aleatoric and epistemic uncertainty. We discuss what these uncertainties are in sleep staging, where they come from, and provide recommendations on how they can be used to improve sleep staging.

Chapter 4 shows how a deep generative network for sleep staging can be leveraged to achieve much better calibration of the estimated aleatoric uncertainty in overnight sleep statistics when compared to discriminative networks. This is shown both using empirical evidence on a dataset with 6 human scorers as well as through theoretical justification leveraging the output and loss functions used in both discriminative and generative modeling.

Chapter 5 validates a single-channel EOG-based sleep staging algorithm on a clinical population. EOG is a less obtrusive sensor modality than the EEG, as it can be applied below the hairline. High quality sleep staging at the level of human inter-rater agreement is possible with the EOG, probably due to desirable (insofar as sleep staging is concerned) contamination of the EOG measurement with EEG signals.

Chapter 6 tackles the challenge of integrating different sensor modalities. In this chapter, the Factorized Score-based Diffusion Model (FSDM) is introduced as a way to combine any set of sensors for automatic sleep staging. We show that the FSDM model can easily be extended to new signals as it only needs to be trained on the new sensor and can then perform zero-shot inference on unseen sensor combinations. On EEG and EOG signals the FSDM model achieves the inter-rater agreement performance limit and also achieves good results on cardio-respiratory modalities and unconventional signals such as single-channel EMG.

Chapter 7 shows that the FSDM algorithm, as introduced in Chapter 6, can be leveraged to effectively score REM epochs in subjects with REM sleep behavior disorder using only single-channel chin EMG. This is of crucial interest, as subjects with this disorder display so-called REM sleep without atonia (RSWA), i.e. their muscles are not paralyzed during REM sleep. We show that there are no significant differences between the clinical RSWA index as derived from this automatic scoring procedure and that derived from human scoring, which requires the full PSG. This approach could thereby enhance the accessibility of RSWA quantification, particularly in ambulatory settings.

Chapter 8 summarizes the main findings and outlines directions for future research.

In conclusion, this thesis shows several ways in which deep generative networks can be applied to sleep diagnostics and how this helps to overcome current challenges in automatic sleep staging. Particularly, how concerns about uncertainty, inter-rater disagreement, fusion of sensor modalities, and application to clinical populations can be addressed.

Share on

Twitter Facebook LinkedIn