Longitudinal Psychometric Analysis of the PHQ-9 (2005–2022)

Depression is one of the most common mental health conditions in the United States, with approximately 1 in 5 adults reporting a lifetime diagnosis and millions experiencing depressive symptoms each year (CDC, 2023).

However, prevalence estimates vary depending on diagnostic criteria and the measurement tools used, making it critical to evaluate whether widely used screening instruments consistently measure the same underlying construct over time.

The PHQ-9 is one of the most widely used tools to measure depressive symptoms in clinical, research, and public health settings (Kroenke et al., 2001). This project examined whether the PHQ-9 consistently measures the underlying construct of depression over time using U.S. population data from 2005 to 2022. Using longitudinal psychometric and item response theory methods, I evaluated the stability, reliability, and measurement precision of the scale.

Technical Methods

Data Source: Harmonized data from five NHANES survey cycles (2005–2022).

Sample Size: N = 22,433 U.S. adults (complete cases)

Measurement Stability: Established Full Scalar Invariance using Multi-Group Confirmatory Factor Analysis (CFA)

Technical Thresholds: Used the ΔCFI < 0.01 criterion, as the large sample size makes traditional Chi-Square difference tests overly sensitive to negligible variations.

Precision Modeling: Applied Item Response Theory (IRT) via a Graded Response Model to evaluate item discrimination and scale information.

Figure 1. Mean PHQ-9 scores across NHANES survey cycles from 2005 to 2022. The figure shows a gradual upward trend in self-reported depressive symptoms in the U.S. adult population, with notable increases in later survey cycles.

Methods Overview: Mean PHQ-9 scores were calculated for U.S. adults in five NHANES cycles (2005–2022). Survey weights were applied to produce nationally representative estimates.

This analysis suggests that average depressive symptom severity has increased over time in the U.S. population. However, interpreting longitudinal trends requires ensuring that the PHQ-9 measures depression consistently across survey cycles. To address this, I tested longitudinal measurement invariance and item-level measurement precision using confirmatory factor analysis and item response theory models.

**Figure 2.** Distribution of PHQ-9 scores across survey cycles, showing changes in variability and symptom severity distribution over time.

Beyond the overall average, a granular look at symptomatic respondents (N = 15,380) highlights a shifting baseline. Between 2005 and 2022, the distribution of scores moved from minimal levels toward the mild threshold, with the symptomatic subsample reaching a mean score of 4.88. This suggests that the rise in depression is not just about more people being diagnosed, but also an increase in the intensity of symptoms reported.

Figure 3 shows a one-factor Confirmatory Factor Analysis (CFA) of the PHQ-9, a widely used screening instrument for depressive symptoms. The model estimates a single latent construct, Depression, measured by nine self-reported symptom items assessing affective, cognitive, behavioral, and somatic dimensions of depression over the past two weeks.

PHQ-9 Items

Responses of “Refused” and “Don’t know” were treated as missing and excluded during data cleaning. Analyses were conducted on complete cases only.

DPQ010 – Little interest or pleasure in doing things
DPQ020 – Feeling down, depressed, or hopeless
DPQ030 – Trouble falling or staying asleep, or sleeping too much
DPQ040 – Feeling tired or having little energy
DPQ050 – Poor appetite or overeating
DPQ060 – Feeling bad about yourself, or that you are a failure or have let yourself or your family down
DPQ070 – Trouble concentrating on things, such as reading the newspaper or watching television
DPQ080 – Moving or speaking slowly, or being unusually fidgety or restless
DPQ090 – Thoughts that you would be better off dead or of hurting yourself in some way

All items are scored on a 4-point ordinal scale:

0 = Not at all
1 = Several days
2 = More than half the days
3 = Nearly every day

Responses of “Refused” and “Don’t know” were treated as missing and excluded during data cleaning. Analyses were conducted on complete cases only.

TEXT

The PHQ-9 demonstrated longitudinal measurement invariance

The scale passed configural, metric, and scalar invariance tests across survey cycles, supporting valid comparisons of depression scores over time.
Depressive symptom scores showed a gradual upward trend over time

Mean PHQ-9 scores increased across survey cycles, with notable shifts in later years, consistent with broader population mental health trends.
The PHQ-9 is most precise in the mild–moderate depression range

Item response theory analysis showed the scale provides the greatest measurement information for mild to moderate latent depression severity, with lower precision at the extreme ends of the trait continuum.
Item-level discrimination varied across symptoms

Some PHQ-9 items contributed more strongly to latent depression measurement than others, highlighting differential item functioning across symptom domains.

Validating a Public Health Standard

The Patient Health Questionnaire-9 (PHQ-9) remains a foundational tool in mental health surveillance due to its brevity, ease of administration, and established presence in clinical and research settings. By establishing full scalar invariance across nearly two decades of NHANES data, this study confirms that the instrument provides a stable longitudinal metric. This psychometric stability ensures that the observed increases in depressive symptoms, most notably during the 2008 financial crisis and the COVID-19 pandemic, reflect genuine population-level shifts rather than changes in how the survey was interpreted or administered.

Precision in Early Intervention

My Item Response Theory (IRT) analysis reveals that the PHQ-9 achieves peak measurement precision within the mild-to-moderate range of depression severity

Targeted Screening: This makes the tool exceptionally well-suited for settings where early identification is critical, such as primary care offices, insurance assessments, and university or community counseling centers.

Low-Investment, High-Impact: Because the PHQ-9 is a low-burden, self-report instrument, it can be widely distributed with minimal training requirements.

Complementary Approaches: While more sophisticated or multi-modal diagnostic methods exist, the established nature of the PHQ-9 makes it the most robust option for tracking long-term trends and maintaining continuity in population health monitoring.

The Economic and Societal Imperative

Depression represents a significant and growing public health challenge, carrying immense costs in the form of healthcare utilization and lost workforce productivity.

Proactive Monitoring: Accurate measurement through established tools allows public health officials to detect symptomatic shifts early, potentially reducing the long-term economic burden.

Research Continuity: With the instrument's stability confirmed, the focus can shift from measurement validation to modeling the complex social, technological, and economic drivers of these mental health trends.

It all begins with an idea. Maybe you want to launch a business. Maybe you want to turn a hobby into something more. Or maybe you have a creative project to share with the world. Whatever it is, the way you tell your story online can make all the difference.

Make it stand out.

Make It

It all begins with an idea. Maybe you want to launch a business. Maybe you want to turn a hobby into something more. Or maybe you have a creative project to share with the world. Whatever it is, the way you tell your story online can make all the difference.

Make it stand out.

Make It

It all begins with an idea. Maybe you want to launch a business. Maybe you want to turn a hobby into something more. Or maybe you have a creative project to share with the world. Whatever it is, the way you tell your story online can make all the difference.

Make it stand out.

Make It

Longitudinal Psychometric Analysis of the PHQ-9 (2005–2022)

Depression is one of the most common mental health conditions in the United States, with approximately 1 in 5 adults reporting a lifetime diagnosis and millions experiencing depressive symptoms each year (CDC, 2023).

However, prevalence estimates vary depending on diagnostic criteria and the measurement tools used, making it critical to evaluate whether widely used screening instruments consistently measure the same underlying construct over time.

Technical Methods

Data Source: Harmonized data from five NHANES survey cycles (2005–2022).

Sample Size: N = 22,433 U.S. adults (complete cases)

Measurement Stability: Established Full Scalar Invariance using Multi-Group Confirmatory Factor Analysis (CFA)

Technical Thresholds: Used the ΔCFI < 0.01 criterion, as the large sample size makes traditional Chi-Square difference tests overly sensitive to negligible variations.

Precision Modeling: Applied Item Response Theory (IRT) via a Graded Response Model to evaluate item discrimination and scale information.

The PHQ-9 demonstrated longitudinal measurement invariance

Depressive symptom scores showed a gradual upward trend over time

The PHQ-9 is most precise in the mild–moderate depression range

Item-level discrimination varied across symptoms

Validating a Public Health Standard

Precision in Early Intervention

My Item Response Theory (IRT) analysis reveals that the PHQ-9 achieves peak measurement precision within the mild-to-moderate range of depression severity

The Economic and Societal Imperative

Depression represents a significant and growing public health challenge, carrying immense costs in the form of healthcare utilization and lost workforce productivity.

Make it stand out.

Make it stand out.

Make it stand out.

Liz Soethe

lizsoethe@icloud.com

Longitudinal Psychometric Analysis of the PHQ-9 (2005–2022)

Depression is one of the most common mental health conditions in the United States, with approximately 1 in 5 adults reporting a lifetime diagnosis and millions experiencing depressive symptoms each year (CDC, 2023).

However, prevalence estimates vary depending on diagnostic criteria and the measurement tools used, making it critical to evaluate whether widely used screening instruments consistently measure the same underlying construct over time.

Technical Methods

Data Source: Harmonized data from five NHANES survey cycles (2005–2022).

Sample Size: N = 22,433 U.S. adults (complete cases)

Measurement Stability: Established Full Scalar Invariance using Multi-Group Confirmatory Factor Analysis (CFA)

Technical Thresholds: Used the ΔCFI < 0.01 criterion, as the large sample size makes traditional Chi-Square difference tests overly sensitive to negligible variations.

Precision Modeling: Applied Item Response Theory (IRT) via a Graded Response Model to evaluate item discrimination and scale information.

The PHQ-9 demonstrated longitudinal measurement invariance

Depressive symptom scores showed a gradual upward trend over time

The PHQ-9 is most precise in the mild–moderate depression range

Item-level discrimination varied across symptoms

Validating a Public Health Standard

Precision in Early Intervention

My Item Response Theory (IRT) analysis reveals that the PHQ-9 achieves peak measurement precision within the mild-to-moderate range of depression severity

The Economic and Societal Imperative

Depression represents a significant and growing public health challenge, carrying immense costs in the form of healthcare utilization and lost workforce productivity.

Make it stand out.

Make it stand out.

Make it stand out.

Project Three

Project Five

Liz Soethe

lizsoethe@icloud.com