少点错误 01月13日
Do Antidepressants work? (First Take)
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文深入探讨了抗抑郁药(SSRIs和SNRIs)的疗效争议,核心问题在于其效果是否优于安慰剂。研究指出,抗抑郁药和安慰剂都能减轻抑郁症状,但抗抑郁药的效果更显著。然而,当使用“增强型”安慰剂(即会产生副作用的安慰剂)时,这种疗效差异会减弱。研究采用HDRS量表评估抑郁症状,并采用连续和二元两种方式分析疗效,结果显示,连续分析得出抗抑郁药疗效不显著,而二元分析则得出更积极的结论。文章重点分析了一项关于氟西汀(百忧解)的荟萃分析,发现其平均疗效提升不明显,但按一定标准划分的缓解率却显著高于安慰剂组,这突显了研究方法和指标选择对结果的影响。

💊抗抑郁药与安慰剂均能减轻抑郁症状,但抗抑郁药效果更显著,但当安慰剂被设计成会产生副作用时,这种差异会减小。

📊研究采用HDRS量表评估抑郁症状,存在连续和二元两种分析方法。连续分析显示抗抑郁药疗效不显著,而二元分析(以50%的症状改善为标准)则显示出显著疗效。

🧪对氟西汀(百忧解)的荟萃分析表明,平均治疗效果在HDRS量表上仅提高0.303分,这被认为临床意义不大;但若以症状缓解率衡量,则氟西汀组显著高于安慰剂组,表明研究方法对结论影响重大。

🤔 研究还通过模拟实验探讨了治疗效果的异质性,发现只有假设治疗效果具有巨大差异时,才能接近研究中观察到的缓解率差异,暗示个体对药物的反应可能存在较大差异。

Published on January 12, 2025 5:11 PM GMT

I've been researching the controversy over whether antidepressants truly work or whether they are not superior to a placebo. The latter possibility really contains two possibilities itself: either placebos are effective at treating depression, or the placebo effect reflects mean reversion. Here, the term "antidepressant" refers to drugs classified as SSRIs and SNRIs.

Some stylized facts from the literature of RCTs:

How is depression measured?

Most studies of antidepressant efficacy use the Hamilton Depression Rating Scale (HDRS) to measure changes in depressive symptoms from the baseline, although entrance into the clinical trials themselves is often not based on this measure. There are several versions of this scale, but the most popular one is HDRS-17, which is a seventeen question survey administered by a clinician. Scores on HDRS-17 vary from zero to 57. Scores of 0-7 are generally considered normal, while a score of 20 or greater indicates at least moderate depression.

You can look at the questions on HDRS-17 here. To my eye, it seems to strangely underweight feelings of subjective psychological distress that define depression, and includes a lot of symptoms that are only contingently associated with what we actually mean when we talk about depression. I think the heterogeneity in types of questions is a potential problem in using it to quantitatively assess the effectiveness of interventions on depressive disorder, but I'd like to leave that aside for now for the most part.

What does it mean for antidepressants to work?

Most studies that use a scale like HDRS-17 to look at the effectiveness of interventions on depression either define a cutoff of how much improvement counts as partial or full remission and use that cutoff as a binary indicator for success, or look at the average treatment effect in terms of points of improvement for the treatment arm relative the placebo arm. In the former case, a decrease of 50% of baseline HDRS-17 points or more is often used as the cutoff, corresponding to at least partial remission of depression. In the continuous treatment effect, an improvement of three points or more is often considered to be a clinically significant effect, although that might be too low (three point swings in HDRS would usually be considered noise for an individual patient in a clinical setting).

What do these studies find?

In this essay, I'm going to focus on one particular meta-analysis of the effectiveness of fluoxetine (Prozac) by Bech P, Cialdella P, Haugh MC, et al in BJP. However, I believe that this study is representative of most similar analyses, and my discussion of it is broadly applicable. Other studies are discussed in a review of meta-analyses in the Annals of Hospital Psychiatry and a critical review of the literature in BMJ. The pattern that studies with a quantitative endpoint found a small and clinically insignificant improvement seems to hold nearly universally as far as I can find. 

Conversely, most studies that examine the efficacy of antidepressants in terms of a cutoff on percent improvement show a clinically significant effect. Sometimes, the same meta-analysis includes both endpoints. The Bech, Cialdella, Haugh et al (heretofore BCH) analysis that I want to look into more closely is one of them.

What does BCH find?

BCH is a meta-analysis of 30 studies, some of which compare fluoxetine to a placebo, and others that compare it to old Tricyclic antidepressants (TCAs). I want to focus on the comparisons to the placebo, of which there were 16, all conducted in the US. The authors consider three measures of treatment: Intention to treat, which counts any study dropouts as having failed the study, Efficacy analysis, which takes anyone who completed four weeks of treatment and uses their last check-in as the post-treatment result, and End-point, which takes anyone who had any post-treatment visit and uses their last check-in as their post-treatment result. The results are summarized in the following table from the paper:

I think the intent to treat measure is the best since antidepressants are not supposed to continue to work after discontinuation, so anyone who does not continue the treatment throughout the post-baseline period should be expected not to show any long-term improvement (though for some, short term relief could be a meaningful benefit). However, only the third measure (End-point) is used for the average treatment effect analysis. In every case, the results seem pretty clinically significant: to be a responder, you have to halve your HDRS-17 score relative to the baseline. If a treated subject is 13.6 or 21.4 or 16.8 percentage points more likely to be a responder by this definition than a recipient of the placebo, I think it is reasonable to call this a positive result for fluoxetine.

The average treatment effect analysis uses the End-point measure and simply takes the mean difference in HDRS-17 scores from before and after treatment and compares the difference in differences for the treatment arm and the survey arm. Unfortunately, it does not report the original averages nor the single-differenced results—it only displays the last difference-in-differences figure. The individual study results and the average are displayed below:

 

The average treatment effect across all studies was an improvement for the fluoxetine relative to the placebo of 0.303 HDRS-17 points. Recall that the HDRS-17 is out of 57 points, most of the participants probably started with a score of 20 or greater, and a score difference of three (ten times what the study shows) is not usually considered clinically significant. Clearly, this result is a pretty precisely estimated zero.

How do you reconcile this?

It is of course possible to get different results when you analyze a variable as continuous versus if you dichotomize it. Indeed, it's possible to make a dichotomous result look material even with a small change by choosing the right threshold. For example, if 50% of treated participants improved their score by two points and 50% saw no effect, while no placebo respondents had a change, then choosing a threshold of one point of improvement would show that the treatment group was 50 percentage points more likely to respond to treatment. 

However, I don't think this is sufficient to explain the discrepancy in this case. The only way I can come up with numbers that approach what the study found is to assume an extremely heterogeneous treatment effect, to the point where the heterogeneity should be a first-order object of concern more than the average treatment effect itself.

Simulation

There are a lot of parameters to the study that aren't given, so I made them up. However, essentially all reasonable parameters give about the same result. Feel free to try to find some assumptions under which the difference in treatment versus control remission rates approaches 16.8 percentage points.

Note that the only thing that the study does state is that the mean difference in change between the pre and post period for treated versus placebo respondents was 0.303 HDRS-17 points. Here's some R code that shows my approach. In the following code, I assume a homogeneous treatment effect.

ENTRY_CUTOFF <- 20LMAX_HDRS_SCORE <- 57LAVERAGE_HDRS_SCORE <- 30LMEAN_PLACEBO_CHANGE <- 8Lset.seed(100)entrants <- rnorm(1000000L, mean=AVERAGE_HDRS_SCORE,sd=4)entrants <- ifelse(entrants < ENTRY_CUTOFF, ENTRY_CUTOFF, entrants)entrants <- ifelse(entrants > MAX_HDRS_SCORE, MAX_HDRS_SCORE, entrants)treat <- entrantspla <- entrantstreatment_effect <- rep_len(-0.303,1000000)placebo_effect <- rnorm(1000000,mean=-1MEAN_PLACEBO_CHANGE,sd=4)treatment_diff <- treatment_effect + placebo_effectpost_treat <- treat + treatment_diffpost_pla <- pla + placebo_effectpost_treat <- ifelse(post_treat < 0, 0, post_treat)post_pla <- ifelse(post_pla < 0, 0, post_pla)diff_diff <- post_treat - post_plamean(diff_diff)# partial remission: 50% declinetreatment_comp <- (post_treat - treat) / treattreat_remission <- ifelse(treatment_comp < -0.5,1,0)tm <- mean(treat_remission)tm100pla_comp <- (post_pla - pla) / plapla_remission <- ifelse(pla_comp < -0.5,1,0)tp <- mean(pla_remission)tp100print((tm - tp)100)# 0.8337%

With these parameters, I find that treated participants should be 0.84 percentage points more likely to be in remission than placebo respondents. 

In the following code, I allow for a heterogeneous treatment effect. To get to something approaching the results of the study, I have to let the treatment effect itself (recall, its mean is 0.303) to be normally distributed with a massive standard deviation of 7.75.

ENTRY_CUTOFF <- 20LMAX_HDRS_SCORE <- 57LAVERAGE_HDRS_SCORE <- 30LMEAN_PLACEBO_CHANGE <- 8Lset.seed(100)entrants <- rnorm(1000000L, mean=AVERAGE_HDRS_SCORE,sd=4)entrants <- ifelse(entrants < ENTRY_CUTOFF, ENTRY_CUTOFF, entrants)entrants <- ifelse(entrants > MAX_HDRS_SCORE, MAX_HDRS_SCORE, entrants)treat <- entrantspla <- entrantstreatment_effect <- rnorm(1000000,mean=-0.303, sd=7.75)placebo_effect <- rnorm(1000000,mean=-1MEAN_PLACEBO_CHANGE,sd=4)treatment_diff <- treatment_effect + placebo_effectpost_treat <- treat + treatment_diffpost_pla <- pla + placebo_effectpost_treat <- ifelse(post_treat < 0, 0, post_treat)post_pla <- ifelse(post_pla < 0, 0, post_pla)diff_diff <- post_treat - post_plamean(diff_diff)# partial remission: 50% declinetreatment_comp <- (post_treat - treat) / treattreat_remission <- ifelse(treatment_comp < -0.5,1,0)tm <- mean(treat_remission)tm100pla_comp <- (post_pla - pla) / plapla_remission <- ifelse(pla_comp < -0.5,1,0)tp <- mean(pla_remission)tp100print((tm - tp)100)# 16.8374%

The implications of this are very strange. Under such a heterogeneous treatment effect, about 48% of respondents were made worse off by the drug. Among those who were made worse off, their HDRS-17 score was increased by an average of 6.07 points, 20 times the average treatment effect in the opposite direction. Among the 52% of those who were made better off, their HDRS-17 score decreased by an average of 6.29 points. This is a histogram of the treatment effect (the actual assumed effect of the drug, not including mean reversion or the placebo effect):

The overwhelming story here is the heterogeneity, not whether the mean is different from zero. Many patients appear to be helped quite a bit by fluoxetine, and almost as many are harmed. 

Conclusion

There are several possibilities that I can think of from this analysis:

If the last option is true, then the endpoints of this study, and most other studies on antidepressants in general, are entirely missing the forest for the trees. A small average treatment effect and a difference in binary remission rates are masking the fact that these drugs are highly effective for some and very harmful for others, at least in the short term. Is it possible to determine a priori who will fall in which group? Is the heterogeneity limited to a subset of the questions on the HDRS-17, like sleep or appetite? Do people who benefit from fluoxetine also benefit from duloxetine and other drugs and vice-versa? Without more information about heterogeneity, it seems very difficult to come to any conclusions about the effectiveness of SSRI antidepressants.

 

References



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

抗抑郁药 安慰剂效应 HDRS量表 氟西汀 疗效评估
相关文章