Defining Successful Studies

Recently, on a couple of occasions, I've gotten into discussions about the definition of a successful clinical trial.  Sometimes quite heated discussions. So this blog posting is my definition of success as applied to clinical trials aimed at curing type-1 diabetes.

Defining Success of Clinical Trials Aimed at Curing T1D

I consider a trial successful if there are good, statistically significant results for the primary outcome, using standard data analysis.  I think this is the most common definition of success in scientific trials, and I don't think it is controversial to use it.   However, it is not the only definition, and it is certainly possible to argue about every part of it.

This is an issue because some researchers, if they don't get a successful primary end point, will point at good results from a secondary end point (or if desperate, a tertiary or ad-hoc end point), and claim success based on that.  But I don't count those as successes.  Other researchers will point at results which are not statistically significant (but are close), and argue why these lessor results should be accepted as successful.

Primary vs Secondary (or Tertiary or Post-Hoc) Results

All clinical trials have a primary end point, chosen by the researcher when they are designing the study.  Recently, some trials have started to have "co" primary results or even multiple primary results, but there is always at least one.  It is standard to use this primary end point as the most important result, and thus the determiner of success.

Part of this is the difference between basic research and trying to find a cure.  If you are doing basic research, then any new news might be good news.  Unusual findings are paths to more research, etc.  So just about any finding might be important, and can be seized on as a "successful" result.  On the other hand, if you are trying to find a cure or a treatment, then what matters is effectiveness and safety.  Random new knowledge is not the goal.  In this case, the primary end point is generally the most important effectiveness measure, and is therefore much more important than secondary end points which are generally less important measures of effectiveness.  (Sometimes, the primary end point is exactly what the FDA requires for new drug approval for the disease being studied, which means it is particularly important for future approvals.)

The situation with Tertiary end points is even worse.  Tertiary results are usually internal or poorly understood markers, so those might lead to more research, but are very unlikely to lead directly to a cure or better treatments. Post-Hoc results are determined after the data has been collected as analysed, so it allows researchers to go looking for new or unexpected findings.  But success for a clinical trial aimed at a cure, should mean confirming the safety/effectiveness of a treatment, not finding some new or unexpected result.

In addition to all the reasons above, it is important to remember that primary end points are selected by the researcher.  So if they are "talking up" a secondary end point because their primary end point failed, it's quite reasonable to ask why they chose the primary end point that they did.  Obviously they thought it was more important at the start of the trial, so it's a bad sign if they are now pointing at something different as being the measure of success.  It's especially bad if the FDA has already identified their primary end point as the important measure for the disease they are studying.

For curing type-1 diabetes, the US FDA has made it very clear that the best marker of success is C-peptide generation.  Treatments that raise C-peptide generation are assumed to be on the path to a cure.   Therefore, an argument can be made that all clinical trials aimed at curing type-1 diabetes should use C-peptide as the primary end point.  But for now, I use whatever the researcher considers "primary" as their measure of success.
    Why Statistically Significant?

    Statistical significance is not a yes/no thing.  There are different measures of statistical significance and different levels of significance using the same measure.  However, as I am not a statistician, I use the most common, which is a P value below 0.05.  Of course, people can argue this point from both sides.  Researchers who have gotten P values just above 0.05 will sometimes argue that a P value of 0.1 is good enough.  Or they might say that a P value of 0.06 is a "trend" in the right direction, or that because they have several results with a P value of 0.06 or 0.07 that suggests they are on the right track.  On the other hand, some statisticians argue that 0.05 is too weak a standard, or that P value is not the best way to judge statistical significance.

    At the end of the day, I use the P value of 0.05 or less, because it is by far the most common definition of statistical significance, and I'm not expert in the field.

    What Is Standard Data Analysis?

    When I require standard data analysis, I exclude results that are only seen by using certain specific techniques which are generally considered bad practice.  For example, post-hoc end points, post-hoc subgroups, and "P value hacking" are all techniques which give results that I ignore.  Also, if the success is seen only by using unusual data analysis, I tend to ignore it.  The phrase "post hoc" refers to decisions made after the researchers know the result data for the trial.  If a researcher decides to measure C-peptide after 12 months and makes that decision before starting the trial, that's good.  But if the researcher looks at the results afterwards and see that CD4+ cells were raised after 6 months, and reports that as a success, that is not good, because the decision to even look at that data was made after the data was known.  Let me discuss all of this in more detail:

    Post-Hoc End Points: In this case, the researchers listed a bunch of end points, and did not see statistically significant results in any of them.  So they analyse the data looking for some difference between the treated and untreated groups.  They eventually find one, and they make that an end point, and report on it as though it was a success.   This can be thought of as "shooting at a blank wall, and then painting a target where you hit".

    Post-Hoc Subgroups: In this case, if the researchers look at everyone in the trial, there is no overall good result.  Often some people improved and others got worse.  Some researchers will then try to figure out who did better and say that the treatment was successful for this one subgroup. (Conveniently ignoring those who got worse.)  I tend to ignore these results, because even if the researchers are right, they will have to run another trial to confirm it.  That trial will only include the successful subgroup, so the results will be obvious, and I'll report on that second trial.  For the first trial, it is just too easy to test dozens, even hundreds, of potential subgroups to find one that works by luck.  (This is similar to p-value hacking described below.)

    A related term is "responders".  "Responders" are patients who got better.  But almost all trials have some responders, so the existence of responders is not -- by itself -- a measure of success.  This is especially true if some people got better and some people got worse and it is not clear why.  This is likely to be just the random fluctuations of a disease or a drug.  Responders can be important to future research, but I don't consider them a measure of success of completed research.

    P-value Hacking: The more end points you measure (or the more different subgroups you analyse), the more likely it is that you will get a low P-value by chance, rather than because your data is good.  Therefore, one way to "manufacture" a success, is to measure a large number of end points or a large number of subgroups.  If you do this enough, you're bound to get one or two low P-value results.  Especially when done "post hoc" (ie. after the trial is over and you can analyse the data), this is more likely to be spurious (due to luck), than real (due to effectiveness).

    There is a lot of statistics behind deciding how many end points are too many, and what P values to use as you test more and more different end points.  My rule of thumb is simple: there should be only a very few primary end points (which precludes P-value Hacking), and if there are more than about 20 other results, then some specific discussion of P-values should be included.  If a study measures 100 different end points, uses the standard P-value of 0.05, and doesn't discuss this issue, then I'm likely to consider it P-value hacking.  In these cases, I try to get an opinion from a trusted statistician.

    But at the end of the day, if the researchers have used a truly unique data analysis technique, one that I have never seen used before in type-1 diabetes research, then I'm likely to exclude those findings.  That's true even if I don't see an obvious mistake.  Luckily, this is rare in type-1 diabetes research, but I think it is far more likely that people use new and unique techniques because if they used standard ones, they didn't get the results they wanted, then the new technique is really better.  In any case, I'll wait for people with more statistical expertise than I to offer their judgement before I go out on a limb for an unusual data analysis technique.

    What is a Good Result?

    By "good" result, I mean an improvement that a patient would notice.  Something good enough to justify the hassle (and side effects) of using the treatment.  That's because I focus on treatment and cures.  A basic researcher might reasonably consider a good result to be anything new, anything that supports their theory, or in the most extreme case, anything that they don't understand.

    These researchers might say, of an otherwise unsuccessful trial, "we saw a new reaction, which we don't understand, so we learned something new, so the trial was a success".  Maybe to a basic researcher, but not to me.  Because I focus on research aimed at curing T1D, I only consider something a success if it moves us in the direction of a cure or treatment.  Learning "something new" is not enough.

    Another related issue, is reporting results so small that most people would not notice them.  For me, those results are not good, even if they are statistically valid.  Although I do pay more attention if there is a plausible way for the treatment to improve results in the future.

    The FDA Does Not Judge Trials

    When I talk to people about clinical trials, they often assume that the FDA determines if a trial is successful or not.  They are often shocked that the FDA doesn't pass judgement on every clinical trial.  This is a natural mistake, because at the very end of the process, the FDA does give (or withhold) marketing approval.  But at no point does the FDA review a single study to decide if it was successful or not.  Some details:
    • At the very end of the process, the FDA does give a decision on marketing approval.  This decision is based on all the clinical data in front if it, especially two phase-III clinical trials.  Except in very rare situations, two phase-III trials are required, and the FDA looks at all available data (all human trials and many animal trials as well).  Plus the manufacturing process is reviewed, as well as the literature that goes with the treatment.  All of this put together results in a go / no-go decision.  Although the two phase-III trials are the most important, there is no single trial success or failure outcome.
    • At the very start of the process, the FDA gives a decision called IND (investigational new drug/device).  This is an agreement on safety (not effectiveness), it is given before human tests start, and is based on safety seen in animal testing.
    • The FDA (and local Institutional Review Boards) do review the design of studies done on people, to assure safety of the people in the study.  For example, they may require studies with adults before those same studies are started on children (if the disease occurs in both adults and children).  This review is aimed at safety, not effectiveness and is done before the trial starts.
    • Finally, the FDA monitors trials once they are underway.  Trials which have unexpected bad results (especially deaths) may be stopped by the agency.  This is very rare, and I've never heard of it happening in a type-1 study.  In any case, this is not based on overall safety or effectiveness, but rather on catastrophic bad outcomes.
    But as you can see, none of these actions involve looking at one study and deciding if it is successful or not in terms of effectiveness.  The FDA does not do that.

    Many people seem to think that a phase-II study will only happen if a phase-I study was "successful" (and the same for phase-III studies based on the "success" of a phase-II study).  This is not true.  A researcher can do a phase-II or even a phase-III study if all of these things are true: (a) they have the desire to do the study, (b) they have the money to do the study,  (c) they have IND approval for the drug, device, or treatment (if needed), and finally (d) that the FDA approves the study design, which is a safety approval, and has little to do with the effectiveness seen in previous studies.

    The real issue is money.  A successful trial needs to generate enough excitement so that money can be raised for the next step in the process.  Therefore, one could argue that the true definition of an unsuccessful trial is one that does not generate enough excitement to fund the next step.  But I don't use that definition, because it is measuring hype, and I measure results.  Indeed, the whole point of this blog is to report on results rather than excitement.

    So I stick with the following definition:
    A trial is successful if there are good, statistically significant results for the primary outcome, using standard data analysis. 

    No comments: