How to reduce predictors the right way for a logistic regression modelUsing principal component analysis (PCA) for feature selectionA more definitive discussion of variable selectionModel Selection: Logistic RegressionHow to choose data for training a predictive model for attrition predictionLogistic regression - how good is my model?Can we correctly identify all the non-zero coefficients in the linear regression model?Interpreting the direction of an interaction effect in Binary Logistic RegressionLogistic regression with serial correlation and/or glmm with arma in R?Logistic regression to adjust for confounders in treatment effect estimation: when is my model satisfactory?how to deal with limited observations of categorical independent variables during logistic regression?Should we do adjustment for “closely related” variable in logistic regression

Is Benjen dead?

Endgame puzzle: How to avoid stalemate and win?

How in the world do I place line of text EVENLY between two horizontal tikz lines?

Are pressure-treated posts that have been submerged for a few days ruined?

It is as simple as ABC

Constitutional limitation of criminalizing behavior in US law?

What was Bran's plan to kill the Night King?

When an imagined world resembles or has similarities with a famous world

Finding an Explicit Formula for a Geometric Series

Has the Hulk always been able to talk?

How does summation index shifting work?

SOQL query WHERE filter by specific months

Can I use a Cat5e cable with an RJ45 and Cat6 port?

Copy previous line to current line from text file

Can my 2 children, aged 10 and 12, who are US citizens, travel to the USA on expired American passports?

How should I tell my manager I'm not paying for an optional after work event I'm not going to?

Why didn't this character get a funeral at the end of Avengers: Endgame?

Correct way of drawing empty, half-filled and fully filled circles?

What do "Sech" and "Vich" mean in this sentence?

Where are the "shires" in the UK?

Install LibreOffice-Writer Only not LibreOffice whole package

Is “snitty” a popular American English term? What is its origin?

Why is my arithmetic with a long long int behaving this way?

Hostile Divisor Numbers



How to reduce predictors the right way for a logistic regression model


Using principal component analysis (PCA) for feature selectionA more definitive discussion of variable selectionModel Selection: Logistic RegressionHow to choose data for training a predictive model for attrition predictionLogistic regression - how good is my model?Can we correctly identify all the non-zero coefficients in the linear regression model?Interpreting the direction of an interaction effect in Binary Logistic RegressionLogistic regression with serial correlation and/or glmm with arma in R?Logistic regression to adjust for confounders in treatment effect estimation: when is my model satisfactory?how to deal with limited observations of categorical independent variables during logistic regression?Should we do adjustment for “closely related” variable in logistic regression






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








8












$begingroup$


So I have been reading some books (or parts of them) on modeling (F. Harrell's "Regression Modeling Strategies" among others), since my current situation right now is that I need to do a logistic model based on binary response data. I have both continuous, categorical, and binary data (predictors) in my data set. Basically I have around 100 predictors right now, which obviously is way too many for a good model. Also, many of these predictors are kind of related, since they are often based on the same metric, although a bit different.



Anyhow, what I have been reading, using univariate regression and step-wise techniques is some of the worst things you can do in order to reduce the amount of predictors. I think the LASSO technique is quite okay (if I understood that correctly), but obviously you just can't use that on 100 predictors and think any good will come of that.



So what are my options here ? Do I really just have to sit down, talk to all my supervisors, and smart people at work, and really think about what the top 5 best predictors could/should be (we might be wrong), or which approach(es) should I consider instead ?



And yes, I also know that this topic is heavily discussed (online and in books), but it sometimes seems a bit overwhelming when you are kind of new in this modeling field.



EDIT:



First of all, my sample size is +1000 patients (which is a lot in my field), and out of those there are between 70-170 positive responses (i.e. 170 yes responses vs. roughly 900 no responses in one of the cases).
Basically the idea is to predict toxicity after radiation treatment. I have some prospective binary response data (i.e. the toxicity, either you have it (1), or you don't (0)), and then I have several types of metrics. Some metrics are patient specific, e.g. age, drugs used, organ and target volume, diabetes etc., and then I have some treatment specific metrics based on the simulated treatment field for the target. From that I can retrieve several predictors, which is often highly relevant in my field, since most toxicity is highly correlated with the amount of radiation (i.e.dose) received. So for example, if I treat a lung tumour, there is a risk of hitting the heart with some amount of dose. I can then calculate how much x-amount of the heart volume receives x-amount of dose, e.g. "how much dose does 50% of the heart volume receive", and then do that in steps, so I check for example 30%, 35%, 40%, 45%, 50%, and so on. In turn I will get a lot of similar predictors, but I can't just pick one to start with (although that is what past experiments have tried to of course, and what I wish to do as well), because I need to know "exactly" at which degree there actually is a large correlation between heart toxicity and volume dose (again, as an example, there are other similar metrics, where the same strategy is applied). So yeah, that's pretty much how my data set looks like. Some different metrics, and some metrics that are somewhat similar.



What I then want to do is make a predictive model so I can hopefully predict which patients will have a risk of getting some kind of toxicity. And since the response data is binary, my main idea was of course to use a logistic regression model. At least that is what other people have done in my field. However, when going through many of these papers, where this has already been done, some of it just seems wrong (at least when reading these specific types of modeling books like F. Harrel's). Many use univariate regression analysis to pick predictors, and use them in multivariate analysis (a thing that is advised against if I'm not mistaken), and also many use step-wise techniques to reduce the amount of predictors.
Of course it's not all bad. Many uses LASSO, PCA, cross-validation, bootstrapping, etc., but the ones I have looked at, it seems like there is always one, or two of their approaches (either in the beginning, middle, or end) where they do these kind of techniques that I read is not a good idea.



Concerning feature selection, this is probably where I'm at now. How do I choose/find the right predictors to use in my model ? I have tried these univariate/step-wise approaches, but every time I think: "Why even do it, if it's wrong?". But maybe it's a good way to show, at least in the end, how a "good model" done the correct way goes up against a "bad model" done the wrong way. So I could probably do it the somewhat wrong way now, what I need help for is getting a direction into doing it the right way.



Sorry for the edit, and it being so long.



EDIT 2:
Just a quick example of how my data looks like:



'data.frame': 1151 obs. of 100 variables:
$ Toxicity : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 1 ...
$
Age : num 71.9 64 52.1 65.1 63.2 ...
$ Diabetes : Factor w/ 2 levels "n","y": 1 1 1 1 1 1 1 1 1 1 ...
$
Risk.Category : Ord.factor w/ 3 levels "LOW"<"INTERMEDIATE"<..: 1 1 1 1 2 1 1 1 1 3 ...
$ Organ.Volume.CC : num 136.1 56.7 66 136.6 72.8 ...
$
Target.Volume.CC : num 102.7 44.2 58.8 39.1 56.3 ...
$ D1perc : num 7961 7718 7865 7986 7890 ...
$
D1.5CC : num 7948 7460 7795 7983 7800 ...
$ D1CC : num 7996 7614 7833 7997 7862 ...
$
D2perc : num 7854 7570 7810 7944 7806 ...
$ D2.5CC : num 7873 7174 7729 7952 7604 ...
$
D2CC : num 7915 7313 7757 7969 7715 ...
$ D3perc : num 7737 7379 7758 7884 7671 ...
$
D3.5CC : num 7787 6765 7613 7913 7325 ...
$ D3CC : num 7827 6953 7675 7934 7480 ...
$
D4perc : num 7595 7218 7715 7798 7500 ...
$ D5perc : num 7428 7030 7638 7676 7257 ...
$
DMEAN : num 1473 1372 1580 1383 1192 ...
$ V2000CGY : num 24.8 23.7 25.9 22.3 19.3 ...
$
V2000CGY_CC : num 33.7 13.4 17.1 30.4 14 ...
$ V2500CGY : num 22.5 21.5 24 20.6 17.5 ...
$
V2500CGY_CC : num 30.7 12.2 15.9 28.2 12.7 ...
$ V3000CGY : num 20.6 19.6 22.4 19.1 15.9 ...
$
V3000CGY_CC : num 28.1 11.1 14.8 26.2 11.6 ...
$ V3500CGY : num 18.9 17.8 20.8 17.8 14.6 ...
$
V3500CGY_CC : num 25.7 10.1 13.7 24.3 10.6 ...
$ V3900CGY : num 17.5 16.5 19.6 16.7 13.6 ...
$
V3900CGY_CC : num 23.76 9.36 12.96 22.85 9.91 ...
$ V4500CGY : num 15.5 14.4 17.8 15.2 12.2 ...
$
V4500CGY_CC : num 21.12 8.18 11.76 20.82 8.88 ...
$ V5000CGY : num 13.9 12.8 16.4 14 11 ...
$
V5000CGY_CC : num 18.91 7.25 10.79 19.09 8.03 ...
$ V5500CGY : num 12.23 11.14 14.84 12.69 9.85 ...
$
V5500CGY_CC : num 16.65 6.31 9.79 17.33 7.17 ...
$ V6000CGY : num 10.56 9.4 13.19 11.34 8.68 ...
$
V6000CGY_CC : num 14.37 5.33 8.7 15.49 6.32 ...
$ V6500CGY : num 8.79 7.32 11.35 9.89 7.44 ...
$
V6500CGY_CC : num 11.96 4.15 7.49 13.51 5.42 ...
$ V7000CGY : num 6.76 5.07 9.25 8.27 5.86 ...
$
V7000CGY_CC : num 9.21 2.87 6.1 11.3 4.26 ...
$ V7500CGY : num 4.61 2.37 6.22 6.13 4 ...
$
V7500CGY_CC : num 6.27 1.34 4.11 8.38 2.91 ...
$ V8000CGY : num 0.7114 0.1521 0.0348 0.6731 0.1527 ...
$
V8000CGY_CC : num 0.9682 0.0863 0.023 0.9194 0.1112 ...
$ V8200CGY : num 0.087 0 0 0 0 ...
$
V8200CGY_CC : num 0.118 0 0 0 0 ...
$ V8500CGY : num 0 0 0 0 0 0 0 0 0 0 ...
$
V8500CGY_CC : num 0 0 0 0 0 0 0 0 0 0 ...
$ n_0.02 : num 7443 7240 7371 7467 7350 ...
$
n_0.03 : num 7196 6976 7168 7253 7112 ...
$ n_0.04 : num 6977 6747 6983 7055 6895 ...
$
n_0.05 : num 6777 6542 6811 6871 6693 ...
$ n_0.06 : num 6592 6354 6649 6696 6503 ...
$
n_0.07 : num 6419 6180 6496 6531 6325 ...
$ n_0.08 : num 6255 6016 6350 6374 6155 ...
$
n_0.09 : num 6100 5863 6211 6224 5994 ...
$ n_0.1 : num 5953 5717 6078 6080 5840 ...
$
n_0.11 : num 5813 5579 5950 5942 5692 ...
$ n_0.12 : num 5679 5447 5828 5809 5551 ...
$
n_0.13 : num 5551 5321 5709 5681 5416 ...
$ n_0.14 : num 5428 5201 5595 5558 5285 ...
$
n_0.15 : num 5310 5086 5485 5439 5160 ...
$ n_0.16 : num 5197 4975 5378 5324 5039 ...
$
n_0.17 : num 5088 4868 5275 5213 4923 ...
$ n_0.18 : num 4982 4765 5176 5106 4811 ...
$
n_0.19 : num 4881 4666 5079 5002 4702 ...
$ n_0.2 : num 4783 4571 4985 4901 4597 ...
$
n_0.21 : num 4688 4478 4894 4803 4496 ...
$ n_0.22 : num 4596 4389 4806 4708 4398 ...
$
n_0.23 : num 4507 4302 4720 4616 4303 ...
$ n_0.24 : num 4421 4219 4636 4527 4210 ...
$
n_0.25 : num 4337 4138 4555 4440 4121 ...
$ n_0.26 : num 4256 4059 4476 4355 4035 ...
$
n_0.27 : num 4178 3983 4398 4273 3951 ...
$ n_0.28 : num 4102 3909 4323 4193 3869 ...
$
n_0.29 : num 4027 3837 4250 4115 3790 ...
$ n_0.3 : num 3955 3767 4179 4039 3713 ...
$
n_0.31 : num 3885 3699 4109 3966 3639 ...
$ n_0.32 : num 3817 3633 4041 3894 3566 ...
$
n_0.33 : num 3751 3569 3975 3824 3496 ...
$ n_0.34 : num 3686 3506 3911 3755 3427 ...
$
n_0.35 : num 3623 3445 3847 3689 3361 ...
$ n_0.36 : num 3562 3386 3786 3624 3296 ...
$
n_0.37 : num 3502 3328 3725 3560 3233 ...
$ n_0.38 : num 3444 3272 3666 3498 3171 ...
$
n_0.39 : num 3387 3217 3609 3438 3111 ...
$ n_0.4 : num 3332 3163 3553 3379 3053 ...
$
n_0.41 : num 3278 3111 3498 3321 2996 ...
$ n_0.42 : num 3225 3060 3444 3265 2941 ...
$
n_0.43 : num 3173 3010 3391 3210 2887 ...
$ n_0.44 : num 3123 2961 3339 3156 2834 ...
$
n_0.45 : num 3074 2914 3289 3103 2783 ...
$ n_0.46 : num 3026 2867 3239 3052 2733 ...
$
n_0.47 : num 2979 2822 3191 3002 2684 ...
$ n_0.48 : num 2933 2778 3144 2953 2637 ...
$
n_0.49 : num 2889 2734 3097 2905 2590 ...


And if I run table(data$Toxicity) the output is:



> table(data$Toxicity)
0 1
1088 63


Again, this is for one type of toxicity. I have 3 others as well.










share|cite|improve this question











$endgroup$







  • 1




    $begingroup$
    What are you aiming to do? Prediction or inference, or something else?
    $endgroup$
    – Stephan Kolassa
    Mar 21 at 6:32










  • $begingroup$
    This is called feature selection. If you must use regression, then categorical features have be one-hotted, but for tree methods you can use them as-is. You could even figure out your most predictive n-way interaction or association terms, and use those.
    $endgroup$
    – smci
    Mar 21 at 7:31











  • $begingroup$
    "Do I really just have to sit down, talk to people, and really think about/reason out the top-n predictors?" Hell no, intuition is a startpoint, but that's why there are feature-selection methods; results from lots of experimentation beats intuition.
    $endgroup$
    – smci
    Mar 21 at 7:35






  • 1




    $begingroup$
    @smci sorry for being unclear. So in my field (radiation oncology) we make treatment plans, which basically is a 3D representation of how radiation/dose is distributed around a target. Unfortunately, this can't be done without hitting at least a small amount of healthy tissue. So from this 3D map, so to speak, I can for example get information about how large a volume receives x-amount of radiation/dose. But as you can imagine, I can "ask" in steps like "how much radiation does 1% of this structure volume receive", and then 2%, 3%. In principle, the values will be somewhat similar.
    $endgroup$
    – Denver Dang
    Mar 21 at 9:58






  • 1




    $begingroup$
    @smci, if prediction is the OPs goal, then correlation should be of no concern. High correlation among the variables would really only be of a great concern when trying to interpret variables included in the model.
    $endgroup$
    – StatsStudent
    Mar 21 at 14:15

















8












$begingroup$


So I have been reading some books (or parts of them) on modeling (F. Harrell's "Regression Modeling Strategies" among others), since my current situation right now is that I need to do a logistic model based on binary response data. I have both continuous, categorical, and binary data (predictors) in my data set. Basically I have around 100 predictors right now, which obviously is way too many for a good model. Also, many of these predictors are kind of related, since they are often based on the same metric, although a bit different.



Anyhow, what I have been reading, using univariate regression and step-wise techniques is some of the worst things you can do in order to reduce the amount of predictors. I think the LASSO technique is quite okay (if I understood that correctly), but obviously you just can't use that on 100 predictors and think any good will come of that.



So what are my options here ? Do I really just have to sit down, talk to all my supervisors, and smart people at work, and really think about what the top 5 best predictors could/should be (we might be wrong), or which approach(es) should I consider instead ?



And yes, I also know that this topic is heavily discussed (online and in books), but it sometimes seems a bit overwhelming when you are kind of new in this modeling field.



EDIT:



First of all, my sample size is +1000 patients (which is a lot in my field), and out of those there are between 70-170 positive responses (i.e. 170 yes responses vs. roughly 900 no responses in one of the cases).
Basically the idea is to predict toxicity after radiation treatment. I have some prospective binary response data (i.e. the toxicity, either you have it (1), or you don't (0)), and then I have several types of metrics. Some metrics are patient specific, e.g. age, drugs used, organ and target volume, diabetes etc., and then I have some treatment specific metrics based on the simulated treatment field for the target. From that I can retrieve several predictors, which is often highly relevant in my field, since most toxicity is highly correlated with the amount of radiation (i.e.dose) received. So for example, if I treat a lung tumour, there is a risk of hitting the heart with some amount of dose. I can then calculate how much x-amount of the heart volume receives x-amount of dose, e.g. "how much dose does 50% of the heart volume receive", and then do that in steps, so I check for example 30%, 35%, 40%, 45%, 50%, and so on. In turn I will get a lot of similar predictors, but I can't just pick one to start with (although that is what past experiments have tried to of course, and what I wish to do as well), because I need to know "exactly" at which degree there actually is a large correlation between heart toxicity and volume dose (again, as an example, there are other similar metrics, where the same strategy is applied). So yeah, that's pretty much how my data set looks like. Some different metrics, and some metrics that are somewhat similar.



What I then want to do is make a predictive model so I can hopefully predict which patients will have a risk of getting some kind of toxicity. And since the response data is binary, my main idea was of course to use a logistic regression model. At least that is what other people have done in my field. However, when going through many of these papers, where this has already been done, some of it just seems wrong (at least when reading these specific types of modeling books like F. Harrel's). Many use univariate regression analysis to pick predictors, and use them in multivariate analysis (a thing that is advised against if I'm not mistaken), and also many use step-wise techniques to reduce the amount of predictors.
Of course it's not all bad. Many uses LASSO, PCA, cross-validation, bootstrapping, etc., but the ones I have looked at, it seems like there is always one, or two of their approaches (either in the beginning, middle, or end) where they do these kind of techniques that I read is not a good idea.



Concerning feature selection, this is probably where I'm at now. How do I choose/find the right predictors to use in my model ? I have tried these univariate/step-wise approaches, but every time I think: "Why even do it, if it's wrong?". But maybe it's a good way to show, at least in the end, how a "good model" done the correct way goes up against a "bad model" done the wrong way. So I could probably do it the somewhat wrong way now, what I need help for is getting a direction into doing it the right way.



Sorry for the edit, and it being so long.



EDIT 2:
Just a quick example of how my data looks like:



'data.frame': 1151 obs. of 100 variables:
$ Toxicity : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 1 ...
$
Age : num 71.9 64 52.1 65.1 63.2 ...
$ Diabetes : Factor w/ 2 levels "n","y": 1 1 1 1 1 1 1 1 1 1 ...
$
Risk.Category : Ord.factor w/ 3 levels "LOW"<"INTERMEDIATE"<..: 1 1 1 1 2 1 1 1 1 3 ...
$ Organ.Volume.CC : num 136.1 56.7 66 136.6 72.8 ...
$
Target.Volume.CC : num 102.7 44.2 58.8 39.1 56.3 ...
$ D1perc : num 7961 7718 7865 7986 7890 ...
$
D1.5CC : num 7948 7460 7795 7983 7800 ...
$ D1CC : num 7996 7614 7833 7997 7862 ...
$
D2perc : num 7854 7570 7810 7944 7806 ...
$ D2.5CC : num 7873 7174 7729 7952 7604 ...
$
D2CC : num 7915 7313 7757 7969 7715 ...
$ D3perc : num 7737 7379 7758 7884 7671 ...
$
D3.5CC : num 7787 6765 7613 7913 7325 ...
$ D3CC : num 7827 6953 7675 7934 7480 ...
$
D4perc : num 7595 7218 7715 7798 7500 ...
$ D5perc : num 7428 7030 7638 7676 7257 ...
$
DMEAN : num 1473 1372 1580 1383 1192 ...
$ V2000CGY : num 24.8 23.7 25.9 22.3 19.3 ...
$
V2000CGY_CC : num 33.7 13.4 17.1 30.4 14 ...
$ V2500CGY : num 22.5 21.5 24 20.6 17.5 ...
$
V2500CGY_CC : num 30.7 12.2 15.9 28.2 12.7 ...
$ V3000CGY : num 20.6 19.6 22.4 19.1 15.9 ...
$
V3000CGY_CC : num 28.1 11.1 14.8 26.2 11.6 ...
$ V3500CGY : num 18.9 17.8 20.8 17.8 14.6 ...
$
V3500CGY_CC : num 25.7 10.1 13.7 24.3 10.6 ...
$ V3900CGY : num 17.5 16.5 19.6 16.7 13.6 ...
$
V3900CGY_CC : num 23.76 9.36 12.96 22.85 9.91 ...
$ V4500CGY : num 15.5 14.4 17.8 15.2 12.2 ...
$
V4500CGY_CC : num 21.12 8.18 11.76 20.82 8.88 ...
$ V5000CGY : num 13.9 12.8 16.4 14 11 ...
$
V5000CGY_CC : num 18.91 7.25 10.79 19.09 8.03 ...
$ V5500CGY : num 12.23 11.14 14.84 12.69 9.85 ...
$
V5500CGY_CC : num 16.65 6.31 9.79 17.33 7.17 ...
$ V6000CGY : num 10.56 9.4 13.19 11.34 8.68 ...
$
V6000CGY_CC : num 14.37 5.33 8.7 15.49 6.32 ...
$ V6500CGY : num 8.79 7.32 11.35 9.89 7.44 ...
$
V6500CGY_CC : num 11.96 4.15 7.49 13.51 5.42 ...
$ V7000CGY : num 6.76 5.07 9.25 8.27 5.86 ...
$
V7000CGY_CC : num 9.21 2.87 6.1 11.3 4.26 ...
$ V7500CGY : num 4.61 2.37 6.22 6.13 4 ...
$
V7500CGY_CC : num 6.27 1.34 4.11 8.38 2.91 ...
$ V8000CGY : num 0.7114 0.1521 0.0348 0.6731 0.1527 ...
$
V8000CGY_CC : num 0.9682 0.0863 0.023 0.9194 0.1112 ...
$ V8200CGY : num 0.087 0 0 0 0 ...
$
V8200CGY_CC : num 0.118 0 0 0 0 ...
$ V8500CGY : num 0 0 0 0 0 0 0 0 0 0 ...
$
V8500CGY_CC : num 0 0 0 0 0 0 0 0 0 0 ...
$ n_0.02 : num 7443 7240 7371 7467 7350 ...
$
n_0.03 : num 7196 6976 7168 7253 7112 ...
$ n_0.04 : num 6977 6747 6983 7055 6895 ...
$
n_0.05 : num 6777 6542 6811 6871 6693 ...
$ n_0.06 : num 6592 6354 6649 6696 6503 ...
$
n_0.07 : num 6419 6180 6496 6531 6325 ...
$ n_0.08 : num 6255 6016 6350 6374 6155 ...
$
n_0.09 : num 6100 5863 6211 6224 5994 ...
$ n_0.1 : num 5953 5717 6078 6080 5840 ...
$
n_0.11 : num 5813 5579 5950 5942 5692 ...
$ n_0.12 : num 5679 5447 5828 5809 5551 ...
$
n_0.13 : num 5551 5321 5709 5681 5416 ...
$ n_0.14 : num 5428 5201 5595 5558 5285 ...
$
n_0.15 : num 5310 5086 5485 5439 5160 ...
$ n_0.16 : num 5197 4975 5378 5324 5039 ...
$
n_0.17 : num 5088 4868 5275 5213 4923 ...
$ n_0.18 : num 4982 4765 5176 5106 4811 ...
$
n_0.19 : num 4881 4666 5079 5002 4702 ...
$ n_0.2 : num 4783 4571 4985 4901 4597 ...
$
n_0.21 : num 4688 4478 4894 4803 4496 ...
$ n_0.22 : num 4596 4389 4806 4708 4398 ...
$
n_0.23 : num 4507 4302 4720 4616 4303 ...
$ n_0.24 : num 4421 4219 4636 4527 4210 ...
$
n_0.25 : num 4337 4138 4555 4440 4121 ...
$ n_0.26 : num 4256 4059 4476 4355 4035 ...
$
n_0.27 : num 4178 3983 4398 4273 3951 ...
$ n_0.28 : num 4102 3909 4323 4193 3869 ...
$
n_0.29 : num 4027 3837 4250 4115 3790 ...
$ n_0.3 : num 3955 3767 4179 4039 3713 ...
$
n_0.31 : num 3885 3699 4109 3966 3639 ...
$ n_0.32 : num 3817 3633 4041 3894 3566 ...
$
n_0.33 : num 3751 3569 3975 3824 3496 ...
$ n_0.34 : num 3686 3506 3911 3755 3427 ...
$
n_0.35 : num 3623 3445 3847 3689 3361 ...
$ n_0.36 : num 3562 3386 3786 3624 3296 ...
$
n_0.37 : num 3502 3328 3725 3560 3233 ...
$ n_0.38 : num 3444 3272 3666 3498 3171 ...
$
n_0.39 : num 3387 3217 3609 3438 3111 ...
$ n_0.4 : num 3332 3163 3553 3379 3053 ...
$
n_0.41 : num 3278 3111 3498 3321 2996 ...
$ n_0.42 : num 3225 3060 3444 3265 2941 ...
$
n_0.43 : num 3173 3010 3391 3210 2887 ...
$ n_0.44 : num 3123 2961 3339 3156 2834 ...
$
n_0.45 : num 3074 2914 3289 3103 2783 ...
$ n_0.46 : num 3026 2867 3239 3052 2733 ...
$
n_0.47 : num 2979 2822 3191 3002 2684 ...
$ n_0.48 : num 2933 2778 3144 2953 2637 ...
$
n_0.49 : num 2889 2734 3097 2905 2590 ...


And if I run table(data$Toxicity) the output is:



> table(data$Toxicity)
0 1
1088 63


Again, this is for one type of toxicity. I have 3 others as well.










share|cite|improve this question











$endgroup$







  • 1




    $begingroup$
    What are you aiming to do? Prediction or inference, or something else?
    $endgroup$
    – Stephan Kolassa
    Mar 21 at 6:32










  • $begingroup$
    This is called feature selection. If you must use regression, then categorical features have be one-hotted, but for tree methods you can use them as-is. You could even figure out your most predictive n-way interaction or association terms, and use those.
    $endgroup$
    – smci
    Mar 21 at 7:31











  • $begingroup$
    "Do I really just have to sit down, talk to people, and really think about/reason out the top-n predictors?" Hell no, intuition is a startpoint, but that's why there are feature-selection methods; results from lots of experimentation beats intuition.
    $endgroup$
    – smci
    Mar 21 at 7:35






  • 1




    $begingroup$
    @smci sorry for being unclear. So in my field (radiation oncology) we make treatment plans, which basically is a 3D representation of how radiation/dose is distributed around a target. Unfortunately, this can't be done without hitting at least a small amount of healthy tissue. So from this 3D map, so to speak, I can for example get information about how large a volume receives x-amount of radiation/dose. But as you can imagine, I can "ask" in steps like "how much radiation does 1% of this structure volume receive", and then 2%, 3%. In principle, the values will be somewhat similar.
    $endgroup$
    – Denver Dang
    Mar 21 at 9:58






  • 1




    $begingroup$
    @smci, if prediction is the OPs goal, then correlation should be of no concern. High correlation among the variables would really only be of a great concern when trying to interpret variables included in the model.
    $endgroup$
    – StatsStudent
    Mar 21 at 14:15













8












8








8





$begingroup$


So I have been reading some books (or parts of them) on modeling (F. Harrell's "Regression Modeling Strategies" among others), since my current situation right now is that I need to do a logistic model based on binary response data. I have both continuous, categorical, and binary data (predictors) in my data set. Basically I have around 100 predictors right now, which obviously is way too many for a good model. Also, many of these predictors are kind of related, since they are often based on the same metric, although a bit different.



Anyhow, what I have been reading, using univariate regression and step-wise techniques is some of the worst things you can do in order to reduce the amount of predictors. I think the LASSO technique is quite okay (if I understood that correctly), but obviously you just can't use that on 100 predictors and think any good will come of that.



So what are my options here ? Do I really just have to sit down, talk to all my supervisors, and smart people at work, and really think about what the top 5 best predictors could/should be (we might be wrong), or which approach(es) should I consider instead ?



And yes, I also know that this topic is heavily discussed (online and in books), but it sometimes seems a bit overwhelming when you are kind of new in this modeling field.



EDIT:



First of all, my sample size is +1000 patients (which is a lot in my field), and out of those there are between 70-170 positive responses (i.e. 170 yes responses vs. roughly 900 no responses in one of the cases).
Basically the idea is to predict toxicity after radiation treatment. I have some prospective binary response data (i.e. the toxicity, either you have it (1), or you don't (0)), and then I have several types of metrics. Some metrics are patient specific, e.g. age, drugs used, organ and target volume, diabetes etc., and then I have some treatment specific metrics based on the simulated treatment field for the target. From that I can retrieve several predictors, which is often highly relevant in my field, since most toxicity is highly correlated with the amount of radiation (i.e.dose) received. So for example, if I treat a lung tumour, there is a risk of hitting the heart with some amount of dose. I can then calculate how much x-amount of the heart volume receives x-amount of dose, e.g. "how much dose does 50% of the heart volume receive", and then do that in steps, so I check for example 30%, 35%, 40%, 45%, 50%, and so on. In turn I will get a lot of similar predictors, but I can't just pick one to start with (although that is what past experiments have tried to of course, and what I wish to do as well), because I need to know "exactly" at which degree there actually is a large correlation between heart toxicity and volume dose (again, as an example, there are other similar metrics, where the same strategy is applied). So yeah, that's pretty much how my data set looks like. Some different metrics, and some metrics that are somewhat similar.



What I then want to do is make a predictive model so I can hopefully predict which patients will have a risk of getting some kind of toxicity. And since the response data is binary, my main idea was of course to use a logistic regression model. At least that is what other people have done in my field. However, when going through many of these papers, where this has already been done, some of it just seems wrong (at least when reading these specific types of modeling books like F. Harrel's). Many use univariate regression analysis to pick predictors, and use them in multivariate analysis (a thing that is advised against if I'm not mistaken), and also many use step-wise techniques to reduce the amount of predictors.
Of course it's not all bad. Many uses LASSO, PCA, cross-validation, bootstrapping, etc., but the ones I have looked at, it seems like there is always one, or two of their approaches (either in the beginning, middle, or end) where they do these kind of techniques that I read is not a good idea.



Concerning feature selection, this is probably where I'm at now. How do I choose/find the right predictors to use in my model ? I have tried these univariate/step-wise approaches, but every time I think: "Why even do it, if it's wrong?". But maybe it's a good way to show, at least in the end, how a "good model" done the correct way goes up against a "bad model" done the wrong way. So I could probably do it the somewhat wrong way now, what I need help for is getting a direction into doing it the right way.



Sorry for the edit, and it being so long.



EDIT 2:
Just a quick example of how my data looks like:



'data.frame': 1151 obs. of 100 variables:
$ Toxicity : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 1 ...
$
Age : num 71.9 64 52.1 65.1 63.2 ...
$ Diabetes : Factor w/ 2 levels "n","y": 1 1 1 1 1 1 1 1 1 1 ...
$
Risk.Category : Ord.factor w/ 3 levels "LOW"<"INTERMEDIATE"<..: 1 1 1 1 2 1 1 1 1 3 ...
$ Organ.Volume.CC : num 136.1 56.7 66 136.6 72.8 ...
$
Target.Volume.CC : num 102.7 44.2 58.8 39.1 56.3 ...
$ D1perc : num 7961 7718 7865 7986 7890 ...
$
D1.5CC : num 7948 7460 7795 7983 7800 ...
$ D1CC : num 7996 7614 7833 7997 7862 ...
$
D2perc : num 7854 7570 7810 7944 7806 ...
$ D2.5CC : num 7873 7174 7729 7952 7604 ...
$
D2CC : num 7915 7313 7757 7969 7715 ...
$ D3perc : num 7737 7379 7758 7884 7671 ...
$
D3.5CC : num 7787 6765 7613 7913 7325 ...
$ D3CC : num 7827 6953 7675 7934 7480 ...
$
D4perc : num 7595 7218 7715 7798 7500 ...
$ D5perc : num 7428 7030 7638 7676 7257 ...
$
DMEAN : num 1473 1372 1580 1383 1192 ...
$ V2000CGY : num 24.8 23.7 25.9 22.3 19.3 ...
$
V2000CGY_CC : num 33.7 13.4 17.1 30.4 14 ...
$ V2500CGY : num 22.5 21.5 24 20.6 17.5 ...
$
V2500CGY_CC : num 30.7 12.2 15.9 28.2 12.7 ...
$ V3000CGY : num 20.6 19.6 22.4 19.1 15.9 ...
$
V3000CGY_CC : num 28.1 11.1 14.8 26.2 11.6 ...
$ V3500CGY : num 18.9 17.8 20.8 17.8 14.6 ...
$
V3500CGY_CC : num 25.7 10.1 13.7 24.3 10.6 ...
$ V3900CGY : num 17.5 16.5 19.6 16.7 13.6 ...
$
V3900CGY_CC : num 23.76 9.36 12.96 22.85 9.91 ...
$ V4500CGY : num 15.5 14.4 17.8 15.2 12.2 ...
$
V4500CGY_CC : num 21.12 8.18 11.76 20.82 8.88 ...
$ V5000CGY : num 13.9 12.8 16.4 14 11 ...
$
V5000CGY_CC : num 18.91 7.25 10.79 19.09 8.03 ...
$ V5500CGY : num 12.23 11.14 14.84 12.69 9.85 ...
$
V5500CGY_CC : num 16.65 6.31 9.79 17.33 7.17 ...
$ V6000CGY : num 10.56 9.4 13.19 11.34 8.68 ...
$
V6000CGY_CC : num 14.37 5.33 8.7 15.49 6.32 ...
$ V6500CGY : num 8.79 7.32 11.35 9.89 7.44 ...
$
V6500CGY_CC : num 11.96 4.15 7.49 13.51 5.42 ...
$ V7000CGY : num 6.76 5.07 9.25 8.27 5.86 ...
$
V7000CGY_CC : num 9.21 2.87 6.1 11.3 4.26 ...
$ V7500CGY : num 4.61 2.37 6.22 6.13 4 ...
$
V7500CGY_CC : num 6.27 1.34 4.11 8.38 2.91 ...
$ V8000CGY : num 0.7114 0.1521 0.0348 0.6731 0.1527 ...
$
V8000CGY_CC : num 0.9682 0.0863 0.023 0.9194 0.1112 ...
$ V8200CGY : num 0.087 0 0 0 0 ...
$
V8200CGY_CC : num 0.118 0 0 0 0 ...
$ V8500CGY : num 0 0 0 0 0 0 0 0 0 0 ...
$
V8500CGY_CC : num 0 0 0 0 0 0 0 0 0 0 ...
$ n_0.02 : num 7443 7240 7371 7467 7350 ...
$
n_0.03 : num 7196 6976 7168 7253 7112 ...
$ n_0.04 : num 6977 6747 6983 7055 6895 ...
$
n_0.05 : num 6777 6542 6811 6871 6693 ...
$ n_0.06 : num 6592 6354 6649 6696 6503 ...
$
n_0.07 : num 6419 6180 6496 6531 6325 ...
$ n_0.08 : num 6255 6016 6350 6374 6155 ...
$
n_0.09 : num 6100 5863 6211 6224 5994 ...
$ n_0.1 : num 5953 5717 6078 6080 5840 ...
$
n_0.11 : num 5813 5579 5950 5942 5692 ...
$ n_0.12 : num 5679 5447 5828 5809 5551 ...
$
n_0.13 : num 5551 5321 5709 5681 5416 ...
$ n_0.14 : num 5428 5201 5595 5558 5285 ...
$
n_0.15 : num 5310 5086 5485 5439 5160 ...
$ n_0.16 : num 5197 4975 5378 5324 5039 ...
$
n_0.17 : num 5088 4868 5275 5213 4923 ...
$ n_0.18 : num 4982 4765 5176 5106 4811 ...
$
n_0.19 : num 4881 4666 5079 5002 4702 ...
$ n_0.2 : num 4783 4571 4985 4901 4597 ...
$
n_0.21 : num 4688 4478 4894 4803 4496 ...
$ n_0.22 : num 4596 4389 4806 4708 4398 ...
$
n_0.23 : num 4507 4302 4720 4616 4303 ...
$ n_0.24 : num 4421 4219 4636 4527 4210 ...
$
n_0.25 : num 4337 4138 4555 4440 4121 ...
$ n_0.26 : num 4256 4059 4476 4355 4035 ...
$
n_0.27 : num 4178 3983 4398 4273 3951 ...
$ n_0.28 : num 4102 3909 4323 4193 3869 ...
$
n_0.29 : num 4027 3837 4250 4115 3790 ...
$ n_0.3 : num 3955 3767 4179 4039 3713 ...
$
n_0.31 : num 3885 3699 4109 3966 3639 ...
$ n_0.32 : num 3817 3633 4041 3894 3566 ...
$
n_0.33 : num 3751 3569 3975 3824 3496 ...
$ n_0.34 : num 3686 3506 3911 3755 3427 ...
$
n_0.35 : num 3623 3445 3847 3689 3361 ...
$ n_0.36 : num 3562 3386 3786 3624 3296 ...
$
n_0.37 : num 3502 3328 3725 3560 3233 ...
$ n_0.38 : num 3444 3272 3666 3498 3171 ...
$
n_0.39 : num 3387 3217 3609 3438 3111 ...
$ n_0.4 : num 3332 3163 3553 3379 3053 ...
$
n_0.41 : num 3278 3111 3498 3321 2996 ...
$ n_0.42 : num 3225 3060 3444 3265 2941 ...
$
n_0.43 : num 3173 3010 3391 3210 2887 ...
$ n_0.44 : num 3123 2961 3339 3156 2834 ...
$
n_0.45 : num 3074 2914 3289 3103 2783 ...
$ n_0.46 : num 3026 2867 3239 3052 2733 ...
$
n_0.47 : num 2979 2822 3191 3002 2684 ...
$ n_0.48 : num 2933 2778 3144 2953 2637 ...
$
n_0.49 : num 2889 2734 3097 2905 2590 ...


And if I run table(data$Toxicity) the output is:



> table(data$Toxicity)
0 1
1088 63


Again, this is for one type of toxicity. I have 3 others as well.










share|cite|improve this question











$endgroup$




So I have been reading some books (or parts of them) on modeling (F. Harrell's "Regression Modeling Strategies" among others), since my current situation right now is that I need to do a logistic model based on binary response data. I have both continuous, categorical, and binary data (predictors) in my data set. Basically I have around 100 predictors right now, which obviously is way too many for a good model. Also, many of these predictors are kind of related, since they are often based on the same metric, although a bit different.



Anyhow, what I have been reading, using univariate regression and step-wise techniques is some of the worst things you can do in order to reduce the amount of predictors. I think the LASSO technique is quite okay (if I understood that correctly), but obviously you just can't use that on 100 predictors and think any good will come of that.



So what are my options here ? Do I really just have to sit down, talk to all my supervisors, and smart people at work, and really think about what the top 5 best predictors could/should be (we might be wrong), or which approach(es) should I consider instead ?



And yes, I also know that this topic is heavily discussed (online and in books), but it sometimes seems a bit overwhelming when you are kind of new in this modeling field.



EDIT:



First of all, my sample size is +1000 patients (which is a lot in my field), and out of those there are between 70-170 positive responses (i.e. 170 yes responses vs. roughly 900 no responses in one of the cases).
Basically the idea is to predict toxicity after radiation treatment. I have some prospective binary response data (i.e. the toxicity, either you have it (1), or you don't (0)), and then I have several types of metrics. Some metrics are patient specific, e.g. age, drugs used, organ and target volume, diabetes etc., and then I have some treatment specific metrics based on the simulated treatment field for the target. From that I can retrieve several predictors, which is often highly relevant in my field, since most toxicity is highly correlated with the amount of radiation (i.e.dose) received. So for example, if I treat a lung tumour, there is a risk of hitting the heart with some amount of dose. I can then calculate how much x-amount of the heart volume receives x-amount of dose, e.g. "how much dose does 50% of the heart volume receive", and then do that in steps, so I check for example 30%, 35%, 40%, 45%, 50%, and so on. In turn I will get a lot of similar predictors, but I can't just pick one to start with (although that is what past experiments have tried to of course, and what I wish to do as well), because I need to know "exactly" at which degree there actually is a large correlation between heart toxicity and volume dose (again, as an example, there are other similar metrics, where the same strategy is applied). So yeah, that's pretty much how my data set looks like. Some different metrics, and some metrics that are somewhat similar.



What I then want to do is make a predictive model so I can hopefully predict which patients will have a risk of getting some kind of toxicity. And since the response data is binary, my main idea was of course to use a logistic regression model. At least that is what other people have done in my field. However, when going through many of these papers, where this has already been done, some of it just seems wrong (at least when reading these specific types of modeling books like F. Harrel's). Many use univariate regression analysis to pick predictors, and use them in multivariate analysis (a thing that is advised against if I'm not mistaken), and also many use step-wise techniques to reduce the amount of predictors.
Of course it's not all bad. Many uses LASSO, PCA, cross-validation, bootstrapping, etc., but the ones I have looked at, it seems like there is always one, or two of their approaches (either in the beginning, middle, or end) where they do these kind of techniques that I read is not a good idea.



Concerning feature selection, this is probably where I'm at now. How do I choose/find the right predictors to use in my model ? I have tried these univariate/step-wise approaches, but every time I think: "Why even do it, if it's wrong?". But maybe it's a good way to show, at least in the end, how a "good model" done the correct way goes up against a "bad model" done the wrong way. So I could probably do it the somewhat wrong way now, what I need help for is getting a direction into doing it the right way.



Sorry for the edit, and it being so long.



EDIT 2:
Just a quick example of how my data looks like:



'data.frame': 1151 obs. of 100 variables:
$ Toxicity : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 1 ...
$
Age : num 71.9 64 52.1 65.1 63.2 ...
$ Diabetes : Factor w/ 2 levels "n","y": 1 1 1 1 1 1 1 1 1 1 ...
$
Risk.Category : Ord.factor w/ 3 levels "LOW"<"INTERMEDIATE"<..: 1 1 1 1 2 1 1 1 1 3 ...
$ Organ.Volume.CC : num 136.1 56.7 66 136.6 72.8 ...
$
Target.Volume.CC : num 102.7 44.2 58.8 39.1 56.3 ...
$ D1perc : num 7961 7718 7865 7986 7890 ...
$
D1.5CC : num 7948 7460 7795 7983 7800 ...
$ D1CC : num 7996 7614 7833 7997 7862 ...
$
D2perc : num 7854 7570 7810 7944 7806 ...
$ D2.5CC : num 7873 7174 7729 7952 7604 ...
$
D2CC : num 7915 7313 7757 7969 7715 ...
$ D3perc : num 7737 7379 7758 7884 7671 ...
$
D3.5CC : num 7787 6765 7613 7913 7325 ...
$ D3CC : num 7827 6953 7675 7934 7480 ...
$
D4perc : num 7595 7218 7715 7798 7500 ...
$ D5perc : num 7428 7030 7638 7676 7257 ...
$
DMEAN : num 1473 1372 1580 1383 1192 ...
$ V2000CGY : num 24.8 23.7 25.9 22.3 19.3 ...
$
V2000CGY_CC : num 33.7 13.4 17.1 30.4 14 ...
$ V2500CGY : num 22.5 21.5 24 20.6 17.5 ...
$
V2500CGY_CC : num 30.7 12.2 15.9 28.2 12.7 ...
$ V3000CGY : num 20.6 19.6 22.4 19.1 15.9 ...
$
V3000CGY_CC : num 28.1 11.1 14.8 26.2 11.6 ...
$ V3500CGY : num 18.9 17.8 20.8 17.8 14.6 ...
$
V3500CGY_CC : num 25.7 10.1 13.7 24.3 10.6 ...
$ V3900CGY : num 17.5 16.5 19.6 16.7 13.6 ...
$
V3900CGY_CC : num 23.76 9.36 12.96 22.85 9.91 ...
$ V4500CGY : num 15.5 14.4 17.8 15.2 12.2 ...
$
V4500CGY_CC : num 21.12 8.18 11.76 20.82 8.88 ...
$ V5000CGY : num 13.9 12.8 16.4 14 11 ...
$
V5000CGY_CC : num 18.91 7.25 10.79 19.09 8.03 ...
$ V5500CGY : num 12.23 11.14 14.84 12.69 9.85 ...
$
V5500CGY_CC : num 16.65 6.31 9.79 17.33 7.17 ...
$ V6000CGY : num 10.56 9.4 13.19 11.34 8.68 ...
$
V6000CGY_CC : num 14.37 5.33 8.7 15.49 6.32 ...
$ V6500CGY : num 8.79 7.32 11.35 9.89 7.44 ...
$
V6500CGY_CC : num 11.96 4.15 7.49 13.51 5.42 ...
$ V7000CGY : num 6.76 5.07 9.25 8.27 5.86 ...
$
V7000CGY_CC : num 9.21 2.87 6.1 11.3 4.26 ...
$ V7500CGY : num 4.61 2.37 6.22 6.13 4 ...
$
V7500CGY_CC : num 6.27 1.34 4.11 8.38 2.91 ...
$ V8000CGY : num 0.7114 0.1521 0.0348 0.6731 0.1527 ...
$
V8000CGY_CC : num 0.9682 0.0863 0.023 0.9194 0.1112 ...
$ V8200CGY : num 0.087 0 0 0 0 ...
$
V8200CGY_CC : num 0.118 0 0 0 0 ...
$ V8500CGY : num 0 0 0 0 0 0 0 0 0 0 ...
$
V8500CGY_CC : num 0 0 0 0 0 0 0 0 0 0 ...
$ n_0.02 : num 7443 7240 7371 7467 7350 ...
$
n_0.03 : num 7196 6976 7168 7253 7112 ...
$ n_0.04 : num 6977 6747 6983 7055 6895 ...
$
n_0.05 : num 6777 6542 6811 6871 6693 ...
$ n_0.06 : num 6592 6354 6649 6696 6503 ...
$
n_0.07 : num 6419 6180 6496 6531 6325 ...
$ n_0.08 : num 6255 6016 6350 6374 6155 ...
$
n_0.09 : num 6100 5863 6211 6224 5994 ...
$ n_0.1 : num 5953 5717 6078 6080 5840 ...
$
n_0.11 : num 5813 5579 5950 5942 5692 ...
$ n_0.12 : num 5679 5447 5828 5809 5551 ...
$
n_0.13 : num 5551 5321 5709 5681 5416 ...
$ n_0.14 : num 5428 5201 5595 5558 5285 ...
$
n_0.15 : num 5310 5086 5485 5439 5160 ...
$ n_0.16 : num 5197 4975 5378 5324 5039 ...
$
n_0.17 : num 5088 4868 5275 5213 4923 ...
$ n_0.18 : num 4982 4765 5176 5106 4811 ...
$
n_0.19 : num 4881 4666 5079 5002 4702 ...
$ n_0.2 : num 4783 4571 4985 4901 4597 ...
$
n_0.21 : num 4688 4478 4894 4803 4496 ...
$ n_0.22 : num 4596 4389 4806 4708 4398 ...
$
n_0.23 : num 4507 4302 4720 4616 4303 ...
$ n_0.24 : num 4421 4219 4636 4527 4210 ...
$
n_0.25 : num 4337 4138 4555 4440 4121 ...
$ n_0.26 : num 4256 4059 4476 4355 4035 ...
$
n_0.27 : num 4178 3983 4398 4273 3951 ...
$ n_0.28 : num 4102 3909 4323 4193 3869 ...
$
n_0.29 : num 4027 3837 4250 4115 3790 ...
$ n_0.3 : num 3955 3767 4179 4039 3713 ...
$
n_0.31 : num 3885 3699 4109 3966 3639 ...
$ n_0.32 : num 3817 3633 4041 3894 3566 ...
$
n_0.33 : num 3751 3569 3975 3824 3496 ...
$ n_0.34 : num 3686 3506 3911 3755 3427 ...
$
n_0.35 : num 3623 3445 3847 3689 3361 ...
$ n_0.36 : num 3562 3386 3786 3624 3296 ...
$
n_0.37 : num 3502 3328 3725 3560 3233 ...
$ n_0.38 : num 3444 3272 3666 3498 3171 ...
$
n_0.39 : num 3387 3217 3609 3438 3111 ...
$ n_0.4 : num 3332 3163 3553 3379 3053 ...
$
n_0.41 : num 3278 3111 3498 3321 2996 ...
$ n_0.42 : num 3225 3060 3444 3265 2941 ...
$
n_0.43 : num 3173 3010 3391 3210 2887 ...
$ n_0.44 : num 3123 2961 3339 3156 2834 ...
$
n_0.45 : num 3074 2914 3289 3103 2783 ...
$ n_0.46 : num 3026 2867 3239 3052 2733 ...
$
n_0.47 : num 2979 2822 3191 3002 2684 ...
$ n_0.48 : num 2933 2778 3144 2953 2637 ...
$
n_0.49 : num 2889 2734 3097 2905 2590 ...


And if I run table(data$Toxicity) the output is:



> table(data$Toxicity)
0 1
1088 63


Again, this is for one type of toxicity. I have 3 others as well.







logistic predictive-models feature-selection regression-strategies






share|cite|improve this question















share|cite|improve this question













share|cite|improve this question




share|cite|improve this question








edited Mar 21 at 13:17







Denver Dang

















asked Mar 21 at 0:21









Denver DangDenver Dang

248111




248111







  • 1




    $begingroup$
    What are you aiming to do? Prediction or inference, or something else?
    $endgroup$
    – Stephan Kolassa
    Mar 21 at 6:32










  • $begingroup$
    This is called feature selection. If you must use regression, then categorical features have be one-hotted, but for tree methods you can use them as-is. You could even figure out your most predictive n-way interaction or association terms, and use those.
    $endgroup$
    – smci
    Mar 21 at 7:31











  • $begingroup$
    "Do I really just have to sit down, talk to people, and really think about/reason out the top-n predictors?" Hell no, intuition is a startpoint, but that's why there are feature-selection methods; results from lots of experimentation beats intuition.
    $endgroup$
    – smci
    Mar 21 at 7:35






  • 1




    $begingroup$
    @smci sorry for being unclear. So in my field (radiation oncology) we make treatment plans, which basically is a 3D representation of how radiation/dose is distributed around a target. Unfortunately, this can't be done without hitting at least a small amount of healthy tissue. So from this 3D map, so to speak, I can for example get information about how large a volume receives x-amount of radiation/dose. But as you can imagine, I can "ask" in steps like "how much radiation does 1% of this structure volume receive", and then 2%, 3%. In principle, the values will be somewhat similar.
    $endgroup$
    – Denver Dang
    Mar 21 at 9:58






  • 1




    $begingroup$
    @smci, if prediction is the OPs goal, then correlation should be of no concern. High correlation among the variables would really only be of a great concern when trying to interpret variables included in the model.
    $endgroup$
    – StatsStudent
    Mar 21 at 14:15












  • 1




    $begingroup$
    What are you aiming to do? Prediction or inference, or something else?
    $endgroup$
    – Stephan Kolassa
    Mar 21 at 6:32










  • $begingroup$
    This is called feature selection. If you must use regression, then categorical features have be one-hotted, but for tree methods you can use them as-is. You could even figure out your most predictive n-way interaction or association terms, and use those.
    $endgroup$
    – smci
    Mar 21 at 7:31











  • $begingroup$
    "Do I really just have to sit down, talk to people, and really think about/reason out the top-n predictors?" Hell no, intuition is a startpoint, but that's why there are feature-selection methods; results from lots of experimentation beats intuition.
    $endgroup$
    – smci
    Mar 21 at 7:35






  • 1




    $begingroup$
    @smci sorry for being unclear. So in my field (radiation oncology) we make treatment plans, which basically is a 3D representation of how radiation/dose is distributed around a target. Unfortunately, this can't be done without hitting at least a small amount of healthy tissue. So from this 3D map, so to speak, I can for example get information about how large a volume receives x-amount of radiation/dose. But as you can imagine, I can "ask" in steps like "how much radiation does 1% of this structure volume receive", and then 2%, 3%. In principle, the values will be somewhat similar.
    $endgroup$
    – Denver Dang
    Mar 21 at 9:58






  • 1




    $begingroup$
    @smci, if prediction is the OPs goal, then correlation should be of no concern. High correlation among the variables would really only be of a great concern when trying to interpret variables included in the model.
    $endgroup$
    – StatsStudent
    Mar 21 at 14:15







1




1




$begingroup$
What are you aiming to do? Prediction or inference, or something else?
$endgroup$
– Stephan Kolassa
Mar 21 at 6:32




$begingroup$
What are you aiming to do? Prediction or inference, or something else?
$endgroup$
– Stephan Kolassa
Mar 21 at 6:32












$begingroup$
This is called feature selection. If you must use regression, then categorical features have be one-hotted, but for tree methods you can use them as-is. You could even figure out your most predictive n-way interaction or association terms, and use those.
$endgroup$
– smci
Mar 21 at 7:31





$begingroup$
This is called feature selection. If you must use regression, then categorical features have be one-hotted, but for tree methods you can use them as-is. You could even figure out your most predictive n-way interaction or association terms, and use those.
$endgroup$
– smci
Mar 21 at 7:31













$begingroup$
"Do I really just have to sit down, talk to people, and really think about/reason out the top-n predictors?" Hell no, intuition is a startpoint, but that's why there are feature-selection methods; results from lots of experimentation beats intuition.
$endgroup$
– smci
Mar 21 at 7:35




$begingroup$
"Do I really just have to sit down, talk to people, and really think about/reason out the top-n predictors?" Hell no, intuition is a startpoint, but that's why there are feature-selection methods; results from lots of experimentation beats intuition.
$endgroup$
– smci
Mar 21 at 7:35




1




1




$begingroup$
@smci sorry for being unclear. So in my field (radiation oncology) we make treatment plans, which basically is a 3D representation of how radiation/dose is distributed around a target. Unfortunately, this can't be done without hitting at least a small amount of healthy tissue. So from this 3D map, so to speak, I can for example get information about how large a volume receives x-amount of radiation/dose. But as you can imagine, I can "ask" in steps like "how much radiation does 1% of this structure volume receive", and then 2%, 3%. In principle, the values will be somewhat similar.
$endgroup$
– Denver Dang
Mar 21 at 9:58




$begingroup$
@smci sorry for being unclear. So in my field (radiation oncology) we make treatment plans, which basically is a 3D representation of how radiation/dose is distributed around a target. Unfortunately, this can't be done without hitting at least a small amount of healthy tissue. So from this 3D map, so to speak, I can for example get information about how large a volume receives x-amount of radiation/dose. But as you can imagine, I can "ask" in steps like "how much radiation does 1% of this structure volume receive", and then 2%, 3%. In principle, the values will be somewhat similar.
$endgroup$
– Denver Dang
Mar 21 at 9:58




1




1




$begingroup$
@smci, if prediction is the OPs goal, then correlation should be of no concern. High correlation among the variables would really only be of a great concern when trying to interpret variables included in the model.
$endgroup$
– StatsStudent
Mar 21 at 14:15




$begingroup$
@smci, if prediction is the OPs goal, then correlation should be of no concern. High correlation among the variables would really only be of a great concern when trying to interpret variables included in the model.
$endgroup$
– StatsStudent
Mar 21 at 14:15










3 Answers
3






active

oldest

votes


















4












$begingroup$

Some of the answers you have received that push feature selection are off base.



The lasso or better the elastic net will do feature selection but as pointed out above you will be quite disappointed at the volatility of the set of "selected" features. I believe the only real hope in your situation is data reduction, i.e., unsupervised learning, as I emphasize in my book. Data reduction brings more interpretability and especially more stability. I very much recommend sparse principal components, or variable clustering followed by regular principal components on clusters.



The information content in your dataset is far, far too low for any feature selection algorithm to be reliable.






share|cite|improve this answer









$endgroup$












  • $begingroup$
    First of all, thank you for taking the time to comment. Secondly, if I'm not mistaken, unsupervised learning is when you don't make use (or don't have) of the particular response of the response variable(s) (i.e. 1 or 0), and make the computer "guess" how the distribution should be divided. But logistic regression (and linear) is supervised as far as I know ? So you recommendation is to abandon that methodology ? On one hand I like the idea, but on the other, logistic and probit regression is how almost every modeling paper in my field (data similar to mine) has been doing so far.
    $endgroup$
    – Denver Dang
    Mar 21 at 13:01











  • $begingroup$
    So wouldn't I be going out on a limb here, or do I just have to assume that everyone else has been doing it "wrong" in forever ?
    $endgroup$
    – Denver Dang
    Mar 21 at 13:05






  • 3




    $begingroup$
    Not everyone but most people have definitely being doing it wrong. This was a prime motivator for writing Regression Modeling Strategies. The goal of data reduction is the lessen the amount of supervised learning that logistic regression gets asked to do. For example you might reduce 100 candidate features to effectively 5 cluster scores, then have to estimate only 5 parameters + intercept.
    $endgroup$
    – Frank Harrell
    Mar 21 at 13:27






  • 2




    $begingroup$
    How do you feel about varying-coefficient models in this case (as I added to my answer)?
    $endgroup$
    – Ben Bolker
    Mar 21 at 13:34










  • $begingroup$
    @FrankHarrell it sounds rather interesting. But sorry for asking, why is supervised learning bad, or at least somewhat bad, as it looks like you are implying ?
    $endgroup$
    – Denver Dang
    Mar 21 at 20:59


















8












$begingroup$

+1 for "sometimes seems a bit overwhelming". It really depends (as Harrell clearly states; see the section at the end of Chapter 4) whether you want to do




  • confirmatory analysis ($to$ reduce your predictor complexity to a reasonable level without looking at the responses, by PCA or subject-area considerations or ...)


  • predictive analysis ($to$ use appropriate penalization methods). Lasso could very well work OK with 100 predictors, if you have a reasonably large sample. Feature selection will be unstable, but that's OK if all you care about is prediction. I have a personal preference for ridge-like approaches that don't technically "select features" (because they never reduce any parameter to exactly zero), but whatever works ...



    You'll have to use cross-validation to choose the degree of penalization, which will destroy your ability to do inference (construct confidence intervals on predictions) unless you use cutting-edge high-dimensional inference methods (e.g. Dezeure et al 2015; I have not tried these approaches but they seem sensible ...)




  • exploratory analysis: have fun, be transparent and honest, don't quote any p-values.

For the particular use case you have now described (a bunch of your predictors essentially represent a cumulative distribution of the dose received by different fractions of the heart), you might want to look into varying-coefficient models (a little hard to search for), which basically fit a smooth curve for the effect of the CDF (these can be implemented in R's mgcv package).






share|cite|improve this answer











$endgroup$












  • $begingroup$
    My sample size is +1000, and depending on which response variable (I have 4), I have between 75-170 positive (or negative, depending on how you look at it) responses of the +1000. I don't know if that makes anything easier, i.e. I can dismiss some steps since the sample set is rather huge (at least in my field).
    $endgroup$
    – Denver Dang
    Mar 21 at 10:13










  • $begingroup$
    Would cross validation destroy the ability to do inference? Maybe. I'm thinking one could bootstrap before cross validation to obtain confidence intervals for predictions. This may be feasible with 1000 observations.
    $endgroup$
    – JTH
    Mar 21 at 15:35










  • $begingroup$
    post-selection inference is really hard; if you use the whole data set to tune hyperparameters (such as the strength of penalization), then you're in the same situation. You'd have to outline your bootstrap + CV approach before I can say if I'd believe it could works ...
    $endgroup$
    – Ben Bolker
    Mar 21 at 16:18


















0












$begingroup$

There are many different approaches. What I would recommend is trying some simple ones, in the following order:



  • L1 regularization (with increasing penalty; the larger the regularization coefficient, the more features will be eliminated)

  • Recursive Feature Elimination (https://scikit-learn.org/stable/modules/feature_selection.html#recursive-feature-elimination) -- removes features incrementally by eliminating the features associated with the smallest model coefficients (assuming that those are the least important once; obviously, it's very crucial here to normalize the input features)

  • Sequential Feature Selection (http://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/) -- removes features based on how important they are for predictive performance





share|cite|improve this answer









$endgroup$












  • $begingroup$
    I believe that all three of these methods will be found to be unstable.
    $endgroup$
    – Frank Harrell
    Mar 21 at 19:20










  • $begingroup$
    it depends on how you define unstable. In practice, you usually use some sort of cross validation like k-fold or leave-one-out and judge based on overall performance + variance (aka the 1SE method) which features you end up choosing.
    $endgroup$
    – resnet
    Mar 21 at 21:39










  • $begingroup$
    Bootstrapping and cross-validation only validate some predictive index for the process generating the model. This results in a good estimate of that index for a model selected using that process but does not provide any comfort for the structure of a model that was developed once, i.e., the overall model. Look at the structure selected (i.e., features selected) across the resamples to see the volatility.
    $endgroup$
    – Frank Harrell
    Mar 22 at 13:10











Your Answer








StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "65"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f398638%2fhow-to-reduce-predictors-the-right-way-for-a-logistic-regression-model%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























3 Answers
3






active

oldest

votes








3 Answers
3






active

oldest

votes









active

oldest

votes






active

oldest

votes









4












$begingroup$

Some of the answers you have received that push feature selection are off base.



The lasso or better the elastic net will do feature selection but as pointed out above you will be quite disappointed at the volatility of the set of "selected" features. I believe the only real hope in your situation is data reduction, i.e., unsupervised learning, as I emphasize in my book. Data reduction brings more interpretability and especially more stability. I very much recommend sparse principal components, or variable clustering followed by regular principal components on clusters.



The information content in your dataset is far, far too low for any feature selection algorithm to be reliable.






share|cite|improve this answer









$endgroup$












  • $begingroup$
    First of all, thank you for taking the time to comment. Secondly, if I'm not mistaken, unsupervised learning is when you don't make use (or don't have) of the particular response of the response variable(s) (i.e. 1 or 0), and make the computer "guess" how the distribution should be divided. But logistic regression (and linear) is supervised as far as I know ? So you recommendation is to abandon that methodology ? On one hand I like the idea, but on the other, logistic and probit regression is how almost every modeling paper in my field (data similar to mine) has been doing so far.
    $endgroup$
    – Denver Dang
    Mar 21 at 13:01











  • $begingroup$
    So wouldn't I be going out on a limb here, or do I just have to assume that everyone else has been doing it "wrong" in forever ?
    $endgroup$
    – Denver Dang
    Mar 21 at 13:05






  • 3




    $begingroup$
    Not everyone but most people have definitely being doing it wrong. This was a prime motivator for writing Regression Modeling Strategies. The goal of data reduction is the lessen the amount of supervised learning that logistic regression gets asked to do. For example you might reduce 100 candidate features to effectively 5 cluster scores, then have to estimate only 5 parameters + intercept.
    $endgroup$
    – Frank Harrell
    Mar 21 at 13:27






  • 2




    $begingroup$
    How do you feel about varying-coefficient models in this case (as I added to my answer)?
    $endgroup$
    – Ben Bolker
    Mar 21 at 13:34










  • $begingroup$
    @FrankHarrell it sounds rather interesting. But sorry for asking, why is supervised learning bad, or at least somewhat bad, as it looks like you are implying ?
    $endgroup$
    – Denver Dang
    Mar 21 at 20:59















4












$begingroup$

Some of the answers you have received that push feature selection are off base.



The lasso or better the elastic net will do feature selection but as pointed out above you will be quite disappointed at the volatility of the set of "selected" features. I believe the only real hope in your situation is data reduction, i.e., unsupervised learning, as I emphasize in my book. Data reduction brings more interpretability and especially more stability. I very much recommend sparse principal components, or variable clustering followed by regular principal components on clusters.



The information content in your dataset is far, far too low for any feature selection algorithm to be reliable.






share|cite|improve this answer









$endgroup$












  • $begingroup$
    First of all, thank you for taking the time to comment. Secondly, if I'm not mistaken, unsupervised learning is when you don't make use (or don't have) of the particular response of the response variable(s) (i.e. 1 or 0), and make the computer "guess" how the distribution should be divided. But logistic regression (and linear) is supervised as far as I know ? So you recommendation is to abandon that methodology ? On one hand I like the idea, but on the other, logistic and probit regression is how almost every modeling paper in my field (data similar to mine) has been doing so far.
    $endgroup$
    – Denver Dang
    Mar 21 at 13:01











  • $begingroup$
    So wouldn't I be going out on a limb here, or do I just have to assume that everyone else has been doing it "wrong" in forever ?
    $endgroup$
    – Denver Dang
    Mar 21 at 13:05






  • 3




    $begingroup$
    Not everyone but most people have definitely being doing it wrong. This was a prime motivator for writing Regression Modeling Strategies. The goal of data reduction is the lessen the amount of supervised learning that logistic regression gets asked to do. For example you might reduce 100 candidate features to effectively 5 cluster scores, then have to estimate only 5 parameters + intercept.
    $endgroup$
    – Frank Harrell
    Mar 21 at 13:27






  • 2




    $begingroup$
    How do you feel about varying-coefficient models in this case (as I added to my answer)?
    $endgroup$
    – Ben Bolker
    Mar 21 at 13:34










  • $begingroup$
    @FrankHarrell it sounds rather interesting. But sorry for asking, why is supervised learning bad, or at least somewhat bad, as it looks like you are implying ?
    $endgroup$
    – Denver Dang
    Mar 21 at 20:59













4












4








4





$begingroup$

Some of the answers you have received that push feature selection are off base.



The lasso or better the elastic net will do feature selection but as pointed out above you will be quite disappointed at the volatility of the set of "selected" features. I believe the only real hope in your situation is data reduction, i.e., unsupervised learning, as I emphasize in my book. Data reduction brings more interpretability and especially more stability. I very much recommend sparse principal components, or variable clustering followed by regular principal components on clusters.



The information content in your dataset is far, far too low for any feature selection algorithm to be reliable.






share|cite|improve this answer









$endgroup$



Some of the answers you have received that push feature selection are off base.



The lasso or better the elastic net will do feature selection but as pointed out above you will be quite disappointed at the volatility of the set of "selected" features. I believe the only real hope in your situation is data reduction, i.e., unsupervised learning, as I emphasize in my book. Data reduction brings more interpretability and especially more stability. I very much recommend sparse principal components, or variable clustering followed by regular principal components on clusters.



The information content in your dataset is far, far too low for any feature selection algorithm to be reliable.







share|cite|improve this answer












share|cite|improve this answer



share|cite|improve this answer










answered Mar 21 at 12:43









Frank HarrellFrank Harrell

56.4k4113246




56.4k4113246











  • $begingroup$
    First of all, thank you for taking the time to comment. Secondly, if I'm not mistaken, unsupervised learning is when you don't make use (or don't have) of the particular response of the response variable(s) (i.e. 1 or 0), and make the computer "guess" how the distribution should be divided. But logistic regression (and linear) is supervised as far as I know ? So you recommendation is to abandon that methodology ? On one hand I like the idea, but on the other, logistic and probit regression is how almost every modeling paper in my field (data similar to mine) has been doing so far.
    $endgroup$
    – Denver Dang
    Mar 21 at 13:01











  • $begingroup$
    So wouldn't I be going out on a limb here, or do I just have to assume that everyone else has been doing it "wrong" in forever ?
    $endgroup$
    – Denver Dang
    Mar 21 at 13:05






  • 3




    $begingroup$
    Not everyone but most people have definitely being doing it wrong. This was a prime motivator for writing Regression Modeling Strategies. The goal of data reduction is the lessen the amount of supervised learning that logistic regression gets asked to do. For example you might reduce 100 candidate features to effectively 5 cluster scores, then have to estimate only 5 parameters + intercept.
    $endgroup$
    – Frank Harrell
    Mar 21 at 13:27






  • 2




    $begingroup$
    How do you feel about varying-coefficient models in this case (as I added to my answer)?
    $endgroup$
    – Ben Bolker
    Mar 21 at 13:34










  • $begingroup$
    @FrankHarrell it sounds rather interesting. But sorry for asking, why is supervised learning bad, or at least somewhat bad, as it looks like you are implying ?
    $endgroup$
    – Denver Dang
    Mar 21 at 20:59
















  • $begingroup$
    First of all, thank you for taking the time to comment. Secondly, if I'm not mistaken, unsupervised learning is when you don't make use (or don't have) of the particular response of the response variable(s) (i.e. 1 or 0), and make the computer "guess" how the distribution should be divided. But logistic regression (and linear) is supervised as far as I know ? So you recommendation is to abandon that methodology ? On one hand I like the idea, but on the other, logistic and probit regression is how almost every modeling paper in my field (data similar to mine) has been doing so far.
    $endgroup$
    – Denver Dang
    Mar 21 at 13:01











  • $begingroup$
    So wouldn't I be going out on a limb here, or do I just have to assume that everyone else has been doing it "wrong" in forever ?
    $endgroup$
    – Denver Dang
    Mar 21 at 13:05






  • 3




    $begingroup$
    Not everyone but most people have definitely being doing it wrong. This was a prime motivator for writing Regression Modeling Strategies. The goal of data reduction is the lessen the amount of supervised learning that logistic regression gets asked to do. For example you might reduce 100 candidate features to effectively 5 cluster scores, then have to estimate only 5 parameters + intercept.
    $endgroup$
    – Frank Harrell
    Mar 21 at 13:27






  • 2




    $begingroup$
    How do you feel about varying-coefficient models in this case (as I added to my answer)?
    $endgroup$
    – Ben Bolker
    Mar 21 at 13:34










  • $begingroup$
    @FrankHarrell it sounds rather interesting. But sorry for asking, why is supervised learning bad, or at least somewhat bad, as it looks like you are implying ?
    $endgroup$
    – Denver Dang
    Mar 21 at 20:59















$begingroup$
First of all, thank you for taking the time to comment. Secondly, if I'm not mistaken, unsupervised learning is when you don't make use (or don't have) of the particular response of the response variable(s) (i.e. 1 or 0), and make the computer "guess" how the distribution should be divided. But logistic regression (and linear) is supervised as far as I know ? So you recommendation is to abandon that methodology ? On one hand I like the idea, but on the other, logistic and probit regression is how almost every modeling paper in my field (data similar to mine) has been doing so far.
$endgroup$
– Denver Dang
Mar 21 at 13:01





$begingroup$
First of all, thank you for taking the time to comment. Secondly, if I'm not mistaken, unsupervised learning is when you don't make use (or don't have) of the particular response of the response variable(s) (i.e. 1 or 0), and make the computer "guess" how the distribution should be divided. But logistic regression (and linear) is supervised as far as I know ? So you recommendation is to abandon that methodology ? On one hand I like the idea, but on the other, logistic and probit regression is how almost every modeling paper in my field (data similar to mine) has been doing so far.
$endgroup$
– Denver Dang
Mar 21 at 13:01













$begingroup$
So wouldn't I be going out on a limb here, or do I just have to assume that everyone else has been doing it "wrong" in forever ?
$endgroup$
– Denver Dang
Mar 21 at 13:05




$begingroup$
So wouldn't I be going out on a limb here, or do I just have to assume that everyone else has been doing it "wrong" in forever ?
$endgroup$
– Denver Dang
Mar 21 at 13:05




3




3




$begingroup$
Not everyone but most people have definitely being doing it wrong. This was a prime motivator for writing Regression Modeling Strategies. The goal of data reduction is the lessen the amount of supervised learning that logistic regression gets asked to do. For example you might reduce 100 candidate features to effectively 5 cluster scores, then have to estimate only 5 parameters + intercept.
$endgroup$
– Frank Harrell
Mar 21 at 13:27




$begingroup$
Not everyone but most people have definitely being doing it wrong. This was a prime motivator for writing Regression Modeling Strategies. The goal of data reduction is the lessen the amount of supervised learning that logistic regression gets asked to do. For example you might reduce 100 candidate features to effectively 5 cluster scores, then have to estimate only 5 parameters + intercept.
$endgroup$
– Frank Harrell
Mar 21 at 13:27




2




2




$begingroup$
How do you feel about varying-coefficient models in this case (as I added to my answer)?
$endgroup$
– Ben Bolker
Mar 21 at 13:34




$begingroup$
How do you feel about varying-coefficient models in this case (as I added to my answer)?
$endgroup$
– Ben Bolker
Mar 21 at 13:34












$begingroup$
@FrankHarrell it sounds rather interesting. But sorry for asking, why is supervised learning bad, or at least somewhat bad, as it looks like you are implying ?
$endgroup$
– Denver Dang
Mar 21 at 20:59




$begingroup$
@FrankHarrell it sounds rather interesting. But sorry for asking, why is supervised learning bad, or at least somewhat bad, as it looks like you are implying ?
$endgroup$
– Denver Dang
Mar 21 at 20:59













8












$begingroup$

+1 for "sometimes seems a bit overwhelming". It really depends (as Harrell clearly states; see the section at the end of Chapter 4) whether you want to do




  • confirmatory analysis ($to$ reduce your predictor complexity to a reasonable level without looking at the responses, by PCA or subject-area considerations or ...)


  • predictive analysis ($to$ use appropriate penalization methods). Lasso could very well work OK with 100 predictors, if you have a reasonably large sample. Feature selection will be unstable, but that's OK if all you care about is prediction. I have a personal preference for ridge-like approaches that don't technically "select features" (because they never reduce any parameter to exactly zero), but whatever works ...



    You'll have to use cross-validation to choose the degree of penalization, which will destroy your ability to do inference (construct confidence intervals on predictions) unless you use cutting-edge high-dimensional inference methods (e.g. Dezeure et al 2015; I have not tried these approaches but they seem sensible ...)




  • exploratory analysis: have fun, be transparent and honest, don't quote any p-values.

For the particular use case you have now described (a bunch of your predictors essentially represent a cumulative distribution of the dose received by different fractions of the heart), you might want to look into varying-coefficient models (a little hard to search for), which basically fit a smooth curve for the effect of the CDF (these can be implemented in R's mgcv package).






share|cite|improve this answer











$endgroup$












  • $begingroup$
    My sample size is +1000, and depending on which response variable (I have 4), I have between 75-170 positive (or negative, depending on how you look at it) responses of the +1000. I don't know if that makes anything easier, i.e. I can dismiss some steps since the sample set is rather huge (at least in my field).
    $endgroup$
    – Denver Dang
    Mar 21 at 10:13










  • $begingroup$
    Would cross validation destroy the ability to do inference? Maybe. I'm thinking one could bootstrap before cross validation to obtain confidence intervals for predictions. This may be feasible with 1000 observations.
    $endgroup$
    – JTH
    Mar 21 at 15:35










  • $begingroup$
    post-selection inference is really hard; if you use the whole data set to tune hyperparameters (such as the strength of penalization), then you're in the same situation. You'd have to outline your bootstrap + CV approach before I can say if I'd believe it could works ...
    $endgroup$
    – Ben Bolker
    Mar 21 at 16:18















8












$begingroup$

+1 for "sometimes seems a bit overwhelming". It really depends (as Harrell clearly states; see the section at the end of Chapter 4) whether you want to do




  • confirmatory analysis ($to$ reduce your predictor complexity to a reasonable level without looking at the responses, by PCA or subject-area considerations or ...)


  • predictive analysis ($to$ use appropriate penalization methods). Lasso could very well work OK with 100 predictors, if you have a reasonably large sample. Feature selection will be unstable, but that's OK if all you care about is prediction. I have a personal preference for ridge-like approaches that don't technically "select features" (because they never reduce any parameter to exactly zero), but whatever works ...



    You'll have to use cross-validation to choose the degree of penalization, which will destroy your ability to do inference (construct confidence intervals on predictions) unless you use cutting-edge high-dimensional inference methods (e.g. Dezeure et al 2015; I have not tried these approaches but they seem sensible ...)




  • exploratory analysis: have fun, be transparent and honest, don't quote any p-values.

For the particular use case you have now described (a bunch of your predictors essentially represent a cumulative distribution of the dose received by different fractions of the heart), you might want to look into varying-coefficient models (a little hard to search for), which basically fit a smooth curve for the effect of the CDF (these can be implemented in R's mgcv package).






share|cite|improve this answer











$endgroup$












  • $begingroup$
    My sample size is +1000, and depending on which response variable (I have 4), I have between 75-170 positive (or negative, depending on how you look at it) responses of the +1000. I don't know if that makes anything easier, i.e. I can dismiss some steps since the sample set is rather huge (at least in my field).
    $endgroup$
    – Denver Dang
    Mar 21 at 10:13










  • $begingroup$
    Would cross validation destroy the ability to do inference? Maybe. I'm thinking one could bootstrap before cross validation to obtain confidence intervals for predictions. This may be feasible with 1000 observations.
    $endgroup$
    – JTH
    Mar 21 at 15:35










  • $begingroup$
    post-selection inference is really hard; if you use the whole data set to tune hyperparameters (such as the strength of penalization), then you're in the same situation. You'd have to outline your bootstrap + CV approach before I can say if I'd believe it could works ...
    $endgroup$
    – Ben Bolker
    Mar 21 at 16:18













8












8








8





$begingroup$

+1 for "sometimes seems a bit overwhelming". It really depends (as Harrell clearly states; see the section at the end of Chapter 4) whether you want to do




  • confirmatory analysis ($to$ reduce your predictor complexity to a reasonable level without looking at the responses, by PCA or subject-area considerations or ...)


  • predictive analysis ($to$ use appropriate penalization methods). Lasso could very well work OK with 100 predictors, if you have a reasonably large sample. Feature selection will be unstable, but that's OK if all you care about is prediction. I have a personal preference for ridge-like approaches that don't technically "select features" (because they never reduce any parameter to exactly zero), but whatever works ...



    You'll have to use cross-validation to choose the degree of penalization, which will destroy your ability to do inference (construct confidence intervals on predictions) unless you use cutting-edge high-dimensional inference methods (e.g. Dezeure et al 2015; I have not tried these approaches but they seem sensible ...)




  • exploratory analysis: have fun, be transparent and honest, don't quote any p-values.

For the particular use case you have now described (a bunch of your predictors essentially represent a cumulative distribution of the dose received by different fractions of the heart), you might want to look into varying-coefficient models (a little hard to search for), which basically fit a smooth curve for the effect of the CDF (these can be implemented in R's mgcv package).






share|cite|improve this answer











$endgroup$



+1 for "sometimes seems a bit overwhelming". It really depends (as Harrell clearly states; see the section at the end of Chapter 4) whether you want to do




  • confirmatory analysis ($to$ reduce your predictor complexity to a reasonable level without looking at the responses, by PCA or subject-area considerations or ...)


  • predictive analysis ($to$ use appropriate penalization methods). Lasso could very well work OK with 100 predictors, if you have a reasonably large sample. Feature selection will be unstable, but that's OK if all you care about is prediction. I have a personal preference for ridge-like approaches that don't technically "select features" (because they never reduce any parameter to exactly zero), but whatever works ...



    You'll have to use cross-validation to choose the degree of penalization, which will destroy your ability to do inference (construct confidence intervals on predictions) unless you use cutting-edge high-dimensional inference methods (e.g. Dezeure et al 2015; I have not tried these approaches but they seem sensible ...)




  • exploratory analysis: have fun, be transparent and honest, don't quote any p-values.

For the particular use case you have now described (a bunch of your predictors essentially represent a cumulative distribution of the dose received by different fractions of the heart), you might want to look into varying-coefficient models (a little hard to search for), which basically fit a smooth curve for the effect of the CDF (these can be implemented in R's mgcv package).







share|cite|improve this answer














share|cite|improve this answer



share|cite|improve this answer








edited Mar 21 at 13:32

























answered Mar 21 at 0:54









Ben BolkerBen Bolker

23.9k16494




23.9k16494











  • $begingroup$
    My sample size is +1000, and depending on which response variable (I have 4), I have between 75-170 positive (or negative, depending on how you look at it) responses of the +1000. I don't know if that makes anything easier, i.e. I can dismiss some steps since the sample set is rather huge (at least in my field).
    $endgroup$
    – Denver Dang
    Mar 21 at 10:13










  • $begingroup$
    Would cross validation destroy the ability to do inference? Maybe. I'm thinking one could bootstrap before cross validation to obtain confidence intervals for predictions. This may be feasible with 1000 observations.
    $endgroup$
    – JTH
    Mar 21 at 15:35










  • $begingroup$
    post-selection inference is really hard; if you use the whole data set to tune hyperparameters (such as the strength of penalization), then you're in the same situation. You'd have to outline your bootstrap + CV approach before I can say if I'd believe it could works ...
    $endgroup$
    – Ben Bolker
    Mar 21 at 16:18
















  • $begingroup$
    My sample size is +1000, and depending on which response variable (I have 4), I have between 75-170 positive (or negative, depending on how you look at it) responses of the +1000. I don't know if that makes anything easier, i.e. I can dismiss some steps since the sample set is rather huge (at least in my field).
    $endgroup$
    – Denver Dang
    Mar 21 at 10:13










  • $begingroup$
    Would cross validation destroy the ability to do inference? Maybe. I'm thinking one could bootstrap before cross validation to obtain confidence intervals for predictions. This may be feasible with 1000 observations.
    $endgroup$
    – JTH
    Mar 21 at 15:35










  • $begingroup$
    post-selection inference is really hard; if you use the whole data set to tune hyperparameters (such as the strength of penalization), then you're in the same situation. You'd have to outline your bootstrap + CV approach before I can say if I'd believe it could works ...
    $endgroup$
    – Ben Bolker
    Mar 21 at 16:18















$begingroup$
My sample size is +1000, and depending on which response variable (I have 4), I have between 75-170 positive (or negative, depending on how you look at it) responses of the +1000. I don't know if that makes anything easier, i.e. I can dismiss some steps since the sample set is rather huge (at least in my field).
$endgroup$
– Denver Dang
Mar 21 at 10:13




$begingroup$
My sample size is +1000, and depending on which response variable (I have 4), I have between 75-170 positive (or negative, depending on how you look at it) responses of the +1000. I don't know if that makes anything easier, i.e. I can dismiss some steps since the sample set is rather huge (at least in my field).
$endgroup$
– Denver Dang
Mar 21 at 10:13












$begingroup$
Would cross validation destroy the ability to do inference? Maybe. I'm thinking one could bootstrap before cross validation to obtain confidence intervals for predictions. This may be feasible with 1000 observations.
$endgroup$
– JTH
Mar 21 at 15:35




$begingroup$
Would cross validation destroy the ability to do inference? Maybe. I'm thinking one could bootstrap before cross validation to obtain confidence intervals for predictions. This may be feasible with 1000 observations.
$endgroup$
– JTH
Mar 21 at 15:35












$begingroup$
post-selection inference is really hard; if you use the whole data set to tune hyperparameters (such as the strength of penalization), then you're in the same situation. You'd have to outline your bootstrap + CV approach before I can say if I'd believe it could works ...
$endgroup$
– Ben Bolker
Mar 21 at 16:18




$begingroup$
post-selection inference is really hard; if you use the whole data set to tune hyperparameters (such as the strength of penalization), then you're in the same situation. You'd have to outline your bootstrap + CV approach before I can say if I'd believe it could works ...
$endgroup$
– Ben Bolker
Mar 21 at 16:18











0












$begingroup$

There are many different approaches. What I would recommend is trying some simple ones, in the following order:



  • L1 regularization (with increasing penalty; the larger the regularization coefficient, the more features will be eliminated)

  • Recursive Feature Elimination (https://scikit-learn.org/stable/modules/feature_selection.html#recursive-feature-elimination) -- removes features incrementally by eliminating the features associated with the smallest model coefficients (assuming that those are the least important once; obviously, it's very crucial here to normalize the input features)

  • Sequential Feature Selection (http://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/) -- removes features based on how important they are for predictive performance





share|cite|improve this answer









$endgroup$












  • $begingroup$
    I believe that all three of these methods will be found to be unstable.
    $endgroup$
    – Frank Harrell
    Mar 21 at 19:20










  • $begingroup$
    it depends on how you define unstable. In practice, you usually use some sort of cross validation like k-fold or leave-one-out and judge based on overall performance + variance (aka the 1SE method) which features you end up choosing.
    $endgroup$
    – resnet
    Mar 21 at 21:39










  • $begingroup$
    Bootstrapping and cross-validation only validate some predictive index for the process generating the model. This results in a good estimate of that index for a model selected using that process but does not provide any comfort for the structure of a model that was developed once, i.e., the overall model. Look at the structure selected (i.e., features selected) across the resamples to see the volatility.
    $endgroup$
    – Frank Harrell
    Mar 22 at 13:10















0












$begingroup$

There are many different approaches. What I would recommend is trying some simple ones, in the following order:



  • L1 regularization (with increasing penalty; the larger the regularization coefficient, the more features will be eliminated)

  • Recursive Feature Elimination (https://scikit-learn.org/stable/modules/feature_selection.html#recursive-feature-elimination) -- removes features incrementally by eliminating the features associated with the smallest model coefficients (assuming that those are the least important once; obviously, it's very crucial here to normalize the input features)

  • Sequential Feature Selection (http://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/) -- removes features based on how important they are for predictive performance





share|cite|improve this answer









$endgroup$












  • $begingroup$
    I believe that all three of these methods will be found to be unstable.
    $endgroup$
    – Frank Harrell
    Mar 21 at 19:20










  • $begingroup$
    it depends on how you define unstable. In practice, you usually use some sort of cross validation like k-fold or leave-one-out and judge based on overall performance + variance (aka the 1SE method) which features you end up choosing.
    $endgroup$
    – resnet
    Mar 21 at 21:39










  • $begingroup$
    Bootstrapping and cross-validation only validate some predictive index for the process generating the model. This results in a good estimate of that index for a model selected using that process but does not provide any comfort for the structure of a model that was developed once, i.e., the overall model. Look at the structure selected (i.e., features selected) across the resamples to see the volatility.
    $endgroup$
    – Frank Harrell
    Mar 22 at 13:10













0












0








0





$begingroup$

There are many different approaches. What I would recommend is trying some simple ones, in the following order:



  • L1 regularization (with increasing penalty; the larger the regularization coefficient, the more features will be eliminated)

  • Recursive Feature Elimination (https://scikit-learn.org/stable/modules/feature_selection.html#recursive-feature-elimination) -- removes features incrementally by eliminating the features associated with the smallest model coefficients (assuming that those are the least important once; obviously, it's very crucial here to normalize the input features)

  • Sequential Feature Selection (http://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/) -- removes features based on how important they are for predictive performance





share|cite|improve this answer









$endgroup$



There are many different approaches. What I would recommend is trying some simple ones, in the following order:



  • L1 regularization (with increasing penalty; the larger the regularization coefficient, the more features will be eliminated)

  • Recursive Feature Elimination (https://scikit-learn.org/stable/modules/feature_selection.html#recursive-feature-elimination) -- removes features incrementally by eliminating the features associated with the smallest model coefficients (assuming that those are the least important once; obviously, it's very crucial here to normalize the input features)

  • Sequential Feature Selection (http://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/) -- removes features based on how important they are for predictive performance






share|cite|improve this answer












share|cite|improve this answer



share|cite|improve this answer










answered Mar 21 at 0:33









resnetresnet

2246




2246











  • $begingroup$
    I believe that all three of these methods will be found to be unstable.
    $endgroup$
    – Frank Harrell
    Mar 21 at 19:20










  • $begingroup$
    it depends on how you define unstable. In practice, you usually use some sort of cross validation like k-fold or leave-one-out and judge based on overall performance + variance (aka the 1SE method) which features you end up choosing.
    $endgroup$
    – resnet
    Mar 21 at 21:39










  • $begingroup$
    Bootstrapping and cross-validation only validate some predictive index for the process generating the model. This results in a good estimate of that index for a model selected using that process but does not provide any comfort for the structure of a model that was developed once, i.e., the overall model. Look at the structure selected (i.e., features selected) across the resamples to see the volatility.
    $endgroup$
    – Frank Harrell
    Mar 22 at 13:10
















  • $begingroup$
    I believe that all three of these methods will be found to be unstable.
    $endgroup$
    – Frank Harrell
    Mar 21 at 19:20










  • $begingroup$
    it depends on how you define unstable. In practice, you usually use some sort of cross validation like k-fold or leave-one-out and judge based on overall performance + variance (aka the 1SE method) which features you end up choosing.
    $endgroup$
    – resnet
    Mar 21 at 21:39










  • $begingroup$
    Bootstrapping and cross-validation only validate some predictive index for the process generating the model. This results in a good estimate of that index for a model selected using that process but does not provide any comfort for the structure of a model that was developed once, i.e., the overall model. Look at the structure selected (i.e., features selected) across the resamples to see the volatility.
    $endgroup$
    – Frank Harrell
    Mar 22 at 13:10















$begingroup$
I believe that all three of these methods will be found to be unstable.
$endgroup$
– Frank Harrell
Mar 21 at 19:20




$begingroup$
I believe that all three of these methods will be found to be unstable.
$endgroup$
– Frank Harrell
Mar 21 at 19:20












$begingroup$
it depends on how you define unstable. In practice, you usually use some sort of cross validation like k-fold or leave-one-out and judge based on overall performance + variance (aka the 1SE method) which features you end up choosing.
$endgroup$
– resnet
Mar 21 at 21:39




$begingroup$
it depends on how you define unstable. In practice, you usually use some sort of cross validation like k-fold or leave-one-out and judge based on overall performance + variance (aka the 1SE method) which features you end up choosing.
$endgroup$
– resnet
Mar 21 at 21:39












$begingroup$
Bootstrapping and cross-validation only validate some predictive index for the process generating the model. This results in a good estimate of that index for a model selected using that process but does not provide any comfort for the structure of a model that was developed once, i.e., the overall model. Look at the structure selected (i.e., features selected) across the resamples to see the volatility.
$endgroup$
– Frank Harrell
Mar 22 at 13:10




$begingroup$
Bootstrapping and cross-validation only validate some predictive index for the process generating the model. This results in a good estimate of that index for a model selected using that process but does not provide any comfort for the structure of a model that was developed once, i.e., the overall model. Look at the structure selected (i.e., features selected) across the resamples to see the volatility.
$endgroup$
– Frank Harrell
Mar 22 at 13:10

















draft saved

draft discarded
















































Thanks for contributing an answer to Cross Validated!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f398638%2fhow-to-reduce-predictors-the-right-way-for-a-logistic-regression-model%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Identifying “long and narrow” polygons in with PostGISlength and width of polygonWhy postgis st_overlaps reports Qgis' “avoid intersections” generated polygon as overlapping with others?Adjusting polygons to boundary and filling holesDrawing polygons with fixed area?How to remove spikes in Polygons with PostGISDeleting sliver polygons after difference operation in QGIS?Snapping boundaries in PostGISSplit polygon into parts adding attributes based on underlying polygon in QGISSplitting overlap between polygons and assign to nearest polygon using PostGIS?Expanding polygons and clipping at midpoint?Removing Intersection of Buffers in Same Layers

Masuk log Menu navigasi

อาณาจักร (ชีววิทยา) ดูเพิ่ม อ้างอิง รายการเลือกการนำทาง10.1086/39456810.5962/bhl.title.447410.1126/science.163.3863.150576276010.1007/BF01796092408502"Phylogenetic structure of the prokaryotic domain: the primary kingdoms"10.1073/pnas.74.11.5088432104270744"Towards a natural system of organisms: proposal for the domains Archaea, Bacteria, and Eucarya"1990PNAS...87.4576W10.1073/pnas.87.12.4576541592112744PubMedJump the queueexpand by handPubMedJump the queueexpand by handPubMedJump the queueexpand by hand"A revised six-kingdom system of life"10.1111/j.1469-185X.1998.tb00030.x9809012"Only six kingdoms of life"10.1098/rspb.2004.2705169172415306349"Kingdoms Protozoa and Chromista and the eozoan root of the eukaryotic tree"10.1098/rsbl.2009.0948288006020031978เพิ่มข้อมูล