Need help understanding what a natural log transformation is actually doing and why specific transformations are required for linear regression [duplicate]In linear regression, when is it appropriate to use the log of an independent variable instead of the actual values?When (and why) should you take the log of a distribution (of numbers)?Box-Cox like transformation for independent variables?Regression: Transforming VariablesBack-transformation and interpretation of $log(X+1)$ estimates in multiple linear regressionAVAS Transformation Interpretation of Multiple Factor Regression Equationlog-log vs. normal plotting for showing linear dependencyWhat is the difference between doing a linear regression and exhaustive descriptive statistics?Log transform before bootstrapping?Interpreting how much my linear model has improved after Box-Cox transformationRequired transformation and regression for non-normal variableTransformation between log and reciprocal powers for regressionusing logit transformation properly in linear model and problems with model given zero F and p values for specific variable combinationsDiscussing R-squared of log-log model with a non-technical audience

How to deal with employer who keeps me at work after working hours

When an imagined world resembles or has similarities with a famous world

Are the Night's Watch still required?

Why didn't this character get a funeral at the end of Avengers: Endgame?

Why do people keep telling me that I am a bad photographer?

Can my 2 children, aged 10 and 12, who are US citizens, travel to the USA on expired American passports?

Handling Null values (and equivalents) routinely in Python

Start job from another SQL server instance

What do I do if my advisor made a mistake?

Is “snitty” a popular American English term? What is its origin?

SOQL query WHERE filter by specific months

Does XQuartz work on macOS Mojave?

Is any special diet an effective treatment of autism?

What to use instead of cling film to wrap pastry

My first C++ game (snake console game)

Has the Hulk always been able to talk?

What was Bran's plan to kill the Night King?

How would a military counter the use of genetically modified insects used to defeat CBRN protective gear?

Where to draw the line between quantum mechanics theory and its interpretation(s)?

Dihedral group D4 composition with custom labels

Is an HNN extension of a virtually torsion-free group virtually torsion-free?

Is the book wrong about the Nyquist Sampling Criterion?

Can you use "едать" and "игрывать" in the present and future tenses?

All of my Firefox add-ons been disabled suddenly, how can I re-enable them?



Need help understanding what a natural log transformation is actually doing and why specific transformations are required for linear regression [duplicate]


In linear regression, when is it appropriate to use the log of an independent variable instead of the actual values?When (and why) should you take the log of a distribution (of numbers)?Box-Cox like transformation for independent variables?Regression: Transforming VariablesBack-transformation and interpretation of $log(X+1)$ estimates in multiple linear regressionAVAS Transformation Interpretation of Multiple Factor Regression Equationlog-log vs. normal plotting for showing linear dependencyWhat is the difference between doing a linear regression and exhaustive descriptive statistics?Log transform before bootstrapping?Interpreting how much my linear model has improved after Box-Cox transformationRequired transformation and regression for non-normal variableTransformation between log and reciprocal powers for regressionusing logit transformation properly in linear model and problems with model given zero F and p values for specific variable combinationsDiscussing R-squared of log-log model with a non-technical audience






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








5












$begingroup$



This question already has an answer here:



  • In linear regression, when is it appropriate to use the log of an independent variable instead of the actual values?

    8 answers



  • Regression: Transforming Variables

    1 answer



  • Box-Cox like transformation for independent variables?

    3 answers



  • When (and why) should you take the log of a distribution (of numbers)?

    2 answers



I’m taking an online “Intro to AI” course for which I’m doing some azure machine learning labs. This course is largely about how to apply azure ML solutions and, while there is an “essential math for ML module”, it doesn’t do any sort of deep dive into the math/statistics. But I want to understand what’s happening mathematically to a basic level. (I've taken an intro to statistics, that's it so far.) Below are some details about some specific things I’d like to understand more, and my specific questions.



At a particular point in the “training a classification model” lab, there is this text: “The distribution of the Age column in the diabetes.csv dataset is skewed because most patients are in the youngest age bracket. Creating a version of this feature that uses a natural log transformation can help create a more linear relationship between Age and other features, and improve the ability to predict the Diabetic label. This kind of feature engineering as it’s called is common in machine learning data preparation.”



Then the instructions show you how to use Azure ML Studio to apply the natural log operation to the age column in the data set:
natural log transform in azure ml studio



And after doing that, the data includes the original/raw age data and the transformed age data: Resulting data visualized



Question #1: What is that transformation actually doing? I don't mean the nitty gritty math, but what is it doing conceptually?



Question #2: The next general question is around why is a transformation necessary. On that, I did some research and found this article (https://www.r-statistics.com/2013/05/log-transformations-for-skewed-and-wide-distributions-from-practical-data-science-with-r/) that describes which type of log transformations to use in a few scenarios. Here’s a snip of text from the article:
“The need for data transformation can depend on the modeling method that you plan to use. For linear and logistic regression, for example, you ideally want to make sure that the relationship between input variables and output variables is approximately linear, that the input variables are approximately normal in distribution, and that the output variable is constant variance (that is, the variance of the output variable is independent of the input variables). You may need to transform some of your input variables to better meet these assumptions.”



I don’t understand why any of that is necessary. I’ve broken my question down:



For linear and logistic regression, for example, you ideally want to make sure that:
the relationship between input variables and output variables is approximately linear – why?
the input variables are approximately normal in distribution- why?
the output variable is constant variance (that is, the variance of the output variable is independent of the input variables – why?



Could someone help with information for those whys, or point me to material that would help?










share|cite|improve this question











$endgroup$



marked as duplicate by whuber regression
Users with the  regression badge can single-handedly close regression questions as duplicates and reopen them as needed.

StackExchange.ready(function()
if (StackExchange.options.isMobile) return;

$('.dupe-hammer-message-hover:not(.hover-bound)').each(function()
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');

$hover.hover(
function()
$hover.showInfoMessage('',
messageElement: $msg.clone().show(),
transient: false,
position: my: 'bottom left', at: 'top center', offsetTop: -7 ,
dismissable: false,
relativeToBody: true
);
,
function()
StackExchange.helpers.removeMessages();

);
);
);
Mar 23 at 14:05


This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.













  • 1




    $begingroup$
    I hate it when questions such as yours are put on hold or marked as duplicate, as we cannot answer them properly. So we're reduced to having to just add comments instead of answers!
    $endgroup$
    – Isabella Ghement
    Mar 23 at 18:02






  • 1




    $begingroup$
    One intuitive way to think about what the log transformation is doing is to look at the distribution of Age - it's right skewed, so many of the ages are 'clumped together' at the lower end of the age scale, while fewer ages are spread out at the higher end of the age scale. When you log transform the distribution of Age, you 'stretch out' the clumped values to make it easier to see what is going on. In other words, the log transformed age values at the lower end of the scale are better separated from each other, instead of crowding each other.
    $endgroup$
    – Isabella Ghement
    Mar 23 at 18:10






  • 1




    $begingroup$
    Look at the high spike in the distribution of Age (a clear sign of clumping or crowdedness). That spike was attenuated in the distribution of the log-transformed Age.
    $endgroup$
    – Isabella Ghement
    Mar 23 at 18:13






  • 1




    $begingroup$
    Strictly speaking, transforming Age before including it in your model is not necessary. In fact, you could fit your model using the original Age variable first and then check the model diagnostics for any evidence that a transformation of Age might improve the model fit. People tend to transform predictor variables such as Age when they have skewed distrubutions because they think that would lead to an improved model. Whether or not that is the case is debatable. Additionally, transforming variables can complicate the interpretation of the model.
    $endgroup$
    – Isabella Ghement
    Mar 23 at 18:17






  • 1




    $begingroup$
    To give you a concrete example, let's say that all predictor variables in your model except for Age are continuous and have roughly symmetric distributions. You fit a linear regression model including all the predictors as they are (i.e., untransformed) to your data and look at the distribution of the model residuals. If this distribution reflects the skeweness of the Age distribution, then you'd be entitled to think that log-transforming Age prior to including it in your model would resolve the skewness of the residual distribution.
    $endgroup$
    – Isabella Ghement
    Mar 23 at 18:21

















5












$begingroup$



This question already has an answer here:



  • In linear regression, when is it appropriate to use the log of an independent variable instead of the actual values?

    8 answers



  • Regression: Transforming Variables

    1 answer



  • Box-Cox like transformation for independent variables?

    3 answers



  • When (and why) should you take the log of a distribution (of numbers)?

    2 answers



I’m taking an online “Intro to AI” course for which I’m doing some azure machine learning labs. This course is largely about how to apply azure ML solutions and, while there is an “essential math for ML module”, it doesn’t do any sort of deep dive into the math/statistics. But I want to understand what’s happening mathematically to a basic level. (I've taken an intro to statistics, that's it so far.) Below are some details about some specific things I’d like to understand more, and my specific questions.



At a particular point in the “training a classification model” lab, there is this text: “The distribution of the Age column in the diabetes.csv dataset is skewed because most patients are in the youngest age bracket. Creating a version of this feature that uses a natural log transformation can help create a more linear relationship between Age and other features, and improve the ability to predict the Diabetic label. This kind of feature engineering as it’s called is common in machine learning data preparation.”



Then the instructions show you how to use Azure ML Studio to apply the natural log operation to the age column in the data set:
natural log transform in azure ml studio



And after doing that, the data includes the original/raw age data and the transformed age data: Resulting data visualized



Question #1: What is that transformation actually doing? I don't mean the nitty gritty math, but what is it doing conceptually?



Question #2: The next general question is around why is a transformation necessary. On that, I did some research and found this article (https://www.r-statistics.com/2013/05/log-transformations-for-skewed-and-wide-distributions-from-practical-data-science-with-r/) that describes which type of log transformations to use in a few scenarios. Here’s a snip of text from the article:
“The need for data transformation can depend on the modeling method that you plan to use. For linear and logistic regression, for example, you ideally want to make sure that the relationship between input variables and output variables is approximately linear, that the input variables are approximately normal in distribution, and that the output variable is constant variance (that is, the variance of the output variable is independent of the input variables). You may need to transform some of your input variables to better meet these assumptions.”



I don’t understand why any of that is necessary. I’ve broken my question down:



For linear and logistic regression, for example, you ideally want to make sure that:
the relationship between input variables and output variables is approximately linear – why?
the input variables are approximately normal in distribution- why?
the output variable is constant variance (that is, the variance of the output variable is independent of the input variables – why?



Could someone help with information for those whys, or point me to material that would help?










share|cite|improve this question











$endgroup$



marked as duplicate by whuber regression
Users with the  regression badge can single-handedly close regression questions as duplicates and reopen them as needed.

StackExchange.ready(function()
if (StackExchange.options.isMobile) return;

$('.dupe-hammer-message-hover:not(.hover-bound)').each(function()
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');

$hover.hover(
function()
$hover.showInfoMessage('',
messageElement: $msg.clone().show(),
transient: false,
position: my: 'bottom left', at: 'top center', offsetTop: -7 ,
dismissable: false,
relativeToBody: true
);
,
function()
StackExchange.helpers.removeMessages();

);
);
);
Mar 23 at 14:05


This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.













  • 1




    $begingroup$
    I hate it when questions such as yours are put on hold or marked as duplicate, as we cannot answer them properly. So we're reduced to having to just add comments instead of answers!
    $endgroup$
    – Isabella Ghement
    Mar 23 at 18:02






  • 1




    $begingroup$
    One intuitive way to think about what the log transformation is doing is to look at the distribution of Age - it's right skewed, so many of the ages are 'clumped together' at the lower end of the age scale, while fewer ages are spread out at the higher end of the age scale. When you log transform the distribution of Age, you 'stretch out' the clumped values to make it easier to see what is going on. In other words, the log transformed age values at the lower end of the scale are better separated from each other, instead of crowding each other.
    $endgroup$
    – Isabella Ghement
    Mar 23 at 18:10






  • 1




    $begingroup$
    Look at the high spike in the distribution of Age (a clear sign of clumping or crowdedness). That spike was attenuated in the distribution of the log-transformed Age.
    $endgroup$
    – Isabella Ghement
    Mar 23 at 18:13






  • 1




    $begingroup$
    Strictly speaking, transforming Age before including it in your model is not necessary. In fact, you could fit your model using the original Age variable first and then check the model diagnostics for any evidence that a transformation of Age might improve the model fit. People tend to transform predictor variables such as Age when they have skewed distrubutions because they think that would lead to an improved model. Whether or not that is the case is debatable. Additionally, transforming variables can complicate the interpretation of the model.
    $endgroup$
    – Isabella Ghement
    Mar 23 at 18:17






  • 1




    $begingroup$
    To give you a concrete example, let's say that all predictor variables in your model except for Age are continuous and have roughly symmetric distributions. You fit a linear regression model including all the predictors as they are (i.e., untransformed) to your data and look at the distribution of the model residuals. If this distribution reflects the skeweness of the Age distribution, then you'd be entitled to think that log-transforming Age prior to including it in your model would resolve the skewness of the residual distribution.
    $endgroup$
    – Isabella Ghement
    Mar 23 at 18:21













5












5








5





$begingroup$



This question already has an answer here:



  • In linear regression, when is it appropriate to use the log of an independent variable instead of the actual values?

    8 answers



  • Regression: Transforming Variables

    1 answer



  • Box-Cox like transformation for independent variables?

    3 answers



  • When (and why) should you take the log of a distribution (of numbers)?

    2 answers



I’m taking an online “Intro to AI” course for which I’m doing some azure machine learning labs. This course is largely about how to apply azure ML solutions and, while there is an “essential math for ML module”, it doesn’t do any sort of deep dive into the math/statistics. But I want to understand what’s happening mathematically to a basic level. (I've taken an intro to statistics, that's it so far.) Below are some details about some specific things I’d like to understand more, and my specific questions.



At a particular point in the “training a classification model” lab, there is this text: “The distribution of the Age column in the diabetes.csv dataset is skewed because most patients are in the youngest age bracket. Creating a version of this feature that uses a natural log transformation can help create a more linear relationship between Age and other features, and improve the ability to predict the Diabetic label. This kind of feature engineering as it’s called is common in machine learning data preparation.”



Then the instructions show you how to use Azure ML Studio to apply the natural log operation to the age column in the data set:
natural log transform in azure ml studio



And after doing that, the data includes the original/raw age data and the transformed age data: Resulting data visualized



Question #1: What is that transformation actually doing? I don't mean the nitty gritty math, but what is it doing conceptually?



Question #2: The next general question is around why is a transformation necessary. On that, I did some research and found this article (https://www.r-statistics.com/2013/05/log-transformations-for-skewed-and-wide-distributions-from-practical-data-science-with-r/) that describes which type of log transformations to use in a few scenarios. Here’s a snip of text from the article:
“The need for data transformation can depend on the modeling method that you plan to use. For linear and logistic regression, for example, you ideally want to make sure that the relationship between input variables and output variables is approximately linear, that the input variables are approximately normal in distribution, and that the output variable is constant variance (that is, the variance of the output variable is independent of the input variables). You may need to transform some of your input variables to better meet these assumptions.”



I don’t understand why any of that is necessary. I’ve broken my question down:



For linear and logistic regression, for example, you ideally want to make sure that:
the relationship between input variables and output variables is approximately linear – why?
the input variables are approximately normal in distribution- why?
the output variable is constant variance (that is, the variance of the output variable is independent of the input variables – why?



Could someone help with information for those whys, or point me to material that would help?










share|cite|improve this question











$endgroup$





This question already has an answer here:



  • In linear regression, when is it appropriate to use the log of an independent variable instead of the actual values?

    8 answers



  • Regression: Transforming Variables

    1 answer



  • Box-Cox like transformation for independent variables?

    3 answers



  • When (and why) should you take the log of a distribution (of numbers)?

    2 answers



I’m taking an online “Intro to AI” course for which I’m doing some azure machine learning labs. This course is largely about how to apply azure ML solutions and, while there is an “essential math for ML module”, it doesn’t do any sort of deep dive into the math/statistics. But I want to understand what’s happening mathematically to a basic level. (I've taken an intro to statistics, that's it so far.) Below are some details about some specific things I’d like to understand more, and my specific questions.



At a particular point in the “training a classification model” lab, there is this text: “The distribution of the Age column in the diabetes.csv dataset is skewed because most patients are in the youngest age bracket. Creating a version of this feature that uses a natural log transformation can help create a more linear relationship between Age and other features, and improve the ability to predict the Diabetic label. This kind of feature engineering as it’s called is common in machine learning data preparation.”



Then the instructions show you how to use Azure ML Studio to apply the natural log operation to the age column in the data set:
natural log transform in azure ml studio



And after doing that, the data includes the original/raw age data and the transformed age data: Resulting data visualized



Question #1: What is that transformation actually doing? I don't mean the nitty gritty math, but what is it doing conceptually?



Question #2: The next general question is around why is a transformation necessary. On that, I did some research and found this article (https://www.r-statistics.com/2013/05/log-transformations-for-skewed-and-wide-distributions-from-practical-data-science-with-r/) that describes which type of log transformations to use in a few scenarios. Here’s a snip of text from the article:
“The need for data transformation can depend on the modeling method that you plan to use. For linear and logistic regression, for example, you ideally want to make sure that the relationship between input variables and output variables is approximately linear, that the input variables are approximately normal in distribution, and that the output variable is constant variance (that is, the variance of the output variable is independent of the input variables). You may need to transform some of your input variables to better meet these assumptions.”



I don’t understand why any of that is necessary. I’ve broken my question down:



For linear and logistic regression, for example, you ideally want to make sure that:
the relationship between input variables and output variables is approximately linear – why?
the input variables are approximately normal in distribution- why?
the output variable is constant variance (that is, the variance of the output variable is independent of the input variables – why?



Could someone help with information for those whys, or point me to material that would help?





This question already has an answer here:



  • In linear regression, when is it appropriate to use the log of an independent variable instead of the actual values?

    8 answers



  • Regression: Transforming Variables

    1 answer



  • Box-Cox like transformation for independent variables?

    3 answers



  • When (and why) should you take the log of a distribution (of numbers)?

    2 answers







regression linear logarithm






share|cite|improve this question















share|cite|improve this question













share|cite|improve this question




share|cite|improve this question








edited Mar 22 at 23:22









Ferdi

3,91452355




3,91452355










asked Mar 22 at 22:07









redmondcoffeheadredmondcoffehead

262




262




marked as duplicate by whuber regression
Users with the  regression badge can single-handedly close regression questions as duplicates and reopen them as needed.

StackExchange.ready(function()
if (StackExchange.options.isMobile) return;

$('.dupe-hammer-message-hover:not(.hover-bound)').each(function()
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');

$hover.hover(
function()
$hover.showInfoMessage('',
messageElement: $msg.clone().show(),
transient: false,
position: my: 'bottom left', at: 'top center', offsetTop: -7 ,
dismissable: false,
relativeToBody: true
);
,
function()
StackExchange.helpers.removeMessages();

);
);
);
Mar 23 at 14:05


This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.









marked as duplicate by whuber regression
Users with the  regression badge can single-handedly close regression questions as duplicates and reopen them as needed.

StackExchange.ready(function()
if (StackExchange.options.isMobile) return;

$('.dupe-hammer-message-hover:not(.hover-bound)').each(function()
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');

$hover.hover(
function()
$hover.showInfoMessage('',
messageElement: $msg.clone().show(),
transient: false,
position: my: 'bottom left', at: 'top center', offsetTop: -7 ,
dismissable: false,
relativeToBody: true
);
,
function()
StackExchange.helpers.removeMessages();

);
);
);
Mar 23 at 14:05


This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.









  • 1




    $begingroup$
    I hate it when questions such as yours are put on hold or marked as duplicate, as we cannot answer them properly. So we're reduced to having to just add comments instead of answers!
    $endgroup$
    – Isabella Ghement
    Mar 23 at 18:02






  • 1




    $begingroup$
    One intuitive way to think about what the log transformation is doing is to look at the distribution of Age - it's right skewed, so many of the ages are 'clumped together' at the lower end of the age scale, while fewer ages are spread out at the higher end of the age scale. When you log transform the distribution of Age, you 'stretch out' the clumped values to make it easier to see what is going on. In other words, the log transformed age values at the lower end of the scale are better separated from each other, instead of crowding each other.
    $endgroup$
    – Isabella Ghement
    Mar 23 at 18:10






  • 1




    $begingroup$
    Look at the high spike in the distribution of Age (a clear sign of clumping or crowdedness). That spike was attenuated in the distribution of the log-transformed Age.
    $endgroup$
    – Isabella Ghement
    Mar 23 at 18:13






  • 1




    $begingroup$
    Strictly speaking, transforming Age before including it in your model is not necessary. In fact, you could fit your model using the original Age variable first and then check the model diagnostics for any evidence that a transformation of Age might improve the model fit. People tend to transform predictor variables such as Age when they have skewed distrubutions because they think that would lead to an improved model. Whether or not that is the case is debatable. Additionally, transforming variables can complicate the interpretation of the model.
    $endgroup$
    – Isabella Ghement
    Mar 23 at 18:17






  • 1




    $begingroup$
    To give you a concrete example, let's say that all predictor variables in your model except for Age are continuous and have roughly symmetric distributions. You fit a linear regression model including all the predictors as they are (i.e., untransformed) to your data and look at the distribution of the model residuals. If this distribution reflects the skeweness of the Age distribution, then you'd be entitled to think that log-transforming Age prior to including it in your model would resolve the skewness of the residual distribution.
    $endgroup$
    – Isabella Ghement
    Mar 23 at 18:21












  • 1




    $begingroup$
    I hate it when questions such as yours are put on hold or marked as duplicate, as we cannot answer them properly. So we're reduced to having to just add comments instead of answers!
    $endgroup$
    – Isabella Ghement
    Mar 23 at 18:02






  • 1




    $begingroup$
    One intuitive way to think about what the log transformation is doing is to look at the distribution of Age - it's right skewed, so many of the ages are 'clumped together' at the lower end of the age scale, while fewer ages are spread out at the higher end of the age scale. When you log transform the distribution of Age, you 'stretch out' the clumped values to make it easier to see what is going on. In other words, the log transformed age values at the lower end of the scale are better separated from each other, instead of crowding each other.
    $endgroup$
    – Isabella Ghement
    Mar 23 at 18:10






  • 1




    $begingroup$
    Look at the high spike in the distribution of Age (a clear sign of clumping or crowdedness). That spike was attenuated in the distribution of the log-transformed Age.
    $endgroup$
    – Isabella Ghement
    Mar 23 at 18:13






  • 1




    $begingroup$
    Strictly speaking, transforming Age before including it in your model is not necessary. In fact, you could fit your model using the original Age variable first and then check the model diagnostics for any evidence that a transformation of Age might improve the model fit. People tend to transform predictor variables such as Age when they have skewed distrubutions because they think that would lead to an improved model. Whether or not that is the case is debatable. Additionally, transforming variables can complicate the interpretation of the model.
    $endgroup$
    – Isabella Ghement
    Mar 23 at 18:17






  • 1




    $begingroup$
    To give you a concrete example, let's say that all predictor variables in your model except for Age are continuous and have roughly symmetric distributions. You fit a linear regression model including all the predictors as they are (i.e., untransformed) to your data and look at the distribution of the model residuals. If this distribution reflects the skeweness of the Age distribution, then you'd be entitled to think that log-transforming Age prior to including it in your model would resolve the skewness of the residual distribution.
    $endgroup$
    – Isabella Ghement
    Mar 23 at 18:21







1




1




$begingroup$
I hate it when questions such as yours are put on hold or marked as duplicate, as we cannot answer them properly. So we're reduced to having to just add comments instead of answers!
$endgroup$
– Isabella Ghement
Mar 23 at 18:02




$begingroup$
I hate it when questions such as yours are put on hold or marked as duplicate, as we cannot answer them properly. So we're reduced to having to just add comments instead of answers!
$endgroup$
– Isabella Ghement
Mar 23 at 18:02




1




1




$begingroup$
One intuitive way to think about what the log transformation is doing is to look at the distribution of Age - it's right skewed, so many of the ages are 'clumped together' at the lower end of the age scale, while fewer ages are spread out at the higher end of the age scale. When you log transform the distribution of Age, you 'stretch out' the clumped values to make it easier to see what is going on. In other words, the log transformed age values at the lower end of the scale are better separated from each other, instead of crowding each other.
$endgroup$
– Isabella Ghement
Mar 23 at 18:10




$begingroup$
One intuitive way to think about what the log transformation is doing is to look at the distribution of Age - it's right skewed, so many of the ages are 'clumped together' at the lower end of the age scale, while fewer ages are spread out at the higher end of the age scale. When you log transform the distribution of Age, you 'stretch out' the clumped values to make it easier to see what is going on. In other words, the log transformed age values at the lower end of the scale are better separated from each other, instead of crowding each other.
$endgroup$
– Isabella Ghement
Mar 23 at 18:10




1




1




$begingroup$
Look at the high spike in the distribution of Age (a clear sign of clumping or crowdedness). That spike was attenuated in the distribution of the log-transformed Age.
$endgroup$
– Isabella Ghement
Mar 23 at 18:13




$begingroup$
Look at the high spike in the distribution of Age (a clear sign of clumping or crowdedness). That spike was attenuated in the distribution of the log-transformed Age.
$endgroup$
– Isabella Ghement
Mar 23 at 18:13




1




1




$begingroup$
Strictly speaking, transforming Age before including it in your model is not necessary. In fact, you could fit your model using the original Age variable first and then check the model diagnostics for any evidence that a transformation of Age might improve the model fit. People tend to transform predictor variables such as Age when they have skewed distrubutions because they think that would lead to an improved model. Whether or not that is the case is debatable. Additionally, transforming variables can complicate the interpretation of the model.
$endgroup$
– Isabella Ghement
Mar 23 at 18:17




$begingroup$
Strictly speaking, transforming Age before including it in your model is not necessary. In fact, you could fit your model using the original Age variable first and then check the model diagnostics for any evidence that a transformation of Age might improve the model fit. People tend to transform predictor variables such as Age when they have skewed distrubutions because they think that would lead to an improved model. Whether or not that is the case is debatable. Additionally, transforming variables can complicate the interpretation of the model.
$endgroup$
– Isabella Ghement
Mar 23 at 18:17




1




1




$begingroup$
To give you a concrete example, let's say that all predictor variables in your model except for Age are continuous and have roughly symmetric distributions. You fit a linear regression model including all the predictors as they are (i.e., untransformed) to your data and look at the distribution of the model residuals. If this distribution reflects the skeweness of the Age distribution, then you'd be entitled to think that log-transforming Age prior to including it in your model would resolve the skewness of the residual distribution.
$endgroup$
– Isabella Ghement
Mar 23 at 18:21




$begingroup$
To give you a concrete example, let's say that all predictor variables in your model except for Age are continuous and have roughly symmetric distributions. You fit a linear regression model including all the predictors as they are (i.e., untransformed) to your data and look at the distribution of the model residuals. If this distribution reflects the skeweness of the Age distribution, then you'd be entitled to think that log-transforming Age prior to including it in your model would resolve the skewness of the residual distribution.
$endgroup$
– Isabella Ghement
Mar 23 at 18:21










1 Answer
1






active

oldest

votes


















8












$begingroup$

There's a lot here to break down. I hate to say it, but some of the advice in your course is quite misguided and wrong.




What is that transformation actually doing? I don't mean the nitty gritty math, but what is it doing conceptually?




The math here is pretty simple. You have a bunch of measurements of people's age that you would like to use as a feature in predicting some other measurement (looks like the probability of something happening). You're simply creating a new feature which is the logarithm of the original feature. I'll explain why you would want to do this below.




For linear and logistic regression, for example, you ideally want to make sure that: the relationship between input variables and output variables is approximately linear – why?




This is a structural assumption of the linear and logistic regression models. I'll focus on linear regression, because its a bit simpler, but the same thing holds for logistic regression.



The linear regression model makes predictions by building a formula based on the data you feed into the algorithm. All prediction models work this way, but linear regression is distinguished by building the simplest possible formula. If $y$ is the thing you are trying to predict, and $x_1, x_2, ldots$ are the features you are using to predict it, then the linear regression formula is:



$$y = beta_0 + beta_1 x_1 + beta_2 x_2 + cdots + beta_k x_k$$



Here, the $beta_i$'s are just numbers, and the job of the algorithm is to determine what numbers work best.



Notice that if you vary one of the $x$'s, and look at how the output $y$ changes as a result, you'll get a line. This is a direct consequence of the way the linear regression model works. If you want it to give you sensible results, then you need to make sure this drawing lines assumption is at least approximately true.




the input variables are approximately normal in distribution- why?




This is simply wrong. Linear regression works fine even if the distribution of the input variables is highly non-normal. What is important is the relationship between the inputs and outputs, not the distribution of the inputs themselves.



This is what I meant by the advice the course being misguided. You don't transform input variables because their distribution is skew, you transform them so that the linear shape the model is trying to draw through your data is reasonable



For example, here is a scatterplot I found online of a country's GDP vs. its average life expectancy (attribution is in the image):



GDP vs. Life Expectancy



Clearly, drawing a line through the scatter plot is completely unreasonable, so the linear regression equation:



$$ textLife Expectancy = beta_0 + beta_1 textGDP $$



is a bad choice for the data. On the other hand, it looks like a logarithmic relationship is reasonable, so something like:



$$ textLife Expectancy = beta_0 + beta_1 log(textGDP) $$



looks like it would work a lot better. This is the type of situation where transforming the GDP measurements with a logarithm is a good idea. But it has nothing to do with the distribution of GDP. You can't tell it's a good idea by drawing a histogram of GDP, it's about the relationship between GDP and life expectancy.




the output variable is constant variance (that is, the variance of the output variable is independent of the input variables – why?




This is a deeper issue of a different nature than the others. For prediction models, it doesn't really matter, so if you're focusing on learning to build good predictive models don't worry about it for now.



As a summary, this assumption is intended to support the computation of the sampling distribution of parameter estimates. For example, if you want to say something like "the probability that I would collect data in which the relationship between log(GDP) and Life Expectancy is greater than what I actually observed, even when the there is truly no relationship, is very, very small" you need to be able to compute the sampling distribution of the parameter estimates. There are various assumptions that allow this to be done, and this constant variance assumption is one them.



That said, if you're only trying to make predictions, this isn't really relevant. And in no case is the distribution of the input data assumed to be normal, that's just a misconception.






share|cite|improve this answer











$endgroup$



















    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    8












    $begingroup$

    There's a lot here to break down. I hate to say it, but some of the advice in your course is quite misguided and wrong.




    What is that transformation actually doing? I don't mean the nitty gritty math, but what is it doing conceptually?




    The math here is pretty simple. You have a bunch of measurements of people's age that you would like to use as a feature in predicting some other measurement (looks like the probability of something happening). You're simply creating a new feature which is the logarithm of the original feature. I'll explain why you would want to do this below.




    For linear and logistic regression, for example, you ideally want to make sure that: the relationship between input variables and output variables is approximately linear – why?




    This is a structural assumption of the linear and logistic regression models. I'll focus on linear regression, because its a bit simpler, but the same thing holds for logistic regression.



    The linear regression model makes predictions by building a formula based on the data you feed into the algorithm. All prediction models work this way, but linear regression is distinguished by building the simplest possible formula. If $y$ is the thing you are trying to predict, and $x_1, x_2, ldots$ are the features you are using to predict it, then the linear regression formula is:



    $$y = beta_0 + beta_1 x_1 + beta_2 x_2 + cdots + beta_k x_k$$



    Here, the $beta_i$'s are just numbers, and the job of the algorithm is to determine what numbers work best.



    Notice that if you vary one of the $x$'s, and look at how the output $y$ changes as a result, you'll get a line. This is a direct consequence of the way the linear regression model works. If you want it to give you sensible results, then you need to make sure this drawing lines assumption is at least approximately true.




    the input variables are approximately normal in distribution- why?




    This is simply wrong. Linear regression works fine even if the distribution of the input variables is highly non-normal. What is important is the relationship between the inputs and outputs, not the distribution of the inputs themselves.



    This is what I meant by the advice the course being misguided. You don't transform input variables because their distribution is skew, you transform them so that the linear shape the model is trying to draw through your data is reasonable



    For example, here is a scatterplot I found online of a country's GDP vs. its average life expectancy (attribution is in the image):



    GDP vs. Life Expectancy



    Clearly, drawing a line through the scatter plot is completely unreasonable, so the linear regression equation:



    $$ textLife Expectancy = beta_0 + beta_1 textGDP $$



    is a bad choice for the data. On the other hand, it looks like a logarithmic relationship is reasonable, so something like:



    $$ textLife Expectancy = beta_0 + beta_1 log(textGDP) $$



    looks like it would work a lot better. This is the type of situation where transforming the GDP measurements with a logarithm is a good idea. But it has nothing to do with the distribution of GDP. You can't tell it's a good idea by drawing a histogram of GDP, it's about the relationship between GDP and life expectancy.




    the output variable is constant variance (that is, the variance of the output variable is independent of the input variables – why?




    This is a deeper issue of a different nature than the others. For prediction models, it doesn't really matter, so if you're focusing on learning to build good predictive models don't worry about it for now.



    As a summary, this assumption is intended to support the computation of the sampling distribution of parameter estimates. For example, if you want to say something like "the probability that I would collect data in which the relationship between log(GDP) and Life Expectancy is greater than what I actually observed, even when the there is truly no relationship, is very, very small" you need to be able to compute the sampling distribution of the parameter estimates. There are various assumptions that allow this to be done, and this constant variance assumption is one them.



    That said, if you're only trying to make predictions, this isn't really relevant. And in no case is the distribution of the input data assumed to be normal, that's just a misconception.






    share|cite|improve this answer











    $endgroup$

















      8












      $begingroup$

      There's a lot here to break down. I hate to say it, but some of the advice in your course is quite misguided and wrong.




      What is that transformation actually doing? I don't mean the nitty gritty math, but what is it doing conceptually?




      The math here is pretty simple. You have a bunch of measurements of people's age that you would like to use as a feature in predicting some other measurement (looks like the probability of something happening). You're simply creating a new feature which is the logarithm of the original feature. I'll explain why you would want to do this below.




      For linear and logistic regression, for example, you ideally want to make sure that: the relationship between input variables and output variables is approximately linear – why?




      This is a structural assumption of the linear and logistic regression models. I'll focus on linear regression, because its a bit simpler, but the same thing holds for logistic regression.



      The linear regression model makes predictions by building a formula based on the data you feed into the algorithm. All prediction models work this way, but linear regression is distinguished by building the simplest possible formula. If $y$ is the thing you are trying to predict, and $x_1, x_2, ldots$ are the features you are using to predict it, then the linear regression formula is:



      $$y = beta_0 + beta_1 x_1 + beta_2 x_2 + cdots + beta_k x_k$$



      Here, the $beta_i$'s are just numbers, and the job of the algorithm is to determine what numbers work best.



      Notice that if you vary one of the $x$'s, and look at how the output $y$ changes as a result, you'll get a line. This is a direct consequence of the way the linear regression model works. If you want it to give you sensible results, then you need to make sure this drawing lines assumption is at least approximately true.




      the input variables are approximately normal in distribution- why?




      This is simply wrong. Linear regression works fine even if the distribution of the input variables is highly non-normal. What is important is the relationship between the inputs and outputs, not the distribution of the inputs themselves.



      This is what I meant by the advice the course being misguided. You don't transform input variables because their distribution is skew, you transform them so that the linear shape the model is trying to draw through your data is reasonable



      For example, here is a scatterplot I found online of a country's GDP vs. its average life expectancy (attribution is in the image):



      GDP vs. Life Expectancy



      Clearly, drawing a line through the scatter plot is completely unreasonable, so the linear regression equation:



      $$ textLife Expectancy = beta_0 + beta_1 textGDP $$



      is a bad choice for the data. On the other hand, it looks like a logarithmic relationship is reasonable, so something like:



      $$ textLife Expectancy = beta_0 + beta_1 log(textGDP) $$



      looks like it would work a lot better. This is the type of situation where transforming the GDP measurements with a logarithm is a good idea. But it has nothing to do with the distribution of GDP. You can't tell it's a good idea by drawing a histogram of GDP, it's about the relationship between GDP and life expectancy.




      the output variable is constant variance (that is, the variance of the output variable is independent of the input variables – why?




      This is a deeper issue of a different nature than the others. For prediction models, it doesn't really matter, so if you're focusing on learning to build good predictive models don't worry about it for now.



      As a summary, this assumption is intended to support the computation of the sampling distribution of parameter estimates. For example, if you want to say something like "the probability that I would collect data in which the relationship between log(GDP) and Life Expectancy is greater than what I actually observed, even when the there is truly no relationship, is very, very small" you need to be able to compute the sampling distribution of the parameter estimates. There are various assumptions that allow this to be done, and this constant variance assumption is one them.



      That said, if you're only trying to make predictions, this isn't really relevant. And in no case is the distribution of the input data assumed to be normal, that's just a misconception.






      share|cite|improve this answer











      $endgroup$















        8












        8








        8





        $begingroup$

        There's a lot here to break down. I hate to say it, but some of the advice in your course is quite misguided and wrong.




        What is that transformation actually doing? I don't mean the nitty gritty math, but what is it doing conceptually?




        The math here is pretty simple. You have a bunch of measurements of people's age that you would like to use as a feature in predicting some other measurement (looks like the probability of something happening). You're simply creating a new feature which is the logarithm of the original feature. I'll explain why you would want to do this below.




        For linear and logistic regression, for example, you ideally want to make sure that: the relationship between input variables and output variables is approximately linear – why?




        This is a structural assumption of the linear and logistic regression models. I'll focus on linear regression, because its a bit simpler, but the same thing holds for logistic regression.



        The linear regression model makes predictions by building a formula based on the data you feed into the algorithm. All prediction models work this way, but linear regression is distinguished by building the simplest possible formula. If $y$ is the thing you are trying to predict, and $x_1, x_2, ldots$ are the features you are using to predict it, then the linear regression formula is:



        $$y = beta_0 + beta_1 x_1 + beta_2 x_2 + cdots + beta_k x_k$$



        Here, the $beta_i$'s are just numbers, and the job of the algorithm is to determine what numbers work best.



        Notice that if you vary one of the $x$'s, and look at how the output $y$ changes as a result, you'll get a line. This is a direct consequence of the way the linear regression model works. If you want it to give you sensible results, then you need to make sure this drawing lines assumption is at least approximately true.




        the input variables are approximately normal in distribution- why?




        This is simply wrong. Linear regression works fine even if the distribution of the input variables is highly non-normal. What is important is the relationship between the inputs and outputs, not the distribution of the inputs themselves.



        This is what I meant by the advice the course being misguided. You don't transform input variables because their distribution is skew, you transform them so that the linear shape the model is trying to draw through your data is reasonable



        For example, here is a scatterplot I found online of a country's GDP vs. its average life expectancy (attribution is in the image):



        GDP vs. Life Expectancy



        Clearly, drawing a line through the scatter plot is completely unreasonable, so the linear regression equation:



        $$ textLife Expectancy = beta_0 + beta_1 textGDP $$



        is a bad choice for the data. On the other hand, it looks like a logarithmic relationship is reasonable, so something like:



        $$ textLife Expectancy = beta_0 + beta_1 log(textGDP) $$



        looks like it would work a lot better. This is the type of situation where transforming the GDP measurements with a logarithm is a good idea. But it has nothing to do with the distribution of GDP. You can't tell it's a good idea by drawing a histogram of GDP, it's about the relationship between GDP and life expectancy.




        the output variable is constant variance (that is, the variance of the output variable is independent of the input variables – why?




        This is a deeper issue of a different nature than the others. For prediction models, it doesn't really matter, so if you're focusing on learning to build good predictive models don't worry about it for now.



        As a summary, this assumption is intended to support the computation of the sampling distribution of parameter estimates. For example, if you want to say something like "the probability that I would collect data in which the relationship between log(GDP) and Life Expectancy is greater than what I actually observed, even when the there is truly no relationship, is very, very small" you need to be able to compute the sampling distribution of the parameter estimates. There are various assumptions that allow this to be done, and this constant variance assumption is one them.



        That said, if you're only trying to make predictions, this isn't really relevant. And in no case is the distribution of the input data assumed to be normal, that's just a misconception.






        share|cite|improve this answer











        $endgroup$



        There's a lot here to break down. I hate to say it, but some of the advice in your course is quite misguided and wrong.




        What is that transformation actually doing? I don't mean the nitty gritty math, but what is it doing conceptually?




        The math here is pretty simple. You have a bunch of measurements of people's age that you would like to use as a feature in predicting some other measurement (looks like the probability of something happening). You're simply creating a new feature which is the logarithm of the original feature. I'll explain why you would want to do this below.




        For linear and logistic regression, for example, you ideally want to make sure that: the relationship between input variables and output variables is approximately linear – why?




        This is a structural assumption of the linear and logistic regression models. I'll focus on linear regression, because its a bit simpler, but the same thing holds for logistic regression.



        The linear regression model makes predictions by building a formula based on the data you feed into the algorithm. All prediction models work this way, but linear regression is distinguished by building the simplest possible formula. If $y$ is the thing you are trying to predict, and $x_1, x_2, ldots$ are the features you are using to predict it, then the linear regression formula is:



        $$y = beta_0 + beta_1 x_1 + beta_2 x_2 + cdots + beta_k x_k$$



        Here, the $beta_i$'s are just numbers, and the job of the algorithm is to determine what numbers work best.



        Notice that if you vary one of the $x$'s, and look at how the output $y$ changes as a result, you'll get a line. This is a direct consequence of the way the linear regression model works. If you want it to give you sensible results, then you need to make sure this drawing lines assumption is at least approximately true.




        the input variables are approximately normal in distribution- why?




        This is simply wrong. Linear regression works fine even if the distribution of the input variables is highly non-normal. What is important is the relationship between the inputs and outputs, not the distribution of the inputs themselves.



        This is what I meant by the advice the course being misguided. You don't transform input variables because their distribution is skew, you transform them so that the linear shape the model is trying to draw through your data is reasonable



        For example, here is a scatterplot I found online of a country's GDP vs. its average life expectancy (attribution is in the image):



        GDP vs. Life Expectancy



        Clearly, drawing a line through the scatter plot is completely unreasonable, so the linear regression equation:



        $$ textLife Expectancy = beta_0 + beta_1 textGDP $$



        is a bad choice for the data. On the other hand, it looks like a logarithmic relationship is reasonable, so something like:



        $$ textLife Expectancy = beta_0 + beta_1 log(textGDP) $$



        looks like it would work a lot better. This is the type of situation where transforming the GDP measurements with a logarithm is a good idea. But it has nothing to do with the distribution of GDP. You can't tell it's a good idea by drawing a histogram of GDP, it's about the relationship between GDP and life expectancy.




        the output variable is constant variance (that is, the variance of the output variable is independent of the input variables – why?




        This is a deeper issue of a different nature than the others. For prediction models, it doesn't really matter, so if you're focusing on learning to build good predictive models don't worry about it for now.



        As a summary, this assumption is intended to support the computation of the sampling distribution of parameter estimates. For example, if you want to say something like "the probability that I would collect data in which the relationship between log(GDP) and Life Expectancy is greater than what I actually observed, even when the there is truly no relationship, is very, very small" you need to be able to compute the sampling distribution of the parameter estimates. There are various assumptions that allow this to be done, and this constant variance assumption is one them.



        That said, if you're only trying to make predictions, this isn't really relevant. And in no case is the distribution of the input data assumed to be normal, that's just a misconception.







        share|cite|improve this answer














        share|cite|improve this answer



        share|cite|improve this answer








        edited Mar 23 at 1:14

























        answered Mar 22 at 23:52









        Matthew DruryMatthew Drury

        27.2k269108




        27.2k269108













            Popular posts from this blog

            Identifying “long and narrow” polygons in with PostGISlength and width of polygonWhy postgis st_overlaps reports Qgis' “avoid intersections” generated polygon as overlapping with others?Adjusting polygons to boundary and filling holesDrawing polygons with fixed area?How to remove spikes in Polygons with PostGISDeleting sliver polygons after difference operation in QGIS?Snapping boundaries in PostGISSplit polygon into parts adding attributes based on underlying polygon in QGISSplitting overlap between polygons and assign to nearest polygon using PostGIS?Expanding polygons and clipping at midpoint?Removing Intersection of Buffers in Same Layers

            Masuk log Menu navigasi

            อาณาจักร (ชีววิทยา) ดูเพิ่ม อ้างอิง รายการเลือกการนำทาง10.1086/39456810.5962/bhl.title.447410.1126/science.163.3863.150576276010.1007/BF01796092408502"Phylogenetic structure of the prokaryotic domain: the primary kingdoms"10.1073/pnas.74.11.5088432104270744"Towards a natural system of organisms: proposal for the domains Archaea, Bacteria, and Eucarya"1990PNAS...87.4576W10.1073/pnas.87.12.4576541592112744PubMedJump the queueexpand by handPubMedJump the queueexpand by handPubMedJump the queueexpand by hand"A revised six-kingdom system of life"10.1111/j.1469-185X.1998.tb00030.x9809012"Only six kingdoms of life"10.1098/rspb.2004.2705169172415306349"Kingdoms Protozoa and Chromista and the eozoan root of the eukaryotic tree"10.1098/rsbl.2009.0948288006020031978เพิ่มข้อมูล