Is it ever recommended to use mean/multiple imputation when using tree-based predictive models?Orthogonal sets of variables in multiple imputation --> separate imputation models?Multiple Imputation Using Different Data Setsusing cluster information in multiple imputationMultiple Imputation for Spatial Modelsmultiple imputation models containing categorical variablesWhen to use multiple imputation chained equations vs regression to impute data?Multiple imputation when explained variance of imputation model is lowPredictive Mean Matching as Single Imputation?How to apply a model built using Multiple Imputation to predict on dataset with missing data?How NULLs in numerical variables are treated in tree-based models?

Changing Color of error messages

Fewest number of steps to reach 200 using special calculator

Naive Monte Carlo, MCMC and their use in Bayesian Theory

Replace four times with sed

How are passwords stolen from companies if they only store hashes?

Is there a hypothetical scenario that would make Earth uninhabitable for humans, but not for (the majority of) other animals?

How can an organ that provides biological immortality be unable to regenerate?

What is the plural TO / OF something

Have the tides ever turned twice on any open problem?

Generic TVP tradeoffs?

Constant Current LED Circuit

Is it true that good novels will automatically sell themselves on Amazon (and so on) and there is no need for one to waste time promoting?

What does "Four-F." mean?

Can a wizard cast a spell during their first turn of combat if they initiated combat by releasing a readied spell?

Print last inputted byte

Practical application of matrices and determinants

Is there a term for accumulated dirt on the outside of your hands and feet?

Calculate the frequency of characters in a string

Is it correct to say "which country do you like the most?"

PTIJ What is the inyan of the Konami code in Uncle Moishy's song?

How to generate binary array whose elements with values 1 are randomly drawn

Why Choose Less Effective Armour Types?

Is honey really a supersaturated solution? Does heating to un-crystalize redissolve it or melt it?

Usage and meaning of "up" in "...worth at least a thousand pounds up in London"



Is it ever recommended to use mean/multiple imputation when using tree-based predictive models?


Orthogonal sets of variables in multiple imputation --> separate imputation models?Multiple Imputation Using Different Data Setsusing cluster information in multiple imputationMultiple Imputation for Spatial Modelsmultiple imputation models containing categorical variablesWhen to use multiple imputation chained equations vs regression to impute data?Multiple imputation when explained variance of imputation model is lowPredictive Mean Matching as Single Imputation?How to apply a model built using Multiple Imputation to predict on dataset with missing data?How NULLs in numerical variables are treated in tree-based models?













4












$begingroup$


Everytime that I am making some predictive model and I have missing data I impute categorical variables with something like "UNKNOWN" and numerical variables with some absurd number that will never be seen in practice (even if the variable is unbounded I can take the exponent of the variable and make the unknown values negative).



The main advantage is that the model knows that the variable is missing, which is not the case for say mean imputation. I can see that this could be disastrous in linear models or neural networks but in tree-based models this is handled really smoothly.



I know that there is a great deal of literature on missing data imputation, but when and why would I ever use these methods when missing data for predictive (tree-based) models?










share|cite|improve this question









$endgroup$











  • $begingroup$
    Imputing a large number for numeric data could be very bad for tree based models. Think of it this way, if your split is for example on income and the split is at say 100k, now everyone that was missing is going to be in the split with the high income earners
    $endgroup$
    – astel
    17 hours ago










  • $begingroup$
    The model will be fitted with that imputed values as well - so if they are significantly different than people with true high income the tree should make a split with true high and fake high (missing) income. If variability is low inside the tree node then there is not much to worry.
    $endgroup$
    – gsmafra
    17 hours ago















4












$begingroup$


Everytime that I am making some predictive model and I have missing data I impute categorical variables with something like "UNKNOWN" and numerical variables with some absurd number that will never be seen in practice (even if the variable is unbounded I can take the exponent of the variable and make the unknown values negative).



The main advantage is that the model knows that the variable is missing, which is not the case for say mean imputation. I can see that this could be disastrous in linear models or neural networks but in tree-based models this is handled really smoothly.



I know that there is a great deal of literature on missing data imputation, but when and why would I ever use these methods when missing data for predictive (tree-based) models?










share|cite|improve this question









$endgroup$











  • $begingroup$
    Imputing a large number for numeric data could be very bad for tree based models. Think of it this way, if your split is for example on income and the split is at say 100k, now everyone that was missing is going to be in the split with the high income earners
    $endgroup$
    – astel
    17 hours ago










  • $begingroup$
    The model will be fitted with that imputed values as well - so if they are significantly different than people with true high income the tree should make a split with true high and fake high (missing) income. If variability is low inside the tree node then there is not much to worry.
    $endgroup$
    – gsmafra
    17 hours ago













4












4








4


1



$begingroup$


Everytime that I am making some predictive model and I have missing data I impute categorical variables with something like "UNKNOWN" and numerical variables with some absurd number that will never be seen in practice (even if the variable is unbounded I can take the exponent of the variable and make the unknown values negative).



The main advantage is that the model knows that the variable is missing, which is not the case for say mean imputation. I can see that this could be disastrous in linear models or neural networks but in tree-based models this is handled really smoothly.



I know that there is a great deal of literature on missing data imputation, but when and why would I ever use these methods when missing data for predictive (tree-based) models?










share|cite|improve this question









$endgroup$




Everytime that I am making some predictive model and I have missing data I impute categorical variables with something like "UNKNOWN" and numerical variables with some absurd number that will never be seen in practice (even if the variable is unbounded I can take the exponent of the variable and make the unknown values negative).



The main advantage is that the model knows that the variable is missing, which is not the case for say mean imputation. I can see that this could be disastrous in linear models or neural networks but in tree-based models this is handled really smoothly.



I know that there is a great deal of literature on missing data imputation, but when and why would I ever use these methods when missing data for predictive (tree-based) models?







missing-data cart boosting data-imputation multiple-imputation






share|cite|improve this question













share|cite|improve this question











share|cite|improve this question




share|cite|improve this question










asked 18 hours ago









gsmafragsmafra

17018




17018











  • $begingroup$
    Imputing a large number for numeric data could be very bad for tree based models. Think of it this way, if your split is for example on income and the split is at say 100k, now everyone that was missing is going to be in the split with the high income earners
    $endgroup$
    – astel
    17 hours ago










  • $begingroup$
    The model will be fitted with that imputed values as well - so if they are significantly different than people with true high income the tree should make a split with true high and fake high (missing) income. If variability is low inside the tree node then there is not much to worry.
    $endgroup$
    – gsmafra
    17 hours ago
















  • $begingroup$
    Imputing a large number for numeric data could be very bad for tree based models. Think of it this way, if your split is for example on income and the split is at say 100k, now everyone that was missing is going to be in the split with the high income earners
    $endgroup$
    – astel
    17 hours ago










  • $begingroup$
    The model will be fitted with that imputed values as well - so if they are significantly different than people with true high income the tree should make a split with true high and fake high (missing) income. If variability is low inside the tree node then there is not much to worry.
    $endgroup$
    – gsmafra
    17 hours ago















$begingroup$
Imputing a large number for numeric data could be very bad for tree based models. Think of it this way, if your split is for example on income and the split is at say 100k, now everyone that was missing is going to be in the split with the high income earners
$endgroup$
– astel
17 hours ago




$begingroup$
Imputing a large number for numeric data could be very bad for tree based models. Think of it this way, if your split is for example on income and the split is at say 100k, now everyone that was missing is going to be in the split with the high income earners
$endgroup$
– astel
17 hours ago












$begingroup$
The model will be fitted with that imputed values as well - so if they are significantly different than people with true high income the tree should make a split with true high and fake high (missing) income. If variability is low inside the tree node then there is not much to worry.
$endgroup$
– gsmafra
17 hours ago




$begingroup$
The model will be fitted with that imputed values as well - so if they are significantly different than people with true high income the tree should make a split with true high and fake high (missing) income. If variability is low inside the tree node then there is not much to worry.
$endgroup$
– gsmafra
17 hours ago










2 Answers
2






active

oldest

votes


















2












$begingroup$

One reason you may not want to use "insert impossible value" methods is that means that your predictive model works conditional on the distribution of the data missingness remaining unchanged. Thus, if after building your tree model, it is realized that we can start using certain features more often, we can no longer use the model that was built using the "impute impossible value" method without retraining the model.



In fact, this problem is even further compounded if the rates of missingness changes during the data collection process itself. Then, even immediately after building the model, it is already "out of date", as the current rates of missingness will be different than the rates of missingness during when the data was collected.



To illustrate the issue, let's suppose a bank is building a database to help predict if clients will default on a loan. Early in the data collection process, loan officers have the option to conduct a background investigation, but they almost never do for clients they deem as trustworthy. Thus, for the especially trustworthy customers, the background check variable is almost always missing. If you use the "impute impossible value" method, having a possible value for background checks indicates high risk.



If background check rates don't change at all, then this "impute impossible value" method will likely still provide valid predictions. However, let's suppose the bank realizes that background checks are really helpful for assessing risk, so they change their policy to include background checks for everyone. Then, everyone will have a possible value for background checks and using the "impute impossible value" method, everyone will be flagged as "high risk".



Cross validation will not catch this issue, as the missingness distribution will be the same between the training and testing sets. So even though the "impute impossible value" method may lead to pretty results during cross-validation, this will lead to poor predictions upon deployment!



Note that you will essentially need to throw away all your data everytime your data collection policy changes! Alternatively, if you can correctly impute the missing values and their uncertainty, you can now use the data that was collected under the old policy.






share|cite|improve this answer











$endgroup$












  • $begingroup$
    That's a good point, imputation could be more robust on changes in the way data is missing. I will take your statement on throwing away past data as an exaggeration though - including a time variable and retraining the model should be enough make it usable again.
    $endgroup$
    – gsmafra
    16 hours ago










  • $begingroup$
    @gsmafra: In general, I don't think adding a time variable will fix the problem. For example, in a random forest, the time variable will only be included in 1/3 of the trees, so it won't even be included in the majority of the decision trees in your random forest.
    $endgroup$
    – Cliff AB
    16 hours ago










  • $begingroup$
    To be clear, I don't think you should throw out your data...but I'd only advise doing "impossible value imputation" on variables you don't think will be very predictive to start with or you're fairly certain that the missingness distribution is fairly stable.
    $endgroup$
    – Cliff AB
    16 hours ago



















0












$begingroup$

I disagree. You can use "missing" as an additional set of information, just as anything else. Imputation, in this case, emphasizes the lack of information along some dimension in your sample. That could have additional explanatory power.



The issue is more that "missing" will become a dichotomic variable and mixing it with continous one can lead to inefficient state space segmentation in the way the regression trees are constructed. The key is to force the tree to only split dichotomically too, too increase efficiency. In other words, your split decision is already known beforehand.



It does not really matter whether the frequency of the missing values increase or decrease. This is more related to the problem being stationary or not and has nothing specifically to do with the decision tree.



Non-stationarity is handled via updating of decision trees or using an ensemble of decision trees.






share|cite|improve this answer









$endgroup$












    Your Answer





    StackExchange.ifUsing("editor", function ()
    return StackExchange.using("mathjaxEditing", function ()
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
    );
    );
    , "mathjax-editing");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "65"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f397942%2fis-it-ever-recommended-to-use-mean-multiple-imputation-when-using-tree-based-pre%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    2












    $begingroup$

    One reason you may not want to use "insert impossible value" methods is that means that your predictive model works conditional on the distribution of the data missingness remaining unchanged. Thus, if after building your tree model, it is realized that we can start using certain features more often, we can no longer use the model that was built using the "impute impossible value" method without retraining the model.



    In fact, this problem is even further compounded if the rates of missingness changes during the data collection process itself. Then, even immediately after building the model, it is already "out of date", as the current rates of missingness will be different than the rates of missingness during when the data was collected.



    To illustrate the issue, let's suppose a bank is building a database to help predict if clients will default on a loan. Early in the data collection process, loan officers have the option to conduct a background investigation, but they almost never do for clients they deem as trustworthy. Thus, for the especially trustworthy customers, the background check variable is almost always missing. If you use the "impute impossible value" method, having a possible value for background checks indicates high risk.



    If background check rates don't change at all, then this "impute impossible value" method will likely still provide valid predictions. However, let's suppose the bank realizes that background checks are really helpful for assessing risk, so they change their policy to include background checks for everyone. Then, everyone will have a possible value for background checks and using the "impute impossible value" method, everyone will be flagged as "high risk".



    Cross validation will not catch this issue, as the missingness distribution will be the same between the training and testing sets. So even though the "impute impossible value" method may lead to pretty results during cross-validation, this will lead to poor predictions upon deployment!



    Note that you will essentially need to throw away all your data everytime your data collection policy changes! Alternatively, if you can correctly impute the missing values and their uncertainty, you can now use the data that was collected under the old policy.






    share|cite|improve this answer











    $endgroup$












    • $begingroup$
      That's a good point, imputation could be more robust on changes in the way data is missing. I will take your statement on throwing away past data as an exaggeration though - including a time variable and retraining the model should be enough make it usable again.
      $endgroup$
      – gsmafra
      16 hours ago










    • $begingroup$
      @gsmafra: In general, I don't think adding a time variable will fix the problem. For example, in a random forest, the time variable will only be included in 1/3 of the trees, so it won't even be included in the majority of the decision trees in your random forest.
      $endgroup$
      – Cliff AB
      16 hours ago










    • $begingroup$
      To be clear, I don't think you should throw out your data...but I'd only advise doing "impossible value imputation" on variables you don't think will be very predictive to start with or you're fairly certain that the missingness distribution is fairly stable.
      $endgroup$
      – Cliff AB
      16 hours ago
















    2












    $begingroup$

    One reason you may not want to use "insert impossible value" methods is that means that your predictive model works conditional on the distribution of the data missingness remaining unchanged. Thus, if after building your tree model, it is realized that we can start using certain features more often, we can no longer use the model that was built using the "impute impossible value" method without retraining the model.



    In fact, this problem is even further compounded if the rates of missingness changes during the data collection process itself. Then, even immediately after building the model, it is already "out of date", as the current rates of missingness will be different than the rates of missingness during when the data was collected.



    To illustrate the issue, let's suppose a bank is building a database to help predict if clients will default on a loan. Early in the data collection process, loan officers have the option to conduct a background investigation, but they almost never do for clients they deem as trustworthy. Thus, for the especially trustworthy customers, the background check variable is almost always missing. If you use the "impute impossible value" method, having a possible value for background checks indicates high risk.



    If background check rates don't change at all, then this "impute impossible value" method will likely still provide valid predictions. However, let's suppose the bank realizes that background checks are really helpful for assessing risk, so they change their policy to include background checks for everyone. Then, everyone will have a possible value for background checks and using the "impute impossible value" method, everyone will be flagged as "high risk".



    Cross validation will not catch this issue, as the missingness distribution will be the same between the training and testing sets. So even though the "impute impossible value" method may lead to pretty results during cross-validation, this will lead to poor predictions upon deployment!



    Note that you will essentially need to throw away all your data everytime your data collection policy changes! Alternatively, if you can correctly impute the missing values and their uncertainty, you can now use the data that was collected under the old policy.






    share|cite|improve this answer











    $endgroup$












    • $begingroup$
      That's a good point, imputation could be more robust on changes in the way data is missing. I will take your statement on throwing away past data as an exaggeration though - including a time variable and retraining the model should be enough make it usable again.
      $endgroup$
      – gsmafra
      16 hours ago










    • $begingroup$
      @gsmafra: In general, I don't think adding a time variable will fix the problem. For example, in a random forest, the time variable will only be included in 1/3 of the trees, so it won't even be included in the majority of the decision trees in your random forest.
      $endgroup$
      – Cliff AB
      16 hours ago










    • $begingroup$
      To be clear, I don't think you should throw out your data...but I'd only advise doing "impossible value imputation" on variables you don't think will be very predictive to start with or you're fairly certain that the missingness distribution is fairly stable.
      $endgroup$
      – Cliff AB
      16 hours ago














    2












    2








    2





    $begingroup$

    One reason you may not want to use "insert impossible value" methods is that means that your predictive model works conditional on the distribution of the data missingness remaining unchanged. Thus, if after building your tree model, it is realized that we can start using certain features more often, we can no longer use the model that was built using the "impute impossible value" method without retraining the model.



    In fact, this problem is even further compounded if the rates of missingness changes during the data collection process itself. Then, even immediately after building the model, it is already "out of date", as the current rates of missingness will be different than the rates of missingness during when the data was collected.



    To illustrate the issue, let's suppose a bank is building a database to help predict if clients will default on a loan. Early in the data collection process, loan officers have the option to conduct a background investigation, but they almost never do for clients they deem as trustworthy. Thus, for the especially trustworthy customers, the background check variable is almost always missing. If you use the "impute impossible value" method, having a possible value for background checks indicates high risk.



    If background check rates don't change at all, then this "impute impossible value" method will likely still provide valid predictions. However, let's suppose the bank realizes that background checks are really helpful for assessing risk, so they change their policy to include background checks for everyone. Then, everyone will have a possible value for background checks and using the "impute impossible value" method, everyone will be flagged as "high risk".



    Cross validation will not catch this issue, as the missingness distribution will be the same between the training and testing sets. So even though the "impute impossible value" method may lead to pretty results during cross-validation, this will lead to poor predictions upon deployment!



    Note that you will essentially need to throw away all your data everytime your data collection policy changes! Alternatively, if you can correctly impute the missing values and their uncertainty, you can now use the data that was collected under the old policy.






    share|cite|improve this answer











    $endgroup$



    One reason you may not want to use "insert impossible value" methods is that means that your predictive model works conditional on the distribution of the data missingness remaining unchanged. Thus, if after building your tree model, it is realized that we can start using certain features more often, we can no longer use the model that was built using the "impute impossible value" method without retraining the model.



    In fact, this problem is even further compounded if the rates of missingness changes during the data collection process itself. Then, even immediately after building the model, it is already "out of date", as the current rates of missingness will be different than the rates of missingness during when the data was collected.



    To illustrate the issue, let's suppose a bank is building a database to help predict if clients will default on a loan. Early in the data collection process, loan officers have the option to conduct a background investigation, but they almost never do for clients they deem as trustworthy. Thus, for the especially trustworthy customers, the background check variable is almost always missing. If you use the "impute impossible value" method, having a possible value for background checks indicates high risk.



    If background check rates don't change at all, then this "impute impossible value" method will likely still provide valid predictions. However, let's suppose the bank realizes that background checks are really helpful for assessing risk, so they change their policy to include background checks for everyone. Then, everyone will have a possible value for background checks and using the "impute impossible value" method, everyone will be flagged as "high risk".



    Cross validation will not catch this issue, as the missingness distribution will be the same between the training and testing sets. So even though the "impute impossible value" method may lead to pretty results during cross-validation, this will lead to poor predictions upon deployment!



    Note that you will essentially need to throw away all your data everytime your data collection policy changes! Alternatively, if you can correctly impute the missing values and their uncertainty, you can now use the data that was collected under the old policy.







    share|cite|improve this answer














    share|cite|improve this answer



    share|cite|improve this answer








    edited 17 hours ago

























    answered 17 hours ago









    Cliff ABCliff AB

    13.5k12567




    13.5k12567











    • $begingroup$
      That's a good point, imputation could be more robust on changes in the way data is missing. I will take your statement on throwing away past data as an exaggeration though - including a time variable and retraining the model should be enough make it usable again.
      $endgroup$
      – gsmafra
      16 hours ago










    • $begingroup$
      @gsmafra: In general, I don't think adding a time variable will fix the problem. For example, in a random forest, the time variable will only be included in 1/3 of the trees, so it won't even be included in the majority of the decision trees in your random forest.
      $endgroup$
      – Cliff AB
      16 hours ago










    • $begingroup$
      To be clear, I don't think you should throw out your data...but I'd only advise doing "impossible value imputation" on variables you don't think will be very predictive to start with or you're fairly certain that the missingness distribution is fairly stable.
      $endgroup$
      – Cliff AB
      16 hours ago

















    • $begingroup$
      That's a good point, imputation could be more robust on changes in the way data is missing. I will take your statement on throwing away past data as an exaggeration though - including a time variable and retraining the model should be enough make it usable again.
      $endgroup$
      – gsmafra
      16 hours ago










    • $begingroup$
      @gsmafra: In general, I don't think adding a time variable will fix the problem. For example, in a random forest, the time variable will only be included in 1/3 of the trees, so it won't even be included in the majority of the decision trees in your random forest.
      $endgroup$
      – Cliff AB
      16 hours ago










    • $begingroup$
      To be clear, I don't think you should throw out your data...but I'd only advise doing "impossible value imputation" on variables you don't think will be very predictive to start with or you're fairly certain that the missingness distribution is fairly stable.
      $endgroup$
      – Cliff AB
      16 hours ago
















    $begingroup$
    That's a good point, imputation could be more robust on changes in the way data is missing. I will take your statement on throwing away past data as an exaggeration though - including a time variable and retraining the model should be enough make it usable again.
    $endgroup$
    – gsmafra
    16 hours ago




    $begingroup$
    That's a good point, imputation could be more robust on changes in the way data is missing. I will take your statement on throwing away past data as an exaggeration though - including a time variable and retraining the model should be enough make it usable again.
    $endgroup$
    – gsmafra
    16 hours ago












    $begingroup$
    @gsmafra: In general, I don't think adding a time variable will fix the problem. For example, in a random forest, the time variable will only be included in 1/3 of the trees, so it won't even be included in the majority of the decision trees in your random forest.
    $endgroup$
    – Cliff AB
    16 hours ago




    $begingroup$
    @gsmafra: In general, I don't think adding a time variable will fix the problem. For example, in a random forest, the time variable will only be included in 1/3 of the trees, so it won't even be included in the majority of the decision trees in your random forest.
    $endgroup$
    – Cliff AB
    16 hours ago












    $begingroup$
    To be clear, I don't think you should throw out your data...but I'd only advise doing "impossible value imputation" on variables you don't think will be very predictive to start with or you're fairly certain that the missingness distribution is fairly stable.
    $endgroup$
    – Cliff AB
    16 hours ago





    $begingroup$
    To be clear, I don't think you should throw out your data...but I'd only advise doing "impossible value imputation" on variables you don't think will be very predictive to start with or you're fairly certain that the missingness distribution is fairly stable.
    $endgroup$
    – Cliff AB
    16 hours ago














    0












    $begingroup$

    I disagree. You can use "missing" as an additional set of information, just as anything else. Imputation, in this case, emphasizes the lack of information along some dimension in your sample. That could have additional explanatory power.



    The issue is more that "missing" will become a dichotomic variable and mixing it with continous one can lead to inefficient state space segmentation in the way the regression trees are constructed. The key is to force the tree to only split dichotomically too, too increase efficiency. In other words, your split decision is already known beforehand.



    It does not really matter whether the frequency of the missing values increase or decrease. This is more related to the problem being stationary or not and has nothing specifically to do with the decision tree.



    Non-stationarity is handled via updating of decision trees or using an ensemble of decision trees.






    share|cite|improve this answer









    $endgroup$

















      0












      $begingroup$

      I disagree. You can use "missing" as an additional set of information, just as anything else. Imputation, in this case, emphasizes the lack of information along some dimension in your sample. That could have additional explanatory power.



      The issue is more that "missing" will become a dichotomic variable and mixing it with continous one can lead to inefficient state space segmentation in the way the regression trees are constructed. The key is to force the tree to only split dichotomically too, too increase efficiency. In other words, your split decision is already known beforehand.



      It does not really matter whether the frequency of the missing values increase or decrease. This is more related to the problem being stationary or not and has nothing specifically to do with the decision tree.



      Non-stationarity is handled via updating of decision trees or using an ensemble of decision trees.






      share|cite|improve this answer









      $endgroup$















        0












        0








        0





        $begingroup$

        I disagree. You can use "missing" as an additional set of information, just as anything else. Imputation, in this case, emphasizes the lack of information along some dimension in your sample. That could have additional explanatory power.



        The issue is more that "missing" will become a dichotomic variable and mixing it with continous one can lead to inefficient state space segmentation in the way the regression trees are constructed. The key is to force the tree to only split dichotomically too, too increase efficiency. In other words, your split decision is already known beforehand.



        It does not really matter whether the frequency of the missing values increase or decrease. This is more related to the problem being stationary or not and has nothing specifically to do with the decision tree.



        Non-stationarity is handled via updating of decision trees or using an ensemble of decision trees.






        share|cite|improve this answer









        $endgroup$



        I disagree. You can use "missing" as an additional set of information, just as anything else. Imputation, in this case, emphasizes the lack of information along some dimension in your sample. That could have additional explanatory power.



        The issue is more that "missing" will become a dichotomic variable and mixing it with continous one can lead to inefficient state space segmentation in the way the regression trees are constructed. The key is to force the tree to only split dichotomically too, too increase efficiency. In other words, your split decision is already known beforehand.



        It does not really matter whether the frequency of the missing values increase or decrease. This is more related to the problem being stationary or not and has nothing specifically to do with the decision tree.



        Non-stationarity is handled via updating of decision trees or using an ensemble of decision trees.







        share|cite|improve this answer












        share|cite|improve this answer



        share|cite|improve this answer










        answered 12 hours ago









        Gkhan CebsGkhan Cebs

        1644




        1644



























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Cross Validated!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f397942%2fis-it-ever-recommended-to-use-mean-multiple-imputation-when-using-tree-based-pre%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Благоевград Съдържание География | История | Население | Политика | Икономика и инфрастуктура | Здравеопазване | Образование и наука | Култура и забавления | Забележителности | Личности | Литература | Външни препратки | Бележки | Навигация42°01′18.99″ с. ш. 23°05′51″ и. д. / 42.021944° с. ш. 23.0975° и. д.*БлагоевградразширитередактиранеОфициален уебсайт на община БлагоевградНовинарски портал на Благоевград – blagoevgrad.euСайтове за БлагоевградНационален статистически институтdariknews.bgГригоровичъ, Викторъ. „Очеркъ путешествія по Европейской Турціи“. Москва, 1877.Стрезов, Георги. Два санджака от Източна Македония. Периодично списание на Българското книжовно дружество в Средец, кн. XXXVII и XXXVIII, 1891, стр. 18 – 19.Македония. Етнография и статистикаГаджанов, Димитър Г. Мюсюлманското население в Новоосвободените земи, в: Научна експедиция в Македония и Поморавието 1916, Военноиздателски комплекс „Св. Георги Победоносец“, Университетско издателство „Св. Климент Охридски“, София, 1993, стр. 244.паметник на незнайния четник&cd=18&hl=en&ct=clnk&client=firefox-a „История на днешен Благоевград“, взето от www.museumblg.com на 16 март 2010 г.„Справка за населението на град Благоевград, община Благоевград, област Благоевград, НСИ“„The population of all towns and villages in Blagoevgrad Province with 50 inhabitants or more according to census results and latest official estimates“„Ethnic composition, all places: 2011 census“История на Неврокопска епархия.Национален статистически институтМюсюлманско изповедание. Главно мюфтийствоНационален публичен регистър на храмовете в БългарияМюсюлманско изповедание. Главно мюфтийствоwww.dnes.bg Джамията в Благоевград не била паленаwww.sesc-bg.orgСписък на побратимени градовеТехническо побратимяванеГУМ грейва в цветовете на нощен Лас Вегас под името „Largo“, „МОЛ Благоевград“..., в. „Струма“grabo.bgwww.cinemaxbg.comррр4238731-067cad53a-0546-417b-a3d3-51e49b1d2232147736077147736077

            What is the best defense strategy for Survival in Grand Theft Auto Online?What is JP used for in Grand Theft Auto Online?How do I setup a Crew HQ in Grand Theft Auto Online?How does stealth work in Grand Theft Auto Online?Is it possible to own more than 10 cars in Grand Theft Auto online?Where to find truck/trailers in Grand Theft Auto OnlineWhat are some of the best missions to do on Grand Theft Auto 5 onlineFastest Car in Grand Theft Auto V PCHow to setup a Crew vs Crew online session in Grand Theft Auto Online?Grand theft auto 5 crossplayingRestart Grand Theft Auto V Online?

            How does Billy Russo acquire his 'Jigsaw' mask? Unicorn Meta Zoo #1: Why another podcast? Announcing the arrival of Valued Associate #679: Cesar Manara Favourite questions and answers from the 1st quarter of 2019Why does Bane wear the mask?Why does Kylo Ren wear a mask?Why did Captain America remove his mask while fighting Batroc the Leaper?How did the OA acquire her wisdom?Is Billy Breckenridge gay?How does Adrian Toomes hide his earnings from the IRS?What is the state of affairs on Nootka Sound by the end of season 1?How did Tia Dalma acquire Captain Barbossa's body?How is one “Deemed Worthy”, to acquire the Greatsword “Dawn”?How did Karen acquire the handgun?