Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

I\'m performing a linear regression against a large data set of ~6400 elements t

ID: 3325748 • Letter: I

Question

I'm performing a linear regression against a large data set of ~6400 elements these elements have a minimum age and a maximum age, and they often overlap. For example, one record might have an age range of 5-18, another 12-50, 30-65, and another 3-100.

My best guess was to recode the variables as a series of dummy variables that would represent whether some fixed age range overlapped with the records age range. For example:

| has_toddlers | has_youths | has_teens | has_youngadults | has_adults | has_seniors |
|--------------|------------|-----------|-----------------|------------|-------------|
| 0 | 1 | 1 | 1 | 0 | 0 |
| 0 | 1 | 1 | 1 | 1 | 0 |
| 0 | 0 | 0 | 0 | 1 | 1 |

While this method seems appears to work fine, it seems appears fairly icky, particularly since the baseline is impossible (all people have one of those ages) and since all non continuous combinations of inputs are impossible.

Any recommendations? Is this method fine for as long I don't use it to compute impossible values? Is it better if I merge these 0s and 1s into a single categorical string?

Explanation / Answer

performing a linear regression against a large data set of ~6400 elements these elements have a minimum age and a maximum age, and they often overlap. For example, one record might have an age range of 5-18, another 12-50, 30-65, and another 3-100.

My best guess was to recode the variables as a series of dummy variables that would represent whether some fixed age range overlapped with the records age range. For example:

While this method seems appears to work fine, it seems appears fairly icky, particularly since the baseline is impossible (all people have one of those ages) and since all non continuous combinations of inputs are impossible.