Re: Efficient creation of regression design matrix
- To: mathgroup at smc.vnet.net
- Subject: [mg82305] Re: [mg82235] Efficient creation of regression design matrix
- From: Darren Glosemeyer <darreng at wolfram.com>
- Date: Wed, 17 Oct 2007 04:03:43 -0400 (EDT)
- References: <200710160720.DAA08572@smc.vnet.net>
Coleman, Mark wrote:
> Hi,
>
> I'm searching for an efficient bit of code to create a design matrix of
> 1's and 0's computed from categorical (non-numeric) variables, suitable
> for use in regression problems. More precisely, imagine one has an n x 1
> vector of k different non-numeric values. For argument sakes, let
> k={Red,Blue,Green,Yellow}. I would like to create an n x k matrix
> consisting of 1's and 0's, where a '1' appears in the row and column
> location corresponding to the presence of an element of k. For example,
> say the original data is
>
> Red
> Blue
> Blue
> Yellow
> Red
> Green
>
If you know the possible categories, you can use the following, which
takes the categories as the second argument.
In[1]:= categoryDesign[xx_, vals_] :=
xx /. Thread[Rule[vals, IdentityMatrix[Length[vals]]]]
In[2]:=
categoryDesign[{Red,Blue,Blue,Yellow,Red,Green},{Red,Blue,Green,Yellow}]
Out[2]= {{1, 0, 0, 0}, {0, 1, 0, 0}, {0, 1, 0, 0}, {0, 0, 0, 1}, {1, 0,
0, 0},
> {0, 0, 1, 0}}
If the possible categories are not known, the following can be used.
In[3]:= categoryDesign[xx_] :=
Block[{vals = Union[xx]},
xx /. Thread[Rule[vals, IdentityMatrix[Length[vals]]]]]
Note that the categories using this definition are coded in Sort order
because of the Union.
In[4]:= categoryDesign[{Red,Blue,Blue,Yellow,Red,Green}]
Out[4]= {{0, 0, 1, 0}, {1, 0, 0, 0}, {1, 0, 0, 0}, {0, 0, 0, 1}, {0, 0,
1, 0},
> {0, 1, 0, 0}}
In terms of efficiency, the first definition takes about a third of a
second for a million values on my machine.
In[5]:= vals = RandomChoice[{Red, Blue, Green, Yellow}, 10^6];
In[6]:= categoryDesign[vals,{Red,Blue,Green,Yellow}];//Timing
Out[6]= {0.344, Null}
The second definition will be slower by the amount of time needed by Union.
Darren Glosemeyer
Wolfram Research
- References:
- Efficient creation of regression design matrix
- From: "Coleman, Mark" <Mark.Coleman@LibertyMutual.com>
- Efficient creation of regression design matrix