Re: Efficient creation of regression design matrix
- To: mathgroup at smc.vnet.net
- Subject: [mg82305] Re: [mg82235] Efficient creation of regression design matrix
- From: Darren Glosemeyer <darreng at wolfram.com>
- Date: Wed, 17 Oct 2007 04:03:43 -0400 (EDT)
- References: <200710160720.DAA08572@smc.vnet.net>
Coleman, Mark wrote: > Hi, > > I'm searching for an efficient bit of code to create a design matrix of > 1's and 0's computed from categorical (non-numeric) variables, suitable > for use in regression problems. More precisely, imagine one has an n x 1 > vector of k different non-numeric values. For argument sakes, let > k={Red,Blue,Green,Yellow}. I would like to create an n x k matrix > consisting of 1's and 0's, where a '1' appears in the row and column > location corresponding to the presence of an element of k. For example, > say the original data is > > Red > Blue > Blue > Yellow > Red > Green > If you know the possible categories, you can use the following, which takes the categories as the second argument. In[1]:= categoryDesign[xx_, vals_] := xx /. Thread[Rule[vals, IdentityMatrix[Length[vals]]]] In[2]:= categoryDesign[{Red,Blue,Blue,Yellow,Red,Green},{Red,Blue,Green,Yellow}] Out[2]= {{1, 0, 0, 0}, {0, 1, 0, 0}, {0, 1, 0, 0}, {0, 0, 0, 1}, {1, 0, 0, 0}, > {0, 0, 1, 0}} If the possible categories are not known, the following can be used. In[3]:= categoryDesign[xx_] := Block[{vals = Union[xx]}, xx /. Thread[Rule[vals, IdentityMatrix[Length[vals]]]]] Note that the categories using this definition are coded in Sort order because of the Union. In[4]:= categoryDesign[{Red,Blue,Blue,Yellow,Red,Green}] Out[4]= {{0, 0, 1, 0}, {1, 0, 0, 0}, {1, 0, 0, 0}, {0, 0, 0, 1}, {0, 0, 1, 0}, > {0, 1, 0, 0}} In terms of efficiency, the first definition takes about a third of a second for a million values on my machine. In[5]:= vals = RandomChoice[{Red, Blue, Green, Yellow}, 10^6]; In[6]:= categoryDesign[vals,{Red,Blue,Green,Yellow}];//Timing Out[6]= {0.344, Null} The second definition will be slower by the amount of time needed by Union. Darren Glosemeyer Wolfram Research
- References:
- Efficient creation of regression design matrix
- From: "Coleman, Mark" <Mark.Coleman@LibertyMutual.com>
- Efficient creation of regression design matrix