MathGroup Archive 2007

[Date Index] [Thread Index] [Author Index]

Search the Archive

Re: Efficient creation of regression design matrix

  • To: mathgroup at smc.vnet.net
  • Subject: [mg82305] Re: [mg82235] Efficient creation of regression design matrix
  • From: Darren Glosemeyer <darreng at wolfram.com>
  • Date: Wed, 17 Oct 2007 04:03:43 -0400 (EDT)
  • References: <200710160720.DAA08572@smc.vnet.net>

Coleman, Mark wrote:
> Hi,
>
> I'm searching for an efficient bit of code to create a design matrix of
> 1's and 0's computed from  categorical (non-numeric) variables, suitable
> for use in regression problems. More precisely, imagine one has an n x 1
> vector of k different non-numeric values. For argument sakes, let
> k={Red,Blue,Green,Yellow}. I would like to create an n x k matrix
> consisting of 1's and 0's, where a '1' appears in the row and column
> location corresponding to the presence of an element of k. For example,
> say the original data is
>
> Red
> Blue
> Blue
> Yellow
> Red
> Green
>   

If you know the possible categories, you can use the following, which 
takes the categories as the second argument.

In[1]:= categoryDesign[xx_, vals_] :=
         xx /. Thread[Rule[vals, IdentityMatrix[Length[vals]]]]

In[2]:= 
categoryDesign[{Red,Blue,Blue,Yellow,Red,Green},{Red,Blue,Green,Yellow}]

Out[2]= {{1, 0, 0, 0}, {0, 1, 0, 0}, {0, 1, 0, 0}, {0, 0, 0, 1}, {1, 0, 
0, 0},
 
 >    {0, 0, 1, 0}}


If the possible categories are not known, the following can be used.

In[3]:= categoryDesign[xx_] :=
         Block[{vals = Union[xx]},
          xx /. Thread[Rule[vals, IdentityMatrix[Length[vals]]]]]


Note that the categories using this definition are coded in Sort order 
because of the Union.

In[4]:= categoryDesign[{Red,Blue,Blue,Yellow,Red,Green}]

Out[4]= {{0, 0, 1, 0}, {1, 0, 0, 0}, {1, 0, 0, 0}, {0, 0, 0, 1}, {0, 0, 
1, 0},
 
 >    {0, 1, 0, 0}}


In terms of efficiency, the first definition takes about a third of a 
second for a million values on my machine.

In[5]:= vals = RandomChoice[{Red, Blue, Green, Yellow}, 10^6];

In[6]:= categoryDesign[vals,{Red,Blue,Green,Yellow}];//Timing

Out[6]= {0.344, Null}

The second definition will be slower by the amount of time needed by Union.


Darren Glosemeyer
Wolfram Research


  • Prev by Date: Re: Implicit plotting issues
  • Next by Date: Re: Integrate question
  • Previous by thread: Re: Efficient creation of regression design matrix
  • Next by thread: Re: Efficient creation of regression design matrix