Re: Efficient creation of regression design matrix
- To: mathgroup at smc.vnet.net
- Subject: [mg82320] Re: Efficient creation of regression design matrix
- From: Ray Koopman <koopman at sfu.ca>
- Date: Wed, 17 Oct 2007 04:11:32 -0400 (EDT)
- References: <ff1ovf$8g5$1@smc.vnet.net>
On Oct 16, 12:24 am, "Coleman, Mark" <Mark.Cole... at LibertyMutual.com> wrote: > Hi, > > I'm searching for an efficient bit of code to create a design matrix of > 1's and 0's computed from categorical (non-numeric) variables, suitable > for use in regression problems. More precisely, imagine one has an n x 1 > vector of k different non-numeric values. For argument sakes, let > k={Red,Blue,Green,Yellow}. I would like to create an n x k matrix > consisting of 1's and 0's, where a '1' appears in the row and column > location corresponding to the presence of an element of k. For example, > say the original data is > > Red > Blue > Blue > Yellow > Red > Green > . > . > . > > Then the corresponding design matrix would be (assuming we use the same > ordering of k): > > Original Red Blue Green Yellow > ====== ============================== > Red 1 0 0 0 > Blue 0 1 0 0 > Blue 0 1 0 0 > Yellow 0 0 0 1 > Red 1 0 0 0 > Green 0 0 1 0 > > And so on. I have some code that does this, but as is the norm, I'm sure > there are some great Mathematica one-liners that do a better job. In applied > problems that I work with, n can be up to 100,000 and k = 30 > > Thanks, > > -Mark If v is a list of values of variables, such as {red, blue, blue, yellow, red, green, ...}, and u is a list of the possible values in v, such as {red, blue, green, yellow}, then probably the simplest way to get what you asked for is x = Boole@Outer[SameQ, v, u] . A slightly more complicated, but much faster, way is x = v /. Thread[u -> IdentityMatrix@Length@u] .