MathGroup Archive 2007

[Date Index] [Thread Index] [Author Index]

Search the Archive

Re: Efficient creation of regression design matrix

  • To: mathgroup at smc.vnet.net
  • Subject: [mg82320] Re: Efficient creation of regression design matrix
  • From: Ray Koopman <koopman at sfu.ca>
  • Date: Wed, 17 Oct 2007 04:11:32 -0400 (EDT)
  • References: <ff1ovf$8g5$1@smc.vnet.net>

On Oct 16, 12:24 am, "Coleman, Mark" <Mark.Cole... at LibertyMutual.com>
wrote:
> Hi,
>
> I'm searching for an efficient bit of code to create a design matrix of
> 1's and 0's computed from  categorical (non-numeric) variables, suitable
> for use in regression problems. More precisely, imagine one has an n x 1
> vector of k different non-numeric values. For argument sakes, let
> k={Red,Blue,Green,Yellow}. I would like to create an n x k matrix
> consisting of 1's and 0's, where a '1' appears in the row and column
> location corresponding to the presence of an element of k. For example,
> say the original data is
>
> Red
> Blue
> Blue
> Yellow
> Red
> Green
> .
> .
> .
>
> Then the corresponding design matrix would be (assuming we use the same
> ordering of k):
>
> Original                Red     Blue    Green   Yellow
> ======                  ==============================
> Red                     1       0       0       0
> Blue                    0       1       0       0
> Blue                    0       1       0       0
> Yellow                  0       0       0       1
> Red                     1       0       0       0
> Green                   0       0       1       0
>
> And so on. I have some code that does this, but as is the norm, I'm sure
> there are some great Mathematica one-liners that do a better job. In applied
> problems that I work with, n can be up to 100,000 and k = 30
>
> Thanks,
>
> -Mark

If v is a list of values of variables, such as {red, blue, blue,
yellow, red, green, ...}, and u is a list of the possible values in v,
such as {red, blue, green, yellow}, then probably the simplest way to
get what you asked for is

x = Boole@Outer[SameQ, v, u] .

A slightly more complicated, but much faster, way is

x = v /. Thread[u -> IdentityMatrix@Length@u] .



  • Prev by Date: Re: Re: format mixed integers & floats with text styling
  • Next by Date: Re: ProgressIndicator Questions
  • Previous by thread: Re: Efficient creation of regression design matrix
  • Next by thread: Mathematica Won't Activate