Re: Efficient creation of regression design matrix

• To: mathgroup at smc.vnet.net
• Subject: [mg82282] Re: Efficient creation of regression design matrix
• From: mcmcclur at unca.edu
• Date: Wed, 17 Oct 2007 03:51:55 -0400 (EDT)
• References: <ff1ovf\$8g5\$1@smc.vnet.net>

```On Oct 16, 3:24 am, "Coleman, Mark" <Mark.Cole... at LibertyMutual.com>
wrote:
> I'm searching for an efficient bit of code to create a
> design matrix of 1's and 0's computed from  categorical
> (non-numeric) variables, suitable for use in regression
> problems.

Suppose that your data is chosen from the integers 1 through
30.  For example,

SeedRandom[1];
data = RandomInteger[{1, 30}, {100000}];

Then, you can set your matrix up via:

matrix = SparseArray[MapIndexed[{#2[[1]], #1} -> 1 &, data]]

Takes about half a second on my machine.  If your data is
categorical, you can convert it to numerical first.  For
example, the word "black" has 32 synonyms, according to
WordData.  We can take that collection of words to be the
categorical terms.

terms = Union[Flatten[Last /@ WordData["black", "Synonyms"]]];
terms // Length

Converting the data and applying the above scheme now
takes about a second and a half.

dataCategorical = RandomChoice[terms, {100000}];
data = Flatten[Position[terms, #] & /@ dataCategorical];
matrix = SparseArray[MapIndexed[{#2[[1]], #1} -> 1 &, data]];

Mark

```

• Prev by Date: Re: Is this normal for Limit?
• Next by Date: Re: ColorData etc.
• Previous by thread: Re: Efficient creation of regression design matrix
• Next by thread: Re: Efficient creation of regression design matrix