Re: Efficient creation of regression design matrix
- To: mathgroup at smc.vnet.net
- Subject: [mg82282] Re: Efficient creation of regression design matrix
- From: mcmcclur at unca.edu
- Date: Wed, 17 Oct 2007 03:51:55 -0400 (EDT)
- References: <ff1ovf$8g5$1@smc.vnet.net>
On Oct 16, 3:24 am, "Coleman, Mark" <Mark.Cole... at LibertyMutual.com>
wrote:
> I'm searching for an efficient bit of code to create a
> design matrix of 1's and 0's computed from categorical
> (non-numeric) variables, suitable for use in regression
> problems.
Suppose that your data is chosen from the integers 1 through
30. For example,
SeedRandom[1];
data = RandomInteger[{1, 30}, {100000}];
Then, you can set your matrix up via:
matrix = SparseArray[MapIndexed[{#2[[1]], #1} -> 1 &, data]]
Takes about half a second on my machine. If your data is
categorical, you can convert it to numerical first. For
example, the word "black" has 32 synonyms, according to
WordData. We can take that collection of words to be the
categorical terms.
terms = Union[Flatten[Last /@ WordData["black", "Synonyms"]]];
terms // Length
Converting the data and applying the above scheme now
takes about a second and a half.
dataCategorical = RandomChoice[terms, {100000}];
data = Flatten[Position[terms, #] & /@ dataCategorical];
matrix = SparseArray[MapIndexed[{#2[[1]], #1} -> 1 &, data]];
Mark