Re: Efficient creation of regression design matrix
- To: mathgroup at smc.vnet.net
- Subject: [mg82282] Re: Efficient creation of regression design matrix
- From: mcmcclur at unca.edu
- Date: Wed, 17 Oct 2007 03:51:55 -0400 (EDT)
- References: <ff1ovf$8g5$1@smc.vnet.net>
On Oct 16, 3:24 am, "Coleman, Mark" <Mark.Cole... at LibertyMutual.com> wrote: > I'm searching for an efficient bit of code to create a > design matrix of 1's and 0's computed from categorical > (non-numeric) variables, suitable for use in regression > problems. Suppose that your data is chosen from the integers 1 through 30. For example, SeedRandom[1]; data = RandomInteger[{1, 30}, {100000}]; Then, you can set your matrix up via: matrix = SparseArray[MapIndexed[{#2[[1]], #1} -> 1 &, data]] Takes about half a second on my machine. If your data is categorical, you can convert it to numerical first. For example, the word "black" has 32 synonyms, according to WordData. We can take that collection of words to be the categorical terms. terms = Union[Flatten[Last /@ WordData["black", "Synonyms"]]]; terms // Length Converting the data and applying the above scheme now takes about a second and a half. dataCategorical = RandomChoice[terms, {100000}]; data = Flatten[Position[terms, #] & /@ dataCategorical]; matrix = SparseArray[MapIndexed[{#2[[1]], #1} -> 1 &, data]]; Mark