Re: find and count partially identical sublist

*To*: mathgroup at smc.vnet.net*Subject*: [mg85921] Re: find and count partially identical sublist*From*: markus.roellig at googlemail.com*Date*: Thu, 28 Feb 2008 02:40:45 -0500 (EST)*References*: <fphd98$8mq$1@smc.vnet.net> <fpm79i$2u7$1@smc.vnet.net>

On 22 Feb., 11:14, Szabolcs Horv=E1t <szhor... at gmail.com> wrote: > markus.roel... at googlemail.com wrote: > > Hello group, > > > I am trying to find and count sublists that are partially identical to > > each other and then modify parts of this sublist with the > > multiplicity. It's easier to understand if I give an example. > > > Say I have an array (strings and numbers mixed) like: > > > {{"B", "A", 0, 1}, {"A", "B", 6, 1}, {"B", "A", 4, 1}, {"B", "A", 4, > > =A0 1}, {"A", "B", 1, 1}, {"B", "A", 5, 1}, {"B", "A", 2, 1}, {"A", "B",= > > 10, 1}} > > > I need to find successive sublists which have the same first two > > elements (here {3,4} and {7,6}). Depending on > > how many repetitions occur I want to divide the 4th element of each > > sublist by the number of repetitions. In the example the result would > > be: > > > {{"B", "A", 0, 1}, {"A", "B", 6, 1}, {"B", "A", 4, 1/2}, {"B", "A", 4, > > =A0 =A01/2}, {"A", "B", 1, 1}, {"B", "A", 5, 1/2}, {"B", "A", 2, 1/ > > =A0 2}, {"A", "B", 10, 1}} > > > The code I came up with is: > > > tst = Table[{RandomChoice[{"A", "B"}], RandomChoice[{"A", "B"}], > > =A0 =A0 RandomInteger[{0, 10}], 1}, {i, 1, 30}]; > > tstSplt = Split[tst, #1[[1 ;; 2]] === #2[[1 ;; 2]] &] // MatrixF= orm > > tab = Table[tstSplt[[1, i]] // Length, {i, 1, Length[tstSplt[[1]]]}] > > rpl = MapThread[#1[[All, 4]]/#2 &, {tstSplt[[1, All]], tab}] // > > =A0 Flatten > > tst[[All, 4]] = tst[[#, 4]] & @@@ rpl; > > tst > > > This works, but I am somewhat concerned with run speed (my actual > > array is much larger, roughly 50000x20). And I have the feeling that I > > am wasting too much memory. > > > One additional comment: The above code only finds successive > > duplicates. How would I have to modify it to find all occurences ? > > I don't have time to benchmark this, so I cannot make any guarantees > about performance ... but I suspect that this will be useful: > > I'll take the case when all duplicates are taken into account (not only > successive ones). > > Extract the relevant part of the data, i.e. the first 2 elements, and > the last element of the sublists: > > dat1 = data[[All, {1,2}]] > dat2 = data[[All, -1]] > > It may (or may not) help performance to work with simple integers > instead of strings. =A0Can you transform the strings into integers? =A0If > yes, integers can be stored in a packed array ... but I'm not sure that > this will help a lot. > > Anyway, let's generate something that looks like dat1: > > dat1 = Table[RandomChoice[{"A", "B"}, 2], {50000}]; > > And replace all elements with their multiplicities: > > mult = dat1 /. Dispatch[Rule @@@ Tally[dat1]]; // Timing > > {0.219, Null} > > Now you only need to divide dat2 by mult (dat2/mult) and assemble the > lists again: > > Tranpose[Append[Transpose[dat1], dat2/mult]] > > I hope this helps, > Szabolcs > > P.S. =A0Do you actually need only successive duplicates or all duplicates?= - Zitierten Text ausblenden - > > - Zitierten Text anzeigen - Thank you all for you input. > P.S. Do you actually need only successive duplicates or all duplicates? Actually, what would be the way to search for all (not only successive) duplicates if I don't want to use Sort? Markus