Re: find and count partially identical sublist

*To*: mathgroup at smc.vnet.net*Subject*: [mg85791] Re: find and count partially identical sublist*From*: Szabolcs Horvát <szhorvat at gmail.com>*Date*: Fri, 22 Feb 2008 05:05:46 -0500 (EST)*Organization*: University of Bergen*References*: <fphd98$8mq$1@smc.vnet.net>

markus.roellig at googlemail.com wrote: > Hello group, > > I am trying to find and count sublists that are partially identical to > each other and then modify parts of this sublist with the > multiplicity. It's easier to understand if I give an example. > > Say I have an array (strings and numbers mixed) like: > > {{"B", "A", 0, 1}, {"A", "B", 6, 1}, {"B", "A", 4, 1}, {"B", "A", 4, > 1}, {"A", "B", 1, 1}, {"B", "A", 5, 1}, {"B", "A", 2, 1}, {"A", "B", > 10, 1}} > > I need to find successive sublists which have the same first two > elements (here {3,4} and {7,6}). Depending on > how many repetitions occur I want to divide the 4th element of each > sublist by the number of repetitions. In the example the result would > be: > > {{"B", "A", 0, 1}, {"A", "B", 6, 1}, {"B", "A", 4, 1/2}, {"B", "A", 4, > 1/2}, {"A", "B", 1, 1}, {"B", "A", 5, 1/2}, {"B", "A", 2, 1/ > 2}, {"A", "B", 10, 1}} > > The code I came up with is: > > > tst = Table[{RandomChoice[{"A", "B"}], RandomChoice[{"A", "B"}], > RandomInteger[{0, 10}], 1}, {i, 1, 30}]; > tstSplt = Split[tst, #1[[1 ;; 2]] === #2[[1 ;; 2]] &] // MatrixForm > tab = Table[tstSplt[[1, i]] // Length, {i, 1, Length[tstSplt[[1]]]}] > rpl = MapThread[#1[[All, 4]]/#2 &, {tstSplt[[1, All]], tab}] // > Flatten > tst[[All, 4]] = tst[[#, 4]] & @@@ rpl; > tst > > > This works, but I am somewhat concerned with run speed (my actual > array is much larger, roughly 50000x20). And I have the feeling that I > am wasting too much memory. > > > One additional comment: The above code only finds successive > duplicates. How would I have to modify it to find all occurences ? I don't have time to benchmark this, so I cannot make any guarantees about performance ... but I suspect that this will be useful: I'll take the case when all duplicates are taken into account (not only successive ones). Extract the relevant part of the data, i.e. the first 2 elements, and the last element of the sublists: dat1 = data[[All, {1,2}]] dat2 = data[[All, -1]] It may (or may not) help performance to work with simple integers instead of strings. Can you transform the strings into integers? If yes, integers can be stored in a packed array ... but I'm not sure that this will help a lot. Anyway, let's generate something that looks like dat1: dat1 = Table[RandomChoice[{"A", "B"}, 2], {50000}]; And replace all elements with their multiplicities: mult = dat1 /. Dispatch[Rule @@@ Tally[dat1]]; // Timing {0.219, Null} Now you only need to divide dat2 by mult (dat2/mult) and assemble the lists again: Tranpose[Append[Transpose[dat1], dat2/mult]] I hope this helps, Szabolcs P.S. Do you actually need only successive duplicates or all duplicates?