Re: find and count partially identical sublist

*To*: mathgroup at smc.vnet.net*Subject*: [mg85751] Re: find and count partially identical sublist*From*: Albert Retey <awnl at arcor.net>*Date*: Thu, 21 Feb 2008 17:55:47 -0500 (EST)*References*: <fphd98$8mq$1@smc.vnet.net>

Hi Markus, > I am trying to find and count sublists that are partially identical to > each other and then modify parts of this sublist with the > multiplicity. It's easier to understand if I give an example. > > Say I have an array (strings and numbers mixed) like: > > {{"B", "A", 0, 1}, {"A", "B", 6, 1}, {"B", "A", 4, 1}, {"B", "A", 4, > 1}, {"A", "B", 1, 1}, {"B", "A", 5, 1}, {"B", "A", 2, 1}, {"A", "B", > 10, 1}} > > I need to find successive sublists which have the same first two > elements (here {3,4} and {7,6}). Depending on > how many repetitions occur I want to divide the 4th element of each > sublist by the number of repetitions. In the example the result would > be: > > {{"B", "A", 0, 1}, {"A", "B", 6, 1}, {"B", "A", 4, 1/2}, {"B", "A", 4, > 1/2}, {"A", "B", 1, 1}, {"B", "A", 5, 1/2}, {"B", "A", 2, 1/ > 2}, {"A", "B", 10, 1}} > > One additional comment: The above code only finds successive > duplicates. How would I have to modify it to find all occurences ? I hope I understood correctly what you want to achieve. Here is my version which finds only successive duplicates, I did not analyze run times and memory consumption, but this could be done more efficient for sure: Flatten[ Map[ Replace[#, {a__, x_} :> {a, x/Length[#]}, {1}] &, Split[data, #1[[{1, 2}]] == #2[[{1, 2}]] &], {1} ], 1 ] If you really need to find all occurrences I would use a completely different approach: ClearAll[$numocc] $numocc[__] = 0; Scan[ ($numocc[#[[1]], #[[2]]] = $numocc[#[[1]], #[[2]]] + 1) &, data ]; Apply[{#1, #2, #3, #4/$numocc[#1, #2]} &, data, {1}] here I abuse the downvalues of the helper symbol $numocc as a "dictionary" to count occurrences of the first two arguments. To understand what happens you might want to look at DownValues[$numocc]. Again I did not analyze run times and memory consumptions but in this case believe they shouldn't be too bad. After all you are only scanning the data twice, once to count the occurrences and once to divide the forth argument. Memorywise you are making one copy of your data. If really necessary it could with some effort be possible to avoid that copy, but most probably this is not necessary. hth, albert