Re: find and count partially identical sublist
- To: mathgroup at smc.vnet.net
- Subject: [mg85751] Re: find and count partially identical sublist
- From: Albert Retey <awnl at arcor.net>
- Date: Thu, 21 Feb 2008 17:55:47 -0500 (EST)
- References: <fphd98$8mq$1@smc.vnet.net>
Hi Markus,
> I am trying to find and count sublists that are partially identical to
> each other and then modify parts of this sublist with the
> multiplicity. It's easier to understand if I give an example.
>
> Say I have an array (strings and numbers mixed) like:
>
> {{"B", "A", 0, 1}, {"A", "B", 6, 1}, {"B", "A", 4, 1}, {"B", "A", 4,
> 1}, {"A", "B", 1, 1}, {"B", "A", 5, 1}, {"B", "A", 2, 1}, {"A", "B",
> 10, 1}}
>
> I need to find successive sublists which have the same first two
> elements (here {3,4} and {7,6}). Depending on
> how many repetitions occur I want to divide the 4th element of each
> sublist by the number of repetitions. In the example the result would
> be:
>
> {{"B", "A", 0, 1}, {"A", "B", 6, 1}, {"B", "A", 4, 1/2}, {"B", "A", 4,
> 1/2}, {"A", "B", 1, 1}, {"B", "A", 5, 1/2}, {"B", "A", 2, 1/
> 2}, {"A", "B", 10, 1}}
>
> One additional comment: The above code only finds successive
> duplicates. How would I have to modify it to find all occurences ?
I hope I understood correctly what you want to achieve. Here is my
version which finds only successive duplicates, I did not analyze run
times and memory consumption, but this could be done more efficient for
sure:
Flatten[
Map[
Replace[#, {a__, x_} :> {a, x/Length[#]}, {1}] &,
Split[data, #1[[{1, 2}]] == #2[[{1, 2}]] &],
{1}
],
1
]
If you really need to find all occurrences I would use a completely
different approach:
ClearAll[$numocc]
$numocc[__] = 0;
Scan[
($numocc[#[[1]], #[[2]]] = $numocc[#[[1]], #[[2]]] + 1) &,
data
];
Apply[{#1, #2, #3, #4/$numocc[#1, #2]} &, data, {1}]
here I abuse the downvalues of the helper symbol $numocc as a
"dictionary" to count occurrences of the first two arguments. To
understand what happens you might want to look at DownValues[$numocc].
Again I did not analyze run times and memory consumptions but in this
case believe they shouldn't be too bad. After all you are only scanning
the data twice, once to count the occurrences and once to divide the
forth argument. Memorywise you are making one copy of your data. If
really necessary it could with some effort be possible to avoid that
copy, but most probably this is not necessary.
hth,
albert