Re: find and count partially identical sublist
- To: mathgroup at smc.vnet.net
- Subject: [mg85791] Re: find and count partially identical sublist
- From: Szabolcs Horvát <szhorvat at gmail.com>
- Date: Fri, 22 Feb 2008 05:05:46 -0500 (EST)
- Organization: University of Bergen
- References: <fphd98$8mq$1@smc.vnet.net>
markus.roellig at googlemail.com wrote:
> Hello group,
>
> I am trying to find and count sublists that are partially identical to
> each other and then modify parts of this sublist with the
> multiplicity. It's easier to understand if I give an example.
>
> Say I have an array (strings and numbers mixed) like:
>
> {{"B", "A", 0, 1}, {"A", "B", 6, 1}, {"B", "A", 4, 1}, {"B", "A", 4,
> 1}, {"A", "B", 1, 1}, {"B", "A", 5, 1}, {"B", "A", 2, 1}, {"A", "B",
> 10, 1}}
>
> I need to find successive sublists which have the same first two
> elements (here {3,4} and {7,6}). Depending on
> how many repetitions occur I want to divide the 4th element of each
> sublist by the number of repetitions. In the example the result would
> be:
>
> {{"B", "A", 0, 1}, {"A", "B", 6, 1}, {"B", "A", 4, 1/2}, {"B", "A", 4,
> 1/2}, {"A", "B", 1, 1}, {"B", "A", 5, 1/2}, {"B", "A", 2, 1/
> 2}, {"A", "B", 10, 1}}
>
> The code I came up with is:
>
>
> tst = Table[{RandomChoice[{"A", "B"}], RandomChoice[{"A", "B"}],
> RandomInteger[{0, 10}], 1}, {i, 1, 30}];
> tstSplt = Split[tst, #1[[1 ;; 2]] === #2[[1 ;; 2]] &] // MatrixForm
> tab = Table[tstSplt[[1, i]] // Length, {i, 1, Length[tstSplt[[1]]]}]
> rpl = MapThread[#1[[All, 4]]/#2 &, {tstSplt[[1, All]], tab}] //
> Flatten
> tst[[All, 4]] = tst[[#, 4]] & @@@ rpl;
> tst
>
>
> This works, but I am somewhat concerned with run speed (my actual
> array is much larger, roughly 50000x20). And I have the feeling that I
> am wasting too much memory.
>
>
> One additional comment: The above code only finds successive
> duplicates. How would I have to modify it to find all occurences ?
I don't have time to benchmark this, so I cannot make any guarantees
about performance ... but I suspect that this will be useful:
I'll take the case when all duplicates are taken into account (not only
successive ones).
Extract the relevant part of the data, i.e. the first 2 elements, and
the last element of the sublists:
dat1 = data[[All, {1,2}]]
dat2 = data[[All, -1]]
It may (or may not) help performance to work with simple integers
instead of strings. Can you transform the strings into integers? If
yes, integers can be stored in a packed array ... but I'm not sure that
this will help a lot.
Anyway, let's generate something that looks like dat1:
dat1 = Table[RandomChoice[{"A", "B"}, 2], {50000}];
And replace all elements with their multiplicities:
mult = dat1 /. Dispatch[Rule @@@ Tally[dat1]]; // Timing
{0.219, Null}
Now you only need to divide dat2 by mult (dat2/mult) and assemble the
lists again:
Tranpose[Append[Transpose[dat1], dat2/mult]]
I hope this helps,
Szabolcs
P.S. Do you actually need only successive duplicates or all duplicates?