MathGroup Archive: February 2008 [00573]

[Date Index] [Thread Index] [Author Index]

Re: find and count partially identical sublist

To: mathgroup at smc.vnet.net
Subject: [mg85791] Re: find and count partially identical sublist
From: Szabolcs Horvát <szhorvat at gmail.com>
Date: Fri, 22 Feb 2008 05:05:46 -0500 (EST)
Organization: University of Bergen
References: <fphd98$8mq$1@smc.vnet.net>

markus.roellig at googlemail.com wrote:
> Hello group,
> 
> I am trying to find and count sublists that are partially identical to
> each other and then modify parts of this sublist with the
> multiplicity. It's easier to understand if I give an example.
> 
> Say I have an array (strings and numbers mixed) like:
> 
> {{"B", "A", 0, 1}, {"A", "B", 6, 1}, {"B", "A", 4, 1}, {"B", "A", 4,
>   1}, {"A", "B", 1, 1}, {"B", "A", 5, 1}, {"B", "A", 2, 1}, {"A", "B",
> 10, 1}}
> 
> I need to find successive sublists which have the same first two
> elements (here {3,4} and {7,6}). Depending on
> how many repetitions occur I want to divide the 4th element of each
> sublist by the number of repetitions. In the example the result would
> be:
> 
> {{"B", "A", 0, 1}, {"A", "B", 6, 1}, {"B", "A", 4, 1/2}, {"B", "A", 4,
>    1/2}, {"A", "B", 1, 1}, {"B", "A", 5, 1/2}, {"B", "A", 2, 1/
>   2}, {"A", "B", 10, 1}}
> 
> The code I came up with is:
> 
> 
> tst = Table[{RandomChoice[{"A", "B"}], RandomChoice[{"A", "B"}],
>     RandomInteger[{0, 10}], 1}, {i, 1, 30}];
> tstSplt = Split[tst, #1[[1 ;; 2]] === #2[[1 ;; 2]] &] // MatrixForm
> tab = Table[tstSplt[[1, i]] // Length, {i, 1, Length[tstSplt[[1]]]}]
> rpl = MapThread[#1[[All, 4]]/#2 &, {tstSplt[[1, All]], tab}] //
>   Flatten
> tst[[All, 4]] = tst[[#, 4]] & @@@ rpl;
> tst
> 
> 
> This works, but I am somewhat concerned with run speed (my actual
> array is much larger, roughly 50000x20). And I have the feeling that I
> am wasting too much memory.
> 
> 
> One additional comment: The above code only finds successive
> duplicates. How would I have to modify it to find all occurences ?

I don't have time to benchmark this, so I cannot make any guarantees 
about performance ... but I suspect that this will be useful:

I'll take the case when all duplicates are taken into account (not only 
successive ones).

Extract the relevant part of the data, i.e. the first 2 elements, and 
the last element of the sublists:

dat1 = data[[All, {1,2}]]
dat2 = data[[All, -1]]

It may (or may not) help performance to work with simple integers 
instead of strings.  Can you transform the strings into integers?  If 
yes, integers can be stored in a packed array ... but I'm not sure that 
this will help a lot.

Anyway, let's generate something that looks like dat1:

dat1 = Table[RandomChoice[{"A", "B"}, 2], {50000}];

And replace all elements with their multiplicities:

mult = dat1 /. Dispatch[Rule @@@ Tally[dat1]]; // Timing

{0.219, Null}

Now you only need to divide dat2 by mult (dat2/mult) and assemble the 
lists again:

Tranpose[Append[Transpose[dat1], dat2/mult]]

I hope this helps,
Szabolcs

P.S.  Do you actually need only successive duplicates or all duplicates?

Prev by Date: Re: find and count partially identical sublist

Next by Date: Re: Using Mathematica figures in MS Word documents

Previous by thread: Re: find and count partially identical sublist

Next by thread: Re: find and count partially identical sublist