MathGroup Archive 2008

[Date Index] [Thread Index] [Author Index]

Search the Archive

Re: find and count partially identical sublist

  • To: mathgroup at smc.vnet.net
  • Subject: [mg85921] Re: find and count partially identical sublist
  • From: markus.roellig at googlemail.com
  • Date: Thu, 28 Feb 2008 02:40:45 -0500 (EST)
  • References: <fphd98$8mq$1@smc.vnet.net> <fpm79i$2u7$1@smc.vnet.net>

On 22 Feb., 11:14, Szabolcs Horv=E1t <szhor... at gmail.com> wrote:
> markus.roel... at googlemail.com wrote:
> > Hello group,
>
> > I am trying to find and count sublists that are partially identical to
> > each other and then modify parts of this sublist with the
> > multiplicity. It's easier to understand if I give an example.
>
> > Say I have an array (strings and numbers mixed) like:
>
> > {{"B", "A", 0, 1}, {"A", "B", 6, 1}, {"B", "A", 4, 1}, {"B", "A", 4,
> > =A0 1}, {"A", "B", 1, 1}, {"B", "A", 5, 1}, {"B", "A", 2, 1}, {"A", "B",=

> > 10, 1}}
>
> > I need to find successive sublists which have the same first two
> > elements (here {3,4} and {7,6}). Depending on
> > how many repetitions occur I want to divide the 4th element of each
> > sublist by the number of repetitions. In the example the result would
> > be:
>
> > {{"B", "A", 0, 1}, {"A", "B", 6, 1}, {"B", "A", 4, 1/2}, {"B", "A", 4,
> > =A0 =A01/2}, {"A", "B", 1, 1}, {"B", "A", 5, 1/2}, {"B", "A", 2, 1/
> > =A0 2}, {"A", "B", 10, 1}}
>
> > The code I came up with is:
>
> > tst = Table[{RandomChoice[{"A", "B"}], RandomChoice[{"A", "B"}],
> > =A0 =A0 RandomInteger[{0, 10}], 1}, {i, 1, 30}];
> > tstSplt = Split[tst, #1[[1 ;; 2]] === #2[[1 ;; 2]] &] // MatrixF=
orm
> > tab = Table[tstSplt[[1, i]] // Length, {i, 1, Length[tstSplt[[1]]]}]
> > rpl = MapThread[#1[[All, 4]]/#2 &, {tstSplt[[1, All]], tab}] //
> > =A0 Flatten
> > tst[[All, 4]] = tst[[#, 4]] & @@@ rpl;
> > tst
>
> > This works, but I am somewhat concerned with run speed (my actual
> > array is much larger, roughly 50000x20). And I have the feeling that I
> > am wasting too much memory.
>
> > One additional comment: The above code only finds successive
> > duplicates. How would I have to modify it to find all occurences ?
>
> I don't have time to benchmark this, so I cannot make any guarantees
> about performance ... but I suspect that this will be useful:
>
> I'll take the case when all duplicates are taken into account (not only
> successive ones).
>
> Extract the relevant part of the data, i.e. the first 2 elements, and
> the last element of the sublists:
>
> dat1 = data[[All, {1,2}]]
> dat2 = data[[All, -1]]
>
> It may (or may not) help performance to work with simple integers
> instead of strings. =A0Can you transform the strings into integers? =A0If
> yes, integers can be stored in a packed array ... but I'm not sure that
> this will help a lot.
>
> Anyway, let's generate something that looks like dat1:
>
> dat1 = Table[RandomChoice[{"A", "B"}, 2], {50000}];
>
> And replace all elements with their multiplicities:
>
> mult = dat1 /. Dispatch[Rule @@@ Tally[dat1]]; // Timing
>
> {0.219, Null}
>
> Now you only need to divide dat2 by mult (dat2/mult) and assemble the
> lists again:
>
> Tranpose[Append[Transpose[dat1], dat2/mult]]
>
> I hope this helps,
> Szabolcs
>
> P.S. =A0Do you actually need only successive duplicates or all duplicates?=
- Zitierten Text ausblenden -
>
> - Zitierten Text anzeigen -

Thank you all for you input.

> P.S.  Do you actually need only successive duplicates or all duplicates?

Actually, what would be the way to search for all (not only
successive) duplicates
if I don't want to use Sort?

Markus


  • Prev by Date: Re: Using Fourier and InverseFourier instead of Convolve
  • Next by Date: Re: How to select terms wanted in a series
  • Previous by thread: Re: find and count partially identical sublist
  • Next by thread: about scoping in modules