MathGroup Archive: February 2008 [00561]

[Date Index] [Thread Index] [Author Index]

Re: find and count partially identical sublist

To: mathgroup at smc.vnet.net
Subject: [mg85751] Re: find and count partially identical sublist
From: Albert Retey <awnl at arcor.net>
Date: Thu, 21 Feb 2008 17:55:47 -0500 (EST)
References: <fphd98$8mq$1@smc.vnet.net>

Hi Markus,

> I am trying to find and count sublists that are partially identical to
> each other and then modify parts of this sublist with the
> multiplicity. It's easier to understand if I give an example.
> 
> Say I have an array (strings and numbers mixed) like:
> 
> {{"B", "A", 0, 1}, {"A", "B", 6, 1}, {"B", "A", 4, 1}, {"B", "A", 4,
>   1}, {"A", "B", 1, 1}, {"B", "A", 5, 1}, {"B", "A", 2, 1}, {"A", "B",
> 10, 1}}
> 
> I need to find successive sublists which have the same first two
> elements (here {3,4} and {7,6}). Depending on
> how many repetitions occur I want to divide the 4th element of each
> sublist by the number of repetitions. In the example the result would
> be:
> 
> {{"B", "A", 0, 1}, {"A", "B", 6, 1}, {"B", "A", 4, 1/2}, {"B", "A", 4,
>    1/2}, {"A", "B", 1, 1}, {"B", "A", 5, 1/2}, {"B", "A", 2, 1/
>   2}, {"A", "B", 10, 1}}
> 
> One additional comment: The above code only finds successive
> duplicates. How would I have to modify it to find all occurences ?

I hope I understood correctly what you want to achieve. Here is my 
version which finds only successive duplicates, I did not analyze run 
times and memory consumption, but this could be done more efficient for 
sure:

Flatten[
  Map[
   Replace[#, {a__, x_} :> {a, x/Length[#]}, {1}] &,
   Split[data, #1[[{1, 2}]] == #2[[{1, 2}]] &],
   {1}
   ],
  1
  ]

If you really need to find all occurrences I would use a completely 
different approach:

ClearAll[$numocc]

$numocc[__] = 0;

Scan[
   ($numocc[#[[1]], #[[2]]] = $numocc[#[[1]], #[[2]]] + 1) &,
   data
];

Apply[{#1, #2, #3, #4/$numocc[#1, #2]} &, data, {1}]

here I abuse the downvalues of the helper symbol $numocc as a 
"dictionary" to count occurrences of the first two arguments. To 
understand what happens you might want to look at DownValues[$numocc].
Again I did not analyze run times and memory consumptions but in this 
case believe they shouldn't be too bad. After all you are only scanning 
the data twice, once to count the occurrences and once to divide the 
forth argument. Memorywise you are making one copy of your data. If 
really necessary it could with some effort be possible to avoid that 
copy, but most probably this is not necessary.

hth,

albert

Prev by Date: Re: NDSolve[] and Differential Equations: Problem solving two

Next by Date: Re: Vector file out of Mathematica?

Previous by thread: Re: find and count partially identical sublist

Next by thread: Re: find and count partially identical sublist