MathGroup Archive: January 2010 [00155]

[Date Index] [Thread Index] [Author Index]

Re: Combining data from indexed lists efficiently

To: mathgroup at smc.vnet.net
Subject: [mg106221] Re: [mg106182] Combining data from indexed lists efficiently
From: Daniel Lichtblau <danl at wolfram.com>
Date: Tue, 5 Jan 2010 01:47:28 -0500 (EST)
References: <201001041059.FAA21250@smc.vnet.net>

Steve W. Brewer wrote:
> I have several lists of the format:
> 
>     { {index1, value}, {index2, value}, ... {indexN, value} }
> 
> For example:
> 
>     list1 = { {"A", 1}, {"B",  2}, {"C",  3}, {"D", 4} }
>     list2 = { {"A", 5}, {"B",  6}, {"D",  7}, {"E", 8} }
>     list3 = { {"A", 9}, {"B", 10}, {"C", 11} }
> 
> The indexes are not necessarily strings; they may be any expression.  (In
> the specific case I'm addressing now, each index is a list representing a
> date/time in the format returned by DateList[].)  The lists are not
> necessarily the same length.  Also, while most of the indexes appear in all
> lists, there are some holes (missing data).
> 
> I want to combine the lists into a single list of the format:
> 
>     { { index1, {value1, value2, ... valueN} },
>       { index2, {value1, value2, ... valueN} },
>       ...
>       { indexN, {value1, value2, ... valueN} } }
> 
> Only the data points with indexes appearing in all lists should be included;
> the rest should be dropped.  Also, I want to include some derived values
> along with the original data values.
> 
> Using the sample data above, let's say I want to include two derived values
> from the functions:
> 
>     f1[list1Data_, list2Data_] := list1Data + list2Data
>     f2[list2Data_, list3Data_] := list2Data + list3Data
> 
> The result would be:
> 
>     combinedList = { { "A", {1, 5,  9, 6, 14} },
>                      { "B", {2, 6, 10, 8, 16} } }
> 
> I have a solution that works fine on "small" data sets. However, it's
> impractically slow on the "large" data sets I really need to run it on (over
> 100k elements in each list).
> 
> Here's what I'm doing now:
> 
> 
>     (* This part executes pretty quickly *)
> 
>     indexesToUse =
>         Intersection[First /@ list1, First /@ list2, First /@ list3];
> 
>     valueAtIndex[index_, list_] :=
>         Cases[list, {index, _}, 1, 1] // First // Last;
> 
>     dataAtIndex[index_] := Block[
>         {v1, v2, v3, vf1, vf2},
> 
>         v1 = valueAtIndex[index, list1];
>         v2 = valueAtIndex[index, list2];
>         v3 = valueAtIndex[index, list3];
> 
>         vf1 = f1[v1, v2];
>         vf2 = f2[v2, v3];
> 
>         {v1, v2, v3, vf1, vf2}
>     ];
> 
>     (* This is where it bogs down *)
> 
>     combinedList =
>         Function[{index}, {index, dataAtIndex[index]}] /@ indexesToUse;
> 
> 
> This is all inside an enclosing Module[] along with some other code, and the
> actual code is a little more complex (e.g. more than three lists, more than
> two derived-value functions).  The derived-value functions themselves are
> mostly simple algebra; I doubt they're the source of the bottleneck, and in
> any case, I can't change them.  (I *can* change the way they're applied,
> though, if it makes a difference.)
> 
> I *think* the bottleneck is probably in my repeated calls to Cases[] to find
> particular data points, but that's just a guess.
> 
> Is there a more efficient way of doing this that would speed things up
> significantly?
> 
> Thanks!
> 
> 
> Steve W. Brewer
> 

One can do this with pedestrian procedural code. The idea is that after 
culling out the indices of interest, only loop once over the entire list 
of lists. Initialize a structure that has the correct shape of the 
output. Keep a set of subindices to tell you which field in each entry 
of the result is ready to be set. Set the fields in the iteration. When 
done with that, compute the functions of each partial result, augment 
those values to the partial results.

indexesToUse[lists_List] := Apply[Intersection,Map[#[[All,1]]&,lists]]

initValueTable[indices_,len_] :=
   Table[{indices[[j]],ConstantArray[0,len]}, {j,Length[indices]}]

valueTables[lists_List,funcs_List] := Module[
   {indices,valtab,subindices,ll,list,listi,indexset,ipos,vtj},
   ll = Length[lists];
   indices = indexesToUse[lists];
   valtab = initValueTable[indices,ll+Length[funcs]];
   subindices = ConstantArray[0,Length[indices]];
   Do[indexset[indices[[j]]] = j, {j,Length[indices]}];
   Do[
     list = lists[[j]];
	Do[
	  listi = list[[i]];
	  ipos = indexset[listi[[1]]];
	  If [!IntegerQ[ipos], Continue[]];
	  subindices[[ipos]] += 1;
	  valtab[[ipos,2,subindices[[ipos]]]] = listi[[2]];
	  ,{i,Length[list]}];
     ,{j,ll}];
   Clear[indexset];
   Do[
	vtj = Take[valtab[[j,2]],ll];
     Do[
	  valtab[[j,2,ll+i]] = funcs[[i]][vtj];
	  ,{i,Length[funcs]}];
     ,{j,Length[valtab]}];
   valtab
   ]

Here is an example that uses 100 lists of (initially) 10^5 elements in 
each. I do some preprocessing so that the first elements in eah pair are 
unique to each list, yet likely to be found in all lists.

In[88]:= lists = Table[{RandomInteger[10^4],RandomReal[]}, {10^2}, {10^5}];
lists2 = Map[GatherBy[#,First]&, lists];
lists3 = Table[Map[{#[[1,1]],Total[#[[All,2]]]}&, lists2[[j]]],
   {j,Length[lists2]}];

In[92]:= Timing[vt = valueTables[lists3,
   {#[[1]]+#[[2]]&,#[[1]]+#[[3]]&}];]
Out[92]= {10.9563, Null}

Daniel Lichtblau
Wolfram Research

References:
- Combining data from indexed lists efficiently
  - From: "Steve W. Brewer" <steve@take5.org>

Prev by Date: Re: Re: Re: More /.{I->-1} craziness

Next by Date: Re: Combining data from indexed lists efficiently

Previous by thread: Combining data from indexed lists efficiently

Next by thread: Re: Combining data from indexed lists efficiently