MathGroup Archive: January 2010 [00158]

[Date Index] [Thread Index] [Author Index]
Re: Combining data from indexed lists efficiently
To: mathgroup at smc.vnet.net
Subject: [mg106209] Re: [mg106182] Combining data from indexed lists efficiently
From: Leonid Shifrin <lshifr at gmail.com>
Date: Tue, 5 Jan 2010 01:45:03 -0500 (EST)
References: <201001041059.FAA21250@smc.vnet.net>
Hi Steve,

I think you correctly identified at least some of your bottlenecks.
Your code is fine as long as the number of different indices is small. For
large number of indices, using Cases to find individual data element indeed
becomes wasteful. Instead, you can use the fact that the indices you want to
keep should be in every list. This means that your final structure will be
rectangular. In the implementation below I have used that:

(* Slightly improved over yours *)
Clear[indexesToUseNew];
indexesToUseNew[lists__List] :=
  Intersection @@ List[lists][[All, All, 1]];

(* Collect all results in a single run using GatherBy *)
Clear[coreValuesFromLists];
coreValuesFromLists[lists__List] :=
  With[{len = Length[{lists}]},
   Transpose[{#[[All, 1, 1]], #[[All, All, 2]]}] &@
    Cases[GatherBy[Join[lists], First], x_ /; Length[x] == len]];

(* This is optimized assuming that your functions for derived values \
only use algebraic operations *)
Clear[addDerivedValues];
addDerivedValues[values_?MatrixQ, {}] := values;
addDerivedValues[values_?MatrixQ, functions_List] :=
  Flatten[Transpose[
    {values,
     Transpose@Through[functions[Transpose[values]]]}], {{1}, {2, 3}}];

(* Putting everything together *)
Clear[valuesFromLists];
valuesFromLists[{lists__List}, functions : _List : {}] :=
  If[# === {}, {},
     Transpose[{#[[All, 1]], addDerivedValues[#[[All, 2]], functions]}]]
&@coreValuesFromLists[lists];



The logic is to first compute the core values and then add derived values
obtained through application of your functions - the latter you have to
supply in a list as a second argument. Here is how this works on your toy
example:

In[5]:= indexesToUseNew[list1, list2, list3]

Out[5]= {"A", "B"}

In[6]:= coreValuesFromLists[list1, list2, list3]

Out[6]= {{"A", {1, 5, 9}}, {"B", {2, 6, 10}}}

In[7]:= addDerivedValues[{{1, 5, 9}, {2, 6,
   10}}, {#[[1]] + #[[2]] &, #[[2]] + #[[3]] &}]

Out[7]= {{1, 5, 9, 6, 14}, {2, 6, 10, 8, 16}}

In[8]:= valuesFromLists[{list1, list2,
  list3}, {#[[1]] + #[[2]] &, #[[2]] + #[[3]] &}]

Out[8]= {{"A", {1, 5, 9, 6, 14}}, {"B", {2, 6, 10, 8, 16}}}

Here is your code which I modified somewhat so that it can work on any
number of lists:


Clear[f1, f2, indexesToUse, valueAtIndex, dataAtIndex, combinedList];

f1[list1Data_, list2Data_] := list1Data + list2Data
f2[list2Data_, list3Data_] := list2Data + list3Data

indexesToUse[lists__List] := indexesToUseNew[lists];

valueAtIndex[index_, list_] :=
  Cases[list, {index, _}, 1, 1] // First // Last;

dataAtIndex[{lists__}, index_] := Block[{vf1, vf2, values},
   values = valueAtIndex[index, #] & /@ {lists};
   vf1 = f1[values[[1]], values[[2]]];
   vf2 = f2[values[[2]], values[[3]]];
   {Sequence @@ values, vf1, vf2}];


combinedList[lists__] :=
  Function[{index}, {index, dataAtIndex[{lists}, index]}] /@
   indexesToUse[lists];


Here is the code for a test generator to produce a test data with larger
numbers of lists and indexes:

Clear[makeTestLists];
makeTestLists[listNum_Integer, indexNum_Integer,
    usedFraction : (_?NumericQ) : 0.5,
    valueRange : {_, _} : {1, 100}] /; 0 < usedFraction <= 1 :=
  Module[{used, indices, generateUsedIndices, generateRandomIndices,
    randomPermutation, addValues,
    indexLetters = Characters["ABCDEFGHIJ"]},
   indices =
    Flatten@Outer[StringJoin, indexLetters,
      ToString /@ Range[indexNum]];
   used  = IntegerPart[Length[indices]*usedFraction];
   randomPermutation = RandomSample[#, Length[#]] &;
   addValues = Map[{#, RandomInteger[valueRange]} &, #] &;
   generateUsedIndices[] := Take[indices, used];
   generateRandomIndices[] :=
    RandomSample[#, RandomInteger[{Length[#] - 1, Length[#]}]] &@
     Drop[indices, used];
   Table[randomPermutation@
     addValues[
      generateUsedIndices[]~Join~
       generateRandomIndices[]], {listNum}]];

It produces the list of your lists, with total number of them <listNum>,
total number of different numerical indices for a fixed index letter
<indexNum>, with <usedFraction> of total indices present in all lists and
the rest probabilistically, and with the values in range <valueRange>. For
example:

In[10]:= makeTestLists[5, 1]

Out[10]= {{{"B1", 39}, {"I1", 96}, {"E1", 55}, {"G1", 30}, {"A1",
   23}, {"H1", 16}, {"J1", 59}, {"C1", 6}, {"D1", 39}}, {{"B1",
   13}, {"C1", 45}, {"D1", 75}, {"A1", 3}, {"J1", 26}, {"I1",
   39}, {"H1", 41}, {"F1", 9}, {"G1", 39}, {"E1", 35}}, {{"J1",
   78}, {"C1", 59}, {"D1", 69}, {"H1", 92}, {"A1", 13}, {"F1",
   75}, {"I1", 7}, {"B1", 15}, {"G1", 43}, {"E1", 40}}, {{"I1",
   61}, {"H1", 80}, {"B1", 13}, {"E1", 20}, {"F1", 56}, {"A1",
   86}, {"G1", 89}, {"J1", 72}, {"D1", 12}, {"C1", 57}}, {{"J1",
   54}, {"D1", 29}, {"E1", 65}, {"F1", 34}, {"A1", 18}, {"B1",
   61}, {"G1", 88}, {"I1", 91}, {"C1", 87}}}

When only one index persists in all lists, your method is slightly faster:

In[11]:= test = makeTestLists[1000, 5, 0.03];

In[12]:= (res1 =
    valuesFromLists[test, {#[[1]] + #[[2]] &, #[[2]] + #[[3]] &}]) //
  Shallow // Timing

Out[12]= {0.13, {{"A1", {<<1002>>}}}}

In[13]:= (res2 = combinedList @@ test) // Shallow // Timing

Out[13]= {0.091, {{"A1", {<<1002>>}}}}

In[14]:= Sort@res1 === Sort@res2

Out[14]= True

In most other cases, my code is faster or much faster. In the opposite case
with 5 lists and 1000 indices, for example:

In[15]:= test = makeTestLists[5, 100];

In[16]:= (res1 =
    valuesFromLists[test, {#[[1]] + #[[2]] &, #[[2]] + #[[3]] &}]) //
  Shallow // Timing

Out[16]= {0.02, {{"G3", {<<7>>}}, {"C20", {<<7>>}}, {"H27", \
{<<7>>}}, {"E3", {<<7>>}}, {"E90", {<<7>>}}, {"A5", {<<7>>}}, {"F58", \
{<<7>>}}, {"B59", {<<7>>}}, {"A13", {<<7>>}}, {"G70", {<<7>>}}, \
<<988>>}}

In[17]:= (res2 = combinedList @@ test) // Shallow // Timing

Out[17]= {2.163, {{"A1", {<<7>>}}, {"A10", {<<7>>}}, {"A100", \
{<<7>>}}, {"A11", {<<7>>}}, {"A12", {<<7>>}}, {"A13", {<<7>>}}, \
{"A14", {<<7>>}}, {"A15", {<<7>>}}, {"A16", {<<7>>}}, {"A17", \
{<<7>>}}, <<988>>}}

In[18]:= Sort@res1 === Sort@res2

Out[18]= True

Cases like the following with 300 lists and 100 indices are somewhere in
between (Cases is still reasonably fast for lists of length ~ 100):

In[19]:= test = makeTestLists[300, 10];

In[20]:= (res1 =
    valuesFromLists[test, {#[[1]] + #[[2]] &, #[[2]] + #[[3]] &}]) //
  Shallow // Timing

Out[20]= {0.07, {{"A8", {<<302>>}}, {"D1", {<<302>>}}, {"D4", \
{<<302>>}}, {"B5", {<<302>>}}, {"B4", {<<302>>}}, {"E1", {<<302>>}}, \
{"C10", {<<302>>}}, {"C6", {<<302>>}}, {"D10", {<<302>>}}, {"B7", \
{<<302>>}}, <<41>>}}

In[21]:= (res2 = combinedList @@ test) // Shallow // Timing

Out[21]= {0.932, {{"A1", {<<302>>}}, {"A10", {<<302>>}}, {"A2", \
{<<302>>}}, {"A3", {<<302>>}}, {"A4", {<<302>>}}, {"A5", {<<302>>}}, \
{"A6", {<<302>>}}, {"A7", {<<302>>}}, {"A8", {<<302>>}}, {"A9", \
{<<302>>}}, <<41>>}}

In[22]:= Sort@res1 === Sort@res2

Out[22]= True

Note that in my code I implicitly used the assumption that your functions
for derived values are Listable (which is true for most numerical/algebraic
operations).

Hope this helps.

Regards,
Leonid




On Mon, Jan 4, 2010 at 2:59 AM, Steve W. Brewer <steve at take5.org> wrote:

> I have several lists of the format:
>
>    { {index1, value}, {index2, value}, ... {indexN, value} }
>
> For example:
>
>    list1 = { {"A", 1}, {"B",  2}, {"C",  3}, {"D", 4} }
>    list2 = { {"A", 5}, {"B",  6}, {"D",  7}, {"E", 8} }
>    list3 = { {"A", 9}, {"B", 10}, {"C", 11} }
>
> The indexes are not necessarily strings; they may be any expression.  (In
> the specific case I'm addressing now, each index is a list representing a
> date/time in the format returned by DateList[].)  The lists are not
> necessarily the same length.  Also, while most of the indexes appear in all
> lists, there are some holes (missing data).
>
> I want to combine the lists into a single list of the format:
>
>    { { index1, {value1, value2, ... valueN} },
>      { index2, {value1, value2, ... valueN} },
>      ...
>      { indexN, {value1, value2, ... valueN} } }
>
> Only the data points with indexes appearing in all lists should be
> included;
> the rest should be dropped.  Also, I want to include some derived values
> along with the original data values.
>
> Using the sample data above, let's say I want to include two derived values
> from the functions:
>
>    f1[list1Data_, list2Data_] := list1Data + list2Data
>    f2[list2Data_, list3Data_] := list2Data + list3Data
>
> The result would be:
>
>    combinedList = { { "A", {1, 5,  9, 6, 14} },
>                     { "B", {2, 6, 10, 8, 16} } }
>
> I have a solution that works fine on "small" data sets. However, it's
> impractically slow on the "large" data sets I really need to run it on
> (over
> 100k elements in each list).
>
> Here's what I'm doing now:
>
>
>    (* This part executes pretty quickly *)
>
>    indexesToUse =
>        Intersection[First /@ list1, First /@ list2, First /@ list3];
>
>    valueAtIndex[index_, list_] :=
>        Cases[list, {index, _}, 1, 1] // First // Last;
>
>    dataAtIndex[index_] := Block[
>        {v1, v2, v3, vf1, vf2},
>
>        v1 = valueAtIndex[index, list1];
>        v2 = valueAtIndex[index, list2];
>        v3 = valueAtIndex[index, list3];
>
>        vf1 = f1[v1, v2];
>        vf2 = f2[v2, v3];
>
>        {v1, v2, v3, vf1, vf2}
>    ];
>
>    (* This is where it bogs down *)
>
>    combinedList =
>        Function[{index}, {index, dataAtIndex[index]}] /@ indexesToUse;
>
>
> This is all inside an enclosing Module[] along with some other code, and
> the
> actual code is a little more complex (e.g. more than three lists, more than
> two derived-value functions).  The derived-value functions themselves are
> mostly simple algebra; I doubt they're the source of the bottleneck, and in
> any case, I can't change them.  (I *can* change the way they're applied,
> though, if it makes a difference.)
>
> I *think* the bottleneck is probably in my repeated calls to Cases[] to
> find
> particular data points, but that's just a guess.
>
> Is there a more efficient way of doing this that would speed things up
> significantly?
>
> Thanks!
>
>
> Steve W. Brewer
>
>
>
References:
- Combining data from indexed lists efficiently
  - From: "Steve W. Brewer" <steve@take5.org>
Prev by Date: Re: algebraic numbers
Next by Date: Re: More /.{I->-1} craziness
Previous by thread: Re: Combining data from indexed lists efficiently
Next by thread: Re: Combining data from indexed lists efficiently