Re: Combining data from indexed lists efficiently
- To: mathgroup at smc.vnet.net
- Subject: [mg106209] Re: [mg106182] Combining data from indexed lists efficiently
- From: Leonid Shifrin <lshifr at gmail.com>
- Date: Tue, 5 Jan 2010 01:45:03 -0500 (EST)
- References: <201001041059.FAA21250@smc.vnet.net>
Hi Steve, I think you correctly identified at least some of your bottlenecks. Your code is fine as long as the number of different indices is small. For large number of indices, using Cases to find individual data element indeed becomes wasteful. Instead, you can use the fact that the indices you want to keep should be in every list. This means that your final structure will be rectangular. In the implementation below I have used that: (* Slightly improved over yours *) Clear[indexesToUseNew]; indexesToUseNew[lists__List] := Intersection @@ List[lists][[All, All, 1]]; (* Collect all results in a single run using GatherBy *) Clear[coreValuesFromLists]; coreValuesFromLists[lists__List] := With[{len = Length[{lists}]}, Transpose[{#[[All, 1, 1]], #[[All, All, 2]]}] &@ Cases[GatherBy[Join[lists], First], x_ /; Length[x] == len]]; (* This is optimized assuming that your functions for derived values \ only use algebraic operations *) Clear[addDerivedValues]; addDerivedValues[values_?MatrixQ, {}] := values; addDerivedValues[values_?MatrixQ, functions_List] := Flatten[Transpose[ {values, Transpose@Through[functions[Transpose[values]]]}], {{1}, {2, 3}}]; (* Putting everything together *) Clear[valuesFromLists]; valuesFromLists[{lists__List}, functions : _List : {}] := If[# === {}, {}, Transpose[{#[[All, 1]], addDerivedValues[#[[All, 2]], functions]}]] &@coreValuesFromLists[lists]; The logic is to first compute the core values and then add derived values obtained through application of your functions - the latter you have to supply in a list as a second argument. Here is how this works on your toy example: In[5]:= indexesToUseNew[list1, list2, list3] Out[5]= {"A", "B"} In[6]:= coreValuesFromLists[list1, list2, list3] Out[6]= {{"A", {1, 5, 9}}, {"B", {2, 6, 10}}} In[7]:= addDerivedValues[{{1, 5, 9}, {2, 6, 10}}, {#[[1]] + #[[2]] &, #[[2]] + #[[3]] &}] Out[7]= {{1, 5, 9, 6, 14}, {2, 6, 10, 8, 16}} In[8]:= valuesFromLists[{list1, list2, list3}, {#[[1]] + #[[2]] &, #[[2]] + #[[3]] &}] Out[8]= {{"A", {1, 5, 9, 6, 14}}, {"B", {2, 6, 10, 8, 16}}} Here is your code which I modified somewhat so that it can work on any number of lists: Clear[f1, f2, indexesToUse, valueAtIndex, dataAtIndex, combinedList]; f1[list1Data_, list2Data_] := list1Data + list2Data f2[list2Data_, list3Data_] := list2Data + list3Data indexesToUse[lists__List] := indexesToUseNew[lists]; valueAtIndex[index_, list_] := Cases[list, {index, _}, 1, 1] // First // Last; dataAtIndex[{lists__}, index_] := Block[{vf1, vf2, values}, values = valueAtIndex[index, #] & /@ {lists}; vf1 = f1[values[[1]], values[[2]]]; vf2 = f2[values[[2]], values[[3]]]; {Sequence @@ values, vf1, vf2}]; combinedList[lists__] := Function[{index}, {index, dataAtIndex[{lists}, index]}] /@ indexesToUse[lists]; Here is the code for a test generator to produce a test data with larger numbers of lists and indexes: Clear[makeTestLists]; makeTestLists[listNum_Integer, indexNum_Integer, usedFraction : (_?NumericQ) : 0.5, valueRange : {_, _} : {1, 100}] /; 0 < usedFraction <= 1 := Module[{used, indices, generateUsedIndices, generateRandomIndices, randomPermutation, addValues, indexLetters = Characters["ABCDEFGHIJ"]}, indices = Flatten@Outer[StringJoin, indexLetters, ToString /@ Range[indexNum]]; used = IntegerPart[Length[indices]*usedFraction]; randomPermutation = RandomSample[#, Length[#]] &; addValues = Map[{#, RandomInteger[valueRange]} &, #] &; generateUsedIndices[] := Take[indices, used]; generateRandomIndices[] := RandomSample[#, RandomInteger[{Length[#] - 1, Length[#]}]] &@ Drop[indices, used]; Table[randomPermutation@ addValues[ generateUsedIndices[]~Join~ generateRandomIndices[]], {listNum}]]; It produces the list of your lists, with total number of them <listNum>, total number of different numerical indices for a fixed index letter <indexNum>, with <usedFraction> of total indices present in all lists and the rest probabilistically, and with the values in range <valueRange>. For example: In[10]:= makeTestLists[5, 1] Out[10]= {{{"B1", 39}, {"I1", 96}, {"E1", 55}, {"G1", 30}, {"A1", 23}, {"H1", 16}, {"J1", 59}, {"C1", 6}, {"D1", 39}}, {{"B1", 13}, {"C1", 45}, {"D1", 75}, {"A1", 3}, {"J1", 26}, {"I1", 39}, {"H1", 41}, {"F1", 9}, {"G1", 39}, {"E1", 35}}, {{"J1", 78}, {"C1", 59}, {"D1", 69}, {"H1", 92}, {"A1", 13}, {"F1", 75}, {"I1", 7}, {"B1", 15}, {"G1", 43}, {"E1", 40}}, {{"I1", 61}, {"H1", 80}, {"B1", 13}, {"E1", 20}, {"F1", 56}, {"A1", 86}, {"G1", 89}, {"J1", 72}, {"D1", 12}, {"C1", 57}}, {{"J1", 54}, {"D1", 29}, {"E1", 65}, {"F1", 34}, {"A1", 18}, {"B1", 61}, {"G1", 88}, {"I1", 91}, {"C1", 87}}} When only one index persists in all lists, your method is slightly faster: In[11]:= test = makeTestLists[1000, 5, 0.03]; In[12]:= (res1 = valuesFromLists[test, {#[[1]] + #[[2]] &, #[[2]] + #[[3]] &}]) // Shallow // Timing Out[12]= {0.13, {{"A1", {<<1002>>}}}} In[13]:= (res2 = combinedList @@ test) // Shallow // Timing Out[13]= {0.091, {{"A1", {<<1002>>}}}} In[14]:= Sort@res1 === Sort@res2 Out[14]= True In most other cases, my code is faster or much faster. In the opposite case with 5 lists and 1000 indices, for example: In[15]:= test = makeTestLists[5, 100]; In[16]:= (res1 = valuesFromLists[test, {#[[1]] + #[[2]] &, #[[2]] + #[[3]] &}]) // Shallow // Timing Out[16]= {0.02, {{"G3", {<<7>>}}, {"C20", {<<7>>}}, {"H27", \ {<<7>>}}, {"E3", {<<7>>}}, {"E90", {<<7>>}}, {"A5", {<<7>>}}, {"F58", \ {<<7>>}}, {"B59", {<<7>>}}, {"A13", {<<7>>}}, {"G70", {<<7>>}}, \ <<988>>}} In[17]:= (res2 = combinedList @@ test) // Shallow // Timing Out[17]= {2.163, {{"A1", {<<7>>}}, {"A10", {<<7>>}}, {"A100", \ {<<7>>}}, {"A11", {<<7>>}}, {"A12", {<<7>>}}, {"A13", {<<7>>}}, \ {"A14", {<<7>>}}, {"A15", {<<7>>}}, {"A16", {<<7>>}}, {"A17", \ {<<7>>}}, <<988>>}} In[18]:= Sort@res1 === Sort@res2 Out[18]= True Cases like the following with 300 lists and 100 indices are somewhere in between (Cases is still reasonably fast for lists of length ~ 100): In[19]:= test = makeTestLists[300, 10]; In[20]:= (res1 = valuesFromLists[test, {#[[1]] + #[[2]] &, #[[2]] + #[[3]] &}]) // Shallow // Timing Out[20]= {0.07, {{"A8", {<<302>>}}, {"D1", {<<302>>}}, {"D4", \ {<<302>>}}, {"B5", {<<302>>}}, {"B4", {<<302>>}}, {"E1", {<<302>>}}, \ {"C10", {<<302>>}}, {"C6", {<<302>>}}, {"D10", {<<302>>}}, {"B7", \ {<<302>>}}, <<41>>}} In[21]:= (res2 = combinedList @@ test) // Shallow // Timing Out[21]= {0.932, {{"A1", {<<302>>}}, {"A10", {<<302>>}}, {"A2", \ {<<302>>}}, {"A3", {<<302>>}}, {"A4", {<<302>>}}, {"A5", {<<302>>}}, \ {"A6", {<<302>>}}, {"A7", {<<302>>}}, {"A8", {<<302>>}}, {"A9", \ {<<302>>}}, <<41>>}} In[22]:= Sort@res1 === Sort@res2 Out[22]= True Note that in my code I implicitly used the assumption that your functions for derived values are Listable (which is true for most numerical/algebraic operations). Hope this helps. Regards, Leonid On Mon, Jan 4, 2010 at 2:59 AM, Steve W. Brewer <steve at take5.org> wrote: > I have several lists of the format: > > { {index1, value}, {index2, value}, ... {indexN, value} } > > For example: > > list1 = { {"A", 1}, {"B", 2}, {"C", 3}, {"D", 4} } > list2 = { {"A", 5}, {"B", 6}, {"D", 7}, {"E", 8} } > list3 = { {"A", 9}, {"B", 10}, {"C", 11} } > > The indexes are not necessarily strings; they may be any expression. (In > the specific case I'm addressing now, each index is a list representing a > date/time in the format returned by DateList[].) The lists are not > necessarily the same length. Also, while most of the indexes appear in all > lists, there are some holes (missing data). > > I want to combine the lists into a single list of the format: > > { { index1, {value1, value2, ... valueN} }, > { index2, {value1, value2, ... valueN} }, > ... > { indexN, {value1, value2, ... valueN} } } > > Only the data points with indexes appearing in all lists should be > included; > the rest should be dropped. Also, I want to include some derived values > along with the original data values. > > Using the sample data above, let's say I want to include two derived values > from the functions: > > f1[list1Data_, list2Data_] := list1Data + list2Data > f2[list2Data_, list3Data_] := list2Data + list3Data > > The result would be: > > combinedList = { { "A", {1, 5, 9, 6, 14} }, > { "B", {2, 6, 10, 8, 16} } } > > I have a solution that works fine on "small" data sets. However, it's > impractically slow on the "large" data sets I really need to run it on > (over > 100k elements in each list). > > Here's what I'm doing now: > > > (* This part executes pretty quickly *) > > indexesToUse = > Intersection[First /@ list1, First /@ list2, First /@ list3]; > > valueAtIndex[index_, list_] := > Cases[list, {index, _}, 1, 1] // First // Last; > > dataAtIndex[index_] := Block[ > {v1, v2, v3, vf1, vf2}, > > v1 = valueAtIndex[index, list1]; > v2 = valueAtIndex[index, list2]; > v3 = valueAtIndex[index, list3]; > > vf1 = f1[v1, v2]; > vf2 = f2[v2, v3]; > > {v1, v2, v3, vf1, vf2} > ]; > > (* This is where it bogs down *) > > combinedList = > Function[{index}, {index, dataAtIndex[index]}] /@ indexesToUse; > > > This is all inside an enclosing Module[] along with some other code, and > the > actual code is a little more complex (e.g. more than three lists, more than > two derived-value functions). The derived-value functions themselves are > mostly simple algebra; I doubt they're the source of the bottleneck, and in > any case, I can't change them. (I *can* change the way they're applied, > though, if it makes a difference.) > > I *think* the bottleneck is probably in my repeated calls to Cases[] to > find > particular data points, but that's just a guess. > > Is there a more efficient way of doing this that would speed things up > significantly? > > Thanks! > > > Steve W. Brewer > > >
- References:
- Combining data from indexed lists efficiently
- From: "Steve W. Brewer" <steve@take5.org>
- Combining data from indexed lists efficiently