Re: Efficiency of repeatedly appending to a List

*To*: mathgroup at smc.vnet.net*Subject*: [mg71170] Re: Efficiency of repeatedly appending to a List*From*: "Philipp" <Philipp.M.O at gmail.com>*Date*: Thu, 9 Nov 2006 03:40:36 -0500 (EST)*References*: <eisg8h$n5a$1@smc.vnet.net>

Andrew, The following addresses your original question & some issues from this question/comment. The most efficient/fastest way of keeping/extending a database of sorted values I have discovered is by using Mathematica Indexed Object. Let's assume that the data is initially stored in a list of a form {{key1, data1}, ..., {keyM, dataM}} for clarity let's further assume that keyK and dataK are String type. For example, In[1]:= data = {{"key01", "data01"}, {"key02", "data02"}, {"key03", "data03"}, {"key04", "data04"}, {"key05", "data05"}, {"key06", "data06"}, {"key07", "data07"}, {"key08", "data08"}, {"key09", "data09"}, {"key10", "data10"}, {"key11", "data11"}, {"key12", "data12"}, {"key13", "data13"}, {"key14", "data14"}}; The following statement will assign the data to the Indexed Object called dataArray, In[2]:= Clear[dataArray]; In[3]:= Scan[(dataArray[First[#]] = Append[dataArray[First[#]] /. {dataArray[_] -> {}}, #]) &, data] or, in more elegant¹ way, Scan[((#1[#2[1]] = Append[#1[#2[1]] /. {#1[_] -> {}}, #2]) &)[dataArray, #1] &, data]; The advantages of this solution are as follows: 1. the key could be "anything" (any syntactically correct expression) 2. several "records" can be associated with a single "key", e.g., In[4]:= addData[dataArray_, r : {k_String, d_String}] := (dataArray[First[#]] = Append[dataArray[First[#]] /. {dataArray[_] -> {}}, #]) &[{k, d}] In[5]:= addData[dataArray, {"key10", "data10B"}] Out[5]= {{key10, data10}, {key10, data10B}} In[6]:= dataArray["key10"] Out[6]= {{key10, data10}, {key10, data10B}} 3. The addition of a new "key" (In[4], In[5]) or retrieval of "data" with a key (In[6]) is of O(1) time complexity, (more precisely, almost constant in time, say O(>1), since Mathematica most probably uses a hashing algorithm and there would be some collisions). 4. deleting a record is also O(1) In[7]:= deleteKey[dataArray_, key_String] := Unset[dataArray[key]] In[8]:= deleteKey[dataArray, "key03"] Extracting the indices is O(n) and can be performed by the following statement, In[9]:= idx = ReleaseHold[Apply[(#1 &), DownValues[#], {1}] /. #[x_] :> x] &[dataArray] Out[9]= {key01, key02, key04, key05, key06, key07, key08, key09, key10, key11, key12, key13, key14} Converting values to a list is also O(n), In[10]:= vlist = ReleaseHold[Apply[(#2 &), DownValues[#], {1}] /. #[x_] :> x] &[dataArray] Out[10]= {{{key01, data01}}, {{key02, data02}}, {{key04, data04}}, {{key05, data05}}, {{key06, data06}}, {{key07, data07}}, {{key08, data08}}, {{key09, data09}}, {{key10, data10}, {key10, data10B}}, {{key11, data11}}, {{key12, data12}}, {{key13, data13}}, {{key14, data14}}} And, of course, operations such as nextKey[dataArray_, key_String] previousKey[dataArray_, key_String] could be implemented also as O(n) by performing binary search on indices (O(n) + O(Log[n]) ~= O(n)). ________________________ ¹obscure, yet cryptic (but hey, we can do it, we've got the technology) ===================================================================== Some numbers indicating O(n) time complexity of Assignment (In[3]) and Index Extraction (In[9]) operations. data comprises a list of "records" in the form {"IRV", {1991, 7, 24}, 41.97, 51.25, 40.45, 44.98, 696000} i.e., each record is a randomly generated stock-market daily price-volume data. The following plot shows normalised execution times of #=Assignment (In[3]) and &=Index Extraction (In[9]) as functions of a number of records in the data list. The number of records ranges from 20,000 to 400,000 in 20,000 steps. # 30## # # && # & Index extraction & && 25## In[9] & # && # && ## & 20# & ## # & & # # & & ## ## & # ## 15# && ## # # & # # && ## 10## && # ## # & ## # # Assignment In[3] # && ## # && && # 5## & && ## # & ## # && && #####&&################################################################ # # # # 100000 200000 300000 It appears from the above picture that both operations are "strongly" linear indicating O(n) time complexity. Cheers, Philipp. Appendix ======== Code to obtain the results and the plot. RandomString[chars_List, len_Integer] := StringJoin[chars[[Table[Random[Integer, {1, Length[chars]}], {len}]]]] Random3Date[] := Take[ToDate[Random[Integer, {1, IntegerPart[FromDate[Date[]]]}]], 3] n = 400000; crangeAZ = CharacterRange["A", "Z"]; data = Table[{RandomString[crangeAZ, Random[Integer, {1, 150}]], Random3Date[], Apply[Sequence, Table[Random[Real, {0., 100.}], {4}]], Random[Integer, {1, 10^6}]}, {n}]; Length[data] x = Table[Clear[dataArray]; {#, First[Timing[Scan[(dataArray[First[#]] = Append[dataArray[First[#]] /. {dataArray[_] -> {}}, #]) &, Take[data, #]]]], First[Timing[t = ReleaseHold[Apply[(#1 &), DownValues[#], {1}] /. #[x_] :> x]] &[dataArray]], Length[t]} &[i], {i, 20000, n, 20000}] /. Second -> 1 {y, z} = Transpose[Apply[{{#1, #2/x[[1, 2]]}, {#1, #3/x[[1, 2]]}} &, x, {1}]] Needs["Graphics`MultipleListPlot`"]; MultipleListPlot[y, z] Phil > Hi all, > > Recently I asked about efficient ways to insert items into a sorted > List such that the new list was also sorted [1]. I received suggestions > to linearly search through the list and find the right place to insert > the new time (search time = O(N), where N is the size of the list), and > also suggestions to use a faster binary search to locate the right > place to insert the new time (search time = O(Log[N])). > > This caused me to wonder about the efficiency of appending items to > Lists in Mathematica in general. Is it an O(N) operation (i.e., > something akin to a realloc in C)? Or does Mathematica employ some sort > of anticipatory memory allocation scheme to make adding new elements to > lists usually faster than O(N)? > > Cheers, > Andrew > > [1] > http://groups-beta.google.com/group/comp.soft-sys.math.mathematica/browse_thread/thread/0c329c83dec5a1b1