Using GatherBy, Select, and Sort to process data.

*To*: mathgroup at smc.vnet.net*Subject*: [mg110961] Using GatherBy, Select, and Sort to process data.*From*: Robert McHugh <r62m10 at gmail.com>*Date*: Wed, 14 Jul 2010 05:36:06 -0400 (EDT)

The two code snippets below were my attempt to process some data. Recommendations about how to improve this and pointers to relevant references are sought. I did a few searches on the site to find similar examples, but didn't find anything similar. Regards ========================================================================= Code fragment #1 -- group data by two parameters: The intent of the code is the following. 1) Take a list of data, where each row in the list is a set of measurments at a specific time. (for this test we use randomly generated data). 2) Only keep those data which satisify some selection criteria. In this case the selection criteria is that the FieldA is close to a multiple of 5 and FieldB is close to a multiple of 2. "Close" is defined by a tolerance parameter, which may be different for each field. 3) Group the results based on these two fields.At the lowest level, the grouping will contain a list of records with similar selection criteria. For example, one group with all records with FieldA closed to 5.0 and with FieldB close to 2.0, another group with all records havig FieldA close to 5.0 and with FieldB close to 4.0, etc. These groups are then grouped according to Field A. 4) Sort the results. See final result for what is meant by this. 5) Plot the result. Making one plot for each value of FieldA, and separate graphs for each value of FieldB. (Simulated data is noisy; Data from the experiments will have more structure.) Still learning how to manipulate lists of lists. One issue that I am still working through: 1) to generate a table with a count of the number of records in each group. ============================================================================= iFieldA = 2; (*Index of field used to group data*); iStepA = 5;(*Retain data that is a multiple of iStep +/- iTolA*); iTolA = 0.01;(*Retain data that is a multiple of iStep +/- iTolA*); iFieldB = 3; (*Index of field used to group data*); iStepB = 2;(*Retain data that is a multiple of iStep +/- iTolB*); iTolB = 0.1;(*Retain data that is a multiple of iStep +/- iTolB*); iFieldC = 1; (*Index of field used to sort all data groups*); dataRaw = RandomReal[{0, 12}, {75000, 4}] ; u = Select[dataRaw, ((Abs@( Round [#[[1]], iStepA] - #[[1]])) < iTolA && (Abs@( Round [#[[2]], iStepB] - #[[2]])) < iTolB ) & @ {#[[iFieldA]], #[[iFieldB]]} & ] ; v = GatherBy[ u, {Round [#[[iFieldA]], iStepA] &, Round [#[[iFieldB]], iStepB] &}]; w = Sort[v, #1[[1, 1, iFieldA]] < #2[[1, 1, iFieldA]] & ]; x = Sort[ #, #1[[1, iFieldB]] < #2[[1, iFieldB]] &] & /@ w; y = Sort[ #, #1[[iFieldC]] < #2[[iFieldC]] &] & /@ # & /@ x; u // MatrixForm; v // TableForm; w // TableForm; x // TableForm; y // TableForm (*Make plots for each value of FieldA*) iFieldx = 1; iFieldy = 4; ListPlot[{#[[iFieldx]], #[[iFieldy]]} & /@ # & /@ #, PlotStyle -> {AbsolutePointSize[5]} , PlotRange -> {{0, 12}, {0, 12}}] & /@ y =============================================================================== =============================================================================== Code snippet #2 -- group data by a single parameter: The intent of the code is the following. 1) Take a list of data, where each row (or record) in the list is a set of measurements at a specific time. (for this test we use randomly generated data). 2) Only keep those data records which satisfy some selection criteria. In this case the seleciton criteria is that the value of iField is sufficiently close to a multiple of 10. 3) Group the data into lists -- one list for each multiple of 10 found in the original data set. The implementation does the following setps: a) Use GatherBy to group the data into separate lists. Data which doesn't satisfy the criteria is put into a list. b) Find the index of the data which doesn't meet the selection criteria must be identified (it could be anywhere). c) Drop data is then dropped, then sort the data. d) Report some summary statistics for each multiple of 10 found in the list. e) Make a plot. (Actual data shows more structure.) ================================================================================ iField = 2;(*Index of field used to group data*); iStep = 10;(*Retain data that is a multiple of iStep+/-iTol*); iTol = 0.01;(*Retain data that is a multiple of iStep+/-iTol*); Timing[dataRaw = RandomReal[{0, 100}, {1000000, 5}];] (*Generate simulated data set*) Timing[w = GatherBy[dataRaw, If[Abs@(Round[#, iStep] - #) < iTol, Round[#, iStep], -1] &@ Part[#, iField] &];] m = ((Position[#, a_ /; ! (Abs@(Round[a, iStep] - a) < iTol)] &)@ w[[All, 1, iField]]) // Flatten; w1 = Drop[w, m] // Sort[#, #1[[1, iField]] < #2[[1, iField]] &] &; Table[{Part[#, 2]} & /@ w1[[i]] // {Round[#1][[1, 1]], Length[#], Min[#], Max[#]} &, {i, Length[w1]}] // MatrixForm xyData = Table[{#[[1]], #[[3]]} & /@ w1[[i, All]], {i, Length[w1]}]; ListPlot[xyData] ================================================================================ Again, recommendations and pointers to relevant references are sought. Thanks.