MathGroup Archive 2010

[Date Index] [Thread Index] [Author Index]

Search the Archive

Using GatherBy, Select, and Sort to process data.

  • To: mathgroup at smc.vnet.net
  • Subject: [mg110961] Using GatherBy, Select, and Sort to process data.
  • From: Robert McHugh <r62m10 at gmail.com>
  • Date: Wed, 14 Jul 2010 05:36:06 -0400 (EDT)

The two code snippets below were my attempt to process some data.
Recommendations about how to improve this and pointers to relevant
references are sought.
I did a few searches on the site to find similar examples, but didn't find
anything similar.
Regards

=========================================================================
 Code fragment #1 -- group data by two parameters:
The intent of the code is the following.
1) Take a list of data, where each row in the list is a set of measurments
at a specific time. (for this test we use randomly generated data).

2) Only keep those data which satisify some selection criteria.  In this
case the selection criteria is that the FieldA is close to a multiple of 5
and FieldB is close to a multiple of 2.  "Close" is defined by a tolerance
parameter, which may be different for each field.

3) Group the results based on these two fields.At the lowest level, the
grouping will contain a list of records with similar selection criteria. For
example, one group with all records with FieldA closed to 5.0 and with
FieldB close to 2.0, another group with all records havig FieldA close to
5.0 and with FieldB close to 4.0, etc.  These groups are then grouped
according to Field A.

4) Sort the results.  See final result for what is meant by this.

5) Plot the result. Making one plot for each value of FieldA, and separate
graphs for each value of FieldB.  (Simulated data is noisy;  Data from the
experiments will have more structure.)

Still learning how to manipulate lists of lists.  One issue that I am still
working through:
1) to generate a table with a count of the number of records in each group.

=============================================================================
iFieldA = 2; (*Index of field used to group data*);
iStepA = 5;(*Retain data that is a multiple of iStep  +/- iTolA*);
iTolA = 0.01;(*Retain data that is a multiple of iStep  +/- iTolA*);
iFieldB = 3; (*Index of field used to group data*);
iStepB = 2;(*Retain data that is a multiple of iStep  +/- iTolB*);
iTolB = 0.1;(*Retain data that is a multiple of iStep  +/- iTolB*);
iFieldC = 1; (*Index of field used to sort all data groups*);
dataRaw = RandomReal[{0, 12}, {75000, 4}] ;
u = Select[dataRaw,
            ((Abs@( Round [#[[1]], iStepA] - #[[1]])) < iTolA
        &&  (Abs@( Round [#[[2]], iStepB] - #[[2]])) < iTolB
        ) &  @ {#[[iFieldA]], #[[iFieldB]]} & ] ;
v = GatherBy[
   u, {Round [#[[iFieldA]], iStepA] &,
    Round [#[[iFieldB]], iStepB] &}];
w = Sort[v, #1[[1, 1, iFieldA]] < #2[[1, 1, iFieldA]] & ];
x = Sort[ #, #1[[1, iFieldB]] < #2[[1, iFieldB]] &] & /@ w;
y = Sort[ #, #1[[iFieldC]] < #2[[iFieldC]] &] & /@ # & /@ x;
u // MatrixForm;
v // TableForm;
w // TableForm;
x // TableForm;
y // TableForm
(*Make plots for each value of FieldA*)
iFieldx = 1;
iFieldy = 4;
ListPlot[{#[[iFieldx]], #[[iFieldy]]} & /@ # & /@ #,
                    PlotStyle -> {AbsolutePointSize[5]} ,
   PlotRange -> {{0, 12}, {0, 12}}] &  /@ y

===============================================================================
===============================================================================
Code snippet #2 -- group data by a single parameter:
 The intent of the code is the following.
1) Take a list of data, where each row (or record) in the list is a set of
measurements at a specific time. (for this test we use randomly generated
data).
2) Only keep those data records which satisfy some selection criteria.  In
this case the seleciton criteria is that the value of iField is sufficiently
close to a multiple of 10.
3) Group the data into lists -- one list for each multiple of 10 found in
the original data set.

The implementation does the following setps:
a)  Use GatherBy to group the data into separate lists.  Data which doesn't
satisfy the criteria is put into a list.
b)  Find the index of the data which doesn't meet the selection criteria
must be identified (it could be anywhere).
c)  Drop  data is then dropped, then sort the data.
d)  Report some summary statistics for each multiple of 10 found in the
list.
e) Make a plot.  (Actual data shows more structure.)
================================================================================
iField = 2;(*Index of field used to group data*);
iStep = 10;(*Retain data that is a multiple of iStep+/-iTol*);
iTol = 0.01;(*Retain data that is a multiple of iStep+/-iTol*);
Timing[dataRaw =
   RandomReal[{0, 100}, {1000000, 5}];] (*Generate simulated data set*)
Timing[w =
   GatherBy[dataRaw,
    If[Abs@(Round[#, iStep] - #) < iTol, Round[#, iStep], -1] &@
      Part[#, iField] &];]
m = ((Position[#, a_ /; ! (Abs@(Round[a, iStep] - a) < iTol)] &)@
     w[[All, 1, iField]]) // Flatten;
w1 = Drop[w, m] // Sort[#, #1[[1, iField]] < #2[[1, iField]] &] &;
Table[{Part[#, 2]} & /@
    w1[[i]] // {Round[#1][[1, 1]], Length[#], Min[#], Max[#]} &, {i,
   Length[w1]}] // MatrixForm
xyData = Table[{#[[1]], #[[3]]} & /@ w1[[i, All]], {i, Length[w1]}];
ListPlot[xyData]
================================================================================

Again, recommendations and pointers to relevant references are sought.
Thanks.


  • Prev by Date: Re: File names and strings
  • Next by Date: Re: Printing part of table
  • Previous by thread: Re: Templates from Usage Messages
  • Next by thread: Manipulate and syntax for InputField