MathGroup Archive: December 2008 [00217]

[Date Index] [Thread Index] [Author Index]

Re: Processing large data sets

To: mathgroup at smc.vnet.net
Subject: [mg94186] Re: Processing large data sets
From: David Bailey <dave at removedbailey.co.uk>
Date: Sat, 6 Dec 2008 19:56:42 -0500 (EST)
References: <ghdmvj$fr0$1@smc.vnet.net>

Ravi Balasubramanian wrote:
> Hello friends,
> 
> I have to process many (10) large data sets (each ~150 MB, 500K rows, 19 
> columns) of numbers with 4 significant digits with operations like 
> Principal Component Analysis (on all the files at the same time!).  The 
> large data sets are causing the memory to fill up quickly.
> 
> Admittedly, the data can still be downsampled along the rows, and I will 
> do that.  I also maxed out the RAM on my machine (OSX 10.4, Mac Book 
> Pro, 3 GB RAM).
> 
> But are there any other ways to do this, for example, modify the amount 
> of memory each entry of the table takes up?  Currently, I use Import to 
> load the table into memory.  Help is appreciated.
> 
> Ravi
> University of Washington
> Seattle, WA.
> 
It would help to see some of your code.

Perhaps the first question is to determine if even one data set will run 
through your code. Real's are always stored within Mathematica at 8 byte 
numbers, but an array of Real's will use substantially more memory than 
this unless it is packed. See the help for Developer`PackedArrayQ and 
other functions referenced from there. It is probably vital to ensure 
that your large arrays are all packed.

Although Import is very convenient, it does read the array in in one go. 
If you are going to downsample the data, it may be better to read the 
data a little at a time using Read - so that you don't run short of 
memory before you can thin the array out!

David Bailey
http://www.dbaileyconsultancy.co.uk

Prev by Date: Re: Processing large data sets

Next by Date: Re: Clever Tricky Solutions

Previous by thread: Re: Processing large data sets

Next by thread: VectorStyle for centered vector tail?