Re: Processing large data sets
- To: mathgroup at smc.vnet.net
- Subject: [mg94186] Re: Processing large data sets
- From: David Bailey <dave at removedbailey.co.uk>
- Date: Sat, 6 Dec 2008 19:56:42 -0500 (EST)
- References: <ghdmvj$fr0$1@smc.vnet.net>
Ravi Balasubramanian wrote: > Hello friends, > > I have to process many (10) large data sets (each ~150 MB, 500K rows, 19 > columns) of numbers with 4 significant digits with operations like > Principal Component Analysis (on all the files at the same time!). The > large data sets are causing the memory to fill up quickly. > > Admittedly, the data can still be downsampled along the rows, and I will > do that. I also maxed out the RAM on my machine (OSX 10.4, Mac Book > Pro, 3 GB RAM). > > But are there any other ways to do this, for example, modify the amount > of memory each entry of the table takes up? Currently, I use Import to > load the table into memory. Help is appreciated. > > Ravi > University of Washington > Seattle, WA. > It would help to see some of your code. Perhaps the first question is to determine if even one data set will run through your code. Real's are always stored within Mathematica at 8 byte numbers, but an array of Real's will use substantially more memory than this unless it is packed. See the help for Developer`PackedArrayQ and other functions referenced from there. It is probably vital to ensure that your large arrays are all packed. Although Import is very convenient, it does read the array in in one go. If you are going to downsample the data, it may be better to read the data a little at a time using Read - so that you don't run short of memory before you can thin the array out! David Bailey http://www.dbaileyconsultancy.co.uk