MathGroup Archive: December 2008 [00216]

[Date Index] [Thread Index] [Author Index]

Re: Processing large data sets

To: mathgroup at smc.vnet.net
Subject: [mg94185] Re: Processing large data sets
From: Szabolcs Horvát <szhorvat at gmail.com>
Date: Sat, 6 Dec 2008 19:56:31 -0500 (EST)
Organization: University of Bergen
References: <ghdmvj$fr0$1@smc.vnet.net>

Ravi Balasubramanian wrote:
> Hello friends,
> 
> I have to process many (10) large data sets (each ~150 MB, 500K rows, 19 
> columns) of numbers with 4 significant digits with operations like 
> Principal Component Analysis (on all the files at the same time!).  The 
> large data sets are causing the memory to fill up quickly.
> 
> Admittedly, the data can still be downsampled along the rows, and I will 
> do that.  I also maxed out the RAM on my machine (OSX 10.4, Mac Book 
> Pro, 3 GB RAM).
> 
> But are there any other ways to do this, for example, modify the amount 
> of memory each entry of the table takes up?  Currently, I use Import to 
> load the table into memory.  Help is appreciated.
> 

Hello Ravi,

I did run into similar problems myself (I have a machine with less 
memory), and if I remember correctly, the following helped:

1. I had the data in plain text files.  Reading it with ReadList[..., 
Real, RecordLists -> True] was faster than Import[..., "Table"].

Import[..., "Table"] will autoidentify data types, and import numbers 
without a decimal point as integers, giving longer processing times and 
a less efficient representation.

2. After you've read in the data, make sure that you change the 
representation to a packed array.

data = Developer`ToPackedArray[data];

This will work if the data types are homogeneous (i.e. all machine 
precision reals, use N if this is not true), and you have a proper 
matrix (i.e. each row has the same length).

3. To avoid filling up the memory make sure that Mathematica does not 
remember earlier results and set $HistoryLength = 0 immediately after 
kernel startup.

4. If possible, process the data files one by one.  Don't keep them all 
in memory unless strictly necessary.  Free up variables like this:

data =.

Prev by Date: Re: Re: A plot of Sign[x]

Next by Date: Re: Processing large data sets

Previous by thread: Processing large data sets

Next by thread: Re: Processing large data sets