Re: Processing large data sets
- To: mathgroup at smc.vnet.net
- Subject: [mg94185] Re: Processing large data sets
- From: Szabolcs Horvát <szhorvat at gmail.com>
- Date: Sat, 6 Dec 2008 19:56:31 -0500 (EST)
- Organization: University of Bergen
- References: <ghdmvj$fr0$1@smc.vnet.net>
Ravi Balasubramanian wrote: > Hello friends, > > I have to process many (10) large data sets (each ~150 MB, 500K rows, 19 > columns) of numbers with 4 significant digits with operations like > Principal Component Analysis (on all the files at the same time!). The > large data sets are causing the memory to fill up quickly. > > Admittedly, the data can still be downsampled along the rows, and I will > do that. I also maxed out the RAM on my machine (OSX 10.4, Mac Book > Pro, 3 GB RAM). > > But are there any other ways to do this, for example, modify the amount > of memory each entry of the table takes up? Currently, I use Import to > load the table into memory. Help is appreciated. > Hello Ravi, I did run into similar problems myself (I have a machine with less memory), and if I remember correctly, the following helped: 1. I had the data in plain text files. Reading it with ReadList[..., Real, RecordLists -> True] was faster than Import[..., "Table"]. Import[..., "Table"] will autoidentify data types, and import numbers without a decimal point as integers, giving longer processing times and a less efficient representation. 2. After you've read in the data, make sure that you change the representation to a packed array. data = Developer`ToPackedArray[data]; This will work if the data types are homogeneous (i.e. all machine precision reals, use N if this is not true), and you have a proper matrix (i.e. each row has the same length). 3. To avoid filling up the memory make sure that Mathematica does not remember earlier results and set $HistoryLength = 0 immediately after kernel startup. 4. If possible, process the data files one by one. Don't keep them all in memory unless strictly necessary. Free up variables like this: data =.