Re: Slow Import of CSV files
- To: mathgroup at smc.vnet.net
- Subject: [mg79448] Re: Slow Import of CSV files
- From: j.f.b.payne at tesco.net
- Date: Fri, 27 Jul 2007 05:46:03 -0400 (EDT)
- References: <f86ren$ph5$1@smc.vnet.net><f89pti$5iq$1@smc.vnet.net>
Hi Jean-Marc Thanks for the suggestion. You are right, it depends on the csv file. For your example I get $Version "6.0 for Microsoft Windows (32-bit) (June 19, 2007)" Timing[data = Import["c:/temp/myfile.csv"];] {52.786, Null} $Version "5.2 for Microsoft Windows (June 20, 2005)" Timing[data=Import["c:/temp/myfile.csv"];] {49.641 Second, Null} So version 5.2 is quicker on my system, but only marginally (the difference from your result will be because this is a Pentium III whereas you, I suppose, have a Pentium IV which has speeded up some instructions more than others). However, the relevant result is with a file more like mine, which has fairly small fixed point numbers (your file has almost all very large floating point numbers, close to $MaxMachineNumber). So data = RandomReal[{0, 1500}, {3 10^5, 3}]; Timing[Export["c:/temp/myfile2.csv", data] ] (I like / because it doesn't have to be doubled, works on both Linux and Windows) The speed of version 6 is not really affected by the difference in the files In[1]:= $Version Out[1]= "6.0 for Microsoft Windows (32-bit) (June 19, 2007)" In[2]:= Timing[data = Import["c:/temp/myfile2.csv"];] Out[2]= {51.885, Null} In[3]:= MaxMemoryUsed[] Out[3]= 139109800 but version 5.2 is about 2.5x quicker for this file In[1]:=$Version Out[1]= "5.2 for Microsoft Windows (June 20, 2005)" In[2]:=Timing[data=Import["c:/temp/myfile2.csv"];] Out[2]= {20.239 Second, Null} In[3]:=MaxMemoryUsed[] Out[3]= 33143280 So it seems that the improvement in version 6 is that Import speed is insensitive to number form! I've now had an update from Technical Support which says "Most of the differences about the importer are internal. The new importer can import much larger files than version 5.2. The amount of memory taken to import a large file has been reduced by almost 3 times. Parsing of dates has been considerably improved. We can import many more variations than before." On the face of it, the above results show that version 6 uses about 4 times _more_ memory than version 5.2, but maybe there's memory used in a Java process in 5.2 or something? In version 6, Import has a "DateStringFormat" option (I _think_ set to "None" by default, but the Help Table doesn't have headings) whereas version 5.2 had DateStyle. It would be nice if there was a way to turn off date import (maybe "DateStringFormat"->"NoneAtAll" ?) and get the factor ~3 speed improvement of version 5.2 Regards John Payne