Re: Slow Import of CSV files
- To: mathgroup at smc.vnet.net
- Subject: [mg79499] Re: Slow Import of CSV files
- From: Jean-Marc Gulliet <jeanmarc.gulliet at gmail.com>
- Date: Sat, 28 Jul 2007 05:26:20 -0400 (EDT)
- Organization: The Open University, Milton Keynes, UK
- References: <f86ren$ph5$1@smc.vnet.net><f89pti$5iq$1@smc.vnet.net> <f8cf9c$2r7$1@smc.vnet.net>
j.f.b.payne at tesco.net wrote: > Hi Jean-Marc > > Thanks for the suggestion. You are right, it depends on the csv file. > For your example I get > > $Version > > "6.0 for Microsoft Windows (32-bit) (June 19, 2007)" > > Timing[data = Import["c:/temp/myfile.csv"];] > > {52.786, Null} > > $Version > > "5.2 for Microsoft Windows (June 20, 2005)" > > Timing[data=Import["c:/temp/myfile.csv"];] > > {49.641 Second, Null} > > So version 5.2 is quicker on my system, but only marginally (the > difference from your result will be because this is a Pentium III > whereas you, I suppose, have a Pentium IV which has speeded up some > instructions more than others). You are right: Pentium IV HT 2.6 MHz 1 Go Ram. > However, the relevant result is with a file more like mine, which has > fairly small fixed point numbers (your file has almost all very large > floating point numbers, close to $MaxMachineNumber). > > So > > data = RandomReal[{0, 1500}, {3 10^5, 3}]; > > Timing[Export["c:/temp/myfile2.csv", data] ] OK, here is my timing with the above code/data (about 16 MB on disk): In[1]:= data = RandomReal[{0, 1500}, {3 10^5, 3}]; Timing[Export["c:/temp/myfile2.csv", data]] Out[2]= {33.516, "c:/temp/myfile2.csv"} > (I like / because it doesn't have to be doubled, works on both Linux > and Windows) > The speed of version 6 is not really affected by the difference in the > files > > In[1]:= $Version > > Out[1]= "6.0 for Microsoft Windows (32-bit) (June 19, 2007)" > > In[2]:= Timing[data = Import["c:/temp/myfile2.csv"];] > > Out[2]= {51.885, Null} > > In[3]:= MaxMemoryUsed[] > > Out[3]= 139109800 In[1]:= $Version Timing[data = Import["c:/temp/myfile2.csv"];] MaxMemoryUsed[] Out[1]= "6.0 for Microsoft Windows (32-bit) (June 19, 2007)" Out[2]= {17.375, Null} Out[3]= 139230320 > but version 5.2 is about 2.5x quicker for this file > > In[1]:=$Version > > Out[1]= "5.2 for Microsoft Windows (June 20, 2005)" > > In[2]:=Timing[data=Import["c:/temp/myfile2.csv"];] > > Out[2]= {20.239 Second, Null} > > In[3]:=MaxMemoryUsed[] > > Out[3]= 33143280 In[1]:= $Version Timing[data=Import["c:/temp/myfile2.csv"];] MaxMemoryUsed[] Out[1]= 5.2 for Microsoft Windows (June 20, 2005) Out[2]= {9.468 Second,Null} Out[3]= 33145168 > So it seems that the improvement in version 6 is that Import speed is > insensitive to number form! > > I've now had an update from Technical Support which says > "Most of the differences about the importer are internal. The new > importer > can import much larger files than version 5.2. The amount of memory > taken > to import a large file has been reduced by almost 3 times. Parsing of > dates > has been considerably improved. We can import many more variations > than > before." > > On the face of it, the above results show that version 6 uses about 4 > times _more_ memory than version 5.2, but maybe there's memory used in > a Java process in 5.2 or something? > > In version 6, Import has a "DateStringFormat" option (I _think_ set to > "None" by default, but the Help Table doesn't have headings) whereas > version 5.2 had DateStyle. > It would be nice if there was a way to turn off date import (maybe > "DateStringFormat"->"NoneAtAll" ?) and get the factor ~3 speed > improvement of version 5.2 > > Regards > > John Payne So, with your example file, I have got results similar to yours: v6.0.1 /is/ slower than 5.2 in this case ( 17 s vs 9 s, respectively, and also lower memory consumption). Regards, Jean-Marc