Re: Slow Import of CSV files
- To: mathgroup at smc.vnet.net
- Subject: [mg79499] Re: Slow Import of CSV files
- From: Jean-Marc Gulliet <jeanmarc.gulliet at gmail.com>
- Date: Sat, 28 Jul 2007 05:26:20 -0400 (EDT)
- Organization: The Open University, Milton Keynes, UK
- References: <f86ren$ph5$1@smc.vnet.net><f89pti$5iq$1@smc.vnet.net> <f8cf9c$2r7$1@smc.vnet.net>
j.f.b.payne at tesco.net wrote:
> Hi Jean-Marc
>
> Thanks for the suggestion. You are right, it depends on the csv file.
> For your example I get
>
> $Version
>
> "6.0 for Microsoft Windows (32-bit) (June 19, 2007)"
>
> Timing[data = Import["c:/temp/myfile.csv"];]
>
> {52.786, Null}
>
> $Version
>
> "5.2 for Microsoft Windows (June 20, 2005)"
>
> Timing[data=Import["c:/temp/myfile.csv"];]
>
> {49.641 Second, Null}
>
> So version 5.2 is quicker on my system, but only marginally (the
> difference from your result will be because this is a Pentium III
> whereas you, I suppose, have a Pentium IV which has speeded up some
> instructions more than others).
You are right: Pentium IV HT 2.6 MHz 1 Go Ram.
> However, the relevant result is with a file more like mine, which has
> fairly small fixed point numbers (your file has almost all very large
> floating point numbers, close to $MaxMachineNumber).
>
> So
>
> data = RandomReal[{0, 1500}, {3 10^5, 3}];
>
> Timing[Export["c:/temp/myfile2.csv", data] ]
OK, here is my timing with the above code/data (about 16 MB on disk):
In[1]:= data = RandomReal[{0, 1500}, {3 10^5, 3}];
Timing[Export["c:/temp/myfile2.csv", data]]
Out[2]= {33.516, "c:/temp/myfile2.csv"}
> (I like / because it doesn't have to be doubled, works on both Linux
> and Windows)
> The speed of version 6 is not really affected by the difference in the
> files
>
> In[1]:= $Version
>
> Out[1]= "6.0 for Microsoft Windows (32-bit) (June 19, 2007)"
>
> In[2]:= Timing[data = Import["c:/temp/myfile2.csv"];]
>
> Out[2]= {51.885, Null}
>
> In[3]:= MaxMemoryUsed[]
>
> Out[3]= 139109800
In[1]:= $Version
Timing[data = Import["c:/temp/myfile2.csv"];]
MaxMemoryUsed[]
Out[1]= "6.0 for Microsoft Windows (32-bit) (June 19, 2007)"
Out[2]= {17.375, Null}
Out[3]= 139230320
> but version 5.2 is about 2.5x quicker for this file
>
> In[1]:=$Version
>
> Out[1]= "5.2 for Microsoft Windows (June 20, 2005)"
>
> In[2]:=Timing[data=Import["c:/temp/myfile2.csv"];]
>
> Out[2]= {20.239 Second, Null}
>
> In[3]:=MaxMemoryUsed[]
>
> Out[3]= 33143280
In[1]:=
$Version
Timing[data=Import["c:/temp/myfile2.csv"];]
MaxMemoryUsed[]
Out[1]=
5.2 for Microsoft Windows (June 20, 2005)
Out[2]=
{9.468 Second,Null}
Out[3]=
33145168
> So it seems that the improvement in version 6 is that Import speed is
> insensitive to number form!
>
> I've now had an update from Technical Support which says
> "Most of the differences about the importer are internal. The new
> importer
> can import much larger files than version 5.2. The amount of memory
> taken
> to import a large file has been reduced by almost 3 times. Parsing of
> dates
> has been considerably improved. We can import many more variations
> than
> before."
>
> On the face of it, the above results show that version 6 uses about 4
> times _more_ memory than version 5.2, but maybe there's memory used in
> a Java process in 5.2 or something?
>
> In version 6, Import has a "DateStringFormat" option (I _think_ set to
> "None" by default, but the Help Table doesn't have headings) whereas
> version 5.2 had DateStyle.
> It would be nice if there was a way to turn off date import (maybe
> "DateStringFormat"->"NoneAtAll" ?) and get the factor ~3 speed
> improvement of version 5.2
>
> Regards
>
> John Payne
So, with your example file, I have got results similar to yours: v6.0.1
/is/ slower than 5.2 in this case ( 17 s vs 9 s, respectively, and also
lower memory consumption).
Regards,
Jean-Marc