MathGroup Archive 2007

[Date Index] [Thread Index] [Author Index]

Search the Archive

Re: Slow Import of CSV files

  • To: mathgroup at smc.vnet.net
  • Subject: [mg79499] Re: Slow Import of CSV files
  • From: Jean-Marc Gulliet <jeanmarc.gulliet at gmail.com>
  • Date: Sat, 28 Jul 2007 05:26:20 -0400 (EDT)
  • Organization: The Open University, Milton Keynes, UK
  • References: <f86ren$ph5$1@smc.vnet.net><f89pti$5iq$1@smc.vnet.net> <f8cf9c$2r7$1@smc.vnet.net>

j.f.b.payne at tesco.net wrote:
> Hi Jean-Marc
> 
> Thanks for the suggestion.  You are right, it depends on the csv file.
> For your example I get
> 
> $Version
> 
> "6.0 for Microsoft Windows (32-bit) (June 19, 2007)"
> 
> Timing[data = Import["c:/temp/myfile.csv"];]
> 
> {52.786, Null}
 >
> $Version
> 
> "5.2 for Microsoft Windows (June 20, 2005)"
> 
> Timing[data=Import["c:/temp/myfile.csv"];]
> 
> {49.641 Second, Null}
> 
> So version 5.2 is quicker on my system, but only marginally (the
> difference from your result will be because this is a Pentium III
> whereas you, I suppose, have a Pentium IV which has speeded up some
> instructions more than others).

You are right: Pentium IV HT 2.6 MHz 1 Go Ram.

> However, the relevant result is with a file more like mine, which has
> fairly small fixed point numbers (your file has almost all very large
> floating point numbers, close to $MaxMachineNumber).
> 
> So
> 
> data = RandomReal[{0, 1500}, {3 10^5, 3}];
> 
> Timing[Export["c:/temp/myfile2.csv", data] ]

OK, here is my timing with the above code/data (about 16 MB on disk):

In[1]:= data = RandomReal[{0, 1500}, {3 10^5, 3}];

Timing[Export["c:/temp/myfile2.csv", data]]

Out[2]= {33.516, "c:/temp/myfile2.csv"}

> (I like / because it doesn't have to be doubled, works on both Linux
> and Windows)
> The speed of version 6 is not really affected by the difference in the
> files
> 
> In[1]:= $Version
> 
> Out[1]= "6.0 for Microsoft Windows (32-bit) (June 19, 2007)"
> 
> In[2]:= Timing[data = Import["c:/temp/myfile2.csv"];]
> 
> Out[2]= {51.885, Null}
> 
> In[3]:= MaxMemoryUsed[]
> 
> Out[3]= 139109800

In[1]:= $Version
Timing[data = Import["c:/temp/myfile2.csv"];]
MaxMemoryUsed[]

Out[1]= "6.0 for Microsoft Windows (32-bit) (June 19, 2007)"

Out[2]= {17.375, Null}

Out[3]= 139230320

> but version 5.2 is about 2.5x quicker for this file
> 
> In[1]:=$Version
> 
> Out[1]= "5.2 for Microsoft Windows (June 20, 2005)"
> 
> In[2]:=Timing[data=Import["c:/temp/myfile2.csv"];]
> 
> Out[2]= {20.239 Second, Null}
> 
> In[3]:=MaxMemoryUsed[]
> 
> Out[3]= 33143280

In[1]:=
$Version
Timing[data=Import["c:/temp/myfile2.csv"];]
MaxMemoryUsed[]

Out[1]=
5.2 for Microsoft Windows (June 20, 2005)

Out[2]=
{9.468 Second,Null}

Out[3]=
33145168

> So it seems that the improvement in version 6 is that Import speed is
> insensitive to number form!
> 
> I've now had an update from Technical Support which says
> "Most of the differences about the importer are internal. The new
> importer
> can import much larger files than version 5.2. The amount of memory
> taken
> to import a large file has been reduced by almost 3 times. Parsing of
> dates
> has been considerably improved. We can import many more variations
> than
> before."
> 
> On the face of it, the above results show that version 6 uses about 4
> times _more_ memory than version 5.2, but maybe there's memory used in
> a Java process in 5.2 or something?
> 
> In version 6, Import has a "DateStringFormat" option (I _think_ set to
> "None" by default, but the Help Table doesn't have headings) whereas
> version 5.2 had DateStyle.
> It would be nice if there was a way to turn off date import (maybe
> "DateStringFormat"->"NoneAtAll" ?) and get the factor ~3 speed
> improvement of version 5.2
> 
> Regards
> 
> John Payne

So, with your example file, I have got results similar to yours: v6.0.1 
/is/ slower than 5.2 in this case ( 17 s vs 9 s, respectively, and also 
lower memory consumption).

Regards,
Jean-Marc


  • Prev by Date: RE: Workbench 1.1 start up bomb
  • Next by Date: Re: Workbench 1.1 start up bomb
  • Previous by thread: Re: Slow Import of CSV files
  • Next by thread: NDSolve Problem