Re: Read/Write streams in parallel
- To: mathgroup at smc.vnet.net
- Subject: [mg109872] Re: Read/Write streams in parallel
- From: David Bailey <dave at removedbailey.co.uk>
- Date: Thu, 20 May 2010 07:24:36 -0400 (EDT)
- References: <ht1unu$36q$1@smc.vnet.net>
ChrisL wrote: > For a future project, I'll need to read and write some streams in > parallel. I've started playing with a simple example and can't make it > work. > What I want to do is to read an entry from a stream ('testfile') and > write its value to another stream ('testfile2'). This is > straightforward with one processor but in parallel, I get either a > wrong result or errors messages. > For example (testfile contains the first 100 prime numbers): > ParallelEvaluate[sr = OpenRead@FileNameJoin[{$TemporaryDirectory, > "testfile"}]] > ParallelEvaluate[wr = OpenAppend@FileNameJoin[{$TemporaryDirectory, > "testfile2"}]] > ParallelDo[ > x = Read[sr]; Print[{x, $KernelID}]; > Write[wr, {x, $KernelID}], {10}]; > ParallelEvaluate[Close[sr]] > ParallelEvaluate[Close[wr]] > > testfile2 then contains: > {2, 2} > {2, 1} > {3, 2} > {3, 1} > {5, 2} > {5, 1} > {7, 2} > {7, 1} > {11, 2} > {11, 1} > This is not satisfactory: the entries are read twice, once per > processor. Obviously, it's because the two streams are opened by both > processors in parallel. So we want the streams to be opened only once, > and their 'pointers' (location of last entry read) to be updated > concurrently. > I try this instead: > sr = OpenRead@FileNameJoin[{$TemporaryDirectory, "testfile"}] > wr = OpenAppend@FileNameJoin[{$TemporaryDirectory, "testfile2"}] > SetSharedVariable[sr, wr] > ParallelDo[x = Read[sr]; Print[{x, $KernelID}]; > Write[wr, {x, $KernelID}], {4}]; > Close[sr] > Close[wr] > > This time the two streams should be open only once and sync-ed thanks > to SetSharedVariable. But testfile2 ends up being empty*, and > Mathematica sends some error messages: > InputStream["/tmp/testfile", 33] (* Value of sr *) > OutputStream["/tmp/testfile2", 34] (* Value of wr *) > (kernel 2) Read::openx: InputStream[/tmp/testfile,33] is not open. (* > yes it is! *) > (kernel 1) Read::openx: InputStream[/tmp/testfile,33] is not open. (* > yes it is! *) > etc. > > Is there any way I can make it work? The correct content for testfile2 > should be the list of the 10 first prime numbers, possibly in a > different order, and the kernelID that wrote the entry. > > * if it starts from an empty file. Notice the OpenAppend (!=OpenWrite) > Thank you very much in advance, > Cheers, > Chris > Sharing the stream is obviously a necessary step here, but it is hardly surprising that it is not sufficient. The structure of an InputStream object contains an integer, that is presumably a pointer to some opaque information inside the kernel, which will not have been set up on the other kernel. I am trying to imagine why you need to do this. One possibility is that the input stream in your real application will be a list of 'things to do' and each kernel will read another one when it has finished processing the one before. In that case, perhaps the quantity of information in the file is not that large, and it would make sense to read the entire file into a shared list, and use a shared integer to sequence down the list in parallel. As regards writing, since you obviously don't care what order the results get written to the output file (I assume!) you could just write them to separate files and join them up later. Certainly, the less interaction between the parallel kernels, the faster things will go. To be honest, I'd also ask the question as to whether it is worth using Mathematica parallelism at all. Certainly, I would put most effort into optimising the task on one kernel first. David Bailey http://www.dbaileyconsultancy.co.uk