Import [ #, Data ]&
- To: mathgroup at smc.vnet.net
- Subject: [mg97622] Import [ #, Data ]&
- From: Fred Klingener <gigabitbucket at BrockEng.com>
- Date: Tue, 17 Mar 2009 04:59:29 -0500 (EST)
Group, Over the last few days, I've spent enough time trying to manipulate and plot web data using brute force techniques to appreciate the immense power (and a few really nasty gotchas) of the way Mathematica handles Import[]. It seemed to be a good idea to share some of the ideas and maybe to reap recommendations about how it really should be done. The immediate task was to follow the evolving Iditarod Trail Sled Dog Race and plot the position vs. time chart for the leaders. During the race, the organizer posts regular updates of the standings, the core of which are the web pages, one for each musher, that contain data tables of arrival and departure times at each checkpoint. It's these tables that I wanted to take apart and reassemble in a way that would let me plot the comparative progress. As a starting point, the page listing the top five mushers contains hyperlinks to the pages for each. This list is available from: top5=Import["http://www.iditarod.com/race/race/ topfive.html","Hyperlinks"] There are a lot here, but ones I want are distinguished by the form that included "musher_" followed by a decimal number. So these could be picked out by top5=Select[ top5 ,{}!=StringPosition[#,"musher_"]& ] Here, as one of the nastiest little surprises, the musher files are not returned in running order even though they appear in order in the source. So I had to prospect each file to find where the current position was located (I found it at [[4, 3]]), get the order, and use that to sort the top5 list. Here's the result: top5data=Import[#,"Data"]&/@top5; sorted=Ordering[top5data[[All,4,3]]]; top5=top5[[sorted]]; So this sorted list of the top five musher pages could be used to retrieve all the latest checkpoint/time data, and here's the crux power in the process. Import[#, "Data"] dissects the target page, evidently recognizes tables, and assembles the results into what it thinks are useful Mathematica structures. In particular, I found the checkpoint/arrival time/departure time [[5, 3]] block for each return. data=(Import[#,"Data"]&/@top5)[[All,5,3]]; The checkpoint list might be different for each musher. trackLength=Length[data[[#]]]&/@Range[Length[data]]; The idea from here was to construct lists of musher positions (in checkpoint miles from the start) vs. time for plotting. DateListPlot[] would have been handy here, but I couldn't get many of its advertised options to work. So I hacked ListPlot[] to do it. Mathematica will complain, but it will convert the time/date stamps in the Iditarod data files into AbsoluteTime[], so it remains to populate the plotting array. The process was complicated by the inconsistency in the way in and out times were recorded among the checkpoints, so I eventually settled on the following Monument to Incorrectness: pos={{Quiet@AbsoluteTime[#[[1,3]]],ToExpression[#[[1,2]]]}}&/@data; For[ musher=1 ,musher<=Length[data] ,musher++ ,Quiet@ For[ checkPoint=1 ,checkPoint<=trackLength[[musher]] ,checkPoint++ ,AppendTo[pos[[musher]],{AbsoluteTime[data[[musher,checkPoint, 3]]],ToExpression[data[[musher,checkPoint,2]]]}] ; If[Length[data[[musher,checkPoint]]]>4 ,AppendTo[pos[[musher]],{AbsoluteTime[data[[musher,checkPoint, 5]]],ToExpression[data[[musher,checkPoint,2]]]}]]; ] ]; As embarrassing as it is to include that bit of code, on reflection it seems to be eminently readable and maintainable. Maybe I don't feel so bad. After I post this, I'll get to work on the encrypted one-liner. >From here, a simple ListPlot[pos,Joined->True] gets me a recognizable form. Then, I spent all the time saved by Import [#, "Data"] on prettification: chkPtNames=StringSplit[#," ("][[1]]&/@data[[1,All,1]]; chkPtMiles=data[[1,All,2]]; yTicksTable=Table[{chkPtMiles[[j]],chkPtNames[[j]]<>" "<>ToString [chkPtMiles[[j]]]},{j,1,Length[data[[1]]]}]; xGridPositions=Table[AbsoluteTime[{2009,03,j}],{j,6,17}]; xLabelPositions=Table[AbsoluteTime[{2009,03,j,12}],{j,6,17}]; xLabels=DateString[#,{"DayNameShort","\n","Day"}]&/@xLabelPositions; xTicksTable=Table[{xLabelPositions[[j]],xLabels[[j]]},{j,1,Length [xLabelPositions]}]; ListPlot[ pos ,PlotRange->{AbsoluteTime[{2009,03,#}]&/@{7,17},{0,1100}} ,Joined->True ,Ticks->{xTicksTable,yTicksTable} ,GridLines->{xGridPositions,chkPtMiles} ,ImageSize->{640,480} ,AxesOrigin->{AbsoluteTime[{2009,03,7}],0} ,BaseStyle->"Label"] There's some prettification to be done. Like a legend so I could tell who is who, but Mackey is going to win it no matter what. Cheers, Fred Klingener
- Follow-Ups:
- Re: Import [ #, Data ]&
- From: Syd Geraghty <sydgeraghty@me.com>
- Re: Import [ #, Data ]&