ISO data management, graphing, and visualization
- To: mathgroup@smc.vnet.net
- Subject: [mg11078] ISO data management, graphing, and visualization
- From: Andrew Glew <glew@cs.wisc.edu>
- Date: Wed, 18 Feb 1998 20:32:29 -0500
- Organization: CS Department, University of Wisconsin
Brief ===== I seek recommendations about software packages to manage experimental data, and to prepare graphs and visualizations. Short list of wished-for features: a) underlying database - i.e. I would like to do joins of differing datasets b) decent graphs: e.g. log axes, different forms of graph c) interactive data browsing: e.g. click on a point, jump to related variable, drag box to control zooming in, etc. (but I also want good batch mode graph control) Justification for Posting ========================= I have probably posted to enough newsgroups to set off some spam filters. Here's why I chose these newsgroups: comp.arch My application domain is the field of computer architecture comp.soft-sys.matlab comp.soft-sys.math.mathematica comp.soft-sys.sas comp.soft-sys.stat.spss comp.soft-sys.stat.systat comp.graphics.apps.gnuplot Some applications that I know are in this field - I seek advice to help me choose between these, and others comp.graphics.visualization The generic field of visualization Detail ====== Context - Improving My Personal BKMs ------------------------------------ comp.arch readers may be aware that I returned to school to finish my Ph.D. after having worked in industry for many years. I am finally beginning to collect experimental data for my research, and I would like to have a system to manage the data. I would like to improve my personal "Best Known Methods" for this task of data management and visualization. I.e. I would like to find better tools to do this than the Perl scripts and GNUplot that I used in my MS 8 years ago, or the not-much-better technology that I have used in industry. In particular, in my last industrial stint "Excell" was considered the state of the art for preparing graphs. I consider this distinctly unsatisfactory - especially since Excell was extremely slow in handling large datasets (upwards of 64000 data points), which I have always been able to do in GNUplot. Furthermore, I consider the imprecation "reduce your data set" not always desirable. I have been playing with computer performance data for more than 15 years now. Surely the state of the art has improved somewhat? Types of Data ------------- The datasets that I wish to manage range from Profiles: 2-tuples of (address,count) for many of the locations in a program - often giving 100s of thousands of data points, which I wish to quickly scroll around in, looking at coarse scale (related to the precision of my screen) so that I can zoom in and out. Simulator Results: Typical measurements such as IPC (Instructions per Clock) and estimated run time for particular benchmarks. 100 or so parameters or measurements per benchmark. Traces: Ideally, I believe that the approach I describe here (that of a database) could be extended to traces of events such as start,execute,retirement for instructions in a program => millions of measurements, limited mainly by disk space. I should note that version control of measurements is an issue - i.e. my measurements are almost never just streams of numbers, but are, at the minimum, streams of numbers augmented with a version, a few dates, and a comment description, which I wish to be an intrinsic part of the data. The Similator Result measurements are often ad-hoc - in that new metrics are created on a daily basis, and discarded as readily. Thus tools that make it painful to create new metric types - e.g. by requiring a database schema to be created by hand - are undesirable. Similarly, measurements are sparse. Wish List - Database -------------------- It has long seemed to me that many of the problems of data manipulation could be handled by a Relational Database Management system. For example, if you have (in a very schematic notation) Experiment = { name = "Baseline", results = { { benchmark = spec95,126.gcc,I=test, ipc = 45 } { benchmark = spec95,126.gcc,I=ref, ipc = 22 } { benchmark = spec95,129.compress,I=test, ipc = 988 } ... } } Experiment = { name = "LatestGreatIdea", results = { { benchmark = spec95,126.gcc,I=test, ipc = 450 } { benchmark = spec95,126.gcc,I=ref, ipc = 212 } { benchmark = spec95,129.compress,I=test, ipc = 98 } ... } } then something like an SQL join should be able to perform the comparison. Straightforward attempts to put this into an RDBMS usually founder on a) the extreme slowness of RDBMS b) cost (perhaps the MiniSQL freeware will make this okay) c) the painful contortions that most RDBMS require in handling inconsistent data - i.e. I don't want to define a schema, my schema is in my datafiles (i.e. the tuple field names) Nonetheless, I have hope that an RDBMS, or perhaps a OODBMS, or even an ORDBMS, may be okay. Wish List - Graphing -------------------- Of course, I wish to be able to produce any graph that I can find in the computer architecture literature, and/or those that I am familiar with from other fields... Wish List - GUI *and* Batch --------------------------- I would like to be able to interactively walk into and out of my data, drawing boxes to zoom, etc. Tools like GNUplot seem to fall down in this regard --- it is a real pain to have to type "set xrange [4:1000]", as opposed to using the mouse. Wish List - Automation ---------------------- There seem to be a number of interactive tools around. Each, of course, insists on slightly different input formats. It is straightforward to write Perl scripts to do the conversion, but... it would be *really* *nice* if those scripts could be wired into the interface, so that, e.g., MATLAB could automatically use my .SSOUT to .MAT converter when I try read a .SSOUT file. I.e. I don't want to automatically create every possible datafile format. Data management nightmare. Rather, I want to convert as needed. Wish List - Good Built In Handling of Missing Data -------------------------------------------------- As mentioned above, my data is often sparse - many metrics are not available in all configurations. Math systems that permit e.g. calculations of ratios and which handle missing data nicely are highly desirable. I.e. can it give "NA" (Not Available) as an answer? Giving spurious values such as 0 (Perl's default lacking programming) is highly undesirable and misleading. Note, especially, that you do not want simply to join on records where all fields are defined - explicitly listing undefined is very useful. Candidates ---------- The list of candidate tools for this that I am aware of includes DEVise Datadesk SPSS Sigmaplot SAS Statsoft SYSTAT JMP IDL IPL gnuplot xmgr MATLAB Mathematica I am not, however, familiar with all of these - I do not feel that I know enough to make a good decision. Hence this post, inviting recommendations from others. (Please reply to me personally by mail - I leave it up to you to decide if you want to reply to the newsgroups as well.) I don't know much, but I have gathered some random impressions and information about these tools that I will sumarize here - hoping that others may correct me. DEVise ------ "Database Exploration and Visualization". Academic work in progress, from the University of Wisconsin. Contains an underlying database - apparently ad-hoc coded, although plans to interface to a standard commercial database. Can use an SQL subset. Primitive graphs - doesn't even have logarithmic axes. Limited GUI querying - click on a graphical object, see the data. Most data exploration (e.g. range setting) seems to have to be done via menus. TCL/TK based => extensible? (in the sense that you can extend any source code that you can read). Datadesk -------- Commercial. Seems to provide the most interactive data browsing: twirling axes in 3D, zooming in and out via boxes, etc. Linked variable windows. Datastructures seem primitive - straight vectors. Q: does it support structures? MATLAB ------ Commercial. Matrix based, but supports matrices of matrices, and matrics with named tuple fields. Good internal programming language. Many packages. Theoretically database type JOINs could be written, but don't seem to exist in standard set of add-ons. Q: is there a database add-on? Theoretically good interactive graph browsing could be written, but standard graphing tools seem to be command/text based. (Part of me seems to think that what I want is SQL for MATLAB, with interactive graph browsing.) SPSS, SAS --------- Commercial. I have used SPSS (and SAS) extensively back in the mainframe days, so I feel confident of their computing abilities. (I am considerably less confident of the computing abilities of many of the newer, graphics oriented, tools.) I am less familiar with the GUI interfaces that have been added in the last decade. Q: do they provide the interactive features, such as drag a box to zoom? Q: in the old days SAS and SPSS were basically sequential file of records oriented. Do they have any database features, like JOIN? IDL --- Commercial. Widely advertized, seems to have a general programming language with good graphics hooks. Q: database? gnuplot ------- Freeware. Very textual, does a good job for a very limited repertoire of graphs. Nice in that it seems to be one of the few programs to allow filters to be specified when importing data, e.g. plot '< grep-ssout VariableName*100 datafile', '< grep-ssout Variable2/Variable1 datafile' IPL --- Seems to be mainly batch oriented. Semi-freeware. xmgr ---- Does the graphical "zoom in by drawing a box" thing. Doesn't do much else. Mathematica ----------- Might be able to handle stuff like this, but seems to have performance problems. Not GUI that I can see. Conclusion ========== Any help in choosing tools that I can use will be appreciated. Basically, I want something that can work out of the box, but which I can also extend. I am somewhat shy of investing in tools without good, detailed explanations or demonstrations: a) $$$ - actually, I am willing to spend typical commercial software prices, but I get somewhat fatigued with the hassle of returning software when it turns out not to do what I want. Hence, I will appreciate people sending me in the direction of demoware packages - ideally time limited demoware, because my experience is that crippleware which, e.g, is limited in the data set sizes it can manage, often looks good on the small data sets but then dies on the large data sets. b) personal time - it takes a long time to learn how to use many of these packages. I am somewhat reluctant to dedicate such time. --- Andy "Krazy" Glew, glew@cs.wisc.edu Place URGENT in email subject line for mail filter prioritization. DISCLAIMER: private posting, not representative of employer. {{ VLIW: the leading edge of the last generation of computer architecture }}