[Date Index] [Thread Index] [Author Index]
ISO data management, graphing, and visualization

To: mathgroup@smc.vnet.net
Subject: [mg11078] ISO data management, graphing, and visualization
From: Andrew Glew <glew@cs.wisc.edu>
Date: Wed, 18 Feb 1998 20:32:29 -0500
Organization: CS Department, University of Wisconsin

Brief
=====

I seek recommendations about software packages to manage experimental
data, and to prepare graphs and visualizations.

Short list of wished-for features:

a) underlying database - i.e. I would like to do
    joins of differing datasets

b) decent graphs: e.g. log axes, different forms of graph

c) interactive data browsing: e.g. click on a point, jump to related
    variable, drag box to control zooming in, etc.
    (but I also want good batch mode graph control)

Justification for Posting
=========================

I have probably posted to enough newsgroups to set off some spam
filters. Here's why I chose these newsgroups:

comp.arch
    My application domain is the field of computer architecture

comp.soft-sys.matlab
comp.soft-sys.math.mathematica
comp.soft-sys.sas
comp.soft-sys.stat.spss
comp.soft-sys.stat.systat
comp.graphics.apps.gnuplot
    Some applications that I know are in this field
    - I seek advice to help me choose between these, and others

comp.graphics.visualization
    The generic field of visualization


Detail
======

Context - Improving My Personal BKMs
------------------------------------

comp.arch readers may be aware that I returned to school to finish my
Ph.D.  after having worked in industry for many years.

I am finally beginning to collect experimental data for my research, and
I would like to have a system to manage the data.

I would like to improve my personal "Best Known Methods" for this task
of data management and visualization. I.e. I would like to find better
tools to do this than the Perl scripts and GNUplot that I used in my MS
8 years ago, or the not-much-better technology that I have used in
industry.

In particular, in my last industrial stint "Excell" was considered the
state of the art for preparing graphs. I consider this distinctly
unsatisfactory - especially since Excell was extremely slow in handling
large datasets (upwards of 64000 data points), which I have always been
able to do in GNUplot.  Furthermore, I consider the imprecation "reduce
your data set" not always desirable.

I have been playing with computer performance data for more than 15
years now.  Surely the state of the art has improved somewhat?

Types of Data
-------------

The datasets that I wish to manage range from

Profiles:
    2-tuples of (address,count)
    for many of the locations in a program
    - often giving 100s of thousands of data points,
    which I wish to quickly scroll around in,
    looking at coarse scale (related to the precision of my screen)
    so that I can zoom in and out.

Simulator Results:
    Typical measurements such as IPC (Instructions per Clock)
    and estimated run time for particular benchmarks.
    100 or so parameters or measurements per benchmark.

Traces:
    Ideally, I believe that the approach I describe here (that of a
database)
    could be extended to traces of events such as
start,execute,retirement
    for instructions in a program => millions of measurements,
    limited mainly by disk space.

I should note that version control of measurements is an issue - i.e. my
measurements are almost never just streams of numbers, but are, at the
minimum, streams of numbers augmented with a version, a few dates, and
a comment description, which I wish to be an intrinsic part of the
data.

The Similator Result measurements are often ad-hoc - in that new metrics
are created on a daily basis, and discarded as readily. Thus tools that
make it painful to create new metric types - e.g. by requiring a
database schema to be created by hand - are undesirable.

Similarly, measurements are sparse.


Wish List - Database
--------------------

It has long seemed to me that many of the problems of data manipulation
could be handled by a Relational Database Management system.

For example, if you have (in a very schematic notation)

    Experiment = {
	name = "Baseline",
	results = {
	    {	benchmark = spec95,126.gcc,I=test, ipc = 45 }
	    {	benchmark = spec95,126.gcc,I=ref, ipc = 22 }
	    {	benchmark = spec95,129.compress,I=test, ipc = 988 }
	    ...
	}
    }
    Experiment = {
	name = "LatestGreatIdea",
	results = {
	    {	benchmark = spec95,126.gcc,I=test, ipc = 450 }
	    {	benchmark = spec95,126.gcc,I=ref, ipc = 212 }
	    {	benchmark = spec95,129.compress,I=test, ipc = 98 }
	    ...
	}
    }

then something like an SQL join should be able to perform the
comparison.

Straightforward attempts to put this into an RDBMS usually founder on

a) the extreme slowness of RDBMS
b) cost (perhaps the MiniSQL freeware will make this okay) c) the
painful contortions that most RDBMS require in handling
    inconsistent data - i.e. I don't want to define a schema,
    my schema is in my datafiles (i.e. the tuple field names)

Nonetheless, I have hope that an RDBMS, or perhaps a OODBMS, or even an
ORDBMS, may be okay.


Wish List - Graphing
--------------------

Of course, I wish to be able to produce any graph that I can find in the
computer architecture literature, and/or those that I am familiar with
from other fields...

Wish List - GUI *and* Batch
---------------------------

I would like to be able to interactively walk into and out of my data,
drawing boxes to zoom, etc.

Tools like GNUplot seem to fall down in this regard --- it is a real
pain to have to type "set xrange [4:1000]", as opposed to using the
mouse.

Wish List - Automation
----------------------

There seem to be a number of interactive tools around. Each, of course,
insists on slightly different input formats.

It is straightforward to write Perl scripts to do the conversion, but...
it would be *really* *nice* if those scripts could be wired into the
interface, so that, e.g., MATLAB could automatically use my .SSOUT to
.MAT converter when I try read a .SSOUT file.

I.e. I don't want to automatically create every possible datafile
format.  Data management nightmare.  Rather, I want to convert as
needed.

Wish List - Good Built In Handling of Missing Data
--------------------------------------------------

As mentioned above, my data is often sparse - many metrics are not
available in all configurations.  Math systems that permit e.g.
calculations of ratios and which handle missing data nicely are highly
desirable.

I.e. can it give "NA" (Not Available) as an answer?  Giving spurious
values such as 0 (Perl's default lacking programming) is highly
undesirable and misleading.

Note, especially, that you do not want simply to join on records where
all fields are defined - explicitly listing undefined is very useful.



Candidates
----------

The list of candidate tools for this that I am aware of includes

    DEVise
    Datadesk
    SPSS    Sigmaplot
    SAS
    Statsoft
    SYSTAT
    JMP
    IDL
    IPL
    gnuplot
    xmgr
    MATLAB
    Mathematica

I am not, however, familiar with all of these - I do not feel that I
know enough to make a good decision. Hence this post, inviting
recommendations from others. (Please reply to me personally by mail - I
leave it up to you to decide if you want to reply to the newsgroups as
well.)

I don't know much, but I have gathered some random impressions and
information about these tools that I will sumarize here - hoping that
others may correct me.

DEVise
------

"Database Exploration and Visualization".

Academic work in progress, from the University of Wisconsin.

Contains an underlying database - apparently ad-hoc coded, although
plans to interface to a standard commercial database.  Can use an SQL
subset.

Primitive graphs - doesn't even have logarithmic axes.

Limited GUI querying - click on a graphical object, see the data. Most
data exploration (e.g. range setting) seems to have to be done via
menus.

TCL/TK based => extensible? (in the sense that you can extend any source
code that you can read).

Datadesk
--------

Commercial. 

Seems to provide the most interactive data browsing: twirling axes in
3D, zooming in and out via boxes, etc.

Linked variable windows.

Datastructures seem primitive - straight vectors. Q: does it support
structures?

MATLAB
------

Commercial.

Matrix based, but supports matrices of matrices, and matrics with named
tuple fields.

Good internal programming language. Many packages.

Theoretically database type JOINs could be written, but don't seem to
exist in standard set of add-ons. Q: is there a database add-on?

Theoretically good interactive graph browsing could be written, but
standard graphing tools seem to be command/text based.

(Part of me seems to think that what I want is SQL for MATLAB, with
interactive graph browsing.)

SPSS, SAS
---------

Commercial.

I have used SPSS (and SAS) extensively back in the mainframe days, so I
feel confident of their computing abilities. (I am considerably less
confident of the computing abilities of many of the newer, graphics
oriented, tools.)

I am less familiar with the GUI interfaces that have been added in the
last decade. Q: do they provide the interactive features, such as drag
a box to zoom?

Q: in the old days SAS and SPSS were basically sequential file of
records oriented.  Do they have any database features, like JOIN?


IDL
---

Commercial.

Widely advertized, seems to have a general programming language with
good graphics hooks. 

Q: database?

gnuplot
-------

Freeware.

Very textual, does a good job for a very limited repertoire of graphs.

Nice in that it seems to be one of the few programs to allow filters to
be specified when importing data, e.g.

plot '< grep-ssout VariableName*100 datafile', '< grep-ssout
Variable2/Variable1 datafile'

IPL
---

Seems to be mainly batch oriented.
Semi-freeware.


xmgr
----

Does the graphical "zoom in by drawing a box" thing. Doesn't do much
else.

Mathematica
-----------

Might be able to handle stuff like this, but seems to have performance
problems.  Not GUI that I can see.


Conclusion
==========

Any help in choosing tools that I can use will be appreciated.

Basically, I want something that can work out of the box, but which I
can also extend.

I am somewhat shy of investing in tools without good, detailed
explanations or demonstrations:
    a) $$$ - actually, I am willing to spend typical commercial software
prices, but I get somewhat fatigued with the hassle of returning
software when it turns out not to do what I want.  Hence, I will
appreciate people sending me in the direction of demoware packages -
ideally time limited demoware, because my experience is that
crippleware which, e.g, is limited in the data set sizes it can manage,
often looks good on the small data sets but then dies on the large data
sets.
    b) personal time - it takes a long time to learn how to use many of
these packages. I am somewhat reluctant to dedicate such time.




---
Andy "Krazy" Glew, glew@cs.wisc.edu
Place URGENT in email subject line for mail filter prioritization.
DISCLAIMER: private posting, not representative of employer. {{ VLIW:
the leading edge of the last generation of computer architecture }}
Prev by Date: Re: Sorting of Data Lists
Next by Date: greek letters in .ps plots
Prev by thread: [CFP] Int. SSCC98 Conf., 22-24 Sep. 1998, Durban, South Africa
Next by thread: greek letters in .ps plots