MathGroup Archive 2012

[Date Index] [Thread Index] [Author Index]

Search the Archive

Re: Extracting Information from XBRL Files

  • To: mathgroup at
  • Subject: [mg127067] Re: Extracting Information from XBRL Files
  • From: "Hans Michel" <hmichel at>
  • Date: Thu, 28 Jun 2012 04:00:59 -0400 (EDT)
  • Delivered-to:
  • References: <> <000601cd546f$e99a77d0$bccf6770$@net> <>


To get you started it does not seem that the DEF 14A  forms are provided
even in XML format

For example
If you query today's (6/27/2012) filing so far before SEC closes receipt of

This should return the last 40 SEC DEF 14A filings list.

Let's choose " FREDS INC (0000724571) (Filer)" The number in first
parenthesis is the CIK.

Other definitive proxy statements
Accession Number: 0001193125-12-285304  Act: 34  Size: 801 KB  2012-06-27
15:01:51 2012-06-27 001-14565 

The [text] link provided on the page is the better link to process.

This will contain a combination SGML-XML and usually well formed HTML or
XHTML in addition to some well encoded image or pdf files.

The HTML file is good for viewing as well as processing but the text file
contains some header information.

The wolfram financial data can help if you are doing portfolio research and
you only know the ticker.

So from Ticker you want to get CIK from CIK you can go to the SEC or through
well formatted URI (URL) for browse edgar cgi you can get the forms for a
particular company.
So this URL get's just  FRES INC lastest DEF 14A

All I did was change the CIK=& to CIK=724571& to get the specific form.

The return list is HTML page an one can easily look for all links the have
the word "[text]" on that resulting page

Wolfram's Financial Data or W|A can help getting a list of Executives at a

Back to the *.txt file. This is the file you want to process. All of these
forms will contain lots of legal blah blah blah text. You want to find the
HTML table or segments that contain compensation data. The HTML is daunting
but applications such as HTML tiddy may help or simply finding the correct
regex pattern to find well formed chuncks of HTML to process will be

Once you have the tables the rest is linking executives with their CIK and
the data and footnotes from the table. The footnotes are needed. The table
may say one number and the footnote may explain how to use it or what it
really means.

There is more information for retrieving SEC data via FTP or HTML and how to
form a URL to get archived data on the SEC website. You may want to pursue
these questions offline.

So if you choose to pursue this then welcome to data mining using


-----Original Message-----
From: Gregory Lypny [mailto:gregory.lypny at] 
Sent: Wednesday, June 27, 2012 10:13 AM
To: MathGroup; Hans Michel
Subject: [mg127067] Re: Extracting Information from XBRL Files

Thanks Hans,

I'm just flying by the seat of my pants.  I will try your suggestion.  I
need compensation tables from the DEF 14A.  I spoke to a SEC representative
yesterday, and she told me that DEF 14A is not yet available in XBRL format.
I like your download-to-notebook-format idea.

Thanks once again,


On Wed, Jun 27, 2012, at 10:19 AM, Hans Michel wrote:

> Gregory:
> I have used Mathematica to extract data from the SEC. (Mostly the 
> older EDGAR format).
> Not all of the data on the SEC website is available in XBRL format.
> For some forms I prefer the EDGAR SGML-XML-HTML-Text hybrid fixed 
> schema and taxonomy without the PDF.
> The XBRL structure brings so much framing with it that parsing the 
> core xml file in Mathematica should be straight forward. But attaching 
> the associates schemas and definitions are not so easy.
> The SEC provide a RSS feed of interactive data. Mathematica can take 
> an RSS fee and change it to Notebook format.
> Import["";, "RSS"]
> With that note book format you can write code to download and extract 
> the zip files source.
> Parsing the main XML data file in a XBRL file is straight forward. It 
> is attaching the schema and the meaning which could be done in 
> Mathematica but to do so one would have to have a compeling reason not 
> to use other tools that are specifically made for such task.
> I am familiar with SGML data I would consider the XBRL format a hybrid 
> of SGML-XML even though the use of schemas (DTD) and entity files etc.
> What are you trying to do?
> Hans
> -----Original Message-----
> From: Gregory Lypny [mailto:gregory.lypny at]
> Sent: Wednesday, June 27, 2012 3:09 AM
> To: mathgroup at
> Subject: Extracting Information from XBRL Files
> Hello everyone,
> This is a long shot, but has anyone used Mathematica to parse XBRL 
> files, such as those accessible from the SEC's (US Securities and 
> Exchange
> Commission) EDGAR system?  XBRL is a tagged format, an offshoot of XML.
> Gregory Lypny

  • Prev by Date: Re: Approximate Zero Times A Symbol
  • Next by Date: Export/Save as text + plots/images
  • Previous by thread: Re: Extracting Information from XBRL Files
  • Next by thread: Re: Extracting Information from XBRL Files