MathGroup Archive: June 2012 [00394]

[Date Index] [Thread Index] [Author Index]

Re: Extracting Information from XBRL Files

To: mathgroup at smc.vnet.net
Subject: [mg127094] Re: Extracting Information from XBRL Files
From: "Hans Michel" <hmichel at cox.net>
Date: Fri, 29 Jun 2012 04:53:20 -0400 (EDT)
Delivered-to: l-mathgroup@mail-archive0.wolfram.com
References: <201206270808.EAA18598@smc.vnet.net> <000601cd546f$e99a77d0$bccf6770$@net> <2BA08E08-A524-41B6-8EA9-CA9624254120@videotron.ca> <002201cd550b$28aab610$7a002230$@net> <36539422-2080-431D-8927-69739ECA8ABA@videotron.ca>

Greg:

What ran through my mind was the typical catch-phrase (if it is one) from
late night commercials selling some goods: "But, wait! There's More!"

The SEC provides a file with Company name and CIK.

http://www.sec.gov/edgar/NYU/cik.coleft.c

The "NYU" folder name is one of the only indication to the history of how
the EDGAR public data dissemination got started. Some people went down to a
DC government office wrote some perl and c code and set up some computers to
demonstrate the value of the data to the public and how facile it is to
setup a computer system to do this.

Also there is the ftp site

ftp://ftp.sec.gov/edgar/docs/cik.txt

The SEC provides another website given a particular CIK one can get some of
the Header information that is in most EDGAR filings.

For example
Import["http://www.edgarcompany.sec.gov/servlet/CompanyDBSearch?page=detaile
d&cik=0000040545&main_back=1", "FullData"]

The CIK is for ticker "GE" but the detailed data for this CIK lacks the
state of incorporation.

Import["http://www.edgarcompany.sec.gov/servlet/CompanyDBSearch?page=detaile
d&cik=0001141391&main_back=1", "FullData"]

The CIK is for MASTERCARD INC, which is registered in Delaware. Seems that
the state of incorporation which used to be provided in past is no longer
provided, but the field name is still present. But two tries is not enough
to confirm. The zip code is in the address. 

http://www.edgarcompany.sec.gov/servlet/CompanyDBSearch?page=main (can no
longer search by state they may have removed this.)

Mathematica parses out the return HTML table fairly well. Nevertheless, this
information is in the Header of most fillings. I have code somewhere to
parse out the header by looking for the fixed all cap text with ":".

Somewhere on SEC site is a Ticker-Company Name-CIK file. I just don't
remember where it is. I will forward this link when found. 

International companies registered with SEC.

http://www.sec.gov/divisions/corpfin/internatl/companies.shtml

Hans Michel
Michel Information Systems

-----Original Message-----
From: Gregory Lypny [mailto:gregory.lypny at videotron.ca] 
Sent: Thursday, June 28, 2012 7:12 AM
To: Hans Michel; MathGroup
Subject: [mg127094] Re: Extracting Information from XBRL Files

Hi Hans,

This looks cool, and it is very kind of you!  I'm going to dissect it and
run some tests.  I have, in fact, downloaded to CIKs using FinancialData[].
I flag the missing ones, double-check for those where the ticker symbol may
be ambiguous, and then manually gather those from EDGAR by searching for
ticker or company name.  With the CIKs, I download the text of the companies
main page from EDGAR (the one that lists the documents available) and
extract the US state location and US state of incorporation, which are
variables in my research.  Your function will help me learn to drill down
through the documents.  Good stuff!

No matter where this goes, you have earned an acknowledgment on our research
paper!

Regards,

Gregory

On Thu, Jun 28, 2012, at 4:51 AM, Hans Michel wrote:

> Greg:
> 
> Try
> 
> processSECDEF14A[ticker_] :=
>  Module[{cik, paddedCIK, urlfullpath, searchResults, textOnlylinks, 
>    top1linkfromList, formDEF14A}, cik = FinancialData[ticker , "CIK"];
>   paddedCIK = IntegerString[ToExpression[cik], 10, 10];
>   urlfullpath = 
>    "http://www.sec.gov/cgi-bin/srch-edgar?text=CIK%3D"; <> paddedCIK <> 
>     "+TYPE%3DDEF&first=1994&last12";
>   searchResults = Import[urlfullpath, "Hyperlinks"];
>   textOnlylinks = 
>    Select[searchResults, Function[StringMatchQ[#, "*.txt"] == True]];
>   top1linkfromList = First[textOnlylinks];
>   formDEF14A = Import[top1linkfromList, "Plaintext"];
>   Return[formDEF14A];
>   ];
> 
> The following Module will return raw text file from the SEC website 
> for the last available Form DEF 14A.
> 
> This file contains SGML, HTML and uuenconde jpgs or gifs or pdfs.
> 
> I tried
> processSECDEF14A["GE"]
> and
> processSECDEF14A["MSFT"]
> with success.
> 
> The code need to be tighter for when cases fail. And Parsing the HTML 
> between <TEXT> </TEXT> tags is left for review.
> 
> The bulk of the data mining for key terms such as "Summary 
> Compensation" and look for the HTML Tables near those terms.
> 
> Hans
> -----Original Message-----
> From: Gregory Lypny [mailto:gregory.lypny at videotron.ca]
> Sent: Wednesday, June 27, 2012 10:13 AM
> To: MathGroup; Hans Michel
> Subject: Re: Extracting Information from XBRL Files
> 
> Thanks Hans,
> 
> I'm just flying by the seat of my pants.  I will try your suggestion.  
> I need compensation tables from the DEF 14A.  I spoke to a SEC 
> representative yesterday, and she told me that DEF 14A is not yet
available in XBRL format.
> I like your download-to-notebook-format idea.
> 
> Thanks once again,
> 
> Gregory
> 
> 
> 
> On Wed, Jun 27, 2012, at 10:19 AM, Hans Michel wrote:
> 
>> Gregory:
>> 
>> I have used Mathematica to extract data from the SEC. (Mostly the 
>> older EDGAR format).
>> 
>> Not all of the data on the SEC website is available in XBRL format.
>> 
>> For some forms I prefer the EDGAR SGML-XML-HTML-Text hybrid fixed 
>> schema and taxonomy without the PDF.
>> 
>> The XBRL structure brings so much framing with it that parsing the 
>> core xml file in Mathematica should be straight forward. But 
>> attaching the associates schemas and definitions are not so easy.
>> 
>> The SEC provide a RSS feed of interactive data. Mathematica can take 
>> an RSS fee and change it to Notebook format.
>> 
>> Import["http://www.sec.gov/Archives/edgar/xbrlrss.all.xml";, "RSS"]
>> 
>> http://xbrl.sec.gov/
>> 
>> With that note book format you can write code to download and extract 
>> the zip files source.
>> 
>> Parsing the main XML data file in a XBRL file is straight forward. It 
>> is attaching the schema and the meaning which could be done in 
>> Mathematica but to do so one would have to have a compeling reason 
>> not to use other tools that are specifically made for such task.
>> 
>> I am familiar with SGML data I would consider the XBRL format a 
>> hybrid of SGML-XML even though the use of schemas (DTD) and entity files
etc.
>> 
>> What are you trying to do?
>> 
>> Hans
>> 
>> -----Original Message-----
>> From: Gregory Lypny [mailto:gregory.lypny at videotron.ca]
>> Sent: Wednesday, June 27, 2012 3:09 AM
>> To: mathgroup at smc.vnet.net
>> Subject: Extracting Information from XBRL Files
>> 
>> Hello everyone,
>> 
>> This is a long shot, but has anyone used Mathematica to parse XBRL 
>> files, such as those accessible from the SEC's (US Securities and 
>> Exchange
>> Commission) EDGAR system?  XBRL is a tagged format, an offshoot of XML.
>> 
>> Gregory Lypny
>> 
>> 
>

References:
- Extracting Information from XBRL Files
  - From: Gregory Lypny <gregory.lypny@videotron.ca>

Prev by Date: Re: Extracting Information from XBRL Files

Next by Date: Re: Replace, ReplaceAll and If time performace comparition

Previous by thread: Re: Extracting Information from XBRL Files

Next by thread: How to plot this function??