EAD 1.0 to 2002 Conversion Toolkit


Setup | Run | Troubleshooting FAQ | XSLT 1.0 Stylesheets | Post Conversion Implementation | Archive


EAD Conversion Toolkit README file (21 August 2012)

Original Toolkit page

Object:

Provide a simple and bare bones open source conversion toolkit for EAD 1.0 to EAD 2002. Observe and achieve the following: SGML to XML conversion; EAD 1.0 to EAD 2002 tag set update/substitution; build into EAD 2002 output local values as well as substitution attributes and/or elements; generate an exhaustive report_x.htm informing the user of any and all changes made by the stylesheet insuring that data will be retained based on user judgements.

Disclaimer:

This toolkit is client side and has been tested in Windows 2000 and XP environments only.

This toolkit is concerned primarily with conversion. I have made every effort to develop and create a conversion toolkit that meets the immediate needs of the Library of Congress as well as the normal coder. The stylesheet that converts EAD 1.0 to EAD 2002 (v1v2002_4.xsl) was generated from a workflow that compares the two DTDs. This insures that the conversion will achieve the best possible results. The HTML stylesheet (ead_2002_html_conv-1.xsl) is to benefit the user with a non-coded display of his data. The decision to respect some EAD elements and not others in generating this HTML document is based upon the Library of Congress best practices current draft.

The software used is open source. The limitations and usage restrictions apply to the user as stated by their owners and distributors. I fully acknowledge my debt to their work.

Development:

I encourage the user to develop further the toolkit. Whatever folders appear in this toolkit when fully unpacked are necessary to its operation.

EAD 2002 users can use the HTML stylesheet in this toolkit to view their in-process coding. Please review the Post Conversion Implementation section below for details.

Technical Support:

This EAD Conversion toolkit comes with no warranty and no formal technical support service.

Contact:

lcead@loc.gov

Dependencies:

Java 1.5 (or higher)

Included in Full Toolkit:

Setup:

1. Download and unzip my_ead_2002_conv.zip [Date: 21 August 2012].

Run:

Note: Batch files can be run a number of different ways.

  1. Double click on the file (fastest)
  2. Open the file ( single click ; right click ; Open )
  3. DOS windows command line. At the prompt type in the file name and hit enter.

1. Run sample SGML file

SGML file included: archivista.sgm

Simply run the following batch files. The archivista.sgm file is a sample SGML document taken from the LC EAD Practices Technical Documents page.

  1. SX.bat
    • Input file: archivista.sgm
    • Output file: archivista.xml
      • Compare to test files: c:\my_ead_2002_conv\test_archivista_docs\archivista.xml
  2. SAX_2002_01.bat
    • Input file: archivista.xml
    • Output file[1]: archivista_x.xml
      • Compare to test files: c:\my_ead_2002_conv\test_archivista_docs\archivista_x.xml
    • Output file[2]: report_03152006.91154.16.htm
      • Compare to test files: c:\my_ead_2002_conv\test_archivista_docs\report_03152006.91154.16.htm
  3. SAX_HTML_01.bat
    • Input file: archivista_x.xml
    • Output file: archivista_x.htm
      • Compare to test files: c:\my_ead_2002_conv\test_archivista_docs\archivista_x.htm

2. Run user document

  1. Edit batch files with current document name. (select; right click; edit)
    1. Example file name: archivista.sgm
    2. SX.bat
      • contents: bin\sx -inotation -wall -wnet -wno-unused-param -wno-mixed -xnotation -xndata -xlower -xempty -xcdata -cead_catalog -ferror.txt archivista.sgm > archivista.xml
      • edit: archivista.sgm >archivista.xml
    3. SAX_2002_01.bat
      • contents: java -jar saxon_6_5_5/saxon.jar -l -w0 -o archivista_x.xml archivista.xml v1v2002_4.xsl "replace=v1v2002_4" 2>error.txt
      • edit: archivista_x.xml archivista.xml
    4. XMLINT.bat
      • contents: xmlint\xmlint.exe archivista_x.xml > error.txt
      • edit: archivista_x.xml
    5. SAX_HTML_01.bat
      • contents: java -jar saxon_6_5_5/saxon.jar -l -w0 -o archivista_x.htm archivista_x.xml ead_2002_html_conv-1.xsl "replace=ead_2002_html_conv-1.xsl"2>error.txt
      • edit: archivista_x.htm archivista_x.xml
  2. Run batch files (double click)
    1. timestamp.bat
      • creates datestamp.txt; it_settings.txt; timestamp.txt
      • If you fail to run this bat, your report_x.htm will be overwritten each time you run SAX_2002.bat
      • If you ignore this bat due to OS configuration date issues (see FAQ), be aware your report_x.htm will be overwritten each time you run SAX_2002.bat. Save your report to another folder before your next run.
    2. SX.bat
      • Open error.txt (should be empty)
    3. SAX_2002_01.bat
      • Open error.txt (should be empty)
      • Open report_x.htm in browser (file name created from timestamp.bat files)
    4. XMLINT.bat
      • Open error.txt (should contain file name only ?_x.xml)
    5. SAX_HTML_01.bat
      • Open error.txt (should be empty)
      • Open ?_x.htm in browser

3. Troubleshooting FAQ

  1. Why ...when I run SX.bat or SAX_2002_01.bat or SAX_HTML_01.bat ...xml file is 0 KB ...run time errors (...nothing works)?
    • These type of errors indicate that the batch file is pointing to a file that does not exist (incorrect path). Check the file list given above with your c:\my_ead_2002_conv folder. More than likely, your zip utility created a folder for each zip file (sp_1_3_4.zip becomes folder sp_1_3_4) instead of extracting them to the root directory (c:\my_ead_2002_conv). SP has three folders that need to appear in the root directory (bin ; doc ; pubtext). SAXON has two files that need to appear in the root directory (instant.html ; SAXON.exe). The execute file ead2002.exe has one folder (related_optional) and one file (ead.dtd) that need to appear in the root directory.
    • Does your file name have spaces in it? File names CANNOT have spaces in them.
    • Is your file (sgm) in the root directory?
    • Is there a typo in your batch file?
    • Did you change AND save each batch file?
    • Are your ENT files in the root directory?
    • Do your ENTITY paths need to be edited?
  2. Why is my document... ( ?_x.xml | ?_x.htm )... ?
    • EAD 2002 XML output has elements built into it (i.e., <langmaterial>). These builds are implemented to insure DTD validation. They are also given in the report.htm document. These elements may not represent the sequence found in the Library of Congress best practices current draft.
    • HTML output does not display the following elements:
      • EDITIONSTMT
      • DESCRULES
      • FRONTMATTER
    • HTML output is formatted in accordance with Library of Congress best practices.
    • timestamp.bat: The date for the conversion is determined by the correct syntax in the datestamp.txt and the it_settings.txt files. If your operating system has unusual configurations, ignore the timestamp.bat in the User Run. Edit the sample_datestamp.txt files and rename them accordingly. Please respect the syntax order of datestamp.txt (mm/dd/yyyy). The stylesheet will use the text in these files.
    • seal_x.txt: This file contains the text of the ENTITY pointing to the institution seal. Edit it accordingly. Do not add any other entities to this file. If you are not using a seal ENTITY, just remove the ENTITY from your ?_x.xml DOCTYPE.
  3. Why can't I get my document (?_x.xml) to validate?
    • Did you validate the sgm document before you began the conversion?
    • Did you check the report_x.htm to see if you have *unparsed* entities that you need to declare in the ?_x.xml DOCTYPE? You will need to copy these into the ?_x.xml DOCTYPE from the ?.xml DOCTYPE. XSL does not handle DOCTYPE or ENTITY.
    • If you moved your document out of the conversion folder, you will have to edit the DOCTYPE statement by adding the path to the ead.dtd file.
  4. I don't want... (EAD 2002 @* built into my ?_x.xml)
    • The @* that are built into ?_x.xml document are in accordance with best practices of the Library of Congress. These mainly affect only two unrepeated elements EADHEADER and EADID. The user can edit these in the converted document (?_x.xml). Check the report_x.htm for specific details (search: 'EAD 2002').
  5. What happened to my &eacute; &#233; &#x00E9;?
    • James Clark's SX software converts all these into unicode bytes. If you see unreadable text in your editor, try viewing the document in Internet Explorer. The unicode in this form is merely carried along by SAXON into your new document without change.
  6. Do I need my ENT files anymore?
    • James Clark's SX grabs all *parsed* entities (per SGML/XML design) replacing the entities with the values designated. *Unparsed* entities are retained in the ?.xml document DOCTYPE. After the SX.bat is run the ENT files are not needed for the ?.xml or ?_x.xml files.
    • Archive your sgm files. I suggest burn some CDs.
  7. Why... line numbers... ?
    • You need an editor that has line numbers. When I took Daniel Pitti's class (2001), we used Note Tab Pro.
    • Are you looking in the ?.xml document (NOT the sgm document and NOT the ?_x.xml document)?
    • Formatting of the ?.xml document may appear strange and hard to get used to. James Clark's SX software puts line breaks in the elements (check the options in SX.bat). This insures that no parameter errors will occur (when lines are too long). This also insures that no characters are added to the data of the document. SAXON returns the tags to their normal format.
    • The built in elements (REVISIONDESC; LANGUAGE; etc.) have line numbers of the anchor element. This is the closest possible line number that can be given for the build. The idea is that you are going to look at the pre-build document, make changes and then begin again.
  8. Why... links in my HTML document are not working... ?
    • Validate the SGML (1.0) or EAD (2002) before running the stylesheet. Every @target must have an @id. Every @id must be unique. Validation will check this.
    • Update your ead_2002_html_conv-1.xsl stylesheet.
    • External links not working could mean a number of things. As always, validate the EAD first. Validation will check any entity syntax. If the document is valid the error is out of scope for this help page. The URL may be dead or have a typo, etc.
    • If your document is valid and you are still having links fail in the HTML. I have created a list of all @id that are respected by the ead_2002_html_conv-1.xsl stylesheet (ead_2002_html_get_id_report.xml).
    • If none of this helps, please contact me.

4. XSLT 1.0 Stylesheet Files:

These stylesheets are regularly updated. They are offered to the user who already has the conversion toolkit setup and working.

Download the zip files and unpack them into your root directory.

  1. v1v2002_4.zip
    • v1v2002_4.xsl (converts EAD 1.0 to EAD 2002)
      • SAXON extensions used to generate report.htm
        • <saxon:output>
        • saxon:line-number()
        • saxon:node-set()
    • marc_lang_codes.zip
      • MARC_lang_codes-1.xml (look up table for MARC language codes)
    • dtd_path.zip
      • dtd_path_1.txt
        • Note: The System Identifier value is supplied by the contents of dtd_path_1.txt. If the user desires to change the path, edit dtd_path_1.txt accordingly. See: sample_dtd_path_1.txt for an example of full path syntax. If unpacked, this file could overwrite the customized user content of your current dtd_path_1.txt.
      • sample_dtd_path_1.txt
        • Contents: http://lcweb2.loc.gov/xmlcommon/dtds/ead2002/ead.dtd
  2. ead_2002_html_1.zip [Date: 21 August 2012]
    • ead_2002_html_conv-1.xsl (transforms EAD 2002 to HTML 4.01 loose.dtd)
      • Note: The HTML output is intended to be a basic representation of many but not all of the elements and attributes of the EAD 2002 XML document. Please check your ?_x.xml document in an editor for the actual code.
      • Uses: EAD_LINKS-1.xml (HTML color settings for each division)
      • Uses: EAD_SPECIAL_FA-1.xml (HTML nested tables when aggregate/eadid = true)
    • ead_2002_html_xml_1.zip
      • Note: These two files contain user nodes specifically created for Library of Congress coders. If unpacked, these files could overwrite the customized user XML documents.
      • EAD_LINKS-1.xml (HTML color settings for each division)
      • EAD_SPECIAL_FA-1.xml (HTML nested tables when aggregate/eadid = true)
  3. ead_2002_schema_html_1.zip [Date: 21 August 2012]
    • ead_2002_schema_html_conv-1.xsl (transforms EAD 2002 SCHEMA to HTML 4.01 loose.dtd)
      • Transforms EAD 2002 SCHEMA to HTML.
      • This stylesheet uses the same values for generating special finding aid formatting & link colors. There are 2 variables in the stylesheet that are set to persistent identifiers for these two documents. For users that are not Library of Congress ead coders, you will need to change these values for your local documents.
      • Uses: $divLinksUrl : http://hdl.loc.gov/loc.music/eadmus.divLinks (HTML color settings for each division)
      • Uses: $specialFaUrl : http://hdl.loc.gov/loc.music/eadmus.specialFa (HTML nested tables when aggregate/eadid = true)

Post Conversion Implementation:

Instructions for using this toolkit to transform in-process EAD 2002 documents.

Post Conversion Setup:

  1. Follow Setup above.

Post Conversion Run:

  1. Follow Run above.
  2. Edit SAX_HTML.bat as indicated above.
  3. Run SAX_HTML.bat
    • Open error.txt (should be empty)
    • Open ?.htm in browser
  4. Troubleshooting FAQ

Special thanks to Daniel Pitti, instructor at the Rare Book School, University of Virginia.

Toolkit compiled by and XSL stylesheets written and designed by Michael Ferrando at the Library of Congress.

ισχύω πάντα εν ενδυναμουντι με σταυρωθέντι


EAD Finding Aids at the Library of Congress | EAD DTD Official Web Site


The Library of Congress
August 21, 2011

Contact Us