#!/usr/bin/perl -w
#
# TabularFormats.pm
#
# The main package:
#     package TabularFormats;
#     package pull_parser;
#     package ExpatNB;
#
# Data management:
#     package DataSchema;
#     package FieldDef;
#     package DataSource;
#     package DataReaders;
#     package DataCurrent;
#     package DataOptions;
#
# Written 2010-03-23 by Steven J. DeRose, sderose@acm.org, as csvFormat.pm
#     (many changes/improvements).
# 2012-03-30 sjd: Rename to TabularFormats.pm, major reorg.
#    More work on adding XSV (XML Tuples) support.
#    Refactor to have package per form, plus RecordDef and FieldDef.
#    Use sjdUtils. Integrate sniffing code from lessCSV.
# 2012-04-13 sjd: Track lastMessage.
# 2012-04-20 sjd: Debugging, cleaning up separation of sub-packages.
# 2012-04-25f sjd: Back to pretty much working. Start SAX i/f.
#     Let user actually *set* field-specific callbacks. Make setExpectedFields()
#     adjust fDefsByName to match. Add stubs for remaining formats.
# 2012-04-30 sjd: Put all options into main pkg, add opt() in all pkgs.
# 2012-05-23 sjd: Add PERL. Rationalize parseRecord... methods. Decide that
#     hash is the definitive internal form. Drop expectedFields notion.
#     Drop readRecordToHash() and readRecordToArray().
#     Do parseRecordToArray() by ...ToHash() and assemble in order.
# 2012-05-25 sjd: Option to set up col specs for XML table output.
#     Implement option and field datatype checking.
#     Better way to organize subclasses.
# 2012-05-29ff sjd: Pull in parsestringtoDOM() from FakeParser.pm. Shift methods
#     between EntityStack/EntityFrame/EntityDef. Implement ARFF. Escaping.
#     Fix @class for XML, attribute names for XSV output. Working for CSV again.
#     Drop {theRecord}, escapeMap option, add xDocument, -XMLDecl, systemId.
#     Move tfWarn and tfError into UNIVERSAL. Add assembleComment().
#     Make TF-level impl of all parseXXX() calls also do setRecord().
#     Add setFieldNumbersByPosition().
# 2012-06-04 sjd: Support options hash arg for parse_start, parsefile, parse.
#     Implement actual pull parsing. Make parse_more() etc. like XML::Parser.
# 2012-06-08 sjd: Add readBalanced(), readToUnquotedDelim().
#     Finish hooking up and documenting 'DataSource' package.
#     Finish readRecord() (incl. comments) for JSON, MANCH.
#     Make readRecord() really do exactly one record (sexp, xml, mime, manch...
# 2012-06-11 sjd: Make some use of null value settings, esp. for output.
# 2012-06-13 sjd: Add postProcessFields(), splitter/joiner. Fix JSON escaping.
#     Add notion of sub-fields. Trap CSV quoting error for output.
#     Add assembleComment() to more formats.
# 2012-06-21 sjd: Sync w/ TabularFormats.pm changes. Add option help strings.
# 2012-07-05 sjd: Implement ARFF readAndParseHeader(). Add readRealLine().
#     Add XML 'attrFields' option and support. Improve setFieldPosition()
#     and make width arg optional.
# 2012-07-13 sjd: Check for nil from getFieldDef(); create field names at need.
#     setFieldPositions(), getAvailableWidth(), getNearestFollowingFieldDef().
# 2012-07-30ff sjd: Better sub-field handling. Catch undef names. Fix
#     postProcessFields(). Error-check arg to setRecordFromXXX().
# 2012-08-14 sjd: getOptions() call getFormatImplementation, for getOptions().
# 2012-10-29 sjd: Improve unescaping.
# 2012-11-02 sjd: Make args for assembleField() consistent. "FIXED"->"COLUMNS".
# 2012-11-26 sjd: Add open() to pass through to DataSource.
#     Add DataSource::binmode(). Discard FieldDef->{fTruncate}.
# 2012-12-17ff sjd:  Add getFieldsArray() and getFieldsHash(). Fix bug where it
#     lost [0] at one point. Treat theFields consistently as a hash.
#     Make assembleRecordFromHash default to current data. Move escapeJson to
#     sjdUtils. Omit empty fields for XSV output. Prettify output layouts.
#     Start fixing header handling.
# 2012-12-19 sjd: Drop readHeader() for readAndParseHeader(). Make consistent.
#     Fiddle w/ XmlTuples API to make like the rest.
# 2013-01-18 sjd: Add tell(), mainly for RecordFile.pm.
# 2013-02-06ff sjd: Don't call sjdUtils::SUset("verbose"). Work on -stripRecord.
#     Break out DataSchema package, and tell it and DataSource what they need,
#     so they don't have to know 'owner' any more. Clean up virtuals a bit.
#     Also break out DataOptions and DataCurrent packages. Fix order of events
#     in pull-parser interface. Format-support packages to separate file.
#     Support repetition indicators on datatypes.
# 2013-02-14 sjd: Sync package DataSource's API, closer to RecordFile.pm.
# 2013-04-02 sjd: Forward a few more calls down to sub-packages (for tab2xml).
#     Add dprev for prior data record. Centralize setFieldNamesFromArray() call
#     from parseHeader() and readAndParseHeader() -- not in TFormatSupport.pm.
# 2013-04-03 sjd: Make getField() create unknown fields as needed.
#     Let addField and FieldDef::new take some optional params.
# 2013-04-23 sjd: Add package prefixes to sub dcls. Debug getFieldValue(0.
#     Distinguish getNSchemaFields() vs getNCurrentFields().
# 2013-06-03: Special-case XSV, which provides its own input handling.
#     Make sure -basicType shows up with addOptionsToGetoptLongArg().
# 2013-06-17ff: Add TabularFormats::getOptionsHash(). Make tfError() print
#     package and function names. Fix \-codes in options. Add sniffFormat().
#
# To do:
#     Fix handling of XSV headers.
#     If no header, setRecordFromArray messed up? Cf cutData on GNG.
#     Add supportsFieldNames().
#     Move DataSource into Recordfile?
#     Handle blank records better (integrate readRealLine).
#     Option to default specific fields to what they were in dprev!
#
#     Do something with date formats.
#     Protect against UTF encoding errors.
#     Replace getRecordAsString (and Array).
#     Way to control order of writing fields where it doesn't matter:
#         xml, xsv (done?), json, mime
#     Integrate into C<makeHTMLtable>, C<makeNMIgraph>, C<taWordlist2csv>.
#     Right-justify numeric fields
#     FormatSniffer.
#
# Format-specific:
#     COLUMNS: add fixFieldWidths() to setFieldPosition().
#     COLUMNS: easier way to pass in column positions?
#     COLUMNS: support reorderings in assembleRecordFromArray()
#     JSON: Option to write JSON arrays vs. dicts.
#     MANCH: Manage additional keywords per tupleset.
#     MANCH: add options for TypeName(s), SuperClass, ID,
#         inclusions. Implement header, prettyPrint.
#     MANCH, XML: finish readAndParseHeader().
#     XML: support tag@attr values with attrFields?
#         XML: Option to parse up attribute values as subfields.
#     XML: Select elements by QGI@, hand back list of children of each???
#     XML: switch to XML::Parser, HTML::Parser, etc.
#     XSV: output: omit defaults.
#
# Low priority:
#     Write merge: n files, (compound?) key designation per, field renaming.
#     Add compound-key-reifier to deriveField.
#     Rotate embedded layer (esp. for SEXP, XML, JSON, etc.)
#     Switch messaging to use sjdUtils?
#     Way to get the original offset/length of each field in record?
#     Add quick options to set tags for docbook, tei, nlm (in and out)
#     Improve handling of missing/extra/duplicate fields.
#     Should chooseFormat go into TFormatSupport? Maybe ditch subclassing there?
#
# Additional formats? See TFormatSupport.pm.
#
###############################################################################
# Messaging ("UNIVERSAL" is inherited by everything)
#
use strict;
use feature 'unicode_strings';

use sjdUtils;

sjdUtils::try_module("XML::DOM") || warn
    "Can't access CPAN XML::DOM module.\n";
sjdUtils::try_module("HTML::Entities") || warn
    "Can't access CPAN HTML::Entities module.\n";
#sjdUtils::try_module("MIME::QuotedPrint") || warn
#    "Can't access CPAN MIME::QuotedPrint module.\n";

use TFormatSupport;

sjdUtils::try_module("Datatypes") || warn
    "Can't access sjd Datatypes module.\n";
sjdUtils::try_module("XmlTuples") || warn
    "Can't access sjd XmlTuples module (needed for XSV support).\n";
sjdUtils::try_module("FakeParser") || warn
    "Can't access sjd FakeParser module (needed for quasi-XML support).\n";

our $VERSION = "3.0";

# SAX (XML parser) events (just the ones we actually generate)
#
my %saxEvents = (
    "Init"    => 1,
    "Fin"     => 1,
    "Start"   => 1,
    "End"     => 1,
    "Text"    => 1,
    "Default" => 1,
    );

# List of supported formats
#
my @bt = qw/ARFF COLUMNS CSV JSON MIME MANCH PERL SEXP XSV XML/;
my $formatNamesExpr = join("|",@bt);

our $lastMessage = "";
our $tfMsgLevel = 0;

sub UNIVERSAL::tfWarn {
    my ($level, $m1, $m2) = @_;
    if (!$m1) { $m1 = ""; }
    if (!$m2) { $m2 = ""; }
    $lastMessage = $m1.$m2;
    ($tfMsgLevel >= $level) || return;
    sjdUtils::vMsg(0,$m1,$m2);
}

sub UNIVERSAL::tfError {
    my ($level, $m1, $m2) = @_;
    if (!$m1) { $m1 = ""; }
    if (!$m2) { $m2 = ""; }
    $lastMessage = $m1.$m2;
    sjdUtils::SUset("locs",4);
    sjdUtils::eMsg($level, sjdUtils::whereAmI(1).": ".$m1, $m2);
    #sjdUtils::eMsg($level, ((caller(0))[3]).$m1, $m2);
    ($level<0) && die "    ******* Error is fatal *******\n";
}

sub UNIVERSAL::getLastMessage {
    return($lastMessage);
}


package DataSchema;
package FieldDef;
package DataSource;
package DataCurrent;
package DataOptions;
package TabularFormats;


###############################################################################
###############################################################################
###############################################################################
# The main package.
#
# Instantiates one of the specific formats, and dispatches calls
# to it.  The top-level package handles messaging, options, field defs,
# a current data record, and some interfaces (like SAX).
# The others handle format-specific i/o.
#
sub TabularFormats::new {
    my ($class, $format, $optionsHash) = @_;
    if (!$format) { $format = "CSV"; }

    # Manage the 'basicType'
    if ($optionsHash && ref($optionsHash) ne "HASH") {
        UNIVERSAL::tfError(
            0, "Arg 2 to constructor (options) is not a hash.");
        return(undef);
    }
    my $self = {
        format        => $format,  # Name of format in use
        formatImpl    => undef,    # -> instance for basicType impl
        dsrc          => undef,    # -> DataSource instance
        dsch          => undef,    # -> DataSchema instance
        dprev         => undef,    # -> Prior dcur object.
        dcur          => undef,    # -> DataCurrent instance
        dopt          => undef,    # -> DataOptions instance

        parsedARecord => 0,        # Finished w/ record 1 yet?

        saxCallbacks  => {},       # In case they want to parse this way
        lastMessage   => "",       # Most recent error message
        gaveObsMsg    => 0,        # Already showed readRecord obsolete msg?
    }; # self

    bless $self, $class;

    $self->{dopt} = new DataOptions();
    if (defined $optionsHash && ref($optionsHash) eq "HASH") {
        $self->{dopt}->setOptionsFromHash($optionsHash);
    }
    $self->{dsrc} = new DataSource();
    $self->{dsch} = new DataSchema();
    $self->{dcur} = new DataCurrent();

    $self->chooseFormat($format);
    return($self);
} # new

sub TabularFormats::reset { # TabularFormats
    my ($self) = @_;
    $self->{dsch}->reset();
    $self->{dcur}->reset();
    if ($self->{dprev}) { $self->{dprev}->reset(); }
    $self->{lastMessage} = "";
}

###############################################################################
# Facilitate callers supporting our options, by providing a single method
# that adds them to a hash for the argument to Getopt::Long::GetOptions().
# The options invoke commands that store their values back here, so caller
# doesn't have to know about them at all.
# Options already defined before calling us are ok (warning on conflict).
#
sub TabularFormats::addOptionsToGetoptLongArg {
    my ($self,
        $getoptHash,          # The hash to pass to GetOptions()
        $prefix               # String to put on front of option names
        ) = @_;
    if (!defined $prefix) { $prefix = ""; }
    $self->{optionsPrefix} = $prefix;

    (ref($getoptHash) eq "HASH") || UNIVERSAL::tfError(
        -1, "Must provide a hashref.");

    my %getOptTypeMap = (
        "boolean"=>"!", "integer"=>"=i", "BaseInt"=>"=o",
        "string"=>"=s", "Name"=>"=s",
        );

    my $i = 0;
    for my $name (sort keys(%{$self->{dopt}->{options}})) {
        $i++;
        ($name =~ m/^\w+$/) || UNIVERSAL::tfError(0,"Bad option name '$name'");
        my $dt = $self->{dopt}->getOptionType($name);
        my $suffix = $getOptTypeMap{$dt};
        if (!$suffix) {
            UNIVERSAL::tfError(
                0, "Unknown type '$dt' for option '$name'.\n  " .
                "Known types: (" . join(", ", keys(%getOptTypeMap)) . ").");
            $suffix = "!";
        }
        if (defined $getoptHash->{"$prefix$name$suffix"}) {
            UNIVERSAL::tfError(0,"'$prefix$name$suffix' already in hash.");
        }
        $getoptHash->{"$prefix$name$suffix"} =
            sub { $self->setOption("$name", $_[1]); };
        #UNIVERSAL::tfWarn(
        #    3, sprintf("  Adding %-16s => %s", "\"$prefix$name$suffix\"",
        # "sub { \$self->setOption('" . $name . "',\t\$_[1]); }"));
    }
    return($i);
} # addOptionsToGetoptLongArg

sub TabularFormats::SniffFormat {
    my ($self, $path) = @_;
    (my $ext = lc($path)) =~ s/^.*\.//;
        
    # Extension should be sufficient
    if ($ext eq "csv") {
        return("CSV\t" . $self->{dopt}->getOption("fieldSep"));
    }
    if ($ext eq "tsv")                   { return("CSV\t\t"); }
    if ($ext =~ m/^(zip|Z|gz|lz|tar)$/)  { return("COMPRESSED\t$ext"); }
    if ($ext =~ m/^(xlsx)$/)             { return("XLSX"); }
    if ($ext =~ m/^(htm|html|xml)$/)     { return("XML");  }
    if ($ext =~ m/^(mbox)$/)             { return("MIME"); }
    if ($self->hasFormat($ext))          { return($ext);   }

    # Unix 'file' command
    my $ufile = `file $path`;
    if ($ufile !~ m/ text/)              { return(undef);  }
    if ($ufile =~ m/(HTML|XML)/)         { return("XML");  }
    if ($ufile =~ m/mail text/)          { return("MIME"); }
    
    # Sniff the beginning of the data
    my $head = `head -n 10 $path`;
    if ($head =~ m/\n\@RELATION/si)      { return("ARFF"); }
    if ($head =~ m/<!-- XSV|<Xsv/i)      { return("XSV");  }
    if ($head =~ m/<!DOCTYPE\s*(\w+)/)   { return("XML\t$1"); }
    if ($head =~ m/<x?html/i)            { return("XML");  }
    if ($head =~ m/^\s*\(/)              { return("SEXP"); }
    if ($head =~ m/\t.*\t.*\t/)          { return("CSV\t\t"); }
    if ($head =~ m/,.*,.*,/)             { return("CSV\t,"); }
    if ($head =~ m/:.*:.*:/)             { return("CSV\t:"); }
    if ($head =~ m/;.*;.*;/)             { return("CSV\t;"); }
    if ($head =~ m/\|.*\|.*\|/)          { return("CSV\t|"); }

    # Fail
    return(undef);
}

sub TabularFormats::hasFormat {
    my ($self, $f) = @_;
    return($f =~ m/^($formatNamesExpr)$/);
}
sub TabularFormats::chooseFormat {
    my ($self, $f) = @_;
    UNIVERSAL::tfWarn(1, "In chooseFormat for '$f'.");
    if (!$f || $f !~ m/$formatNamesExpr/) {
        UNIVERSAL::tfError(-1, "Unknown format '$f'");
        return(undef);
    }
    
    my $fi = undef;
    my @args = ($self->{dsrc}, $self->{dsch}, $self->{dcur}, $self->{dopt});
    if    ($f eq "ARFF" )   { $fi = new formatARFF(@args) ;}
    elsif ($f eq "COLUMNS") { $fi = new formatCOLUMNS(@args);}
    elsif ($f eq "CSV"  )   { $fi = new formatCSV(@args)  ;}
    elsif ($f eq "JSON" )   { $fi = new formatJSON(@args) ;}
    elsif ($f eq "MIME" )   { $fi = new formatMIME(@args) ;}
    elsif ($f eq "MANCH")   { $fi = new formatMANCH(@args);}
    elsif ($f eq "PERL" )   { $fi = new formatPERL(@args) ;}
    elsif ($f eq "SEXP" )   { $fi = new formatSEXP(@args) ;}
    elsif ($f eq "XML"  )   { $fi = new formatXML(@args)  ;}
    elsif ($f eq "XSV"  )   { $fi = new formatXSV(@args)  ;}
    else {
        UNIVERSAL::tfError(-1,"Unknown format name '$f'.\n");
        return(undef);
    }
    $self->{format} = $f;
    $self->{formatImpl} = $fi;
    return($fi);
} # chooseFormat

sub TabularFormats::getOptionsHash {
    my ($self) = @_;
    return($self->{dopt}->getOptionsHash());
}

# Copy all the values from a hash whose keys appear in a given array.
# Typically used to get items from one instance to another by name.
# If toNamesRef array is omitted, write fields in normative order.
#
sub TabularFormats::hashToArray { # TabularFormats
    my ($self, $fromDataRef, $toNamesRef) = @_;
    if (!$toNamesRef)  {                     # default the field order
        $toNamesRef  = $self->getFieldNamesArray();
    }
    UNIVERSAL::tfWarn(0, "hashToArray: order = (" . join(", ", @{$toNamesRef}) .
                   ")");
    if (ref($fromDataRef) ne "HASH" ||       # args ok?
        ref($toNamesRef) ne "ARRAY") {
        UNIVERSAL::tfError(0,"Bad argument types: from '" .
                     ref($fromDataRef) . "' to '" . ref($toNamesRef) . "'\n");
        return(undef);
    }
    ($toNamesRef->[0]) &&
        UNIVERSAL::tfError(0, "Non-empty name[0].");
    my @toData = ("");
    for my $f (1..(scalar(@{$toNamesRef})-1)) {
        my $name = $toNamesRef->[$f];
        if (!$name) {
            UNIVERSAL::tfError(0, "Undefined name in order array [$f]");
            push @toData, "UNDEF";
            next;
        }
        my $value = $fromDataRef->{$name};
        push @toData, (defined $value ? $value : "");
    }
    return(\@toData);
} # hashToArray


sub TabularFormats::parseRecordFromString {
    my ($self, $rec) = @_;
    if (ref($rec)) {
        UNIVERSAL::tfError(0, "Not a string.");
        return();
    }
    $self->{dprev} = $self->{dcur};
    $self->{dcur} = new DataCurrent();
    $self->{dcur}->setFieldsHash($self->parseRecordToHash($rec));
}


###############################################################################
# SAX and DOM i/f support
#
sub TabularFormats::parsestringtoDOM { # TabularFormats
    my ($self, $s, $optionsHash) = @_;

    if (defined $optionsHash && ref($optionsHash) eq "HASH") {
        $self->setOptionsFromHash($optionsHash);
    }

    # Same event names as CPAN XML::Parser, except additions where marked.
    #
    my %eventNames = (
        "Init"          => 1,
        "Final"         => 1,
        "Start"         => 1,
        "End"           => 1,
        "Char"          => 1,
        "Proc"          => 1,
        "Comment"       => 1,
        "CdataStart"	=> 1,
        "CdataEnd"	    => 1,
        "Default"	    => 1,

        "Unparsed"	    => 1,
        "ExternEnt"	    => 1,
        "ExternEntFin"	=> 1,

        "Element"	    => 1, # Dcl
        "Attlist"	    => 1, # Dcl -- once per *attribute*!
        "Entity"	    => 1, # Dcl
        "Notation"	    => 1, # Dcl

        "Doctype"       => 1,
        "DoctypeFin"    => 1,
        "XMLDecl"       => 1,

        "Attr"          => 1, # Extension
        "AttrFin"       => 1, # Extension
        "ProcAttr"      => 1, # Extension
        "ProcAttrFin"   => 1, # Extension
        "ERROR"         => 1, # Extension
        );

    my $theDoc    = new XML::DOM::Document();
    my $curNode   = $theDoc->createElement("xhtml");
    $theDoc->setRoot($curNode);
    my $newNode   = undef;
    my $unhandled = 0;

    $self->parse_start();
    while (my @args = @{$self->parse_more()}) {
        my $eType = shift @args;
        if ($eType eq "XMLDecl") {                # XMLDecl
            # $newNode = $theDoc->createXMLDecl(shift @args);
            # $theDoc->setXMLDecl($newNode);
        }
        elsif ($eType eq "Doctype") {             # Doctype
            # $newNode = $theDoc->createDocumentType(shift @args);
            # $theDoc->setDoctype($newNode);
        }
        elsif ($eType eq "Start") {               # Start
            $newNode = $theDoc->createElement(shift @args);
            while (@args) {
                my $name = shift;
                my $value = shift;
                $newNode->setAttribute($name,$value);
            }
            $curNode->appendChild($newNode);
            $curNode = $newNode;
        }
        elsif ($eType eq "Attr") {                # Attr
            my $name = shift;
            my $value = shift;
            $curNode->setAttribute($name,$value);
        }
        elsif ($eType eq "End") {                 # End
            $curNode = $curNode->getParent();
        }
        elsif ($eType eq "Char") {                # Char
            $newNode = $theDoc->createTextNode(shift @args);
            $curNode->appendChild($newNode);
        }
        elsif ($eType eq "Proc") {                # Proc
            my $tgt = shift @args;
            my $txt = shift @args;
            $newNode = $theDoc->createProcessingInstruction($tgt,$txt);
            $curNode->appendChild($newNode);
        }
        elsif ($eType eq "Comment") {             # Comment
            $newNode = $theDoc->createComment(shift @args);
            $curNode->appendChild($newNode);
        }
        elsif ($eType eq "Unparsed") {            # Unparsed
            $newNode = $theDoc->createEntityReference(shift @args);
            $curNode->appendChild($newNode);
        }
        elsif ($eType eq "ExternEnt") {           # ExternEnt
            $newNode = $theDoc->createEntityReference(shift @args);
            $curNode->appendChild($newNode);
        }

        elsif ($eType eq "ERROR") {               # ERROR
            if ($self->{died}) { return(undef); }
        }

        elsif (defined $eventNames{$eType}) {
            $unhandled++;
            # "AttrFin" "Init" "Final" "CdataStart" "CdataEnd" "ExternEntFin"
            # "Default" "Element" "Attlist" "Entity" "Notation"
            # "DoctypeFin" "ERROR"
        }
        else {
            UNIVERSAL::tfError(-1,"Unknown event type '$eType'");
        }
    } # while parse_more
    return($theDoc);
} # parsestringtoDOM

sub TabularFormats::setHandlers {
    my ($self, $hands) = @_;
    for my $h (keys(%{$hands})) {
        if (!isHandlerName($h)) {
            UNIVERSAL::tfError(0,"Unknown event type '$h'\n");
        }
        else {
            $self->{saxCallbacks}->{$h} = $hands->{$h};
        }
    } # for each handler
}
sub TabularFormats::isHandlerName {
    my ($self, $name) = @_;
    return((defined $saxEvents{$name}) ? 1:0);
}

###############################################################################
# Parse data from various sources (cf FakeParser.pm, SAX. etc.).
#
sub TabularFormats::parsefile { # TabularFormats
    my ($self, $file, $optionsHash) = @_;
    (my $fh = $self->{dsrc}->open($file)) || return(undef);
    $self->{dsrc}->binmode($self->{encoding});
    $self->parse($fh);
    return(1); # (or result of Final() handler)
}

sub TabularFormats::parsestring {
    my ($self, $s, $optionsHash) = @_;
    $self->parse($s, $optionsHash);
}

sub TabularFormats::parse {
    my ($self, $fhOrString, $optionsHash) = @_;
    if (ref($fhOrString) eq "#IO") {
        $self->{dsrc}->attach($fhOrString);
        $self->{dsrc}->binmode($self->{encoding});
    }
    else {
        $self->{dsrc}->add_text($fhOrString);
    }
    $self->parse_run($optionsHash);
}

# Run the actual parse, using data fetched via readRecord().
#
sub TabularFormats::parse_run {
    my ($self, $optionsHash) = @_;
    if (defined $optionsHash && ref($optionsHash) eq "HASH") {
        $self->setOptionsFromHash($optionsHash);
    }

    my @fNames = @{$self->getFieldNamesArray()};

    $self->saxEvent("Init");
    $self->saxEvent("Start", "table");
    while (my $theRec = $self->readRecord()) {
        $self->saxEvent("Start", "tr");
        my @fields = @{$self->parseRecordToArray($theRec)};
        for (my $i=1; $i<scalar(@fields); $i++) {
            $self->saxEvent("Start", "td", "class", $fNames[$i]);
            $self->saxEvent("Text",  $fields[$i]);
            $self->saxEvent("End", "td");
        }
        $self->saxEvent("End", "tr");
    }
    $self->saxEvent("End", "table");
    $self->saxEvent("Fin");
}
sub TabularFormats::saxEvent {
    my $self = shift;
    my $e    = shift;
    my $cb = $self->{saxCallbacks}->{$e};
    if ($cb) {
        $cb->($self, @_);
    }
    elsif ($cb = $self->{saxCallbacks}->{"Default"}) {
        $cb->($self, @_);
    }
}

sub TabularFormats::parse_start {
    my ($self, $optionsHash) = @_;
    my $nb = new ExpatNB($self, $optionsHash);
    $nb->clear_data();
    return($nb);
}

sub TabularFormats::pull_start {
    my ($self, $rec, $optionsHash) = @_;
    my $pp = new pull_parser($self, $optionsHash);
    $pp->add_data($rec);
}


###############################################################################
# Relations to DataSource package
#
sub TabularFormats::getDataSource {
    my ($self) = @_;
    return($self->{dsrc});
}
sub TabularFormats::open {
    my ($self, $path) = @_;
    my $rc = undef;
    # XSV has its own input-handling, rather than using DataSource....
    if ($self->{format} eq "XSV") {
        $rc = $self->{formatImpl}->{xsvParser}->open($path);
        $self->{dsrc}->attach($rc);
    }
    else {
        $rc = $self->{dsrc}->open($path);
        $self->{dsrc}->binmode($self->{encoding});
    }
    return($rc);
}
sub TabularFormats::attach {
    my ($self, $fh) = @_;
    $self->{dsrc}->attach($fh);
    $self->{dsrc}->binmode($self->{encoding});
}
sub TabularFormats::close {
    my ($self) = @_;
    return($self->{dsrc}->close());
}
sub TabularFormats::rewind {
    my ($self) = @_;
    $self->{dsrc}->seek(0,0);
    if ($self->{dprev}) { $self->{dprev}->reset(); }
    if ($self->{dcur})  { $self->{dcur}->reset(); }
    return(1);
}
sub TabularFormats::seek {
    my ($self, $n) = @_;
    return($self->{dsrc}->seek($n));
}
sub TabularFormats::tell {
    my ($self) = @_;
    return($self->{dsrc}->tell());
}


###############################################################################
# Relations to DataSchema package
#
sub TabularFormats::getDataSchema {
    my ($self) = @_;
    return($self->{dsch});
}
sub TabularFormats::setRecover {
    my ($self, $flag) = @_;
    return($self->{dsch}->setRecover($flag));
}
sub TabularFormats::getNSchemaFields {
    my ($self) = @_;
    return($self->{dsch}->getNSchemaFields());
}
sub TabularFormats::getFieldName {
    my ($self, $fn) = @_;
    my $name = $self->{dsch}->getFieldName($fn);
    (ref($name)) &&  UNIVERAL::tfError(
        -1, "Ref wrongly passed to getFieldName().");
    return($name);
}
sub TabularFormats::getFieldNumber {
    my ($self, $fn) = @_;
    return($self->{dsch}->getFieldNumber($fn));
}
sub TabularFormats::getFieldNamesArray {
    my ($self) = @_;
    return($self->{dsch}->getFieldNamesArray());
}
sub TabularFormats::setFieldName {
    my ($self, $fieldNN, $newName) = @_;
    return($self->{dsch}->setFieldName($fieldNN, $newName));
}
sub TabularFormats::setFieldNamesFromArray {
    my ($self, $aRef) = @_;
    return($self->{dsch}->setFieldNamesFromArray($aRef));
}


###############################################################################
# Relations to DataCurrent package
#
sub TabularFormats::getDataCurrent {
    my ($self) = @_;
    return($self->{dcur});
}
sub TabularFormats::clearCurrentRecord {
    my ($self) = @_;
    $self->{dcur}->reset();
}
sub TabularFormats::getNCurrentFields {
    my ($self) = @_;
    return($self->{dcur}->getNCurrentFields());
}
sub TabularFormats::setFieldValue {
    my ($self, $fieldNN, $value) = @_;
    my $fName = $self->{dsch}->getFieldName($fieldNN);
    (ref($fName)) && UNIVERSAL::tfError(
        -1, "top setFieldValue: bad return from dsch");
    ($fName) || return(undef);
    UNIVERSAL::tfWarn(2,"Calling down to DC setFieldValue.");
    $self->{dcur}->setFieldValue($fName, $value);
}
sub TabularFormats::getFieldValue { # Unlike DataCurrent, name or number.
    my ($self, $fieldNN) = @_;
    my $fName = $self->{dsch}->getFieldName($fieldNN);
    ($fName) || return(undef);
    return($self->{dcur}->getFieldValue($fName));
}
sub TabularFormats::setRecordFromHash {
    my ($self, $hRef) = @_;
    $self->{dprev} = $self->{dcur};
    $self->{dcur} = new DataCurrent();
    return($self->{dcur}->setRecordFromHash($hRef));
}

# End relations to DataCurrent package


###############################################################################
# Relations to DataOptions package
#
sub TabularFormats::setOption {
    my ($self, $optionName, $value) = @_;
    UNIVERSAL::tfWarn(1, "Setting TF option '$optionName' to '$value'.");
    if ($optionName eq "stripRecords") { # Notify other package of this one!
        $self->{dsrc}->{stripRecords} = $value;
    }
    if ($optionName eq "basicType") {
        $self->chooseFormat($value);
    }
    if ($optionName eq "TFverbose") {
        $tfMsgLevel = $value;
    }
    $self->{dopt}->setOption($optionName, $value); # Calls chooseFormat
}
sub TabularFormats::hasOption {
    my ($self, $optionName) = @_;
    return($self->{dopt}->hasOption($optionName));
}
sub TabularFormats::getOption {
    my ($self, $optionName) = @_;
    return($self->{dopt}->getOption($optionName));
}


###############################################################################
# Relations to the format-support packages.
#
# Methods below here forward to {formatImpl}, which will be
# a subclass of TFormatSupport, for a particular data format.
#
sub TabularFormats::isOkFieldName {
    my ($self, $fn) = @_;
    return($self->{formatImpl}->isOkFieldName($fn));
}
sub TabularFormats::cleanFieldName {
    my ($self, $fn) = @_;
    return($self->{formatImpl}->cleanFieldName($fn));
}
# Schema's names are set from here, not from formatImpl.
sub TabularFormats::readAndParseHeader {
    my ($self) = @_;
    UNIVERSAL::tfWarn(1, "In TF::readAndParseHeader.");
    my $fieldNames = $self->{formatImpl}->readAndParseHeader();
    if (!$fieldNames) {
        UNIVERSAL::tfWarn(1, "No header found.\n");
        return(undef);
    }
    if (scalar(@{$fieldNames}) <=2) {
        UNIVERSAL::tfWarn(
            0, "Header found only 1 field. Bad -basicType (" .
            $self->{dopt}->getOption("basicType") . ") or -fieldSep (" .
            sjdUtils::vis($self->{dopt}->getOption("fieldSep")) .
            ")?\n");
        return(undef);
    }


    my $nf = scalar(@{$fieldNames});
    for (my $i=1; $i<=$nf; $i++) {
        if (!$fieldNames->[$i]) {
            $fieldNames->[$i] = "F_$i";
        }
        else {
            my $h = $fieldNames->[$i];
            if (!$self->isOkFieldName($h)) {
                $fieldNames->[$i] = $self->cleanFieldName($h);
                UNIVERSAL::tfWarn(
                    0, "Bad field name in header: ",
                    "#$i: '$h' (=> '" . $fieldNames->[$i] . "').");
            }
        }
    } # for
    UNIVERSAL::tfWarn(1, "readAndParseHeader: field names: " .
                      "):\n    " . join("\n    ", @{$fieldNames}));
    warn "About to set setFieldNamesFromArray\n";
    $self->{dsch}->setFieldNamesFromArray($fieldNames);
    return($fieldNames);
} # readAndParseHeader

# Skips comments, should only stop at real records.
sub TabularFormats::readRecord {
    my ($self) = @_;
    UNIVERSAL::tfWarn(1, "TF::readRecord. impl: ", ref($self->{formatImpl}));
    my $rec = $self->{formatImpl}->readRecord();
    UNIVERSAL::tfWarn(1, "TF::readRecord: Got logical rec:\n  ", $rec);
    return($rec);
}

sub TabularFormats::parseHeader {
    my ($self, $rec) = @_;
    my $fieldNames = $self->{formatImpl}->parseHeader($rec);
    $self->{dsch}->setFieldNamesFromArray($fieldNames);
    return($fieldNames);
}
sub TabularFormats::parseRecord { # OBS
    my ($self, $rec) = @_;
    if (!$self->{gaveObsMsg}) {
        UNIVERSAL::tfError(1,"obsolete, use parseRecordToArray().\n");
        $self->{gaveObsMsg} = 1;
    }
    return($self->{formatImpl}->parseRecordToArray($rec));
}
sub TabularFormats::parseRecordToArray {
    my ($self, $rec) = @_;
    my $aRef = $self->{formatImpl}->parseRecordToArray($rec);
    UNIVERSAL::tfWarn(2, "parseRecordToArray: setting current record");
    if (!$self->{parsedARecord} && $self->{dsch}->getNSchemaFields()==0) {
        my $nf = scalar(@{$aRef}) - 1;
        UNIVERSAL::tfWarn(1, "parseRecordToArray: Adding fields to make schema.");
        for (my $i=1; $i<=$nf; $i++) {
            $self->{dsch}->addField();
        }
        UNIVERSAL::tfWarn(1, "parseRecordToArray: schema now has " .
                      $self->{dsch}->getNSchemaFields() . " fields.");
    }
    $self->{parsedARecord} = 1;
    my @names = ();
    for (my $i=1; $i<=$self->{dsch}->getNSchemaFields(); $i++) {
        my $fDef = $self->{dsch}->{fDefsByNumber}->[$i];
        push @names, $fDef->{fName};
    }
    $self->{dcur}->setRecordFromArray($aRef, \@names);
    return($aRef);
}
sub TabularFormats::parseRecordToHash {
    my ($self, $rec) = @_;
    my $fields = $self->{formatImpl}->parseRecordToHash($rec);
    $self->setRecordFromHash($fields);
    $fields = $self->postProcessFields($fields);
    $self->{parsedARecord} = 1;
    return($fields);
}
sub TabularFormats::postProcessFields {
    my ($self, $fieldHashRef) = @_;
    return($self->{formatImpl}->postProcessFields($fieldHashRef));
}

sub TabularFormats::assembleRecord { # OBS
    my ($self, $aRef) = @_;
    return($self->{formatImpl}->assembleRecordFromArray($aRef));
}
sub TabularFormats::assembleRecordFromArray {
    my ($self, $aRef) = @_;
    return($self->{formatImpl}->assembleRecordFromArray($aRef));
}
sub TabularFormats::assembleField {
    my ($self, $fDef, $value) = @_;
    return($self->{formatImpl}->assembleField($fDef,$value));
}
sub TabularFormats::assembleComment {
    my ($self, $text) = @_;
    return($self->{formatImpl}->assembleComment($text));
}
sub TabularFormats::assembleRecordFromHash {
    my ($self, $hRef) = @_;
    return($self->{formatImpl}->assembleRecordFromHash($hRef));
}
sub TabularFormats::assembleHeader {
    my ($self) = @_;
    return($self->{formatImpl}->assembleHeader());
}
sub TabularFormats::assembleTrailer {
    my ($self) = @_;
    return($self->{formatImpl}->assembleTrailer());
}

# End of TabularFormats package


###############################################################################
# "pull" style parsing (not provided by SAX).
# Call start_file(), start_string(), or start_fh()
#
package pull_parser;

sub pull_parser::new {
    my ($class, $owner, $optionsHash) = @_;

    my $self = {
        owner         => $owner,
        data          => "",
        pendingEvents => [],
        DONE          => 0,
    }; # self

    bless $self, $class;

    if (defined $optionsHash && ref($optionsHash) eq "HASH") {
        $self->setOptionsFromHash($optionsHash);
    }

    $self->queueEvent("Init");
    $self->queueEvent("Start", "table");
}

sub pull_parser::clear_data { # pull_parser
    my ($self, $s) = @_;
    $self->data = "";
}

sub pull_parser::add_data { # pull_parser
    my ($self, $s) = @_;
    $self->data .= $s;
}

sub pull_parser::pull_more { # pull_parser
    my ($self) = @_;

    if ($self->{DONE}) {
        $self->{pendingEvents} = [];
        return(undef);
    }

    # Load one record and create the SAX-like events it represents.
    if (scalar(@{$self->{pendingEvents}}) <= 0) {
        if (my $rec = $self->readRecord()) {
            $self->queueEvent("Start", "tr");
            my @fields = @{$self->parseRecordToArray($rec)};
            for (my $i=1; $i<scalar(@fields); $i++) {
                $self->queueEvent(
                    "Start", "td", "class",
                    $self->{dsch}->getFieldDefByNumber($i)->{fName});
                $self->queueEvent("Text",  $fields[$i]);
                $self->queueEvent("End", "td");
            }
            $self->queueEvent("End", "tr");
        }
        else {
            $self->queueEvent("Fin");
        }
    }

    # Return the first pending event.
    return(shift @{$self->{pendingEvents}});
}

sub pull_parser::pull_done { # pull_parser
    my ($self) = @_;
    $self->{DONE} = 1;
}

sub pull_parser::queueEvent { # pull_parser
    my $self = shift;
    my $e    = shift;
    push @{$self->{pendingEvents}}, \@_;
}


###############################################################################
###############################################################################
###############################################################################
# This package is instantiated by TabularFormats::parse_start(), whose
# caller then calls our parse_more() method until done.
#
package ExpatNB;

sub ExpatNB::new {
    my ($class, $owner, $optionsHash) = @_;

    my $self = {
        owner         => $owner,
        data          => "",
        pendingEvents => [],
        DONE          => 0,
    }; # self

    bless $self, $class;

    if (defined $optionsHash && ref($optionsHash) eq "HASH") {
        $self->setOptionsFromHash($optionsHash);
    }

    $self->queueEvent("Init");
    $self->queueEvent("Start", "table");
}

sub ExpatNB::clear_data { # ExpatNB
    my ($self) = @_;
    $self->data .= "";
}

sub ExpatNB::parse_more { # ExpatNB
    my ($self, $s) = @_;

    if ($self->{DONE}) {
        $self->{pendingEvents} = [];
        return(undef);
    }

    $self->data .= $s;
    if (scalar(@{$self->{pendingEvents}}) <= 0) {
        if (my $rec = $self->readRecord()) {
            $self->queueEvent("Start", "tr");
            my @fields = @{$self->parseRecordToArray($rec)};
            for (my $i=1; $i<scalar(@fields); $i++) {
                $self->queueEvent(
                    "Start", "td", "class",
                    $self->{dsch}->getFieldDefByNumber($i)->{fName});
                $self->queueEvent("Text",  $fields[$i]);
                $self->queueEvent("End", "td");
            }
            $self->queueEvent("End", "tr");
        }
        else {
            $self->queueEvent("Fin");
        }
    }
    return(pop @{$self->{pendingEvents}});
}

sub ExpatNB::parse_done { # ExpatNB
    my ($self) = @_;
    $self->{DONE} = 1;
}

sub ExpatNB::queueEvent { # ExpatNB
    my $self = shift;
    my $e    = shift;
    push @{$self->{pendingEvents}}, \@_;
}


###############################################################################
###############################################################################
###############################################################################
# Manage per-field-type information for one field. Aggregated by DataSchema.
# TabularFormats gets these from DataSchema, then modifies fields directly.
#
# Note: This does *not* include the field's number (order).
#
package FieldDef;

sub FieldDef::new {
    my ($class, $name, $ersatz) = @_;

    my $self = {
        # Where to find the field (not all always used)
        fName        => $name || undef,      # Field name
        fErsatz      => ($ersatz) ? 1:0,     # Was it created as error recovery?
        fStart       => 0,                   # Starting column (optional)
        fWidth       => 0,                   # Column width

        # Input processing and cleanup
        fDefault     => undef,               # Value to load if missing (?)
        fCharset     => "utf-8",             # Character encoding
        fNilValueIn  => "",                  # Reserved "nil" value
        fCallback    => undef,               # Last-minute callback

        fSplitter    => undef,               # Regex to split() sub-fields
        fJoiner      => undef,               # String to join() sub-fields

        fDatatype    => "",                  # Datatypes.pm name for checking

        # Output possibilities (see align(), below)
        fAlign       => "",                  # l/c/r/d/a
        fNilValueOut => "",                  # Write this for undef
    };

    bless $self, $class;
    return $self;
}

# Mainly for COLUMNS, but anybody can use it.
# Left, Center, Right, Dot, Auto, or "".
#
sub FieldDef::align {
    my ($self, $value) = @_;
    if (!defined $value) {                        # Undefined / nil
        $value = $self->{fNilValueOut};
    }

    my $svalue = $value . "";
    if ($self->{fWidth} <= 0) {                   # No width known
        return($value);
    }
    my $needed = $self->{fWidth}-length($svalue); # How wide?
    if ($needed == 0) {
        return($svalue);
    }
    if ($needed < 0) {                            # Too wide
        $svalue = substr($svalue,0,$self->{fWidth});
        return($svalue);
    }
    if ($needed == 0) {                           # Just right
        return($svalue);
    }
                                                  # Room to pad it
    if ($self->{fAlign} eq "L") {                   # Left-align
        $svalue = $svalue . (" " x $needed);
    }
    elsif ($self->{fAlign} eq "C") {                # Center-align
        my $still = $needed - ($needed/2);
        $svalue = (" " x ($needed/2)) . $svalue . $still
    }
    elsif ($self->{fAlign} eq "R") {                # Right-align
        $svalue = (" " x $needed) . $svalue;
    }
    elsif ($self->{fAlign} eq "D") {                # lame Decimal-align
        my $ind = index($svalue, ".");
        if ($ind>=0 && $ind<$self->{fWdith}/2) {
            $needed = $self->{fWdith}/2 - $ind;
            $svalue = (" " x $needed) . $svalue;
        }
    }
    elsif ($self->{fAlign} eq "A") {                # Auto-align
        if ($svalue =~ m/^\s*\d+(\.\d+)?/) {
            $svalue = (" " x $needed) . $svalue;
        }
        else {
            $svalue = $svalue . (" " x $needed);
        }
    }

    return($svalue);
} # align

# End of FieldDef package.


###############################################################################
###############################################################################
###############################################################################
# Manage fields: names, numbers, positions (stored in FieldDef objects)
# Mostly, these methods just find the right FieldDef object by name or number,
# then access its fields directly. FieldDef has few methods of its own.
#
package DataSchema;

sub DataSchema::new {
    my ($class, $recover) = @_;
    $recover = ($recover) ? 1:0;
    my $self = {
        # Refs to FieldDef objects define what a record can contain
        fDefsByName   => {},
        fDefsByNumber => [ undef ],     # [0] always unused. -> fDefs
        recover       => $recover,      # Make requested fields as needed.
    };

    bless $self, $class;
    return $self;
}

sub DataSchema::reset {
    my ($self) = @_;
    $self->fDefsByName   => {},
    $self->fDefsByNumber => [ "" ],
}

sub DataSchema::setRecover { # Allow auto-creation of fields?
    my ($self, $flag) = @_;
    $self->{recover} = $flag;
}

sub DataSchema::addField { # All params optional
    my ($self, $name, $number, $ersatz) = @_;
    my $fNum = ($number) ? $number : ($self->getNSchemaFields() + 1);
    if ($name && defined($self->{fDefsByName}->{$name})) {
        UNIVERSAL::tfError(0,"'$name' already defined.\n");
        return(-1);
    }
    if (!$name) { $name = "F_$fNum"; }
    UNIVERSAL::tfWarn(2,"TF::addField: adding #$fNum: '$name'");

    my $fDef = new FieldDef($name, $ersatz);
    $self->{fDefsByName}->{$name} = $fDef;
    $self->{fDefsByNumber}->[$fNum] = $fDef;
    return($self->getNSchemaFields());
}

# Find field object, given name or number.
# If we get asked for one that we don't know about, add it to avoid
# huge trails of errors....
#
sub DataSchema::getFieldDef {
    my ($self, $fieldNN, $recover) = @_;
    if (!defined $recover) { $recover = 0; }
    my $fDef = undef;
    if ($fieldNN =~ m/^\s*\d+\s*$/) {             # By number
        $fDef = $self->{fDefsByNumber}->[$fieldNN];
    }
    else {                                        # By name
        $fDef = $self->{fDefsByName}->{$fieldNN};
    }
    if (!$fDef) {
        UNIVERSAL::tfError(0, "field $fieldNN not found, " .
                           "(rec=$recover, self.rec=$self->{recover}). " .
                           join(", ", @{$self->getFieldNamesArray()}));
        if ($recover || $self->{recover}) {
            $fieldNN =~ s/^\s*(\d*)\s*$/F_$1/;
            $self->addField("", $fieldNN, "ERSATZ");
        }
    }
    return($fDef);
}
sub DataSchema::getFieldDefByName {
    my ($self, $fieldNN) = @_;
    my $fDef = $self->{fDefsByName}->{$fieldNN};
    if (!$fDef) {
        my @knowns = keys(%{$self->{fDefsByName}});
        UNIVERSAL::tfError(0,"'$fieldNN' not found. Known: " .
                       join(", ", sort(@knowns)) . ".");
    }
    return($fDef);
}
sub DataSchema::getFieldDefByNumber {
    my ($self, $fieldNN) = @_;
    if (0 && $fieldNN !~ m/^\s*\d+\s*$/) { # Disable for speed
        UNIVERSAL::tfError(0,"'$fieldNN' not numeric.");
        return(undef);
    }
    my $fDef = $self->{fDefsByNumber}->[$fieldNN];
    if (!$fDef) {
        UNIVERSAL::tfError(0,"#$fieldNN not found.");
    }
    return($fDef);
}

# Set/get properties of a single FieldDef
#
sub DataSchema::setFieldName {
    my ($self, $fieldNN, $newName) = @_;
    my $fDef = $self->getFieldDef($fieldNN);
    ($fDef) || return(undef);
    if (!$self->isOkFieldName($newName)) {
        UNIVERSAL::tfError(0,"Bad name '$newName' -- defaulted.");
        $newName = "F_" . $fDef->getFieldNumber();
    }
    $fDef->{fName} = $newName;
    return(1);
}
sub DataSchema::getFieldName {
    my ($self, $fieldNN, $recover) = @_;
    my $fDef = $self->getFieldDef($fieldNN, $recover);
    ($fDef) || return(undef);
    my $name = $fDef->{fName};
    return($name);
}

sub DataSchema::setFieldDatatype {
    my ($self, $fieldNN, $newDtName) = @_;
    if (!$self->{theDatatypes}->isKnownDatatype($newDtName)) {
        UNIVERSAL::tfError(0,"Unknown datatype for field '" .
                       "$fieldNN': '$newDtName'");
        return(0);
    }
    my $fDef = $self->getFieldDef($fieldNN);
    ($fDef) || return(undef);
    $fDef->{fDatatype} = $newDtName;
    return(1);
}
sub DataSchema::getFieldDatatype {
    my ($self, $fieldNN) = @_;
    my $fDef = $self->getFieldDef($fieldNN);
    ($fDef) || return(undef);
    return($fDef->{fDatatype});
}

sub DataSchema::setFieldDefault {
    my ($self, $fieldNN, $newDefault) = @_;
    my $fDef = $self->getFieldDef($fieldNN);
    ($fDef) || return(undef);
    $fDef->{fDefault} = $newDefault;
    return(1);
}
sub DataSchema::getFieldDefault {
    my ($self, $fieldNN) = @_;
    my $fDef = $self->getFieldDef($fieldNN);
    ($fDef) || return(undef);
    return($fDef->{fDefault});
}

# Rudimentary support for one level of sub-fields (e.g., paragraphs inside
# table cells, tokens inside a field, etc.
#
sub DataSchema::setFieldSplitter {
    my ($self, $fieldNN, $splitterRegex, $joinerString) = @_;
    my $fDef = $self->getFieldDef($fieldNN);
    ($fDef) || return(undef);
    $fDef->{fSplitter} = $splitterRegex;
    $fDef->{fJoiner} = $joinerString;
}


# A field callback should always return an array of fields.
# For example the 'Encoding:' MIME header often has a 'Charset' field within.
# Or, the callback can be used to do normalization such as case-folding, etc.
# Called from postProcessFields() in each implementation.
#
sub DataSchema::setFieldCallback {
    my ($self, $fieldNN, $cb) = @_;
    my $fDef = $self->getFieldDef($fieldNN);
    ($fDef) || return(undef);
     $fDef->{fCallback} = $cb;
}


###############################################################################
# Manage properties and order of the whole set of fields.
#
sub DataSchema::setNFields {
    my ($self, $n) = @_;
    my $nf = $self->getNSchemaFields();
    while ($nf < $n) {
        $nf++;
        $self->addField("F_$nf");
    }
}
sub DataSchema::getNSchemaFields {
    my ($self) = @_;
    my $nfNumbers = scalar(@{$self->{fDefsByNumber}}) - 1; # [0] unused!
    my $nfNames = scalar(keys(%{$self->{fDefsByName}}));
    ($nfNumbers == $nfNames) ||
        UNIVERSAL::tfError(1, "field-count mismatch");
    return($nfNumbers);
}

sub DataSchema::setFieldNumber {
    my ($self, $fieldNN, $newNumber) = @_;
    if ($newNumber<1) {
        UNIVERSAL::tfError(0,"Bad field number '$newNumber'");
        return(undef);
    }
    my $fDef = $self->getFieldDef($fieldNN);
    ($fDef) || return(undef);
    my $oldNumber = $self->getFieldNumber($fieldNN);
    splice(@{$self->{fDefsByNumber}},$oldNumber,1);
    splice(@{$self->{fDefsByNumber}},$newNumber,0,$fDef);
    return($newNumber);
}
sub DataSchema::getFieldNumber {
    my ($self, $fieldNN) = @_;
    if ($fieldNN =~ m/^\s*\d+\s*$/) { return($fieldNN); }
    my $fDef = $self->getFieldDef($fieldNN);
    ($fDef) || return(undef);
    my $nf = $self->getNSchemaFields();
    for my $i (1..$nf) {
        my $iDef = $self->{fDefsByNumber}->[$i];
        if ($iDef->{fName} eq $fDef->{fName}) {
            return($i);
        }
    }
    return(0);
}

sub DataSchema::setFieldNamesFromArray {
    my ($self, $aRef) = @_;
    if (ref($aRef) ne "ARRAY") {
        UNIVERSAL::tfError(0, "Not passed an array");
        return(0);
    }
    for (my $i=1; $i<scalar(@{$aRef}); $i++) {
        my $name = $aRef->[$i];
        if ($name) {
            my $fDef = $self->{fDefsByNumber}->[$i];
            if (!$fDef) {
                $fDef = new FieldDef($name);
                $self->{fDefsByNumber}->[$i] = $fDef;
            }
            else {
                $fDef->{fName} = $name;
            }
        }
        else {
            UNIVERSAL::tfWarn(0, "setFieldNamesFromArray: [$i] is nil.");
        }
    }
    return(scalar(@{$aRef}));
}
sub DataSchema::getFieldNamesArray {
    my ($self) = @_;
    if (!$self->{fDefsByNumber}) {
        UNIVERSAL::tfError(0,"Nobody there.");
        return("");
    }
    my @names = ("");
    my $nf = $self->getNSchemaFields();
    for my $i (1..$nf) {
        my $fDef = $self->{fDefsByNumber}->[$i];
        if (!$fDef) {
            UNIVERSAL::tfError(0,"Missing fDef #$i.");
            my $name = "F_$i";
            my $fDef = new FieldDef($name);
            $self->{fDefsByName}->{$name} = $fDef;
            $self->{fDefsByNumber}->[$i] = $fDef;
        }
        push @names, $fDef->{fName}; # Includes empty [0]
    }
    return(\@names);
}

# Re. field positions: Do they need to be in order? Doesn't seem like it;
# only meaningful in COLUMNS, and you might want order for something else.
# Just have to be careful to do COLUMNS output right.
#
sub DataSchema::setFieldPositions {
    my ($self, $starts) = @_;
    my $laterStart = $starts->[-1] + 8; # Meh
    for (my $i=scalar(@{$starts})-1; $i>0; $i--) {
        my $width = $laterStart - $starts->[$i];
        if ($width <= 0) {
            UNIVERSAL::tfError(0,"Out of order.");
            $width = 1;
        }
        $self->setFieldPosition($i, $starts->[$i], $width);
        $laterStart = $starts->[$i];
    }
}

sub DataSchema::setFieldPosition {
    my ($self, $fieldNN, $start, $width, $align) = @_;
    my $fDef = $self->getFieldDef($fieldNN);
    ($fDef) || return(undef);
    my $nf = $self->getNSchemaFields();

    if (!$width) {                      # Pick a default width
        my $nearestFollowingStart = 99999999;
        for (my $i=1; $i<=$nf; $i++) {
            my $iDef = $self->{fDefsByNumber}->[$i];
            if ($iDef->{fStart} > $start &&
                $iDef->{fStart} < $nearestFollowingStart) {
                $nearestFollowingStart = $iDef->{fStart};
            }
        }
        if ($nearestFollowingStart> -1) {
            $width = $nearestFollowingStart - $start;
        }
    }
    else {                              # Check for column conflict
        for (my $i=1; $i<=$nf; $i++) {
            my $iDef = $self->{fDefsByNumber}->[$i];
            next if ($iDef == $fDef);
            my $istart = $iDef->{fStart};
            my $iwidth = $iDef->{fWidth};
            if ($istart < $start+$width &&
                $istart+$iwidth > $start) { # overlap
                return(0);
            }
        }
    }
    $fDef->{fStart} = $start;
    $fDef->{fWidth} = $width;
    if (!$align || $align =~ m/^[LRCDA]/i) {
        $fDef->{fAlign} = $align;
    }
    else {
        UNIVERSAL::tfError(
            0, "Bad 'align' argument '$align' for field '$fieldNN'");
    }
    return(1);
}

sub DataSchema::getFieldPosition {
    my ($self, $fieldNN) = @_;
    my $fDef = $self->getFieldDef($fieldNN);
    ($fDef) || return(undef);
    return($fDef->{fStart}, $fDef->{fWidth}, $fDef->{fAlign});
}

sub DataSchema::setFieldNumbersByPosition {
    my ($self) = @_;
    my @fDefArray = @{$self->{fDefsByNumber}};
    shift @fDefArray;
    @fDefArray = sort { return($a->{fStart} <=> $b->{fStart}); } @fDefArray;
    unshift @fDefArray, undef;
    $self->{fDefsByNumber} = \@fDefArray;
}

sub DataSchema::getAvailableWidth {
    my ($self, $fieldNN) = @_;
    my $fDef = $self->getFieldDef($fieldNN);
    ($fDef) || return(0);
    my $near = $self->getNearestFollowingFieldDef($fieldNN);
    ($near) || return(0);
    return($near->{fStart} - $fDef->{fStart});
}

sub DataSchema::getNearestFollowingFieldDef {
    my ($self, $fieldNN) = @_;
    my $fDef = $self->getFieldDef($fieldNN);
    ($fDef) || return(undef);
    my $nf = $self->getNSchemaFields();
    my $near = undef;
    for (my $i=1; $i<=$nf; $i++) {
        my $iDef = $self->{fDefsByNumber}->[$i];
        if ($iDef->{fStart} &&
            (!$near || ($iDef->{fStart} < $near->{fStart})) &&
            $iDef->{fStart} > $fDef->{fStart}) {
            $near = $iDef;
        }
    }
    return($near);
}

# End package DataSchema


###############################################################################
###############################################################################
###############################################################################
#
### Should integrate into RecordFile.pm, then ditch this package.
# Should look pretty much like a regular file.
#
# UsedHere  DataSource       RecordFile         Files
#    Y        new              new
#    Y        open             open               open
#    Y        binmode          binmode            binmode
#    Y        close            close              close
#    Y        seek             seek               seek
#    Y        tell             tell               tell
#    Y        readline         readline           readline
#
#    Y        attach           attach
#    Y        addText
#             pushback
#              (only used by MANCH format, so far).

#             readRealLine
#             readBalanced
#               findCloser
#             readToUnquotedDelim
#                              setInterruptCB
#                              seekRecord
#                              tellRecord
#                              readRecordsAsArray
#                              readOneRecord
#                              readNthRecord
#
package DataSource;

sub DataSource::new {
    my ($class) = @_;

    my $self = {
        path         => undef,
        FH           => undef,
        encoding     => "",
        buffer       => "",
        hasAnyDataBeenSupplied => 0,
        stripRecords => 0, # set by setOption() as needed.
    };

    bless $self, $class;
    return $self;
}

sub DataSource::open {
    my ($self, $path, $encoding) = @_;
    if ($self->{FH}) {
        $self->close();
    }
    $self->{buffer} = "";
    $self->{hasAnyDataBeenSupplied} = 0;

    if (!open($self->{FH}, "<$path")) {
        return(undef);
    }
    if (!$encoding) {
        $encoding = "";
    }
    else {
        $self->binmode($encoding);
    }
    $self->{path} = $path;
    $self->{hasAnyDataBeenSupplied} = 1;
    return($self->{FH});
} # open

sub DataSource::binmode {
    my ($self, $encoding) = @_;
    if (!$self->{FH}) {
        UNIVERSAL::tfError(0, "Can't binmode unless a file is open.");
        return(0);
    }
    if (!$encoding) { $encoding = "utf8"; }
    $self->{encoding} = $encoding;
    $self->{FH}->binmode(":encoding($encoding)");
    return(1);
}

sub DataSource::attach {
    my ($self, $fh) = @_;
    $self->{FH} = $fh;
    $self->{path} = "";
    $self->{hasAnyDataBeenSupplied} = 1;
    return(1);
} # attach

sub DataSource::close {
    my ($self) = @_;
    $self->{buffer} = "";
    if ($self->{FH}) {
        close($self->{FH});
    }
    $self->{hasAnyDataBeenSupplied} = 0;
}

sub DataSource::seek {
    my ($self, $n) = @_;
    $self->{buffer} = "";
    return($self->{FH}->seek($n,0));
} # tell

sub DataSource::tell {
    my ($self) = @_;
    # Doesn't account for text (not file), or for pushbacks.
    return($self->{FH}->tell());
} # tell

# Read and return one physical line, or undef on EOF/EOB.
# Take input from the buffer first, then from the file (if any).
#
sub DataSource::readline {
    my ($self) = @_;
    my $rc = undef;
    if ($self->{buffer} ne "") {             # string (includes pushback)
        $self->{buffer} =~ s/^(.*?\n)//;
        if ($1) {
            $rc = $1;
        }
        else {
            $rc = $self->{buffer};
            $self->{buffer} = undef;
        }
    }

    if (!$rc && $self->{FH}) {               # if needed, read file
        # SHOULD DO MORE TO PROTECT AGAINST BAD UNICODE
        #warn "Ref of FH is: " . ref($self->{FH}) . ".\n";
        if (ref($self->{FH}) eq "GLOB") {
            my $fh = $self->{FH};
            $rc = <$fh>;
        }
        else {
            $rc = $self->{FH}->readline();
        }
    }

    if (defined $rc) {
        chomp $rc;
        if ($self->{stripRecords}) { # Set by setOption().
            $rc =~ s/\s+$//;
            $rc =~ s/^\s+//;
        }
    }
    return($rc);
} # readline


###############################################################################
# Methods *not* also found in RecordFile.pm.
#
sub DataSource::addText {
    my ($self, $text) = @_;
    if ($self->{FH}) {
        $self->close();
    }
    $self->{buffer} .= $text;
    $self->{hasAnyDataBeenSupplied} = 1;
} # add_text

sub DataSource::pushback {
    my ($self, $text) = @_;
    $self->{buffer} = $text . $self->{buffer};
} # pushback


###############################################################################
###############################################################################
###############################################################################
#
package DataReaders;

sub DataReaders::new {
    my ($class) = @_;

    my $self = {
        B_openers      => "(",           # See setupBalance().
        B_closers      => ")",
        B_quoters      => "\"",
        B_qdouble      => 1,
        B_escapes      => "\\",
        B_comment      => "",
    };

    bless $self, $class;
    return $self;
}

sub DataReaders::setupBalance {
    my ($self, $openers, $closers, $quoters, $qdouble, $escapes, $comment) = @_;
    (scalar(@_)==7) || die "setupBalance: bad args\n";
    if (defined $openers) { $self->{B_openers} = $openers; }
    if (defined $closers) { $self->{B_closers} = $closers; }
    if (defined $quoters) { $self->{B_quoters} = $quoters; }
    if (defined $qdouble) { $self->{B_qdouble} = $qdouble; }
    if (defined $escapes) { $self->{B_escapes} = $escapes; }
    if (defined $comment) { $self->{B_comment} = $comment; }
}


# Read the next physical line, ignoring comments and blank lines.
#
sub DataReaders::readRealLine {
    my ($self, $commentDelim) = @_;
    my $line = undef;
    while (defined ($line = $self->readline())) {
        last if ($line !~ m/^\s*$/ &&
                 $line !~ m/^$commentDelim/);
    }
    return($line);
} # readRealLine

# Return an expression up to the next balance point.
#     Can handle multiple bracket types, quotes, escaped/doubled quotes, etc.
#     Also strips out comments (unless comment delim is escaped or quoted).
#     If that point is in mid-line, we pushback the rest of the line.
#
#     For example: (foo (bar spam) baz).
#
# ### If multiple openers/closers, all treated the same.
# ### Skip to first opener at start?
# ### Compile comment and endExpr regexes (or caller can)?
# ### Save what line we started at, and report in case of problem.
#
sub DataReaders::readBalanced {
    my ($self) = @_;
    if (scalar(@_)>1) {
        setupBalance(@_);
    }

    my $buf     = "";    # Accumulate return data here
    my @nested  = ();    # How many layers deep are we, in what?
    my $inQuote = "";    # What kind of quote char is open, or "".
    my $done    = 0;     # Time to bail
    my $com     = $self->{B_comment};
    while (defined (my $newPart = $self->{dsrc}->readline()) && !$done) {
        next if ($com &&                                    # COMMENT LINE
                 !$inQuote &&
                 $newPart =~ m/^\s*$com/);

        for (my $i=0; $i<length($newPart); $i++) {
            my $c = substr($newPart,$i,1);
            if (index($self->{B_escapes},$c) >= 0) {        # ESCAPE CHAR
                $i++;
            }
            if (index($self->{B_quoters},$c) >= 0) {        # QUOTE CHAR
                if ($self->{B_qdouble} && substr("$newPart ",$i+1,1) eq $c) {
                    $i++;
                }
                elsif (!$inQuote)      { $inQuote = $c; }
                elsif ($c eq $inQuote) { $inQuote = ""; }
            }
            elsif ($self->{B_comment} && !$inQuote &&       # COMMENT
                   substr($newPart,$i) =~ m/$com/) {
                $newPart = substr($newPart,0,$i-1);
                $done = 1;
                last;
            }
            elsif (index($self->{B_openers},$c) >= 0) {     # OPENER
                push @nested, $c;
            }
            elsif (index($self->{B_closers},$c) >= 0) {     # CLOSER
                if (scalar(@nested) == 0) {
                    UNIVERSAL::tfError(0,"'$c' without corresponding open.");
                    last;
                }
                elsif ($c ne findCloser($nested[-1])) {
                    UNIVERSAL::tfError(
                        0, "'$c' does not match pending '$nested[-1]'.");
                    last;
                }
                pop @nested;
                if (scalar(@nested) == 0) {       # BALANCE BACK TO 0
                    if ($i<length($newPart)-1) {    # leftovers
                        $self->{dsrc}->pushback(substr($newPart,$i+1));
                    }
                    $newPart = substr($newPart,0,$i);
                    $done = 1;
                    last;
                }
            }
        } # for chars
        $buf .= $newPart;
    } # EOF

    if (scalar(@nested) > 0) {
        UNIVERSAL::tfError(0,"Unclosed items reading balanced expression: " .
            join(" ", @nested));
    }
    if ($inQuote) {
        UNIVERSAL::tfError(0,"Unclosed quote reading balanced expression.");
    }
    return($buf);
} # readBalanced

sub DataReaders::findCloser {
    my ($self, $theOpener) = @_;
    my $ind = index($self->{B_openers}, $theOpener);
    ($ind<0 || $ind > length($self->{B_closers})) &&
        UNIVERSAL::tfError(-1, "findCloser: Bad data.");
    return(substr($self->{B_closers},$ind,1));
} # findCloser


# Return data up to (but not including) a given ending delimiter string|regex.
# An ending delimiter match doesn't count if it's in quotes.
# ### Compile the regexes.
#
sub DataReaders::readToUnquotedDelim {
    my ($self, $endExpr, $quoters, $qdouble, $escapes, $comment) = @_;
    if (!$endExpr)         { $endExpr = ";"; }
    if (!defined $quoters) { $quoters = "\""; }
    if (!defined $qdouble) { $qdouble = 0; }
    if (!defined $escapes) { $escapes = "\\"; }

    my $buf     = "";    # Accumulate return data here
    my $inQuote = "";    # What kind of quote char is open, or "".
    my $done    = 0;     # Time to bail

    while (defined (my $newPart = $self->readline()) && !$done) {
        next if ($comment && !$inQuote &&                   # COMMENT LINE
                 $newPart =~ m/^\s*$comment/);

        for (my $i=0; $i<length($newPart); $i++) {
            my $c = substr($newPart,$i,1);
            if (index($escapes,$c) >= 0) {                  # ESCAPE CHAR
                $i++;
            }
            if (index($quoters,$c) >= 0) {                  # QUOTE CHAR
                if ($qdouble && substr("$newPart ",$i+1,1) eq $c) {
                    $i++;
                }
                elsif (!$inQuote)      { $inQuote = $c; }
                elsif ($c eq $inQuote) { $inQuote = ""; }
            }
            elsif ($comment && !$inQuote &&                 # COMMENT
                   substr($newPart,$i) =~ m/$comment/) {
                $newPart = substr($newPart,0,$i-1);
                $done = 1;
                last;
            }
            elsif (substr($newPart,$i) =~ m/^$endExpr/) {   # FOUND END
                if ($i<length($newPart)-1) { # leftovers
                    $self->{dsrc}->pushback(substr($newPart,$i+1));
                }
                $newPart = substr($newPart,0,$i-1);
                $done = 1;
                last;
            }
        } # for chars
        $buf .= $newPart;
    } # EOF

    ($inQuote) && UNIVERSAL::tfError(0,"Syntax error seeking /$endExpr/.");

    return($buf);
} # readToUnquotedDelim

# End of DataReaders package.


###############################################################################
###############################################################################
###############################################################################
# Stash one record, parsed and unparsed.
#
package DataCurrent;

sub DataCurrent::new {
    my ($class) = @_;
    my $self = {
        original    => "",
        theFields   => {},
    };
    bless $self, $class;
    return $self;
}

sub DataCurrent::reset {
    my ($self) = @_;
    $self->{theFields} = {};
    $self->{original} = "";
}

sub DataCurrent::getNCurrentFields {
    my ($self) = @_;
    (ref($self->{theFields}) eq "HASH") || UNIVERSAL::tfError(
        -1, " got non-HASH:" . ref($self->{theFields}));
    my @k = keys(%{$self->{theFields}});
    my $n = scalar(@k);
    return($n);
}

sub DataCurrent::getFieldValue {
    my ($self, $name) = @_;
    my $v = undef;
    if ($name =~ m/^\d+$/) {
        UNIVERSAL::tfError(0, "names only!");
    }
    else {
        $v = $self->{theFields}->{$name};
    }
    return($v);
}

sub DataCurrent::setFieldValue {
    my ($self, $name, $value) = @_;
    if (!defined $value) {
        UNIVERSAL::tfError(3, "Setting field '$name' to undef.");
        $value = ""; 
    }
    if ($name =~ m/^\d+$/) {
        UNIVERSAL::tfError(-1, "names only.");
    }
    else {
        sjdUtils::eMsg(2, "Set '$name' = '$value'.");
        $self->{theFields}->{$name} = $value || "";
    }
}

sub DataCurrent::setRecordFromArray {
    my ($self, $values, $names) = @_;
    if (ref($values) ne "ARRAY" || ref($names) ne "ARRAY") {
        UNIVERSAL::tfError(0, "Expected array refs.");
        return(0);
    }
    my $nVal = scalar(@{$values}) - 1;
    my $nNam = scalar(@{$names}) - 1;
    ($nVal==$nNam) || UNIVERSAL::tfError(
        1, "Size mismatch: $nVal values, $nNam names.");
    for (my $i=1; $i<=$nNam; $i++) {
        my $name = $names->[$i];
        my $value = (defined $values->[$i]) ? "'$values->[$i]'" : "UNDEF"; 
        UNIVERSAL::tfWarn(2,"Calling setFieldValue $i ($name)\t = ", $value);
        $self->setFieldValue($name, $values->[$i]);
    }
}

sub DataCurrent::setRecordFromHash {
    my ($self, $hRef) = @_;
    if (ref($hRef) ne "HASH") {
        UNIVERSAL::tfError(0, "Not a hash ref.");
        return();
    }
    my %copy = %{$hRef};
    $self->{theFields} = \%copy;
}

sub DataCurrent::getRecordAsArray {
    my ($self, $fnames) = @_;
    my @aRef = [ "" ];
    for (my $i=1; $i<scalar(@{$fnames}); $i++) {
        push @aRef, $self->{theFields}->{$fnames->[$i]};
    }
    return(\@aRef);
}

sub DataCurrent::getRecordAsHash {
    my ($self) = @_;
    my %copy = %{$self->{theFields}};
    return(\%copy);
}

# End of DataCurrent package.


###############################################################################
###############################################################################
###############################################################################
# Options
# The datatypes are drawn from Datatypes.pm, which is based on XSD.
#
package DataOptions;

sub DataOptions::new {
    my ($class) = @_;

    my $self = {
        theDataTypes  => new Datatypes(),
        optionPrefix  => "",
        optionTypes   => {},
        optionTypeReps=> {},
        optionHelps   => {},
        optionDefaults=> {},
        options       => {},
    };

    bless $self, $class;
    $self->defineOptions();
    return $self;
} # new

# See TFormatSupport.pm for option details.
#     Called from constructor.
#
sub DataOptions::defineOptions {
    my ($self) = @_;
    #                    Name            Datatype Name        Default
    # General options
    #
    $self->defineOption("basicType"    , "string"          , "CSV",
        "Name of the format to apply.");
    
    $self->defineOption("ASCII"        , "boolean"          , 0,
        "Reduce output to just ASCII.");
    $self->defineOption("comment"      , "string"           , "",
        "Comment delimiter.");
    $self->defineOption("TFverbose"    , "integer"          , 0,
        "Level of messages provided.");
    $self->defineOption("encoding"     , "Name"             , "utf-8",
        "Character set to use.");
    $self->defineOption("stripStart"   , "boolean"          , 0,
        "Remove whitespace from start of records?");
    $self->defineOption("stripFields"  , "boolean"          , 1,
        "Remove whitespace from start and end of fields?");
    $self->defineOption("stripRecords" , "boolean"          , 0,
        "Remove whitespace from start and end of records?");
    $self->defineOption("typeCheck"    , "boolean"          , 1,
        "Do type checking on options?");

    # ARFF options
    #
    $self->defineOption("sparse"       , "boolean"          , 0,
        "ARFF: Create sparse-format when possible?");

    # COLUMNS options
    #

    # CSV options
    #
    $self->defineOption("header"       , "boolean"          , 0,
        "CSV: Header record?");
    $self->defineOption("tableSep"     , "string"           , "",
        "CSV: String separating entire tables.");
    $self->defineOption("recordSep"    , "string"           , "\n",
        "CSV: String separating records.");
    $self->defineOption("fieldSep"     , "string"           , "\t",
        "CSV: String separating fields.");

    $self->defineOption("quote"        , "string"           , '"',
        "CSV: Quote character.");
    $self->defineOption("nlInQuotes"   , "boolean"          , 0,
        "CSV: Allow newlines inside quotes field values?");
    $self->defineOption("qdouble"      , "boolean"          , 0,
        "CSV: Escape quotes by doubling?");
    $self->defineOption("qstray"       , "boolean"          , 0,
        "CSV: Are unescaped quotes allowed in field values?");

    $self->defineOption("escape"       , "string"           , "",
        "CSV: Escape character?");
    $self->defineOption("escape2hex"   , "string"           , "",
        "");

    # JSON options
    #
    $self->defineOption("jsonArray"     , "boolean"         , 0,
        "JSON: Should the top-level be an array (vs. a hash)?");

    # MANCH options
    #
    $self->defineOption("typeField"    , "string"          , 78,
        "MANCH: What field has 'type'.");

    # MIME options
    #
    $self->defineOption("lineLength"   , "integer"          , 78,
        "MIME: Max line length to output.");

    # SEXP options
    #

    # XML (incl. HTML) options
    #
    $self->defineOption("HTMLEntities" , "boolean"          , 0,
        "XML: Use HTML named entities?");
    $self->defineOption("XMLDecl"      , "boolean"          , 0,
        "XML: Write an XML Declaration?");
    $self->defineOption("XMLEntities"  , "boolean"          , 1,
        "XML: Use the 5 built-in XML entities?");
    $self->defineOption("colspecs"     , "boolean"          , 0,
        "XML: Create col elements in output tables?");
    $self->defineOption("entityBase"   , "integer"          , 16,
        "XML: Base 10 or 16 for numeric character references.");
    $self->defineOption("entityWidth"  , "integer"          , 4,
        "XML: Minimum number of digits for numeric character references.");
    $self->defineOption("idValue"      , "string"           , "",
        "XML: ");
    $self->defineOption("prettyPrint"  , "boolean"          , 1,
        "XML: Add whitespace to output for readability?");
    $self->defineOption("publicId"     , "string"           , "",
        "XML: PUBLIC identifier to write.");
    $self->defineOption("systemId"     , "string"           , "",
        "XML: SYSTEM identifier to write.");

    # X(HT)ML tag and attribute name substitutions
    #
    $self->defineOption("htmlTag"      , "Name"             , "html",
        "XML: Tag to use in place of 'html'");
    $self->defineOption("tableTag"     , "Name"             , "table",
        "XML: Tag to use in place of 'table'");
    $self->defineOption("theadTag"     , "Name"             , "thead",
        "XML: Tag to use in place of 'thead'");
    $self->defineOption("tbodyTag"     , "Name"             , "tbody",
        "XML: Tag to use in place of 'tbody'");
    $self->defineOption("trTag"        , "Name"             , "tr",
        "XML: Tag to use in place of 'tr' (row)");
    $self->defineOption("tdTag"        , "Name"             , "td",
        "XML: Tag to use in place of 'td' (field)");
    $self->defineOption("thTag"        , "Name"             , "th",
        "XML: Tag to use in place of 'th' (field in header)");
    $self->defineOption("pTag"         , "Name"             , "p",
        "XML: Tag to use in place of 'p' (sub-fields)");
    $self->defineOption("classAttr"    , "Name?"            , "",
        "XML: Attribute name to use in place of 'class'");
    $self->defineOption("idAttr"       , "Name?"            , "id",
        "XML: Attribute name to use in place of 'id'");
    $self->defineOption("attrFields"   , "string?"          , "",
        "XML: Attributes to treat as additional fields");

    # XSV options
    #
    $self->defineOption("typeCheck"    , "boolean"          , 0,
        "XSV: Enable type checking of input?");
    $self->defineOption("omitDefaults" , "boolean"          , 0,
        "XSV: Don't write out values that match their defaults?");
} # defineOptions

sub DataOptions::defineOption {
    my ($self, $optionName, $dtName, $default, $doc) = @_;
    (my $dtName2 = $dtName) =~ s/(\W+)$//; # Strip repetition indicator
    my $rep = $1 || "";
    if (!$self->{theDataTypes}) {
        UNIVERSAL::tfError(0,"Datatypes not set up.\n");
    }
    elsif (!$dtName) {
        UNIVERSAL::tfError(-1,"No data type for '$optionName'.\n");
    }
    elsif (!$self->{theDataTypes}->isKnownDatatype($dtName2)) {
        UNIVERSAL::tfError(-1,"Unknown datatype '$dtName' " .
            "for '$optionName'.\n");
    }
    $self->{optionTypes}   ->{$optionName} = $dtName2;
    $self->{optionTypeReps}->{$optionName} = $rep;
    $self->{optionDefaults}->{$optionName} = $default;
    $self->{options}       ->{$optionName} = $default;
    $self->{optionHelps}   ->{$optionName} = $doc || "";
    return(1);
} # defineOption

sub DataOptions::setOptionsFromHash {
    my ($self, $optionsHash) = @_;
    for my $opt (keys(%{$optionsHash})) {
        $self->setOption($opt,$optionsHash->{$opt});
    }
}
sub DataOptions::getOptionsHash {
    my ($self) = @_;
    my %copy = %{$self->{options}};
    return(\%copy);
}
sub DataOptions::getOptionHelps {
    my ($self) = @_;
    my %copy = %{$self->{optionHelps}};
    return(\%copy);
}

###############################################################################
# Check that the options seem consistent and valid...
#
sub DataOptions::readyCheck {
    my ($self) = @_;
    if ($self->options->{"basicType"} eq "CSV") {
        if (!$self->options->{"fieldSep"}) { return(0); }
        if (($self->options->{"qdouble"} || $self->options->{"nlInQuotes"}) &&
            !$self->options->{"quote"}) { return(0); }
    }
    return(1);
}

###############################################################################
# WARNING: Calling this directly to set basicType, will not cause
# chooseFormat() to be called as needed!
#
sub DataOptions::setOption {
    my ($self, $original, $value) = @_;
    UNIVERSAL::tfWarn(
        2, "  DataOptions::setOption: '$original'=$value, caller " .
        sjdUtils::whereAmI(1));
    
    (defined $original) || UNIVERSAL::tfError(
        -1, "setOption: undefined name passed.");
    (defined $value) ||  UNIVERSAL::tfError(
        -1, "undefined value for '$original'.");

    my $name = $self->fixOptionName($original);
    if (!defined $self->{options}->{$name}) {
        UNIVERSAL::tfError(1, "Bad option '$name', known options: ",
            join(", ", sort(keys(%{$self->{options}}))) . ".");
        return(undef);
    }

    if (index($value, "\\")>=0) { # INTERPRET BACKSLASH CODES
        $value = sjdUtils::unbackslash($value);
    }

    # Datatype checks, if enabled.
    if (!$self->{theDataTypes}) {
        $self->{options}->{$name} = $value;
    }
    else {
        (my $type = $self->{optionTypes}->{$original}) || return(undef);
        my $rep = $self->{optionTypeReps}->{$original};
        my $req = ($rep =~ m/^(|!|\+)$/) ? 1:0;
        # "?" arg indicates nil is ok, "!" indicates value is required.
        # This should be in the actual datatype spec above.
        if ($req &&
            !$self->{theDataTypes}->checkValueForType($type,$req,$value,1)) {
            UNIVERSAL::tfError(
                0, "Invalid value '$value' for option '$name'");
            return(undef);
        }

        if ($self->{theDataTypes}->isNumericDatatype($type)) {
            $self->{options}->{$name} = $value + 0;
        }
        else {
            $self->{options}->{$name} = $value . '';
        }
    } # datatypes

    return($value);
} # setOption

sub DataOptions::hasOption {
    my ($self, $original) = @_;
    my $name = $self->fixOptionName($original);
    return(defined $self->{options}->{$name});
}

sub DataOptions::getOption {
    my ($self, $original) = @_;
    if (!$original) {
        UNIVERSAL::tfError(1, "Missing option name.");
        return(undef);
    }
    my $name = $self->fixOptionName($original);
    if (!defined $self->{options}->{$name}) {
        UNIVERSAL::tfError(1, "Bad option '$name'");
        return(undef);
    }
    my $value = $self->{options}->{$name};
    if ($value && index($value, "\\")>=0) { # ESCAPE ANY BACKSLASHES
        $value =~ s/\\/\\\\/g;
    }
    return($value);
}

sub DataOptions::getOptionType {
    my ($self, $name) = @_;
    if (!$name) {
        UNIVERSAL::tfError(1, "Missing option name.");
        return(undef);
    }
    $name = $self->fixOptionName($name);
    my $type = $self->{optionTypes}->{$name};
    if ($type) { return($type); }
    UNIVERSAL::tfError(
        1, "Bad option name '$name'" .
        ", known options: (" . join(", ", keys %{$self->{optionTypes}}) . ").");
    return(undef);
}
sub DataOptions::fixOptionName { # Synonyms for backward compatibility
    my ($self, $name) = @_;
    if ($name eq "delim") {
        $name = "fieldSep";
    }
    elsif ($self->{prefix} && $name eq $self->{prefix} . "delim") {
        $name = $self->{prefix} . "fieldSep";
    }
    return($name);
}

sub DataOptions::displayOptions {
    my ($self) = @_;
    my $buf = "";
    my %opts = %{$self->{options}};
    for my $o (sort keys %opts) {
        $buf .= sprintf(" %-16s %s\n", $o, $opts{$o});
    }
    return($buf);
}

# End of DataOptions package.


    1;


###############################################################################
###############################################################################
###############################################################################
#

=pod

=head1 Usage

use TabularFormats;

Provide parsing and generation for basic record/field structures
in many formats. The functionality and expressiveness are
essentially that of CSV and its kin; however, many more formats
are supported for such simple data (even formats that can do more in general).

APIs are provided to request a record at a time; a whole document parsed
to build a hash or array; a Sax API that makes the data look like a simple
HTML table (whether it was or not!); and "pull" API for SAX-equivalent events.
You can also generate records in any of the formats.

With this script, fields always have names and an ordering.
Most methods can specify a field by either name or number
(numeric names are not recommended).


=head2 Formats and variations supported

The formats ("basictype"s) supported include these, which have a simple
records/fields structure:
    B<ARFF> (for WEKA system),
    B<COLUMNS> (column-oriented),
    B<CSV> (lots of variations),
    B<MIME> (headers),
    B<XSV> (a simple XML subset designed for CSV-style data).

and also the following more sophisticated formats, that
can be used in simple ways, corresponding to
a basic records/fields structure:
    B<JSON>,
    B<MANCH> (a syntax for RDF),
    B<PERL> (array or hash declaration),
    B<SEXP> (LISP/Scheme S-expression),
    B<XML> (simple cases such as an HTML or analogous table),

C<TabularFormats> is a way to move simple tabular data around!
Representing general XML, JSON, PERL data, or OWL in CSV-like formats
is awkward at best, and this script doesn't deal with data that complex.
Formats that store data as binary fields (n-byte integers and float,
packed bits, length-prefixed strings, and such) are not supported.
Special support for date/time fields is not provided.

For more details, see below under L</"Supported formats">.


=head2 Example

First instantiate this package, choosing a format by name.
Some formats have options such as delimiter choices,
tag names, etc., which can be set on the constructor call or later.
This done, start parsing records. For example:

    use TabularFormats;
    my $tf = new TabularFormats("CSV");
    $tf->setOption("fieldSep", ":");
    $tf->open($myPath) || die "Can't open file '$myPath'.\n";
    my $fieldNamesRef = $tf->readAndParseHeader();
    while (my $rec = $tf->readRecord()) {
        my $arrayRef = $tf->parseRecordToArray($rec);
        for (my $i=1; $i<=$tf->getNSchemaFields(); $i++) {
            print $i . ": " . $arrayRef->[$i] . "\n";
        }
    }

B<Note>: If the caller doesn't know which format is desired until after
instantiating this package, it can be omitted from the constructor call.
Set it later via I<< $tf->setOption("basicType", name) >> instead,
or it defaults to "CSV". See also I<sniffFormat()>.

B<Note>: Some formats allow more than one physical line per logical record.
For example, I<some> CSV variants allow newlines inside quoted field valuess;
and HTML is not line-oriented at all.
At least in such cases, use I<readRecord>() as shown above,
instead of just reading records in the caller. This will make sure you get a
whole logical record each time.

Commonly, the only options you'll need
will be I<fieldSep>, I<iencoding>, and/or I<header>.
More complex CSV files (such as exports from popular office products) may
also require I<quote>, I<qdouble>, and I<nlInQuotes>.


=head2 What comes back?

Parsing saves the fields in the B<current data record> (see below),
and returns your choice of:

=over

=item * B<a hash of fields>, keyed by field name. See I<parseRecordToHash>().

=item * B<an array of fields>. The same field order
seen in the header (if applicable) or the first record, will be used.
See I<parseRecordToArray>().

=item * B<SAX-style callbacks> like what I<XML::Parser> provides
for an equivalent XHTML <table>.
See I<setHandlers>(), I<parsestring>(), I<parse>(), I<parse_start>(), and
I<parse_more>(). If I<setFieldSplitter>() is used to define sub-fields
for some or all fields, then (with XML input) any direct children of
table cells will be taken as starting sub-fields.
For XML output, sub-fields will be made into paragraphs within table cells.

=item * B<a "pull" interface>, with the same SAX-style
events, but where you ask for each one rather than getting callbacks.
See I<pull_start> and I<pull_more>.

=item * B<a complete XML::DOM structure> equivalent to what the SAX-style
callbacks define. See I<parsestringtoDOM>().

=back

All these results are available regardless of which input format is being
parsed. The SAX and DOM interfaces make the data look like a simple XHTML
C<table>, even if it started out as CSV, JSON, XSV, or whatever else.

A hash is the "internal" form (even though some formats  might prefer arrays).
With formats that lack field names, such as headerless CSV and COLUMNS,
the script will assign field names like "F1", "F2", etc.
Set other names using <setFieldNamesFromArray>() or <setFieldName>(),
or ignore field names altogether by using arrays.

You can also set the B<current data record> field by field, or pass in
an array or hash of data for it. An array will be mapped to fields
in accordance with the field ordering defined for the instance you're using.

Fields and their ordering can be set up, changed, or managed directly by
requesting the DataSchema object via I<getDataSchema>(), and using its
methods (described below). Typically this won't be necessary, because
I<parseHeader> will parse a format-specific header such as the first
line of many CSV files or the <Head> tag in XSV.

You can also create a second instance of this package (most likely with a
different format or options), and use it to assemble and output records
in a similar range of formats. A record can be assembled for output
starting from the B<current data record>, a hash, or an array. You can
also assemble a header and trailer if needed (formats that lack trailers
just provide ""). This makes it pretty easy to construct converters;
C<xml2tab> and C<tab2xml> are examples.

B<Note>: For this script, field numbers always count from one, not zero.
When arrays are handled, element [0] is the empty string.
This is not the way C/Perl/etc count arrays
(well, except when counting from the right).
However, it is consistent with the way *nix tools count fields
(for example C<cut>, C<sort>, C<msort>, C<join>, etc.).
Methods complain if you ask them to do something with field [0],
or if they notice non-empty data there.

B<Note>: A few of the supported formats (such as JSON, SEXP, and XML)
can express much more complex structures than the others,
such as hierarchies rather than merely records and fields.
Only certain very simple subsets of those formats are supported here:
ones that map fairly readily between all the supported formats
(the I<least common denominator>, if you will).

For example, XML support is limited to table-like structures (though you
can change the particular tag names used) with little to no
(relevant) structure inside individual cells; JSON and SEXP are
similarly supported only for essentially two levels of structure.


=head2 Setting up access to all the options

To display all of the options and their descriptions
call I<getOptionHelps(STDERR)>. The author recommends that
callers define a I<-help-tf> option to do this.

To add all of this package's options to the caller's
option definitions for C<Getopt::Long>,
call I<addOptionsToGetoptLongArg(hashRef)>, as shown below.
It adds to the hash, option definitions that stores the option values 
using I<setOption>(). The caller doesn't have to know about 
specific options (or later changes) at all:

    my $myOpt     = 0;
    my %getoptArg = (
        "myOpt!"      => \$myOpt,
    }
    my $tf = new TabularFormats();
    $tf->addOptionsToGetoptLongArg(\%getoptArg);

    Getopt::Long::Configure ("ignore_case");
    GetOptions(%getoptArg) || die "Bad options.\n";

You need to instantiate C<TabularFormats> first, so you can ask it to add its
options before you call I<GetOptions()>.

To avoid option-name conflicts, you can add a second parameter
to I<addOptionsToGetoptLongArg>: a string that will be prefixed to
all the option names it adds (conventionally 
"i" for options controlling input files, and 
"o" for for options controlling output files).


=head2 Inner objects

This package uses several internal packages. They
require attention, but they have their own interfaces if you should
need them (details are discussed below):

=over

=item * Data reading is handled by package C<DataSource> and C<DataReaders>.
and you can get a reference to the former via $tf->getDataSource>().

=item * Information about field names, orders, positions, datatypes,
and so forth is handled by package C<DataSchema>,
and you can get a reference to it via $tf->getDataSchema>().

=item * The original text and the parsed fields 
of the current record and one preceding record 
are kept in instances of C<DataCurrent>,
and you can get a reference to it via $tf->getDataCurrent>().


=back


=for nobody ===================================================================

=head1 Methods and Options

=head2 Methods for basic setup

=over

=item * B<new TabularFormats(basicType, options)>

Create a new C<TabularFormats> object, which can then be customized for the details
of your format, and used to parse and/or assemble records. For converting
between alternate formats, allocate one C<TabularFormats> object for input and one
for output, and copy between them.

I<basicType> specifies the overall format to be used (this can also
be specified later via I<setOption('basicType', value)>.
The value must be one of the types described below under
L</"Supported formats, with examples">.


If I<options> is specified, it must be a reference
to a hash of option names => values.
The pairs will be passed to I<setOption>().
Options can also be set directly using I<setOption>().
B<Note>: I<setOption>() interprets (unescape) any backslash-codes in values.

=item * B<sniffFormat(path)>

Look at extension, the *nix C<file> command, and the start of the data,
and return a best-guess as to the file format in use. This may be C<undef>,
or a name from the list of supported formats, or such a name plus a TAB
and additional information, such as the I<fieldSep> for CSV.
If I<path> leads to a compressed file (zip, gzip, etc.), you'll get
"COMPRESSED", a TAB, and the file extension back.

=item * B<reset()>

Resets the DataSource and DataSchema objects (but not DataOptions, which
knows what format you're dealing with. This would be used when reading
a number of data collections that are in the same format.

=item * B<hasFormat(name)>

Return true iff the data format I<name> is supported.
B<Note>: This does not necessarily mean that every defined feature of the
data format is supported (or meaningful in this context). For example,
this package supports simple HTML or XML tables (simple row/column arrangements
with no spans, embedded tables, etc. etc.); but it does not support the full
generality of HTML or XML (or JSON or SEXP or...).

=item * B<chooseFormat(name)>

Set up the implementation for a particular format such as CSV, ARFF, etc.
This is also called automatically if you do I<setOption("basicType", xyz)>.

=item * B<setOption(name, value)>

Set the option I<name> to the given I<value>.
The specific options available are discussed below.
If you choose I<basicType>, then I<chooseFormat(name)> is called automatically.
B<Note>: I<setOption>() interprets (unescapes) any backslash-codes in values.
This is because it is typically called directly from getOpt::Long(), which
leaves backslashes from the user's command line intact.

=item * B<getOption(name)>

Return the current value of the option I<name>.
B<Note>: I<setOption>() escapes (doubles) any backslashes in values
it returns, in order to be round-trippable with I<setOption()>.

=item * B<setOptionsFromHash(hashRef)>

Shorthand for calling I<setOption>() for each of the name/value pairs in
the referenced hash. See also I<getOptionHash>(). You can use the two together
to copy all options from one I<TabularFormats> instance to another.
B<Note>: I<setOption>() interprets (unescape) any backslash-codes in values.

=item * B<getOptionsHash>()

With no arguments, returns a reference to a copy of the hash of
all the options with their current values.
Calling this right after I<new>() will get you the defaults.

=item * B<getOptionHelps>(glob)

Returns a reference to a hash of the doc strings for the options,
keyed by option name. if I<glob> is provided, the information is sent to
that file as well.

=back


=for nobody ===================================================================

=head2 General Options

This is just a list; the full details are at C<TFormatsupport.pm>.

Use with I<setOption>() and I<getOption>().
Using an unknown option name will fail.
You can set an option to 0 or "", but not to C<undef>.

Unless otherwise specified below, options with unquoted default values
are boolean, and options with quoted values are strings.

Each format may use different options, which determine how data
will be parsed on input, or laid out on output:

=over

=item * B<basicType>     (string, naming the format) --
this may be set on the constructor, or set later like any other option.

=item * B<comment>       (string, depending on the format) --

=item * B<prettyPrint>   (boolean) 1

=item * B<TFverbose>     (integer) 0   # Show more trace information?

=item * B<recordSep>     (string) "\n"

=item * B<fieldSep>      (string) "\t"

=item * B<delim>         Synonym for I<fieldSep>

=item * B<escape>        (string) ""

=item * B<escape2hex>    (character) ""

Character used to mark hex-escapes,
such as "%" for URI-style escapes. This applies after I<escape>.

=item * B<header>        (boolean) 0

=item * B<nlInQuotes>    (boolean) 0

Allow newline in quotes?
B<Note>: With this option, you must read records via
I<readRecord>(); using normal Perl reads will of course not get
the multiple physical records that make up one logical record.
If this option is I<not> set, the library still checks for unbalanced
quotes, and issues a warning if found.

=item * B<qdouble>       (boolean) 0

Can embedded quotes be expressed by doubling them? Default: off.
If I<escape> is also set, it takes precedence over I<qdouble>.

=item * B<qstray>        (boolean) 0

If a quote character is found in a field, not at the beginning,
and isn't escaped, doubled, or in
a quoted field (considering those related option settings), a warning
is issued. To suppress that warning, set I<qstray>.
(experimental, esp. re. interaction with I<escape> and I<qdouble>).

=item * B<quote>         (string)  "\""

=item * B<stripFields>   (boolean) 1

Discard leading/trailing space on fields?

=item * B<stripRecords>  (boolean) 1

Discard leading/trailing space on records?

=item * B<classField>    (name) "Type"

(MANCH)
Treat the specified field as the "Class" (this matters when converting other
data to Manchester, since the name "Class" is special in OWL).

=back


=head3 Options applicable only to XML/HTML

=over

=item * B<htmlTag>, B<tableTag>, B<theadTag>, B<tbodyTag>,
B<trTag>, B<tdTag>, B<thTag>.

(XML) These options can be used to replace the default HTML element type names
for XML input and output. In general, tags set to "" will be omitted.

=item * B<idAttr>        (string) "id"

(XML) Use this name for any ID attributes generated for output XML.
You must set I<idValue> in order for this to have any effect.

=item * B<attrFields>    (string) ""

(XML) A whitespace-separated list of XML attribute names. These attribute
(regardless of what elements they occur on), will be treated as additional
fields, with the same names. An obvious example is C<href>.

=item * B<idValue>       (string) ""

(XML) If not "", generates ID attributes for each output row in XML.
The attribute name is taken from I<idAttr>, while I<idValue> specifies
where to get the value (a field name, or "*" for row number).

=item * B<classAttr>     (name) "class"

(XML) Put the field-name on this attribute of the "td" (field) elements.
If this is set to "", use field-names as the element type names for fields,
ignoring I<tdTag>.

=item * B<colspecs>      (boolean) 0

(XML) Generate HTML table COL elements?
For this to be very useful, you'll probably want to use I<setFieldPosition>().
The column specifications can include width and alignment. You can also
use the I<classAttr> option to put field names in as class attributes on
cells, and use that to hook up style definitions.

=item * B<entityWidth>   (int) 5

(XML) Minimum width for writing for numeric character references, as an integer.

=item * B<entityBase>    (10|16) 16

(XML) Base for writing numeric character references, as an integer.

=item * B<HTMLEntities>  (boolean) 0

(XML) Use HTML entity names for output when applicable? (not yet supported)

=item * B<publicId>      (string) ""

(XML) Write out a DOCTYPE declaration, with this PUBLIC id, and
with the document type name taken from I<htmlTag>.

=item * B<systemId>      (string) ""

(XML) Write out a DOCTYPE declaration, with this SYSTEM id, and
with the document type name taken from I<htmlTag>.

=item * B<XMLEntities>   (boolean) 1

(XML) Use the 5 XML built-in entities in output (if turned off, then
use numeric character references for those 5 characters).

=item * B<XMLDecl>       (boolean) 0

(XML) Write out an XML Declaration.

=back


=for nobody ===================================================================
=for nobody ===================================================================

=head2 Methods for operating on actual records

The main actions you can take are at the level of records: you can I<read>,
I<parse>, and I<assemble> records. You don't have to use the I<read> methods
here; however, some CSV files allow newlines within quoted field-values, and
the methods here take care of that, so can save you a lot of annoyance.
XML and XSV formats are also not as simple to read as I<readline>() or its
equivalent.

=over

=item B<hashToArray>(fromData, toNamesArray?)

Copy values from a hash into an array, and return a reference to it.
The data will be taken from the hash referenced by I<fromData>.

Fields will be copied in the order listed in the array referenced by
I<toNamesArray>; fields whose names do not appear in the array will not
be copied.
If I<toNamesArray> is not supplied, the defined field ordering will be used.

=item B<parseRecordFromString>(rec)

Parse the string in I<rec> and store its data as the current data record.
You would use this if you're reading or creating records yourself, and
just need them parsed into fields.

=back


=for nobody ===================================================================

=head2 Input (non-parsing) methods

=over

=item * B<readRecord>()

Return a logical record from the data source (file, buffer, handle,... --
see I<open>() and I<attach>().
This must be used unless each logical record is always one physical record too.
The definition of "record" for each format, is described below
under L<"Supported formats, with examples">

=back


=for nobody ===================================================================

=head2 Input parsing methods

These take a record as a string (perhaps as returned by the I<readXXX>()
methods above, but not necessarily). The record is parsed according to the
format and options in effect, and returned somehow: as an array,
a hash, a sequence of SAX-equivalent events, or an XML::DOM structure.

=over

=item * B<readAndParseHeader>()

Reads the file header as defined by the format in effect, and uses it to
define the set of known fields. Some formats prohibit headers; some allow them,
and some require them. Some formats have the header look like a data
record, some (particularly ARFF) use a completely different format.
Many formats merely specify field names in the header, but some also give
datatype and/or other information. See also I<parseHeader>.
A reference to an array of the names is
returned. The 0th entry is not a field name (fields always count from 1).

=item * B<parseHeader(string)>

This parses a header out of a string, rather than extracting it from the
file. This only makes sense if the format defines header to be in the same
syntax as data records (for example, like CSV but unlike ARFF).
As with I<readAndParseHeader>, this sets up an
internal list of field-names, that can then be used to refer to fields more
mnemonically than using numbers.
A reference to an array of the names is
returned. The 0th entry is not a field name (fields always count from 1).

=item * B<parseRecordToArray(s)>

Takes a string (with or without a record
separator on the end), and parse it into an array of fields.
Returns a reference to the array, or undef if the record passed is a comment.
Field [0] is set to the null string, because fields are conventionally
numbered from 1 and it's easier to avoid the off-by-one correction.
Also sets the I<current data record>.

B<Note>: With formats that allow logical records that do not exactly
correspond to physical records (for example XML, SEXP, and some forms of CSV),
you should use I<readRecord> rather than merely Perl's <F> or similar.

=item * B<parseRecord(s)>

Synonym for I<parseRecordToArray(s)>.

=item * B<parseRecordToHash(s)>

Like I<parseRecordToArray>(), but returns
a reference to a hash, where each field is stored under its name.
Also sets the I<current data record>.

=item * B<setHandlers(hashRef)>

Attach callback functions to SAX-style events.
After doing this, call I<parsestring> to parse data from the source,
making it look like the event stream you'd get for the XHTML table
equivalent of the data.

The SAX events you can attach to are:
"Init", "Fin", "Start", "End", "Text", and/or "Default".

=item * B<parsestring>(s)

The data I<s> will be parsed according to the format and options in effect,
and will generate a stream of SAX-style callbacks (see I<setHandlers>()).

=item * B<parse_start>()

Begin a SAX "pull" parse. After calling this, call I<parse_more>() until
you get a "Fin" event.

=item * B<parse_more>()

Get the next event for a SAX "pull" parse. Call I<parse_more>() until
you get a "Fin" event. You will get an event stream that looks like an
XHTML I<table>, with no extraneous white-space nodes, just table/tr/td/#text.

=item * B<parsestringToDOM>(s)

Parse the input data, and return an XML::DOM instance. The DOM returned
will correspond to the structure you'd get if you used I<parse_more>.

=back


=for nobody ===================================================================

=head2 Output assembly methods

=over

=item * B<assembleHeader(arrayRef)>

Like I<assembleRecordFromArray>.

=item * B<assembleComment(text)>

If the output format supports comments, issue I<text> as a comment;
otherwise do nothing.

=item * B<assembleRecordFromHash(hashRef)>

Like I<assembleRecordFromArray>, but goes through the known fields inorder (which
must each include a fieldName), and retrieves each field's value by name
from the hash at I<hashRef>, rather than getting them from an array.

In formats such as JSON, a record typically needs a comma after it, unless
it is the last record. The comma is not provided. However, for formats like
PERL where a comma after the last record is permitted, it is.

=item * B<assembleRecordFromArray(arrayRef)>

Take an array of fields (for example as
produced by I<parseRecord>(), and assemble them into a string according to
the format specifications. If no I<arrayRef> is provided, the current
data record is used.

In formats such as JSON, a record typically needs a comma after it, unless
it is the last record. The comma is not provided. However, for formats like
PERL where a comma after the last record is permitted, it is.

=item * B<assembleRecordFromHash(hashRef)>

Like I<assembleRecordFromArray>, but gets the fields by name from a hash
instead of from an array. It puts them in the order learned
from parseHeader() or from calls to
I<addField>(), I<setFieldNamesFromArray>(), and/or I<setFieldName>().

=back


=for nobody ===================================================================

=head1 Supported formats

The set of supported formats is determined by what implementations exist
in C<TFormatSupport.pm>. See its Perldoc for details and references.
At last check, the formats included were:

=head2 ARFF

This is the I<Attribute-Relation File Format> form for the C<WEKA> ML tookit.

  % Signers data
  %
  @RELATION DeclarationOfIndependence

  @ATTRIBUTE Fname        STRING
  @ATTRIBUTE LName        STRING
  @ATTRIBUTE State        { PA, MA, RI, DE, NH, VT, VA }

  @DATA
  John,      Adams,      MA
  Benjamin,  Franklin,   PA
  John,      Hancock,    MA
  Stephen,   Hopkins,    RI
  Andries,   'van Dam',  RI


=head2 COLUMNS

Fixed column-oriented layout. To use this, you'll need to call
I<setFieldPosition>() to define column placements.

  Fname     LName      State
  John      Adams      MA
  Benjamin  Franklin   PA
  John      Hancock    MA
  Stephen   Hopkins    RI
  Andries   van Dam    RI


=head2 CSV

A wide variety of record/field delimited file formats, such as
CSV and TSV. Forms vary by the choice of delimiter; repeatability of the
delimiter; whether spaces are ignored; whether fields can be quoted, and
if so when they I<must> be quoted; whether and how quotes and newlines can
appear within quotes; whether there's a header record (generally as the
very first record); and more.

A logical record in CSV is a physical line
unless I<nlInQuotes> is set, in which
case newlines can appear inside quotes and are counted as part of the data,
rather than as end of logical record.
No comments are typically allowed, though this script does support them
if needed (see I<setOption>()).

  Id, Fname, LName, State
  Signer01, John, Adams, MA
  Signer02, Benjamin, Franklin, PA
  Signer03, John, Hancock, MA
  Signer04, Stephen, Hopkins, RI
  Signer05, Andries, "van Dam", RI


=head2 JSON

Javascript Object Notation, commonly used for passing
program data structures around.

A logical data record in JSON is essentially
a Javascript expression with balanced (), [], and/or {}.
Quoted contents doesn't count toward balancing. This script only deals
with simplistic JSON (on par with CSV), such as:

  { "Table": [
    "Signer01": {"Fname":"John",     "LName":"Adams",    "State":"MA" }
    "Signer02": {"Fname":"Benjamin", "LName":"Franklin", "State":"PA" }
    "Signer03": {"Fname":"John",     "LName":"Hancock",  "State":"MA" }
    "Signer04": {"Fname":"Stephen",  "LName":"Hopkins",  "State":"RI" }
    "Signer05": {"Fname":"Andries",  "LName":"van Dam",  "State":"RI" }
  ]}


=head2 MANCH

The Manchester OWL (Web Ontology Language) format is
used by C<Protege> and some other C<RDF> applications
(see "References" below).
This script only supports The Manchester "IndividualFrame" item,
for assigning Class, SubClassOf, and Facts to the individuals.
As with XML, JSON, and some others, this represents a
"least common denominator" subset, comparable to CSV and its kin.

  Prefix: : http://www.example.org/mystuff
  Class: Signer
      SubClassOf: owl:Thing
  Individual: Signer01
    Types: Signer
    Facts: Fname John, LName Adams, State MA
  Individual: Signer02
    Types: Signer
    Facts: Fname Benjamin, LName Franklin, State PA
  Individual: Signer03
    Types: Signer
    Facts: Fname John, LName Hancock, State MA
  Individual: Signer04
    Types: Signer
    Facts: Fname Stephen, LName Hopkins, State RI
  Individual: Signer05
    Types: Signer
    Facts: Fname Andries, LName "van Dam", State RI


=head2 MIME

MIME mail header form (incomplete).
See L<RC 1521>, L<RFC 2045>, L<RFC 822>.
Uses I<label>-prefixed fields (with continuation lines indented), and
a blank line (only) before each entire record.

  Id:    Signer01
  Fname: John
  LName: Adams
  State: MA

  Id:    Signer02
  Fname: Benjamin
  LName: Franklin
  State: PA

  Id:    Signer03
  Fname: John
  LName: Hancock
  State: MA

  Id:    Signer04
  Fname: Stephen
  LName: Hopkins
  State: RI

  Id:    Signer05
  Fname: Andries
  LName: van
      Dam
  State: RI


=head2 PERL

This is mainly for output. It produces PERL source code that creates
an array (one element per record) of references to hashes (which map from
field names to values).
All the field names, and all non-numeric values, are quoted as strings.
You should be able to just paste this into a PERL program and then access
the data easily.

  my @foo = (
      ( "Fname" => "John",
        "LName" => "Adams",
        "State" => "MA",
      ),
      ( "Fname" => "Benjamin",
        "LName" => "Franklin",
        "State" => "PA",
      ),
      ( "Fname" => "John",
        "LName" => "Hancock",
        "State" => "MA",
      ),
      ( "Fname" => "Stephen",
        "LName" => "Hopkins",
        "State" => "RI",
      ),
      ( "Fname" => "Andries",
        "LName" => "van Dam"",
        "State" => "RI",
      ),
  );

=head2 SEXP

S-Expressions should be familiar from LISP, Scheme, and their kin.
They can express much more than CSV, but only simple cases are supported here.
They are supported in two flavors: association lists vs. plain lists.
Field names and values are quoted (single left only if they consist only
of alphanumerics; otherwise double enclosed).

I<SXML> syntax is a variation on SEXP that is not yet supported.


=head2 XML

(X)HTML or XML table or table-like markup (XML and HTML can of course
represent much more than that, but only tables or similarly-shaped
data are handled here):
Elements for each record and field and field values in content.
HTML table tags are used by default, but tag names can be changed.
Attributes are not presently used for fields
except via I<idAttr>, I<idValue>, and I<classAttr>.

=head2 XSV

This is a very simple subset
of XML, limited to about the same functionality as CSV, ARFF, etc.
It is slightly more verbose in some cases, slightly less verbose in others
(due to support for default values, C<HTML>-like "BASE" factoring, etc).
It is supported via a separate package, and supports simple datatype checking.

Every XSV data set is a Well-Formed XML document (so can be processed
by perfectly normal XML software). However, not all WF XML documents
are XSV. That is, XSV supports a subset of XML.

For example:

  <!-- Some signers of the Declaration of Independence.
       List created: March 15, 1066 A.D.
    -->
  <Head  Id="#NMTOKEN" Fname="#string" LName="John"
         State="" DOB="#date">
    <Rec Id="Signer01" Fname="John"     LName="Adams"    State="MA" />
    <Rec Id="Signer02" Fname="Benjamin" LName="Franklin" State="PA" />
    <Rec Id="Signer03"                  LName="Hancock"  State="MA" />
    <Rec Id="Signer04" Fname="Stephen"  LName="Hopkins"  State="RI" />
    <Rec Id="Signer05" Fname="Andries"
         LName="v&aacute;n Dam" State="RI" />
  </Head>


=for nobody ===================================================================

=head1 Managing options

=over

=item * B<addOptionsToGetoptLongArg>I<(hashRef, prefix?)>

Add all of this package's options to the hash at I<hashRef>, in the form
you would pass to Perl's C<Getopt::Long> package. The options will be set
up to store their values directly to the C<TabularFormats> instance, via
the I<setOption>() method.

If I<prefix> is defined,
it will be added to the beginning of each option name; this allows you
to avoid name conflicts with the caller, or between multiple instances of
C<TabularFormats> (for example, one for input and one for output).

If an option is already present in the hash (note that the key, as always
for C<Getopt::Long>, includes aliases and suffixes like "=s"), a warning is
issued and the new one replaces the old.

Returns: The number of options added.

=back


=for nobody ===================================================================
=for nobody ===================================================================

=head1 Internal package "DataSource"

Get a reference to the active instance of this package, from the
I<TabularFormats> instance, using I<getDataSource>().

All of the format readers read their data through an internal package
called C<DataSource>. It provides this interface (which can also be
used independently). This will likely be removed from here and integrated
into C<TFormatSupport.pm> or C<RecordFile.pm>.

=over

=item B<new>()

=item B<open>(path)

Open the file at I<path> and make it the current source of data.
Any previously opened or attached file is closed.
Returns: undef on failure, otherwise the file handle to the open file.

=item B<close>()

Close any currently-open input file, and discard any pushed-back or added text.

=item B<seek(n, whence)>

Move the open file to position I<n>, and clear any pushback data.
I<whence> is 0 to count from start of file, 1 to count forward, and
-1 to count backward from end of file.

=item B<tell()>

Return the current offset into the open file.


=item B<attach>(self, fh)

Make the file handle I<fh> the current source of data.
Any previously-open file is detached.

=item B<add_text>(self, text)

Add I<text> to be read.
Any previously-attached or opened file is detached.
If there is text data still unread
from prior I<add_text>() or I<pushback>() calls, the new I<ttext> is appended.

=item B<pushback>(self, text)

Add I<text> to be read I<before> any still-unread text
from prior I<add_text>() or I<pushback>() calls (if a file if open, it
stays open).


=item B<readline>(self)

Read and return one physical line (terminated by \n).
Input comes first from the buffer, then from the open file if any.

=item B<readRealLine>(self, commentDelim)

Read a I<logical> line, as defined for the active format. For example,
some types of CSV files permit newlines within quoted fields,
and I<readRealLine>() accounts for this. Used by I<readRecord>.

=item B<readToUnquotedDelim>(self, endExpr, quoters, qdouble, escapes, comment)

Reads up to (but not including) the first unquoted occurrence of
the regular expression I<endExpr>.

This is used to read to the ";" that ends a Perl declaration, the "@DATA"
that separates header and data in ARFF, etc.

The parameters are as the like-named paramters of I<setupBalance>, plus:

=over

=item I<endExpr> -- a (Perl) regex, the first match to which ends the scan.

=back

=item B<setupBalance>(self, openers, closers, quoters, qdouble, escapes, comment)

Set up the parameters needed for I<readBalanced> (qv).
All parameters must be provided, even if some or all are "".

Openers and closers that are quoted, doubled, or unescaped when those
parameters are in effect, do not count towards balancing.

=over

=item I<openers> -- a string containing the characters that
can open expressions. Default: "(".
Characters in corresponding positions in I<openers> and I<closers>
must correspond (for example, "([{" goes with ")]}", not "])}").

=item I<closers> -- a string containing the characters that
can close expressions. Default: ")".
Characters in corresponding positions in I<openers> and I<closers>
must correspond (for example, "([{" goes with ")]}", not "])}").

=item I<quoters> -- a string containing characters that
function as quotes, disabling the effect of openers and closers
within their scope. Default: "\"".

=item I<qdouble> -- 0 or 1 to indicate whether 2 of
the same quote characters in a row count as data rather than closing
an open quote group.

=item I<escapes> -- a string containing the characters that
cause the character following them to be as data rather than as
an opener, closer, quoter, escape, or comment.

=item I<comment> -- a string (just one) that (when not escaped or in quotes)
causes the rest of the physical lines to be discarded as a comment.

=back

=item B<readBalanced>(self)

If extra parameters are found, then I<setupBalance> will be called 
with the same parameter list, before processing as usual.

Return text up through the next balance point in terms
of parentheses, brackets, braces, or similar delimiters.
For example, this method can read a complete SEXP S-expression,
or a complete JSON group, allowing for nested constructs.
If the expression ends in mid-line, this method calls I<pushback>()
for the rest of the line.

=back


=for nobody ===================================================================
=for nobody ===================================================================
=for nobody ===================================================================

=head1 Internal package: DataSchema

Get a reference to the active instance of this package, from the
I<TabularFormats> instance, using I<getDataSchema>().

Many of these methods allow you to identify a specific field
by either name or number. Fields always have both.

=over

=item * B<setRecover(flag)>

Call this to enable the schema to create new fields on the fly, if it is
ever asked for a field that isn't known. This is particularly useful when
a file has no schema or header.

=item * B<getNSchemaFields>()

Return the number of fields known to the B<schema> (that is, the
number of existing field definitions. This is not necessarily the same as the
number of fields of the current records 
(for which see I<DataCurrent::getNCurrentFields>()).

=item * B<addField>(name)

Append a field definition to the list of known fields.
Returns the number of fields defined so far (including the new one).
See also I<parseHeader>(), below.

=item * B<setNFields(n)>

Ensure that there are at least I<n> fields defined.

=item * B<setFieldName>(n, name)

Change the name of field I<n> (name or number).

=item * B<getFieldName(n)>

Return the name of field I<n> (name or number).

=item * B<setFieldNamesFromArray>(arrayRef)

Rename fields en masse. The names in the array referenced by I<arrayRef>
will be assigned to the fields in the current field order (if there are more
names than defined fields, new fields will be quietly defined as needed).
I<arrayRef>->[0] should be undefined or empty. Undefined or empty
elements will not cause renaming of the corresponding fields.
B<Note>: Changing field names while in the middle of reading a file is
unwise, at least for formats that have explicit field names in the data (as in
many or most formats other than CSV and ARFF).

=item * B<getFieldNamesArray>()

Return an array of the names of the fields, in field-number order.
As always, [0] will be present but empty.

=item * B<setFieldDatatype>(n, dtName)

Change the datatype of field I<n> (name or number).
The names are as supported by C<Datatypes.pm>, which include the
built-in XML Schema Datatypes plus some extensions.

=item * B<getFieldDatatype>(n)

Get the datatype of field I<n> (name or number).

=item * B<setFieldDefault>(n, defaultValue)

Set the default value for field I<n> (name or number).
This will be filled in when the field is missing in the input (in most
formats, whitespace counts as empty). For formats that have their own
defaulting mechanism, this operates I<after> that mechanism.

If the <omitDefaults> option is set, fields that match their default will
not have their values written to the output (this is only supported
for XSV so far).

Exactly what "missing" means depends on the specific format in use.
For example, XSV fields are identified by name so can be entirely omitted,
while CSV and COLUMNS necessitate a placeholder.

=item * B<getFieldDefault>(n)

Returns the present default value for field I<n> (name or number).

=item * B<setFieldSplitter>(n, regex, joinerString)

Enable support for "sub-fields" (experimental).
No splitting is done by default.
On input, any field for which I<regex> has been set will be split()
to make an array.

On output, any field whose value is an array reference
will combine the array elements into a single field.
Most of the supported formats do not define such a notion, so the output
field will simply be created by doing a Perl join() using the specifying
I<joinerString>, and putting quotes around the outside. For example,
if the second field is a reference to an array of the first five integers
and the I<joinerString)> is a space, for CSV the second field ends up
as shown here:

    field1, "1 2 3 4 5", field3

For output to formats that have a notion of hierarchy, their syntax is used:

=over

=item * For XML, sub-elements are created using the I<name> specified
as I<joinerString>. A typical example might be dividing table cells into
"<p>" or "<span>" elements. If I<joinerString> contains a space, anything
after the space will be deleted when writing the I<end-tag>; this allows
specifying attributes if desired (like 'p class="foo"').

=item * For JSON and Perl, the array elements will be separated by ", ",
and the whole list parenthesized (I<joinerString> is ignored).

= item * For SEXP, a parenthesized quoted list is created, with individual
items quoted if needed (I<joinerString> is ignored).

=back


=item * B<setFieldNumber>(n, newNumber)

Move field I<n> (name or number) to field ordering place I<newNumber>.
In effect, the field is deleted from the ordering (with all later
fields therefore moving down by 1 position), and then inserted before
field I<newNumber> (with all later fields moving up by 1 position).
See also I<getFieldNamesArray>().

B<Note>: The fields of the B<current data record> are always organized by name,
not number. So if you change field numbers after loading a record, the data
for the field is "moved" along with the field. However, the field ordering
is used when parsing formats that are defined by order
(mainly ARFF, COLUMNS, CSV, and some variants of SEXP not yet supported). So if
you use I<setFieldNumber>, any records you later parse will assume the new
ordering.

To modify the order of fields in such formats,
create two instances of this package,
one for input (where you never call I<setFieldNumber>()),
and one for output (where you do).
Define the desired fields for the output with I<addField>(), perhaps
copying them from the input instance, perhaps renaming or reordering.
Then call I<parseRecordToHash>() in the first instance, and pass the returned
hashes to I<assembleRecordFromHash>() in the second instance.

=item * B<getFieldNumber(n)>

Return the field number corresponding to field I<n> (name or number),
or 0 if there is no such field.

=item * B<setFieldPositions>(startArrayRef)

Call I<setFieldPosition>() for each element in the array referenced by
I<startArrayRef> (as always, [0] should be present but empty). These entries
should be the start columns for the respective fields.
The widths will be set to be everything up to the next start column
(except for the last one, whose width is presently undefined).
Field alignments will not be set.

=item * B<setFieldPosition>(n,startCol,width?,align?)

Sets the column range (counting from 1) that field I<n> (name or number)
occupies. This only applies when dealing with COLUMNS format.
This is the only way to tell COLUMNS where the fields are.

Note that it uses I<width>, not I<endCol>. If I<width> is omitted, it will
be set to occupy everything up to just before the nearest following
I<startCol> (or undef is there is no following field has been defined yet).

The optional I<align> argument may be L (left), R (right), C (center),
D (decimal), or A (automatic), to specify how the data will be padded
if needed. "D" is limited to using "." to align on, and aligning that
character to the center of the permitted width.

This method checks for position conflicts (overlap).
If there is a conflict with an already-defined column range for
another field, it returns 0 (otherwise 1).

B<Note>: This method does I<not> change any fields' sequence number;
you may want to call I<setFieldNumbersByPosition>() afterward to do so.

=item * B<getFieldPosition>(n)

Return the starting column, width, and alignment
for field I<b> (name or number).

=item * B<setFieldNumbersByPosition>()

If you moved fields around with I<setFieldPosition>(), this will
re-number them (like I<setFieldNumber>()) to be in ascending order by position.

=item * B<getAvailableWidth>(n)

Return the number of columns available, by searching for the nearest
following field by start position, and subtracting start positions.

=item * B<getNearestFollowingFieldDef>(n)

Return the field definition of the next field, in order of start position,
after field I<n> (name or number).

=item * B<setFieldCallback(n,cb)> (experimental)

Attach a callback function I<cb> to field I<n> (name or number).
Whenever that field is parsed out of input data, the callback will be called,
being passed a reference to the TabularFormats instance calling it,
and the string form of the field value,
and the returned value will be used in place of the value passed:

    theCallback($tf, $s)

B<Note>: There should be a way for the callback to do internal parsing and
return more than one field; but there isn't. However, the callback can
do explicit calls to I<< $tf->setFieldValue($n, $x) >>.
This feature is not yet integrated with sub-fields/splitters (cf), and
the result if you use both is undefined.

=back


=for nobody ===================================================================
=for nobody ===================================================================

=head1 Internal package "DataCurrent"

This package keeps the fields of the current data record.

Get a reference to the active instance of this package, from the
I<TabularFormats> instance, using I<getDataCurrent>().

B<Note>: parseXXX methods of C<TabularFormats> modify the current
data record automatically; you don't need to call I<getDataCurrent>() again
after (for example) each I<parseRecord>.

=over

=item B<setRecordFromArray>(values, names)

Copy the data items from the array referenced by I<values> into
the current data record. It copies only the fields named in
the array referenced by I<names>, in that order (skipping [0]).
No checking against the list of defined fields is done (this package
is not directly related to DataSchema).

=item B<setRecordFromHash>(hRef)

Copy the data items from the hash referenced by I<href> into
the current data record. Items not present in the hash, are unchanged
(use I<clearRecord> first if desired).

=item B<clearCurrentRecord>()

Undefine all fields for the current data record.

=item B<setFieldValue>(n, value)

Change the value stored for field I<n> (name or number)
of the current data record.

=item B<getFieldValue>(n)

Return the value stored for field I<n> (name or number)
of the current data record.

=item B<getRecordAsString>()

Return the current data record's field values as a string in the
appropriate format.

=item B<getRecordAsArray>(names)

Return some or all of the current data record's field values as an array;
[0] will be "" as always; the rest of the array will be filled in by the
values of the fields named in the array referenced by I<names>.
Any nonexistent names, will result in undefined array elements.

=item B<getRecordAsHash>()

Return a hash of the current data record's fields.

=back


=for nobody ===================================================================
=for nobody ===================================================================

=head1 Related commands

=over

=item * C<xml2tab> and C<tab2xml> are basic wrappers on top of this,
that just convert from one form to another. The names are historical.

=item * C<sexp2xml> -- a somewhat similar conversion, but specialized for
Penn TreeBank files, which are kind of like SEXP but contain
many other embedded syntaxes, which this script also converts.

=item * C<align> -- take a file and measure all the fields, then
space-pad them so they line up nicely.
Can also do box-drawing in ASCII or Unicode.

=item * C<XmlTuples.pm> -- support for the XSV format.

=item * C<RecordFile.pm> -- provices record-oriented i/o, with cached offsets.
Looks basically like a file, but
handles logical rather than physical records.

=item * C<FakeParser.pm>, C<YMLParser.pm> -- simple parsers for XML. 
Much ike CPAN's C<XML::Parser>, but more forgiving of errors, and thus
not fully-conforming XML parsers. C<YMLParser.pm> also supports some
extra minimization conventions, especially for tables.

=item * Some sjd utilities that use C<TabularFormats.pm>:
C<addup>,
C<align>,
C<countChars>,
C<countData>,
C<disaggregate>,
C<dropDictionaryWords>,
C<globalChange>,
C<grepData>,
C<lessTabular>,
C<scraper>,
C<splitFiles>,
C<tab2xml2>,
C<vocab>,
C<xml2tab>.

=item * Some OA utilities that use C<TabularFormats.pm>
C<calculateIRR>,
C<mergeWordLists>,
C<nmi2turk>,
C<oapSelect>,
C<reviewFiles>,
C<text2turk>.

=item * Some utilities that may not give access to all TF options yet:
C<makeHTMLtable>*,
C<makeNMIgraph>*,
C<taWordlist2csv>* (unfinished),

=back


=for nobody ===================================================================

=head1 Known bugs and limitations

See also C<TFormatSupport.pm>.

=over

=item * Not safe against UTF-8 encoding errors. Use C<iconv> if needed.

=item * Leading spaces on records are not reliably stripped.

=item * Particular formats set their own values for the I<comment> option.
This means you can't override it until after calling I<chooseFormat>,
which is annoying.

=item * The I<ASCII> option is supported for JSON, Perl, XML, and XSV.
For some other formats it is not clear how to escape non-ASCII characters.
ARFF appears to provide no way at all.
MIME headers use I<quoted-printable> form, but support for full Unicode is not
yet finished.

=item * The behavior if using regexes rather than strings for I<delim>,
I<quote>, I<comment>, etc., for CSVs is undefined.
Most likely it will work ok for input, but not for output.

=item * Support for decoding HTML entity references is implemented but
commented out; to use it, uncomment things starting C<HTML::Entities>
and install the eponymous CPAN package.

=item * Datatype checking is experimental.

=item * The behavior if a given field is found more than once in an input
record is undefined. This is only possible with some formats (essentially
those that identify fields by name, not position). Some options
may be added for this, perhaps taking the first or last, or concatenating
them with some separator, or serializing them somehow.

=back


=head1 Ownership

This work by Steven J. DeRose is licensed under a Creative Commons
Attribution-Share Alike 3.0 Unported License. For further information on
this license, see L<http://creativecommons.org/licenses/by-sa/3.0/>.

The author's present email is sderose at acm.org.

For the most recent version, see L<http://www.derose.net/steve/utilities/>.

=cut