#!/usr/bin/perl -w # # TabularFormats.pm # # The main package: # package TabularFormats; # package pull_parser; # package ExpatNB; # # Data management: # package DataSchema; # package FieldDef; # package DataSource; # package DataReaders; # package DataCurrent; # package DataOptions; # # Written 2010-03-23 by Steven J. DeRose, sderose@acm.org, as csvFormat.pm # (many changes/improvements). # 2012-03-30 sjd: Rename to TabularFormats.pm, major reorg. # More work on adding XSV (XML Tuples) support. # Refactor to have package per form, plus RecordDef and FieldDef. # Use sjdUtils. Integrate sniffing code from lessCSV. # 2012-04-13 sjd: Track lastMessage. # 2012-04-20 sjd: Debugging, cleaning up separation of sub-packages. # 2012-04-25f sjd: Back to pretty much working. Start SAX i/f. # Let user actually *set* field-specific callbacks. Make setExpectedFields() # adjust fDefsByName to match. Add stubs for remaining formats. # 2012-04-30 sjd: Put all options into main pkg, add opt() in all pkgs. # 2012-05-23 sjd: Add PERL. Rationalize parseRecord... methods. Decide that # hash is the definitive internal form. Drop expectedFields notion. # Drop readRecordToHash() and readRecordToArray(). # Do parseRecordToArray() by ...ToHash() and assemble in order. # 2012-05-25 sjd: Option to set up col specs for XML table output. # Implement option and field datatype checking. # Better way to organize subclasses. # 2012-05-29ff sjd: Pull in parsestringtoDOM() from FakeParser.pm. Shift methods # between EntityStack/EntityFrame/EntityDef. Implement ARFF. Escaping. # Fix @class for XML, attribute names for XSV output. Working for CSV again. # Drop {theRecord}, escapeMap option, add xDocument, -XMLDecl, systemId. # Move tfWarn and tfError into UNIVERSAL. Add assembleComment(). # Make TF-level impl of all parseXXX() calls also do setRecord(). # Add setFieldNumbersByPosition(). # 2012-06-04 sjd: Support options hash arg for parse_start, parsefile, parse. # Implement actual pull parsing. Make parse_more() etc. like XML::Parser. # 2012-06-08 sjd: Add readBalanced(), readToUnquotedDelim(). # Finish hooking up and documenting 'DataSource' package. # Finish readRecord() (incl. comments) for JSON, MANCH. # Make readRecord() really do exactly one record (sexp, xml, mime, manch... # 2012-06-11 sjd: Make some use of null value settings, esp. for output. # 2012-06-13 sjd: Add postProcessFields(), splitter/joiner. Fix JSON escaping. # Add notion of sub-fields. Trap CSV quoting error for output. # Add assembleComment() to more formats. # 2012-06-21 sjd: Sync w/ TabularFormats.pm changes. Add option help strings. # 2012-07-05 sjd: Implement ARFF readAndParseHeader(). Add readRealLine(). # Add XML 'attrFields' option and support. Improve setFieldPosition() # and make width arg optional. # 2012-07-13 sjd: Check for nil from getFieldDef(); create field names at need. # setFieldPositions(), getAvailableWidth(), getNearestFollowingFieldDef(). # 2012-07-30ff sjd: Better sub-field handling. Catch undef names. Fix # postProcessFields(). Error-check arg to setRecordFromXXX(). # 2012-08-14 sjd: getOptions() call getFormatImplementation, for getOptions(). # 2012-10-29 sjd: Improve unescaping. # 2012-11-02 sjd: Make args for assembleField() consistent. "FIXED"->"COLUMNS". # 2012-11-26 sjd: Add open() to pass through to DataSource. # Add DataSource::binmode(). Discard FieldDef->{fTruncate}. # 2012-12-17ff sjd: Add getFieldsArray() and getFieldsHash(). Fix bug where it # lost [0] at one point. Treat theFields consistently as a hash. # Make assembleRecordFromHash default to current data. Move escapeJson to # sjdUtils. Omit empty fields for XSV output. Prettify output layouts. # Start fixing header handling. # 2012-12-19 sjd: Drop readHeader() for readAndParseHeader(). Make consistent. # Fiddle w/ XmlTuples API to make like the rest. # 2013-01-18 sjd: Add tell(), mainly for RecordFile.pm. # 2013-02-06ff sjd: Don't call sjdUtils::SUset("verbose"). Work on -stripRecord. # Break out DataSchema package, and tell it and DataSource what they need, # so they don't have to know 'owner' any more. Clean up virtuals a bit. # Also break out DataOptions and DataCurrent packages. Fix order of events # in pull-parser interface. Format-support packages to separate file. # Support repetition indicators on datatypes. # 2013-02-14 sjd: Sync package DataSource's API, closer to RecordFile.pm. # 2013-04-02 sjd: Forward a few more calls down to sub-packages (for tab2xml). # Add dprev for prior data record. Centralize setFieldNamesFromArray() call # from parseHeader() and readAndParseHeader() -- not in TFormatSupport.pm. # 2013-04-03 sjd: Make getField() create unknown fields as needed. # Let addField and FieldDef::new take some optional params. # 2013-04-23 sjd: Add package prefixes to sub dcls. Debug getFieldValue(0. # Distinguish getNSchemaFields() vs getNCurrentFields(). # 2013-06-03: Special-case XSV, which provides its own input handling. # Make sure -basicType shows up with addOptionsToGetoptLongArg(). # 2013-06-17ff: Add TabularFormats::getOptionsHash(). Make tfError() print # package and function names. Fix \-codes in options. Add sniffFormat(). # # To do: # Fix handling of XSV headers. # If no header, setRecordFromArray messed up? Cf cutData on GNG. # Add supportsFieldNames(). # Move DataSource into Recordfile? # Handle blank records better (integrate readRealLine). # Option to default specific fields to what they were in dprev! # # Do something with date formats. # Protect against UTF encoding errors. # Replace getRecordAsString (and Array). # Way to control order of writing fields where it doesn't matter: # xml, xsv (done?), json, mime # Integrate into C, C, C. # Right-justify numeric fields # FormatSniffer. # # Format-specific: # COLUMNS: add fixFieldWidths() to setFieldPosition(). # COLUMNS: easier way to pass in column positions? # COLUMNS: support reorderings in assembleRecordFromArray() # JSON: Option to write JSON arrays vs. dicts. # MANCH: Manage additional keywords per tupleset. # MANCH: add options for TypeName(s), SuperClass, ID, # inclusions. Implement header, prettyPrint. # MANCH, XML: finish readAndParseHeader(). # XML: support tag@attr values with attrFields? # XML: Option to parse up attribute values as subfields. # XML: Select elements by QGI@, hand back list of children of each??? # XML: switch to XML::Parser, HTML::Parser, etc. # XSV: output: omit defaults. # # Low priority: # Write merge: n files, (compound?) key designation per, field renaming. # Add compound-key-reifier to deriveField. # Rotate embedded layer (esp. for SEXP, XML, JSON, etc.) # Switch messaging to use sjdUtils? # Way to get the original offset/length of each field in record? # Add quick options to set tags for docbook, tei, nlm (in and out) # Improve handling of missing/extra/duplicate fields. # Should chooseFormat go into TFormatSupport? Maybe ditch subclassing there? # # Additional formats? See TFormatSupport.pm. # ############################################################################### # Messaging ("UNIVERSAL" is inherited by everything) # use strict; use feature 'unicode_strings'; use sjdUtils; sjdUtils::try_module("XML::DOM") || warn "Can't access CPAN XML::DOM module.\n"; sjdUtils::try_module("HTML::Entities") || warn "Can't access CPAN HTML::Entities module.\n"; #sjdUtils::try_module("MIME::QuotedPrint") || warn # "Can't access CPAN MIME::QuotedPrint module.\n"; use TFormatSupport; sjdUtils::try_module("Datatypes") || warn "Can't access sjd Datatypes module.\n"; sjdUtils::try_module("XmlTuples") || warn "Can't access sjd XmlTuples module (needed for XSV support).\n"; sjdUtils::try_module("FakeParser") || warn "Can't access sjd FakeParser module (needed for quasi-XML support).\n"; our $VERSION = "3.0"; # SAX (XML parser) events (just the ones we actually generate) # my %saxEvents = ( "Init" => 1, "Fin" => 1, "Start" => 1, "End" => 1, "Text" => 1, "Default" => 1, ); # List of supported formats # my @bt = qw/ARFF COLUMNS CSV JSON MIME MANCH PERL SEXP XSV XML/; my $formatNamesExpr = join("|",@bt); our $lastMessage = ""; our $tfMsgLevel = 0; sub UNIVERSAL::tfWarn { my ($level, $m1, $m2) = @_; if (!$m1) { $m1 = ""; } if (!$m2) { $m2 = ""; } $lastMessage = $m1.$m2; ($tfMsgLevel >= $level) || return; sjdUtils::vMsg(0,$m1,$m2); } sub UNIVERSAL::tfError { my ($level, $m1, $m2) = @_; if (!$m1) { $m1 = ""; } if (!$m2) { $m2 = ""; } $lastMessage = $m1.$m2; sjdUtils::SUset("locs",4); sjdUtils::eMsg($level, sjdUtils::whereAmI(1).": ".$m1, $m2); #sjdUtils::eMsg($level, ((caller(0))[3]).$m1, $m2); ($level<0) && die " ******* Error is fatal *******\n"; } sub UNIVERSAL::getLastMessage { return($lastMessage); } package DataSchema; package FieldDef; package DataSource; package DataCurrent; package DataOptions; package TabularFormats; ############################################################################### ############################################################################### ############################################################################### # The main package. # # Instantiates one of the specific formats, and dispatches calls # to it. The top-level package handles messaging, options, field defs, # a current data record, and some interfaces (like SAX). # The others handle format-specific i/o. # sub TabularFormats::new { my ($class, $format, $optionsHash) = @_; if (!$format) { $format = "CSV"; } # Manage the 'basicType' if ($optionsHash && ref($optionsHash) ne "HASH") { UNIVERSAL::tfError( 0, "Arg 2 to constructor (options) is not a hash."); return(undef); } my $self = { format => $format, # Name of format in use formatImpl => undef, # -> instance for basicType impl dsrc => undef, # -> DataSource instance dsch => undef, # -> DataSchema instance dprev => undef, # -> Prior dcur object. dcur => undef, # -> DataCurrent instance dopt => undef, # -> DataOptions instance parsedARecord => 0, # Finished w/ record 1 yet? saxCallbacks => {}, # In case they want to parse this way lastMessage => "", # Most recent error message gaveObsMsg => 0, # Already showed readRecord obsolete msg? }; # self bless $self, $class; $self->{dopt} = new DataOptions(); if (defined $optionsHash && ref($optionsHash) eq "HASH") { $self->{dopt}->setOptionsFromHash($optionsHash); } $self->{dsrc} = new DataSource(); $self->{dsch} = new DataSchema(); $self->{dcur} = new DataCurrent(); $self->chooseFormat($format); return($self); } # new sub TabularFormats::reset { # TabularFormats my ($self) = @_; $self->{dsch}->reset(); $self->{dcur}->reset(); if ($self->{dprev}) { $self->{dprev}->reset(); } $self->{lastMessage} = ""; } ############################################################################### # Facilitate callers supporting our options, by providing a single method # that adds them to a hash for the argument to Getopt::Long::GetOptions(). # The options invoke commands that store their values back here, so caller # doesn't have to know about them at all. # Options already defined before calling us are ok (warning on conflict). # sub TabularFormats::addOptionsToGetoptLongArg { my ($self, $getoptHash, # The hash to pass to GetOptions() $prefix # String to put on front of option names ) = @_; if (!defined $prefix) { $prefix = ""; } $self->{optionsPrefix} = $prefix; (ref($getoptHash) eq "HASH") || UNIVERSAL::tfError( -1, "Must provide a hashref."); my %getOptTypeMap = ( "boolean"=>"!", "integer"=>"=i", "BaseInt"=>"=o", "string"=>"=s", "Name"=>"=s", ); my $i = 0; for my $name (sort keys(%{$self->{dopt}->{options}})) { $i++; ($name =~ m/^\w+$/) || UNIVERSAL::tfError(0,"Bad option name '$name'"); my $dt = $self->{dopt}->getOptionType($name); my $suffix = $getOptTypeMap{$dt}; if (!$suffix) { UNIVERSAL::tfError( 0, "Unknown type '$dt' for option '$name'.\n " . "Known types: (" . join(", ", keys(%getOptTypeMap)) . ")."); $suffix = "!"; } if (defined $getoptHash->{"$prefix$name$suffix"}) { UNIVERSAL::tfError(0,"'$prefix$name$suffix' already in hash."); } $getoptHash->{"$prefix$name$suffix"} = sub { $self->setOption("$name", $_[1]); }; #UNIVERSAL::tfWarn( # 3, sprintf(" Adding %-16s => %s", "\"$prefix$name$suffix\"", # "sub { \$self->setOption('" . $name . "',\t\$_[1]); }")); } return($i); } # addOptionsToGetoptLongArg sub TabularFormats::SniffFormat { my ($self, $path) = @_; (my $ext = lc($path)) =~ s/^.*\.//; # Extension should be sufficient if ($ext eq "csv") { return("CSV\t" . $self->{dopt}->getOption("fieldSep")); } if ($ext eq "tsv") { return("CSV\t\t"); } if ($ext =~ m/^(zip|Z|gz|lz|tar)$/) { return("COMPRESSED\t$ext"); } if ($ext =~ m/^(xlsx)$/) { return("XLSX"); } if ($ext =~ m/^(htm|html|xml)$/) { return("XML"); } if ($ext =~ m/^(mbox)$/) { return("MIME"); } if ($self->hasFormat($ext)) { return($ext); } # Unix 'file' command my $ufile = `file $path`; if ($ufile !~ m/ text/) { return(undef); } if ($ufile =~ m/(HTML|XML)/) { return("XML"); } if ($ufile =~ m/mail text/) { return("MIME"); } # Sniff the beginning of the data my $head = `head -n 10 $path`; if ($head =~ m/\n\@RELATION/si) { return("ARFF"); } if ($head =~ m/ =for nobody =================================================================== =head1 Managing options =over =item * BI<(hashRef, prefix?)> Add all of this package's options to the hash at I, in the form you would pass to Perl's C package. The options will be set up to store their values directly to the C instance, via the I() method. If I is defined, it will be added to the beginning of each option name; this allows you to avoid name conflicts with the caller, or between multiple instances of C (for example, one for input and one for output). If an option is already present in the hash (note that the key, as always for C, includes aliases and suffixes like "=s"), a warning is issued and the new one replaces the old. Returns: The number of options added. =back =for nobody =================================================================== =for nobody =================================================================== =head1 Internal package "DataSource" Get a reference to the active instance of this package, from the I instance, using I(). All of the format readers read their data through an internal package called C. It provides this interface (which can also be used independently). This will likely be removed from here and integrated into C or C. =over =item B() =item B(path) Open the file at I and make it the current source of data. Any previously opened or attached file is closed. Returns: undef on failure, otherwise the file handle to the open file. =item B() Close any currently-open input file, and discard any pushed-back or added text. =item B Move the open file to position I, and clear any pushback data. I is 0 to count from start of file, 1 to count forward, and -1 to count backward from end of file. =item B Return the current offset into the open file. =item B(self, fh) Make the file handle I the current source of data. Any previously-open file is detached. =item B(self, text) Add I to be read. Any previously-attached or opened file is detached. If there is text data still unread from prior I() or I() calls, the new I is appended. =item B(self, text) Add I to be read I any still-unread text from prior I() or I() calls (if a file if open, it stays open). =item B(self) Read and return one physical line (terminated by \n). Input comes first from the buffer, then from the open file if any. =item B(self, commentDelim) Read a I line, as defined for the active format. For example, some types of CSV files permit newlines within quoted fields, and I() accounts for this. Used by I. =item B(self, endExpr, quoters, qdouble, escapes, comment) Reads up to (but not including) the first unquoted occurrence of the regular expression I. This is used to read to the ";" that ends a Perl declaration, the "@DATA" that separates header and data in ARFF, etc. The parameters are as the like-named paramters of I, plus: =over =item I -- a (Perl) regex, the first match to which ends the scan. =back =item B(self, openers, closers, quoters, qdouble, escapes, comment) Set up the parameters needed for I (qv). All parameters must be provided, even if some or all are "". Openers and closers that are quoted, doubled, or unescaped when those parameters are in effect, do not count towards balancing. =over =item I -- a string containing the characters that can open expressions. Default: "(". Characters in corresponding positions in I and I must correspond (for example, "([{" goes with ")]}", not "])}"). =item I -- a string containing the characters that can close expressions. Default: ")". Characters in corresponding positions in I and I must correspond (for example, "([{" goes with ")]}", not "])}"). =item I -- a string containing characters that function as quotes, disabling the effect of openers and closers within their scope. Default: "\"". =item I -- 0 or 1 to indicate whether 2 of the same quote characters in a row count as data rather than closing an open quote group. =item I -- a string containing the characters that cause the character following them to be as data rather than as an opener, closer, quoter, escape, or comment. =item I -- a string (just one) that (when not escaped or in quotes) causes the rest of the physical lines to be discarded as a comment. =back =item B(self) If extra parameters are found, then I will be called with the same parameter list, before processing as usual. Return text up through the next balance point in terms of parentheses, brackets, braces, or similar delimiters. For example, this method can read a complete SEXP S-expression, or a complete JSON group, allowing for nested constructs. If the expression ends in mid-line, this method calls I() for the rest of the line. =back =for nobody =================================================================== =for nobody =================================================================== =for nobody =================================================================== =head1 Internal package: DataSchema Get a reference to the active instance of this package, from the I instance, using I(). Many of these methods allow you to identify a specific field by either name or number. Fields always have both. =over =item * B Call this to enable the schema to create new fields on the fly, if it is ever asked for a field that isn't known. This is particularly useful when a file has no schema or header. =item * B() Return the number of fields known to the B (that is, the number of existing field definitions. This is not necessarily the same as the number of fields of the current records (for which see I()). =item * B(name) Append a field definition to the list of known fields. Returns the number of fields defined so far (including the new one). See also I(), below. =item * B Ensure that there are at least I fields defined. =item * B(n, name) Change the name of field I (name or number). =item * B Return the name of field I (name or number). =item * B(arrayRef) Rename fields en masse. The names in the array referenced by I will be assigned to the fields in the current field order (if there are more names than defined fields, new fields will be quietly defined as needed). I->[0] should be undefined or empty. Undefined or empty elements will not cause renaming of the corresponding fields. B: Changing field names while in the middle of reading a file is unwise, at least for formats that have explicit field names in the data (as in many or most formats other than CSV and ARFF). =item * B() Return an array of the names of the fields, in field-number order. As always, [0] will be present but empty. =item * B(n, dtName) Change the datatype of field I (name or number). The names are as supported by C, which include the built-in XML Schema Datatypes plus some extensions. =item * B(n) Get the datatype of field I (name or number). =item * B(n, defaultValue) Set the default value for field I (name or number). This will be filled in when the field is missing in the input (in most formats, whitespace counts as empty). For formats that have their own defaulting mechanism, this operates I that mechanism. If the option is set, fields that match their default will not have their values written to the output (this is only supported for XSV so far). Exactly what "missing" means depends on the specific format in use. For example, XSV fields are identified by name so can be entirely omitted, while CSV and COLUMNS necessitate a placeholder. =item * B(n) Returns the present default value for field I (name or number). =item * B(n, regex, joinerString) Enable support for "sub-fields" (experimental). No splitting is done by default. On input, any field for which I has been set will be split() to make an array. On output, any field whose value is an array reference will combine the array elements into a single field. Most of the supported formats do not define such a notion, so the output field will simply be created by doing a Perl join() using the specifying I, and putting quotes around the outside. For example, if the second field is a reference to an array of the first five integers and the I is a space, for CSV the second field ends up as shown here: field1, "1 2 3 4 5", field3 For output to formats that have a notion of hierarchy, their syntax is used: =over =item * For XML, sub-elements are created using the I specified as I. A typical example might be dividing table cells into "

" or "" elements. If I contains a space, anything after the space will be deleted when writing the I; this allows specifying attributes if desired (like 'p class="foo"'). =item * For JSON and Perl, the array elements will be separated by ", ", and the whole list parenthesized (I is ignored). = item * For SEXP, a parenthesized quoted list is created, with individual items quoted if needed (I is ignored). =back =item * B(n, newNumber) Move field I (name or number) to field ordering place I. In effect, the field is deleted from the ordering (with all later fields therefore moving down by 1 position), and then inserted before field I (with all later fields moving up by 1 position). See also I(). B: The fields of the B are always organized by name, not number. So if you change field numbers after loading a record, the data for the field is "moved" along with the field. However, the field ordering is used when parsing formats that are defined by order (mainly ARFF, COLUMNS, CSV, and some variants of SEXP not yet supported). So if you use I, any records you later parse will assume the new ordering. To modify the order of fields in such formats, create two instances of this package, one for input (where you never call I()), and one for output (where you do). Define the desired fields for the output with I(), perhaps copying them from the input instance, perhaps renaming or reordering. Then call I() in the first instance, and pass the returned hashes to I() in the second instance. =item * B Return the field number corresponding to field I (name or number), or 0 if there is no such field. =item * B(startArrayRef) Call I() for each element in the array referenced by I (as always, [0] should be present but empty). These entries should be the start columns for the respective fields. The widths will be set to be everything up to the next start column (except for the last one, whose width is presently undefined). Field alignments will not be set. =item * B(n,startCol,width?,align?) Sets the column range (counting from 1) that field I (name or number) occupies. This only applies when dealing with COLUMNS format. This is the only way to tell COLUMNS where the fields are. Note that it uses I, not I. If I is omitted, it will be set to occupy everything up to just before the nearest following I (or undef is there is no following field has been defined yet). The optional I argument may be L (left), R (right), C (center), D (decimal), or A (automatic), to specify how the data will be padded if needed. "D" is limited to using "." to align on, and aligning that character to the center of the permitted width. This method checks for position conflicts (overlap). If there is a conflict with an already-defined column range for another field, it returns 0 (otherwise 1). B: This method does I change any fields' sequence number; you may want to call I() afterward to do so. =item * B(n) Return the starting column, width, and alignment for field I (name or number). =item * B() If you moved fields around with I(), this will re-number them (like I()) to be in ascending order by position. =item * B(n) Return the number of columns available, by searching for the nearest following field by start position, and subtracting start positions. =item * B(n) Return the field definition of the next field, in order of start position, after field I (name or number). =item * B (experimental) Attach a callback function I to field I (name or number). Whenever that field is parsed out of input data, the callback will be called, being passed a reference to the TabularFormats instance calling it, and the string form of the field value, and the returned value will be used in place of the value passed: theCallback($tf, $s) B: There should be a way for the callback to do internal parsing and return more than one field; but there isn't. However, the callback can do explicit calls to I<< $tf->setFieldValue($n, $x) >>. This feature is not yet integrated with sub-fields/splitters (cf), and the result if you use both is undefined. =back =for nobody =================================================================== =for nobody =================================================================== =head1 Internal package "DataCurrent" This package keeps the fields of the current data record. Get a reference to the active instance of this package, from the I instance, using I(). B: parseXXX methods of C modify the current data record automatically; you don't need to call I() again after (for example) each I. =over =item B(values, names) Copy the data items from the array referenced by I into the current data record. It copies only the fields named in the array referenced by I, in that order (skipping [0]). No checking against the list of defined fields is done (this package is not directly related to DataSchema). =item B(hRef) Copy the data items from the hash referenced by I into the current data record. Items not present in the hash, are unchanged (use I first if desired). =item B() Undefine all fields for the current data record. =item B(n, value) Change the value stored for field I (name or number) of the current data record. =item B(n) Return the value stored for field I (name or number) of the current data record. =item B() Return the current data record's field values as a string in the appropriate format. =item B(names) Return some or all of the current data record's field values as an array; [0] will be "" as always; the rest of the array will be filled in by the values of the fields named in the array referenced by I. Any nonexistent names, will result in undefined array elements. =item B() Return a hash of the current data record's fields. =back =for nobody =================================================================== =for nobody =================================================================== =head1 Related commands =over =item * C and C are basic wrappers on top of this, that just convert from one form to another. The names are historical. =item * C -- a somewhat similar conversion, but specialized for Penn TreeBank files, which are kind of like SEXP but contain many other embedded syntaxes, which this script also converts. =item * C -- take a file and measure all the fields, then space-pad them so they line up nicely. Can also do box-drawing in ASCII or Unicode. =item * C -- support for the XSV format. =item * C -- provices record-oriented i/o, with cached offsets. Looks basically like a file, but handles logical rather than physical records. =item * C, C -- simple parsers for XML. Much ike CPAN's C, but more forgiving of errors, and thus not fully-conforming XML parsers. C also supports some extra minimization conventions, especially for tables. =item * Some sjd utilities that use C: C, C, C, C, C, C, C, C, C, C, C, C, C, C. =item * Some OA utilities that use C C, C, C, C, C, C. =item * Some utilities that may not give access to all TF options yet: C*, C*, C* (unfinished), =back =for nobody =================================================================== =head1 Known bugs and limitations See also C. =over =item * Not safe against UTF-8 encoding errors. Use C if needed. =item * Leading spaces on records are not reliably stripped. =item * Particular formats set their own values for the I option. This means you can't override it until after calling I, which is annoying. =item * The I option is supported for JSON, Perl, XML, and XSV. For some other formats it is not clear how to escape non-ASCII characters. ARFF appears to provide no way at all. MIME headers use I form, but support for full Unicode is not yet finished. =item * The behavior if using regexes rather than strings for I, I, I, etc., for CSVs is undefined. Most likely it will work ok for input, but not for output. =item * Support for decoding HTML entity references is implemented but commented out; to use it, uncomment things starting C and install the eponymous CPAN package. =item * Datatype checking is experimental. =item * The behavior if a given field is found more than once in an input record is undefined. This is only possible with some formats (essentially those that identify fields by name, not position). Some options may be added for this, perhaps taking the first or last, or concatenating them with some separator, or serializing them somehow. =back =head1 Ownership This work by Steven J. DeRose is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License. For further information on this license, see L. The author's present email is sderose at acm.org. For the most recent version, see L. =cut