How does one find out how a DBF file or any file i

2019-06-08 16:07发布

问题:

I am attempting to pick apart a DBF file using the code in THIS AskTom post however I have no clue where I would even go to figure out how the files I'm wanting to pick apart are formatted? How was the original AskTom answer even produced? How do I figure out the dbf Header. How do I know where within the dbf file the data is stored or even how to pull out that data?

My goal is to work with the code provided and come up with a solution as others have done but I'm stuck at the very first part.

While researching this I found that there are many systems that use DBF files. ACCESS and MS Excel have the ability to import DBF files but none of those features worked when I attempted to import the DBF files from my Oracle Express installation. When in ACCESS there are options to choose from several different dbf file formats including dBASE III, dBASE IV, dBASE 5, and dBASE 7. I have no idea if Oracle is even any of these formats and wish I knew how to find out this information. Fox Pro appears to be the main user of dbf files regarding the formats I listed.

The code in that AskTom link is very long and isn't necessarily relevant to this question but just so you know where I'm coming from I've included it anyway. I can run the code but it just doesn't work. I'd love to fix it but I would need knowledge of Oracles DBF files or perhaps there is something else I am missing?

How to Read a DBase File in Java or Oracle PL/SQL

What is the best opensource dbf driver for java?

create or replace package dbase_fox as
  -- procedure to a load a table with records
  -- from a DBASE file.
  --
  -- Uses a BFILE to read binary data and dbms_sql
  -- to dynamically insert into any table you
  -- have insert on.
  --
  -- p_filename is the name of a file in that directory
  --            will be the name of the DBASE file
  -- p_colnames is an optional list of comma separated
  --            column names.  If not supplied, this pkg
  --            assumes the column names in the DBASE file
  --            are the same as the column names in the
  --            table
  -- p_rownum boolean that activates an autonum
  --          functionality such that a sequential
  --          numbered virtual column is created
  /* EXAMPLE USAGE
   | BEGIN
   |   
   |   dbase_fox.showtable(p_filename => 'CONTROL' 
   |                     , p_colnames => null
   |                     , p_rownum   => false);
   | 
   | END;
  */


  procedure loadtable(p_filename in varchar2
                    , p_colnames in varchar2 default null
                    , p_rownum in boolean default false);

  -- procedure to print (and not insert) what we find in
  -- the DBASE files (not the data, just the info
  -- from the dbase headers....)
  --
  -- p_filename is the name of a file in that directory
  --            will be the name of the DBASE file
  -- p_colnames is an optional list of comma separated
  --            column names.  If not supplied, this pkg
  --            assumes the column names in the DBASE file
  --            are the same as the column names in the
  --            table
  -- p_rownum boolean that activates an autonum
  --          functionality such that a sequential
  --          numbered virtual column is created
  procedure showtable(p_filename in varchar2
                    , p_colnames in varchar2 default null
                    , p_rownum in boolean default false);
end;

/

/* Package BODY */
    create or replace package body dbase_fox as
PREFIX     constant varchar2(32) default 'stage_';
CR         constant varchar(2)   default chr(13)||chr(10);
MEMODTYPE  constant varchar2(32) default 'varchar2(4000)';
ROWNUMNAME constant varchar2(32) default '"ROWNUM"';
FRAMESIZE  constant integer      default 1000;

addrownum  boolean := false;
colnames   varchar2(255) := '';
filename   varchar2(32) := '';
dbfbfile   bfile := null;
fptbfile   bfile := null;

DBF_HEADER_SIZE constant number default 32;
type dbf_header_type is record (
    version    varchar2(25) -- dBASE version number
   ,year       int          -- 1 byte int year, add to 1900
   ,month      int          -- 1 byte month
   ,day        int          -- 1 byte day
   ,no_records int          -- number of records in file, 4 byte int
   ,hdr_len    int          -- length of header, 2 byte int
   ,rec_len    int          -- number of bytes in record, 2 byte int
   ,no_fields  int          -- number of fields
);
dbf_header dbf_header_type := null;
subtype    dbf_header_data is raw(32);

DBF_FIELD_DESCRIPTOR_SIZE constant number default 32;
type dbf_field_descriptor_type is record (
    name     varchar2(11)
   ,type     char(1)
   ,length   int    -- 1 byte length
   ,decimals int    -- 1 byte scale
);
type dbf_field_descriptor_array is table of dbf_field_descriptor_type index by binary_integer;
subtype dbf_field_descriptor_data is raw(32);
dbf_field_descriptor dbf_field_descriptor_array;

type rowarray_type is table of dbms_sql.varchar2_table index by binary_integer;
rowarray rowarray_type;

subtype raw_type is raw(4000);
type rawarray_type is table of raw_type index by binary_integer;
rawarray rawarray_type;

loadcursor binary_integer;
mblocksize number := 0;

procedure get_header is
  l_data dbf_header_data;
begin
  l_data := dbms_lob.substr(dbfbfile, DBF_HEADER_SIZE, 1);
  dbf_header.version    := utl_raw.cast_to_binary_integer(utl_raw.substr(l_data, 1, 1));
  dbf_header.year       := 1900 + utl_raw.cast_to_binary_integer(utl_raw.substr( l_data, 2, 1));
  dbf_header.month      := utl_raw.cast_to_binary_integer(utl_raw.substr(l_data, 3, 1));
  dbf_header.day        := utl_raw.cast_to_binary_integer(utl_raw.substr(l_data, 4, 1));
  dbf_header.no_records := utl_raw.cast_to_binary_integer(utl_raw.substr(l_data, 5, 4),2);
  dbf_header.hdr_len    := utl_raw.cast_to_binary_integer(utl_raw.substr(l_data, 9, 2),2);
  dbf_header.rec_len    := utl_raw.cast_to_binary_integer(utl_raw.substr(l_data, 11, 2),2);
  dbf_header.no_fields  := trunc((dbf_header.hdr_len - DBF_HEADER_SIZE) / DBF_FIELD_DESCRIPTOR_SIZE);
end;

procedure get_header_fields is
  l_data dbf_field_descriptor_data;
begin
  for i in 1 .. dbf_header.no_fields loop
    l_data := dbms_lob.substr(dbfbfile, DBF_FIELD_DESCRIPTOR_SIZE, 1+DBF_HEADER_SIZE+(i-1)*DBF_FIELD_DESCRIPTOR_SIZE); -- starting past the header
    dbf_field_descriptor(i).name     := rtrim(utl_raw.cast_to_varchar2(utl_raw.substr(l_data,1,11)),chr(0));
    dbf_field_descriptor(i).type     := utl_raw.cast_to_varchar2(utl_raw.substr(l_data, 12, 1));
    dbf_field_descriptor(i).length   := utl_raw.cast_to_binary_integer(utl_raw.substr(l_data, 17, 1));
    dbf_field_descriptor(i).decimals := utl_raw.cast_to_binary_integer(utl_raw.substr(l_data,18,1));
  end loop;
end;

procedure show_field_header_columns is
begin
  dbms_output.put_line(CR||'Num'
             ||chr(9)||'Name       '
             ||chr(9)||'Type'
             ||chr(9)||'Length'
             ||chr(9)||'Decimals');
  dbms_output.put_line('==='
             ||chr(9)||'====       '
             ||chr(9)||'===='
             ||chr(9)||'======'
             ||chr(9)||'========');
end;

procedure show_header(p_file_length in integer) is
begin
  dbms_output.put_line(chr(9)||dbf_header.version
                     ||chr(9)||dbf_header.year
                     ||chr(9)||dbf_header.month
                     ||chr(9)||dbf_header.day
                     ||chr(9)||dbf_header.no_records
                     ||chr(9)||dbf_header.hdr_len
                     ||chr(9)||dbf_header.rec_len
                     ||chr(9)||dbf_header.no_fields
                     ||chr(9)||p_file_length);
end;

procedure show_fields is
begin
  for i in dbf_field_descriptor.first .. dbf_field_descriptor.last loop
    dbms_output.put_line(i
               ||chr(9)||rpad(dbf_field_descriptor(i).name,11,' ')
               ||chr(9)||dbf_field_descriptor(i).type
               ||chr(9)||dbf_field_descriptor(i).length
               ||chr(9)||dbf_field_descriptor(i).decimals);
  end loop;
end;

function build_insert return varchar2 is
  l_statement long;
begin
  l_statement := 'insert into ' || PREFIX || filename || ' (';

  if colnames is not null then
    l_statement := l_statement || colnames;
  else
    for i in dbf_field_descriptor.first .. dbf_field_descriptor.last loop
      if i <> 1 then
        l_statement := l_statement || ',';
      end if;
      l_statement := l_statement || '"'||  dbf_field_descriptor(i).name || '"';
    end loop;
    if addrownum then
      l_statement := l_statement || ',' || ROWNUMNAME;
    end if;
  end if;

  l_statement := l_statement || ') values (';

  for i in dbf_field_descriptor.first .. dbf_field_descriptor.last loop
    if i <> 1 then
      l_statement := l_statement || ',';
    end if;
    if dbf_field_descriptor(i).type = 'D' then
      l_statement := l_statement || 'to_date(:bv' || i || ',''yyyymmdd'' )';
    else
      l_statement := l_statement || ':bv' || i;
    end if;
  end loop;
  if addrownum then
    l_statement := l_statement || ',:bv' || (dbf_field_descriptor.last + 1);
  end if;
  l_statement := l_statement || ')';
  return l_statement;
end;

function build_create return varchar2 is
  l_statement long;
begin
  l_statement := 'create table ' || PREFIX || filename || ' (';

  for i in dbf_field_descriptor.first .. dbf_field_descriptor.last loop
    l_statement := l_statement || CR;
    if i <> dbf_field_descriptor.first then
      l_statement := l_statement ||',';
    else
      l_statement := l_statement ||' ';
    end if;
    l_statement := l_statement || '"'||  dbf_field_descriptor(i).name || '"'||chr(9);
    if dbf_field_descriptor(i).type = 'D' then
      l_statement := l_statement || 'date';
    elsif dbf_field_descriptor(i).type = 'F' then
      l_statement := l_statement || 'float';
    elsif dbf_field_descriptor(i).type = 'N' then
      if dbf_field_descriptor(i).decimals > 0 then
        l_statement := l_statement || 'number('||dbf_field_descriptor(i).length||','|| dbf_field_descriptor(i).decimals || ')';
      else
        l_statement := l_statement || 'number('||dbf_field_descriptor(i).length||')';
      end if;
    elsif dbf_field_descriptor(i).type = 'M' then
      l_statement := l_statement || MEMODTYPE;
    else
      l_statement := l_statement || 'varchar2(' || dbf_field_descriptor(i).length || ')';
    end if;
  end loop;
  if addrownum then
    l_statement := l_statement || CR || ',' || ROWNUMNAME || chr(9) || 'number';
  end if;
  l_statement := l_statement ||CR||');'||CR||'/';
  return l_statement;
end;

procedure show_header_columns is
begin
  dbms_output.put_line(CR||'DBASE File'
             ||chr(9)||'Version'
             ||chr(9)||'Year'
             ||chr(9)||'Month'
             ||chr(9)||'Day'
             ||chr(9)||'#Recs'
             ||chr(9)||'HdrLen'
             ||chr(9)||'RecLen'
             ||chr(9)||'#Fields'
             ||chr(9)||'Size');
  dbms_output.put_line('=========='
             ||chr(9)||'======='
             ||chr(9)||'===='
             ||chr(9)||'====='
             ||chr(9)||'==='
             ||chr(9)||'====='
             ||chr(9)||'======'
             ||chr(9)||'======'
             ||chr(9)||'======='
             ||chr(9)||'====');
end;

procedure loadtablerecord(i in number) is
  l_n      number;
  l_fblock number;
  l_data   raw_type;
begin
  l_data := dbms_lob.substr(dbfbfile,dbf_header.rec_len,2+DBF_HEADER_SIZE+dbf_header.no_fields*DBF_FIELD_DESCRIPTOR_SIZE+(i-1)*dbf_header.rec_len); -- starting past the header and field descriptors
  rawarray(0) := utl_raw.substr(l_data, 1, 1);
  l_n := 2;
  for j in 1 .. dbf_header.no_fields loop
    rawarray(j) := utl_raw.substr(l_data,l_n,dbf_field_descriptor(j).length);
    if dbf_field_descriptor(j).type = 'F' and rawarray(j) = '.' then
      rawarray(j) := null;
    elsif dbf_field_descriptor(j).type = 'M' then
      if dbms_lob.isopen(fptbfile) != 0 then
        l_fblock := nvl(utl_raw.cast_to_binary_integer(dbms_lob.substr(fptbfile, 4, to_number(trim(utl_raw.cast_to_varchar2(rawarray(j))))*mblocksize+5)),0);
        rawarray(j) := dbms_lob.substr(fptbfile, l_fblock, to_number(trim(utl_raw.cast_to_varchar2(rawarray(j))))*mblocksize+9);
      else
        dbms_output.put_line(filename || '.fpt not found');
      end if;
    end if;
    l_n := l_n + dbf_field_descriptor(j).length;
  end loop;
end;

procedure loadtablearray(p_cntarr in int) is
  l_bulkcnt number;
begin
  for j in 1 .. dbf_header.no_fields loop
    dbms_sql.bind_array(loadcursor, ':bv'||j, rowarray(j),1,p_cntarr);
  end loop;
  if addrownum then
    dbms_sql.bind_array(loadcursor, ':bv'||(dbf_header.no_fields+1), rowarray(dbf_header.no_fields+1),1,p_cntarr);
  end if;
  begin
    l_bulkcnt := dbms_sql.execute(loadcursor);
    --dbms_output.put_line('Bulk insert count ' || l_bulkcnt);
  exception
    when others then
      dbms_output.put_line('Bulk insert failed ' || sqlerrm);
      dbms_output.put_line(build_insert);
  end;
end;

procedure loadtablebulk is
  l_cntrow int default 0;
  l_cntarr int default 0;
begin
  loadcursor := dbms_sql.open_cursor;
  dbms_sql.parse(loadcursor, build_insert, dbms_sql.native);

  for i in 1 .. dbf_header.no_records loop
    loadtablerecord(i);
    if utl_raw.cast_to_varchar2(rawarray(0)) <> '*' then
      l_cntarr := l_cntarr + 1;
      for j in 1 .. dbf_header.no_fields loop
        rowarray(j)(l_cntarr) := trim(utl_raw.cast_to_varchar2(rawarray(j)));
      end loop;
      if addrownum then
        l_cntrow := l_cntrow + 1;
        rowarray((dbf_header.no_fields+1))(l_cntarr) := l_cntrow;
      end if;
      if l_cntarr >= FRAMESIZE then
        loadtablearray(l_cntarr);
        l_cntarr := 0;
      end if;
    end if;
  end loop;
  if l_cntarr > 0 then
    loadtablearray(l_cntarr);
  end if;

  dbms_sql.close_cursor(loadcursor);
exception
  when others then
    if dbms_sql.is_open(loadcursor) then
      dbms_sql.close_cursor(loadcursor);
    end if;
    dbms_output.put_line('loadtable failed for ' || filename);
    dbms_output.put_line('insert ' || build_insert);
end;

procedure open_dbf is
begin
  dbfbfile := bfilename('FILE_GET_DIR', filename || '.dbf');
  dbms_lob.fileopen(dbfbfile);
end;

procedure open_fpt is
begin
  fptbfile := bfilename('FILE_GET_DIR', filename || '.fpt');
  if dbms_lob.fileexists(fptbfile) != 0 then
    dbms_lob.fileopen(fptbfile);
  end if;
end;

procedure close_dbf is
begin
  if dbms_lob.isopen(dbfbfile) > 0 then
    dbms_lob.fileclose(dbfbfile);
  end if;
end;

procedure close_fpt is
begin
  if dbms_lob.isopen(fptbfile) > 0 then
    dbms_lob.fileclose(fptbfile);
  end if;
end;

procedure initialize is
  l_empty_dbf_field_descriptor dbf_field_descriptor_array;
  l_empty_rowarray rowarray_type;
  l_empty_rawarray rawarray_type;
begin
  dbfbfile := null;
  fptbfile := null;
  dbf_field_descriptor := l_empty_dbf_field_descriptor;
  dbf_header := null;
  rowarray := l_empty_rowarray;
  rawarray := l_empty_rawarray;
  loadcursor := 0;
  mblocksize := 0;
end;

procedure showtable(p_filename in varchar2, p_colnames in varchar2 default null, p_rownum in boolean default false) is

  errorAtLine NUMBER  := 0;

begin

  filename := p_filename;
  addrownum := p_rownum;
  colnames := p_colnames;  

  initialize;

    errorAtLine := 1;

  open_dbf;

    errorAtLine := 2;

  get_header;

    errorAtLine := 3;

  get_header_fields;

    errorAtLine := 4;

  show_header_columns;

    errorAtLine := 5;

  dbms_output.put(filename || '.dbf');

    errorAtLine := 6;

  show_header(dbms_lob.getlength(dbfbfile));

    errorAtLine := 7;

  show_field_header_columns;

    errorAtLine := 8;

  show_fields;

    errorAtLine := 9;

  dbms_output.put_line(CR||'Insert statement:');
  dbms_output.put_line(build_insert);

  dbms_output.put_line(CR||'Create statement:');
  dbms_output.put_line(build_create);

  close_dbf;
exception
  when others then
    close_dbf;
    dbms_output.put_line('Error At: ' || errorAtLine);
    raise;
end;

procedure loadtable(p_filename in varchar2, p_colnames in varchar2 default null, p_rownum in boolean default false) is
begin
  filename := p_filename;
  addrownum := p_rownum;
  colnames := p_colnames;

  initialize;

  open_dbf;
  open_fpt;

  if dbms_lob.isopen(fptbfile) > 0 then
    mblocksize := utl_raw.cast_to_binary_integer(dbms_lob.substr(fptbfile, 2, 7));
  else
    mblocksize := 0;
  end if;

  get_header;
  get_header_fields;

  loadtablebulk;

  close_dbf;
  close_fpt;
exception
  when others then
    close_dbf;
    close_fpt;
    raise;
end;

end;

回答1:

I had a similar problem and this is how I did it.

TL;DR: You'll need to use Apache Tika to parse DBase files. It converts the content into a XHTML table and returns it as a java.lang.String, which you can parse via a DOM or a SAX parser to get the data in the format you need. Here are some examples: https://tika.apache.org/1.20/examples.html

To start, add the following Maven dependency to your POM:

<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-parsers</artifactId>
  <version>1.21</version>
</dependency>

Then initialize the parser:

Parser parser =  new DBFParser(); //Alternatively, you can use AutoDetectParser
ContentHandler handler = new BodyContentHandler(new ToXMLContentHandler()); //This is tells the parser to produce an XHTML as an output.
parser.parse(dbaseInputStream, handler, new Metadata(), new ParseContext()); // Here, dbaseInputStream is a FileInputStream object for the DBase file.
String dbaseAsXhtml = handler.toString(); //This will have the content in XHTML format

Now, to convert the data into a more convenient format (in this case CSV), I did the following:

First, convert the whole String into a DOM object:

DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document xhtmlDoc= builder.parse(new InputSource(new StringReader(xmlString.trim().replaceAll("\t", "")))); //I'm trimming out the tabs and whitespaces here, so that I don't have to dealt with them later

Now, to get the headers:

XPath xPath = XPathFactory.newInstance().newXPath();
NodeList tableHeader = (NodeList)xPath.evaluate("//table/thead/th", xhtmlDoc, XPathConstants.NODESET);

String [] headers = new String[tableHeader.getLength()];
for(int i = 0; i < tableHeader.getLength(); i++) {
    headers[i] = tableHeader.item(i).getTextContent();
}

Then the records:

XPath xPath = XPathFactory.newInstance().newXPath();
NodeList tableRecords = (NodeList)xPath.evaluate("//table/tbody/tr", xhtmlDoc, XPathConstants.NODESET);

List<String[]> records = new ArrayList<String[]>(tableRecords.getLength());

for(int i = 0; i < tableRecords.getLength(); i++) {
    NodeList recordNodes = tableRecords.item(i).getChildNodes();
    String[] record = new String[recordNodes.getLength()];
    for(int j = 0; j < recordNodes.getLength(); j++)
        record[j] = recordNodes.item(j).getTextContent();
        records.add(record);
    }

Finally, we put them together to form a CSV:

StringBuilder dbaseCsvStringBuilder = new StringBuilder(String.join(",", headers) + "\n");
for(String[] record : records)
        dbaseCsvStringBuilder.append(String.join(",", record) + "\n");
String csvString = dbaseCsvStringBuilder.toString();

Here's the complete source code: https://github.com/Debojit/DbaseTranslater/blob/master/src/main/java/nom/side/poc/file/dbf/DbaseReader.java