Dear Users I am working on apache lucene for indexing and searching . I have to index html files stored on the local disc of computer . I have to make indexing on filename and contents of the html files . I am able to store the file names in the lucene index but not the html file contents which should index not only the data but the entire page consisting images link and url and how can i access the contents from those indexed files for indexing i am using the following code:
File indexDir = new File(indexpath);
File dataDir = new File(datapath);
String suffix = ".htm";
IndexWriter indexWriter = new IndexWriter(,
new SimpleAnalyzer(),
indexDirectory(indexWriter, dataDir, suffix);
numIndexed = indexWriter.maxDoc();
private void indexDirectory(IndexWriter indexWriter, File dataDir, String suffix) throws IOException {
try {
for (File f : dataDir.listFiles()) {
if (f.isDirectory()) {
indexDirectory(indexWriter, f, suffix);
} else {
indexFileWithIndexWriter(indexWriter, f, suffix);
} catch (Exception ex) {
System.out.println("exception 2 is" + ex);
private void indexFileWithIndexWriter(IndexWriter indexWriter, File f,
String suffix) throws IOException {
try {
if (f.isHidden() || f.isDirectory() || !f.canRead() || !f.exists()) {
if (suffix != null && !f.getName().endsWith(suffix)) {
Document doc = new Document();
doc.add(new Field("contents", new FileReader(f)));
doc.add(new Field("filename", f.getFileName(),
Field.Store.YES, Field.Index.ANALYZED));
} catch (Exception ex) {
System.out.println("exception 4 is" + ex);
thanks in advance
This line of code is the reason why your contents is not being stored:
This method DOES NOT STORE the contents being indexed.
If you are trying to index HTML files, try using JTidy. It will make the process much easier.
Sample Codes:
To get an InputStream from a URL:
To get an InputStream from a File: