I have downloaded the indexes generated for Maven Central from http://mirrors.ibiblio.org/pub/mirrors/maven2/dot-index/nexus-maven-repository-index.gz
I would like to list the artifacts information from these index files (groupId, artifactId, version for example). I have read that there is a high level API for that. It seems that I have to use the following maven dependency. However, I don't know what is the entry point to use (which class?) and how to use it to access those files:
<dependency>
<groupId>org.sonatype.nexus</groupId>
<artifactId>nexus-indexer</artifactId>
<version>3.0.4</version>
</dependency>
The legacy zip index is a simple lucene index. I was able to open it with Luke and write some simple lucene code to dump out the headers of interest ("u" in this case)
Sample output ...
There may be better ways to achieve this though ...
Take a peek at https://github.com/cstamas/maven-indexer-examples project.
In short: you dont need to download the GZ/ZIP (new/legacy format) manually, it will indexer take care of doing it for you (moreover, it will handle incremental updates for you too, if possible).
GZ is the "new" format, independent of Lucene index-format (hence, independent of Lucene version) containing data only, while the ZIP is "old" format, which is actually plain Lucene 2.4.x index zipped up. No data content change happens currently, but is planned in future.
As I said, there is no data content difference between two, but some fields (like you noticed) are Indexed but not stored on index, hence, if you consume the ZIP format, you will have them searchable, but not retrievable.
The https://github.com/cstamas/maven-indexer-examples is obsolete. And the build fails (tests do not pass).
The Nexus Indexer has moved along and included the examples too: https://github.com/apache/maven-indexer/tree/master/indexer-examples
That builds, and the code works.
Here is a simplified version if you want to roll your own:
Maven:
Java:
We use this in the Windup project - JBoss migration tool.