I have downloaded the indexes generated for Maven Central from http://mirrors.ibiblio.org/pub/mirrors/maven2/dot-index/nexus-maven-repository-index.gz
I would like to list the artifacts information from these index files (groupId, artifactId, version for example). I have read that there is a high level API for that. It seems that I have to use the following maven dependency. However, I don't know what is the entry point to use (which class?) and how to use it to access those files:
<dependency>
<groupId>org.sonatype.nexus</groupId>
<artifactId>nexus-indexer</artifactId>
<version>3.0.4</version>
</dependency>
Take a peek at https://github.com/cstamas/maven-indexer-examples project.
In short: you dont need to download the GZ/ZIP (new/legacy format) manually, it will indexer take care of doing it for you (moreover, it will handle incremental updates for you too, if possible).
GZ is the "new" format, independent of Lucene index-format (hence, independent of Lucene version) containing data only, while the ZIP is "old" format, which is actually plain Lucene 2.4.x index zipped up. No data content change happens currently, but is planned in future.
As I said, there is no data content difference between two, but some fields (like you noticed) are Indexed but not stored on index, hence, if you consume the ZIP format, you will have them searchable, but not retrievable.
The https://github.com/cstamas/maven-indexer-examples is obsolete. And the build fails (tests do not pass).
The Nexus Indexer has moved along and included the examples too:
https://github.com/apache/maven-indexer/tree/master/indexer-examples
That builds, and the code works.
Here is a simplified version if you want to roll your own:
Maven:
<dependencies>
<dependency>
<groupId>org.apache.maven.indexer</groupId>
<artifactId>indexer-core</artifactId>
<version>6.0-SNAPSHOT</version>
<scope>compile</scope>
</dependency>
<!-- For ResourceFetcher implementation, if used -->
<dependency>
<groupId>org.apache.maven.wagon</groupId>
<artifactId>wagon-http-lightweight</artifactId>
<version>2.3</version>
<scope>compile</scope>
</dependency>
<!-- Runtime: DI, but using Plexus Shim as we use Wagon -->
<dependency>
<groupId>org.eclipse.sisu</groupId>
<artifactId>org.eclipse.sisu.plexus</artifactId>
<version>0.2.1</version>
</dependency>
<dependency>
<groupId>org.sonatype.sisu</groupId>
<artifactId>sisu-guice</artifactId>
<version>3.2.4</version>
</dependency>
Java:
public IndexToGavMappingConverter(File dataDir, String id, String url)
throws PlexusContainerException, ComponentLookupException, IOException
{
this.dataDir = dataDir;
// Create Plexus container, the Maven default IoC container.
final DefaultContainerConfiguration config = new DefaultContainerConfiguration();
config.setClassPathScanning( PlexusConstants.SCANNING_INDEX );
this.plexusContainer = new DefaultPlexusContainer(config);
// Lookup the indexer components from plexus.
this.indexer = plexusContainer.lookup( Indexer.class );
this.indexUpdater = plexusContainer.lookup( IndexUpdater.class );
// Lookup wagon used to remotely fetch index.
this.httpWagon = plexusContainer.lookup( Wagon.class, "http" );
// Files where local cache is (if any) and Lucene Index should be located
this.centralLocalCache = new File( this.dataDir, id + "-cache" );
this.centralIndexDir = new File( this.dataDir, id + "-index" );
// Creators we want to use (search for fields it defines).
// See https://maven.apache.org/maven-indexer/indexer-core/apidocs/index.html?constant-values.html
List<IndexCreator> indexers = new ArrayList();
// https://maven.apache.org/maven-indexer/apidocs/org/apache/maven/index/creator/MinimalArtifactInfoIndexCreator.html
indexers.add( plexusContainer.lookup( IndexCreator.class, "min" ) );
// https://maven.apache.org/maven-indexer/apidocs/org/apache/maven/index/creator/JarFileContentsIndexCreator.html
//indexers.add( plexusContainer.lookup( IndexCreator.class, "jarContent" ) );
// https://maven.apache.org/maven-indexer/apidocs/org/apache/maven/index/creator/MavenPluginArtifactInfoIndexCreator.html
//indexers.add( plexusContainer.lookup( IndexCreator.class, "maven-plugin" ) );
// Create context for central repository index.
this.centralContext = this.indexer.createIndexingContext(
id + "Context", id, this.centralLocalCache, this.centralIndexDir,
url, null, true, true, indexers );
}
final IndexSearcher searcher = this.centralContext.acquireIndexSearcher();
try
{
final IndexReader ir = searcher.getIndexReader();
Bits liveDocs = MultiFields.getLiveDocs(ir);
for ( int i = 0; i < ir.maxDoc(); i++ )
{
if ( liveDocs == null || liveDocs.get( i ) )
{
final Document doc = ir.document( i );
final ArtifactInfo ai = IndexUtils.constructArtifactInfo( doc, this.centralContext );
if (ai == null)
continue;
if (ai.getSha1() == null)
continue;
if (ai.getSha1().length() != 40)
continue;
if ("javadoc".equals(ai.getClassifier()))
continue;
if ("sources".equals(ai.getClassifier()))
continue;
out.append(StringUtils.lowerCase(ai.getSha1())).append(' ');
out.append(ai.getGroupId()).append(":");
out.append(ai.getArtifactId()).append(":");
out.append(ai.getVersion()).append(":");
out.append(StringUtils.defaultString(ai.getClassifier()));
out.append('\n');
}
}
}
finally
{
this.centralContext.releaseIndexSearcher( searcher );
}
We use this in the Windup project - JBoss migration tool.
The legacy zip index is a simple lucene index. I was able to open it with Luke
and write some simple lucene code to dump out the headers of interest ("u" in this case)
import org.apache.lucene.document.Document;
import org.apache.lucene.search.IndexSearcher;
public class Dumper {
public static void main(String[] args) throws Exception {
IndexSearcher searcher = new IndexSearcher("c:/PROJECTS/Test/index");
for (int i = 0; i < searcher.maxDoc(); i++) {
Document doc = searcher.doc(i);
String metadata = doc.get("u");
if (metadata != null) {
System.out.println(metadata);
}
}
}
}
Sample output ...
org.ioke|ioke-lang-lib|P-0.4.0-p11|NA
org.jboss.weld.archetypes|jboss-javaee6-webapp|1.0.1.CR2|sources|jar
org.jboss.weld.archetypes|jboss-javaee6-webapp|1.0.1.CR2|NA
org.nutz|nutz|1.b.37|javadoc|jar
org.nutz|nutz|1.b.37|sources|jar
org.nutz|nutz|1.b.37|NA
org.openengsb.wrapped|com.google.gdata|1.41.5.w1|NA
org.openengsb.wrapped|openengsb-wrapped-parent|6|NA
There may be better ways to achieve this though ...