While trying to implement Solr indexer, I came across problem on how to actually normalize the date format according to Solr format. Surprisingly, this question was asked quite often in Stackoverflow. So, I decided to write a Solr plugin with the hope to solve this problem. However, at the end of the day, it turned out that going this approach does not actually solve my problem due to some Solr's peculiarity. Nonetheless, I hope this writing will help anyone who is trying to develop any Solr plugin as most examples are outdated. I will also explain the reason why it doesn't work (in my case).

The Solr version I'm working on is 4.4. There are 2 + 2 steps in creating the plugin,

  1. Create the plugin factory class
  2. Create the plugin real implementation that will be called later by (1)
  3. Create a .jar file and upload it to Solr home
  4. Register the plugin in schema.xml

Factory

package org.shulhi.solr.analyzers;

/**
 * Created by Shulhi on 12/30/13.
 */

import org.apache.lucene.analysis.util.TokenFilterFactory;  
import org.apache.lucene.analysis.TokenStream;

import java.util.Map;

public class IsoDateFilterFactory extends TokenFilterFactory {  
    public IsoDateFilterFactory(Map<String, String> args) {
        super(args);
    }

    @Override
    public TokenStream create(TokenStream tokenStream) {
        return new IsoDateFilter(tokenStream);
    }
}

Implementation

package org.shulhi.solr.analyzers;

/**
 * Created by Shulhi on 12/30/13.
 */

import org.apache.lucene.analysis.TokenFilter;  
import org.apache.lucene.analysis.TokenStream;  
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

import java.io.IOException;  
import java.text.DateFormat;  
import java.text.ParseException;  
import java.text.ParsePosition;  
import java.text.SimpleDateFormat;  
import java.util.Date;  
import java.util.Locale;  
import java.util.TimeZone;

public final class IsoDateFilter extends TokenFilter {  
    private CharTermAttribute charTermAttr;

    public IsoDateFilter(TokenStream tokenStream) {
        super(tokenStream);
        this.charTermAttr = addAttribute(CharTermAttribute.class);
    }

    @Override
    public boolean incrementToken() throws IOException {
        if(!input.incrementToken()) {
            return false;
        }

        String parsed = null;
        try {
            parsed = formatizeIsoDate(charTermAttr.toString());
        } catch (ParseException e) {
            e.printStackTrace();
        }

        char[] parsedArray = parsed.toCharArray();

        charTermAttr.setEmpty();
        charTermAttr.copyBuffer(parsedArray, 0, parsedArray.length);

        return true;
    }

    private String formatizeIsoDate(String term) throws ParseException {
        DateFormat df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSSZ"){
            public Date parse(String source,ParsePosition pos) {
                return super.parse(source.replaceFirst(":(?=[0-9]{2}$)",""),pos);
            }
        };

        Date parsed = df.parse(term);

        DateFormat outFormat = new SimpleDateFormat("yyyy-MM-dd'T'HH:MM:ss'Z'", Locale.ENGLISH);
        outFormat.setTimeZone(TimeZone.getTimeZone("UTC"));
        String utc = outFormat.format(parsed);

        return utc;
    }
}

Creating .jar file

I was stucked at where should I place this jar file at first. I tried placing it inside Solr's home directory as suggested and also tried loading it through solrconfig.xml but none work for me. So I end up placing it in /var/lib/solr/lib/, so whenever it loads dependency for Solr, it will also load my jar file.

Updating schema.xml

<fieldType name="ztdate" class="solr.TextField">  
  <analyzer>
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="org.shulhi.solr.analyzers.IsoDateFilterFactory"/>
  </analyzer>
</fieldType>  

This is how you register your plugin to be used in Solr. You need only need to specify your factory class, and from that it will ask your factory to create the analyzer.

Why it doesn't work in my case

I tested the plugin and it works well but it doesn't work in my case because of how copyField works in Solr.

<copyField source="isodate_t" dest="ztisodate"/>  
<copyField source="ztisodate" dest="trieisodate"/>  

My objective is something like this:

{unformatted date} -> isodate_t -> ztisodate -> trieisodate

The unformatted date will be indexed to isodate_t field. Then it will get copied to ztisodate where my analyzer will parse the date and reformat it accordingly to Solr's format and copy it to trieisodate.

It doesn't work because when you are copying fields, it copies from the input stream rather than the output/result. So, when I'm trying to copy the result of ztisodate to trieisodate, it is actually copying isodate_t to trieisodate. face palm waste of effort. But I glad I learn something from this.