Wednesday, October 15, 2014

CXF becomes friends with Tika and Lucene

You may have been thinking for a while: would it actually be cool to get some experience with Apache Lucene and Apache Tika and enhance the JAX-RS services you work upon along the way ? Lucene and Tika are those cool projects people are talking about but as it happens there has never been an opportunity to use them in your project...

Apache Lucene is a well known project where its community keeps innovating with improving and optimizing the capabilities of various text analyzers. Apache Tika is a cool project which can be used to get the metadata and content out of binary resources with formats such as PDF, ODT, etc, with lots of other formats being supported. As a side note, Apache Tika is not only a cool project, it is also a very democratic project where everyone is welcomed from the get go - the perfect project to start your Apache career if you think of starting involved into one of the Apache projects.

Now, a number of services you have written may be supporting uploads of the binary resources, for example, you may have a JAX-RS server accepting multipart/form-data uploads.

As it happens, Lucene plus Tika is what one needs to be able to analyze the binary content easily and effectively. Tika would give you the metadata and the content, Lucene will tokenize it and help search over it. As such you can let your users search and download only those PDF or other binary resources which match the search query. It is something your users will appreciate.

CXF 3.1.0 which is under the active development offers a utility support for working with Tika and Lucene. Andriy Redko worked on improving the integration with Lucene and introducing a content extraction support with the help of  Tika. It is all shown in a nice jax_rs/search demo which offers a Bootstrap UI for uploading, searching and downloading of PDF and ODT files. The demo will be shipped in the CXF distribution.  

Please start experimenting today with the demo (download CXF 3.1.0-SNAPSHOT distribution), let us know what you think, and get your JAX-RS project to the next level.

You are also encouraged to experiment with Apache Solr which offers an  advanced search engine on top of Lucene, with Tika also being utilized.

Enjoy!      






No comments: