Tuesday, September 12, 2017

The Real Data Processing with Apache Beam and Tika

If we talk about the data injestion in the big data streaming pipelines it is fair to say that in the vast majority of cases it is the files in the CSV and other text, easy to parse formats which provide the source data.

Things will become more complex when the task is to read and parse the files in the format such as PDF. One would need to create a reader/receiver capable of parsing the PDF files and feeding the content fragments (the regular text, the text found in the embedded attachments and the file metadata) into the processing pipelines. That was tricky to do right but you did it just fine.

The next morning you get a call from your team lead letting you know the customer actually needs the content injested not only from the PDF files but also from the files in a format you've never heard of before. You spend the rest of the week looking for a library which can parse such files and when you finish writing the code involving that library's not well documented API all you think of is that the weekends have arrived just in time.

On Monday your new task is to ensure that the pipelines have to be initialized from the same network folder where the files in PDF and other format will be dropped. You end up writing a frontend reader code which reads the file, checks the extension, and then chooses a more specific reader.   

Next day, when you are told that Microsoft Excel and Word documents which may or may not be zipped will have to be parsed as well, you report back asking for the holidays...

I'm sure you already know I've been preparing you for a couple of good news.

The first one is a well known fact that Apache Tika allows to write a generic code which can collect the data from the massive number of text, binary, image and video formats. One has to prepare or update the dependencies and configuration and have the same code serving the data from the variety of the data formats.

The other and main news is that Apache Beam 2.2.0-SNAPSHOT now ships a new TikaIO module (thanks to my colleague JB for reviewing and merging the PR). With Apache Beam capable of running the pipelines on top of Spark, Flink and other runners and Apache Tika taking care of various file formats, you get the most flexible data streaming system.

Do give it a try, help to improve TikaIO with new PRs, and if you are really serious about supporting a variety of the data formats in the pipelines, start planning on integrating it into your products :-)

Enjoy!



Wednesday, September 6, 2017

Mostly On Topic: CXF and Swagger Integration Keeps Getting Better

While thinking about a title of this post I thought the current title line, with the " Keeps Getting Better" finishing touch may work well; I knew I used a similar line before, and after looking through my posts I found it.

Oh dear. I'm transported back to 2008, I can see myself, 9 years younger, walking to the Iona Technologies office, completely wired on trying to stop the Jersey JAX-RS domination :-), spotting an ad of the latest  Christina Aguilera's albom on the exit from the Lansdowne Dart station and thinking, it would be fun, trying to blog about it and link to CXF, welcome to the start of the [OT] series. I'm not sure now if I'm more surprised it was actually me who did write that post or that 9 years later I'm still here, talking about CXF :-).

Let me get back to the actual subject of this post. You know CXF started quite late with embracing Swagger, and I'm still getting nervous whenever I remind myself Swagger does not support 'matrix' parameters :-). But the Swagger team has done a massive effort through the years, my CXF hat is off to them.

I'm happy to say that now Apache CXF offers one of the best Swagger2 integrations around, at the JSON only and UI levels and it just keeps getting better.

We've talked recently with Dennis Kieselhorst and one can now configure Swagger2Feature with the external properties file which can be especially handy when this feature is auto-discovered.

Just at the last minute we resolved an issue reported by a CXF user to do with accessing Swagger UI from the reverse proxies.

Finally, Freeman contributed a java2swagger Maven plugin.

Swagger 3 will be supported as soon as possible too.

Enjoy!

Thursday, August 31, 2017

Apache CXF 3.2.0 NIO Extension

In CXF 3.2.0 we have also introduced a server-side NIO extension which is based on the very first JAX-RS API prototype done by Santiago Pericas-Geertsen. The client NIO API prototype was not ready but the server one had some promising start. It was immediately implemented in CXF once a long-awaited 1st 2.1 API jar got published to Maven.

However, once the JAX-RS 2.1 group finally resumed its work and started working on finalizing NIO API, the early NIO API was unfortunately dropped (IMHO it could've stayed as an entry point, 'easy' NIO API), while the new NIO API did not materialize primarily due to the time constraints of the JCP process.

The spec leads did all they could but it was too tight for them to make it right. As sad as it was, they did the right decision, rather then do something in a hurry, better do it right at some later stage...

It was easily the major omission from the final 2.1 API. How long JAX-RS users will wait till the new JAX-RS version will get finalized with the new NIO API becoming available to them given that it takes years for major Java EE umbrella of various specs be done ?

In meantime the engineering minds in SpringBoot and RxJava and other teams will come up with some new brilliant ways of doing it. There will be not 1 but several steps ahead.

Which brings me to this point: if I were to offer a single piece of advice to Java EE process designers, I'd recommend them to make sure that the new features can be easily added after the EE release date with the minor EE releases embracing these new features to follow soon,  without waiting for N years. If it were an option then we could've seen a JAX-RS 2.2 NIO in say 6 months - just a dream at the moment, I know. The current mechanism where EE users wait for several years for some new features is out of sync with the competitive reality of the software industry and only works because of the great teams around doing EE, the EE users loyalty and the power of the term 'standard'.

Anyway, throwing away our own implementation of that NIO API prototype now gone from 2.1 API just because it immediately became the code supporting a non-standard feature was not a good idea.

It offers an easy link to the Servlet 3.1 NIO extensions from the JAX-RS code and offers the real value. Thus the code stayed and is now available for the CXF users to experiment with.

It's not very shiny but it will deliver. Seriously, if you need to have a massive InputStream copied to/from the HTTP connection with NIO and asynchronous callbacks involved, what else do you need but a simple and easy way to do it from the code ? Well, nothing can be simpler than this option for sure.

Worried a bit it is not a standard feature ? No, it is fine, doing it the CXF way is a standard :-)
  

JAX-RS 2.1 is Released

JAX-RS 2.1 (JSR 370) has been finally released and JAX-RS users can now start looking forward to experimenting with the new features very soon, with a number of final JAX-RS 2.1 implementations being already available (such as Jersey) or nearly ready to be released.

Apache CXF 3.2.0 is about to be released shortly, and all of the new JAX-RS 2.1 features have been implemented:  reactive client API extensions, client/server Server Sent Events support, returning CompletableFuture from the resource methods and other minor improvements.

As part of the 2.1 work (but also based on the CXF JIRA request) we also introduced RxJava Observable and recently - RxJava2 Flowable/Observable client and server extensions. One can use them as an alternative to using CompletableFuture  on the client or/and the server side. Note, the combination of RxJava2 Flowable with JAX-RS AsyncResponse on the server is quite cool.

The other new CXF extension which was introduced as part of the JAX-RS 2.1 work is the NIO extension, this will be a topic of the next post.

Pavel Bucek and Santiago Pericas-Geertsen were the great JAX-RS 2.1 spec leads. Andriy Redko spent a lot of his time with getting CXF 3.2.0 JAX-RS 2.1 ready.

Thursday, July 13, 2017

[OT] I Work with CXF and I Want It That Way

The time has come for a regular OT post.

The journey of the software developer is always about finding the home where he or she can enjoy being every day, can look forward to contributing to the bigger effort every day.

In addition to that the journey of the web services developer is always about finding the web services framework which will help with creating the coolest HTTP service on the Web. We all know there are many quality HTTP service frameworks around.

My software developer's journey so far has been mostly about supporting one of such web services frameworks, Apache CXF. It has been a great journey.

Some of you helped by using and contributing to Apache CXF earlier, some of you are long term Apache CXF users and contributors, preparing the ground for the new users and contributors who are yet to discover CXF.

No matter which group you are in, even if you're no longer with CXF, I'm sure you've had that feeling at least once that you'd like your CXF experience last forever :-).

Listen to a message from the best boys band in the world. Enjoy :-)

  


Monday, July 3, 2017

Multiple JWE Encryptions POC With Apache CXF in two hours

The summer has been great so far, and as usual, instead of watching yet another sport event final, you've decided to catch up with your colleagues after work and do a new round of the Apache CXF JOSE coding. Nice idea they said.

The idea of  creating an application processing the content encrypted for the multiple recipients has captured your imagination.

After reviewing the CXF JWE JSON documentation you've decided to start with the following client code. This code creates a client proxy which posts some text.

JWE JSON filter registered with the proxy will encrypt whatever the content the proxy is sending (does not have to be only text) only once, and the content encrypting key (CEK) will be encrypted with the recipient specific encrypting keys. Thus if you have 2 recipients then CEK will be encrypted twice.

Registering the jwejson1.properties  and jwejson2.properties with the proxy instructs the JWE JSON filter that a JWE JSON container for 2 recipients needs to be created, that the content encryption algorithm is A128GCM and key encryption algorithm is A128KW, and each recipient is using its own symmetric key encryption key. Each recipient specific entry will also include a 'kid' key identifier of the key encryption key for the service to figure out which JWE JSON entry is targeted at which recipient.

Setting up the client took you all one hour.

Next task was to prototype a service code. That was even easier. Loading the recipient specific properties, locating a recipient specific entry and getting the decrypted content was all what was needed.

Two hours in total. Note I did not promise it would take you 30 mins to do all the POC, it would've been really a child's play which is not realistic. With the two hours long project it is more complex, it felt like it was a walk in the park :-)



 

Friday, June 16, 2017

How to do JOSE in Apache CXF service code

This blog entry continues the series started with the introduction to Apache CXF JOSE implementation followed recently with the post talking about the signing of HTTP attachments.

So CXF helps with shipping JOSE filters which can protect the application data by wrapping them into JOSE JWS or JWE envelopes or verify that the data has been properly encrypted and/or signed. In these cases the application code is not even aware that the JOSE processors are involved.

How would one approach the task of signing/verifying and/or encrypting/decrypting the data directly in the application code ? For example, what if an individual property of the bigger payload needs to be JOSE protected ?

The most obvious approach is to use either CXF JOSE or the preferred 3rd party library to deal with the JOSE primitives in the application code. This is Option 1. It is a must option if one needs to have a closer control over the JOSE envelope creation process.

Or you can basically do nearly nothing at all and let CXF handle it for you, this is Option 2. This is a CXF Way Option - make it as easy as possible for the users to embrace the advanced technologies fast. It is not though only about making it easy - but is also about having a more flexible and even portable JOSE-aware code.

In this case such requirements as "sign only" or "encrypt only" or "sign and encrypt" and similarly for the "verify/decrypt" are not encoded in the code - it is managed at the time of configuring the JOSE helpers from the application contexts (by default they only sign/verify).

Likewise, the signature and encryption algorithm and key properties are controlled externally.

I know, it is hard to believe that it can be so easy. Try it to believe it. Enjoy !



Tuesday, May 23, 2017

Signing HTTP Attachments with Apache CXF JOSE

JOSE, the primary mechanism for securing various OAuth2/OIDC tokens, slowly but surely is becoming the main technology for securing the data in the wider contexts. JOSE, alongside COSE, will become more and more visible going forward.

I talked about Apache CXF JOSE implementation in this post. One of the practical aspects of this implementation is that one can apply JOSE to securing the regular HTTP payloads, with the best attempt at keeping the streaming going made by the sender side filters, with the JOSE protection of these payloads (JWS signature or JWE encryption) being able to 'stay' with the data even beyond the HTTP request-response time if needed.

In CXF 3.1.12 I have enhanced this feature to support the signing of HTTP attachments. It depends on JWS Detached Content and Unencoded Content features which allow to integrity-protect the payload which can continue flowing to its destination in a clear form.

Combining it with the super-flexible mechanism of processing the attachments in Apache CXF, and particularly with the newly introduced Multipart filters which let pre-process individual multipart attachment streams, helped produce the final solution.  

Besides, as part of this effort, the optional binding of the outer HTTP headers to the secure JWS or JWE payloads has also been realized.

Be the first in experimenting with this IMHO very cool feature, try it and provide the feedback, enjoy !


Thursday, May 18, 2017

Distributed Tracing with CXF: New Features

As you may already know Apache CXF has been offering a simple but effective support for tracing CXF client and server calls with HTrace since 2015.

What is interesting about this feature is that it was done after the DevMind attended to Apache Con NA 2015 and got inspired about integrating CXF with HTrace.

You'll be glad to know this feature has now been enhanced to get the trace details propagated to the logs which is the least intrusive way of working with HTrace though should you need more advanced control, CXF will help, see this section for example.

CXF has also been integrated with Brave. That should do better for CXF OSGI users. The integration work with Brave 4 is under way now.