Apache OpenNLP and ONNX Models

I got started using the Apache OpenNLP project some time around 2009. I had a large amount of unstructured text that I wanted to process and I didn’t know how. As a Java programmer, Apache OpenNLP provided the tools I needed to make that large amount of text usable. Back then, the NLP options available were largely limited to Apache OpenNLP and Stanford NLP, with the latter being licensed under the GPL so it limited its commercial use.

Around 2015 and coinciding with the introduction of Word2vec, the NLP community completely shifted to Python (and for good reason). But, this didn’t help the Java ecosystem where NLP needs are still present. Since then, I think all NLP development has been in the Python ecosystem, leaving Java engineers with options such as communicating with Python-based NLP services via RPC services, often to containerized NLP processes.

With the introduction of ONNX and the ONNX Runtime, we are now presented with a common platform and interface for machine learning models. This means, in theory, that we can train an NLP model in the Python ecosystem, convert it to an ONNX model, and use it from any of the programming languages supported by the ONNX Runtime (and that includes Java). So this is very promising and exciting! We might be able to replace our RPC calls with regular Java code to use the models directly.

I recently did an implementation of the ONNX Runtime for Apache OpenNLP and made a pull request. This pull request adds person named-entity recognition and document classification using ONNX models implemented to the OpenNLP interfaces. You just need to add a dependency on the ONNX Runtime, an ONNX model and the vocabulary file for the model. Here’s an example for document classification:


final File model = new File("model.onnx");
final File vocab = new File("vocab.txt");

final DocumentCategorizerDL documentCategorizerDL =
new DocumentCategorizerDL(model, vocab, getCategories());
final double[] result = documentCategorizerDL.categorize(new String[]{"I am happy"});
System.out.println(Arrays.toString(result));

We provide the locations of our ONNX model file and vocabulary files. Next, we instantiate an instance of the document categorizer using those files. Now, we can call the categorize() function given the input text. The result is the values of the predicted categories.

You can try it out using the nlptown/bert-base-multilingual-uncased-sentiment model available on Huggingface. You just need to convert the model to ONNX and download its vocab.txt file. Here’s how you can do the ONNX conversion:


python -m transformers.onnx -m nlptown/bert-base-multilingual-uncased-sentiment --feature sequence-classification exported

This will convert the model to an ONNX model and name it exported. We can now reference the exported model and the vocab.txt file in the code snippet above! (For more examples, check out the unit tests in the pull request.)

So what did we just do? We took a document classification (sentiment) model that was made in the Python ecosystem and published to the Huggingface model hub. We converted the model to ONNX, and then we used it directly from Java using the ONNX Runtime.

I’m very excited to be able to provide this new capability in Apache OpenNLP. Look for implementations of the other OpenNLP interfaces in the future along with an Apache OpenNLP 2.0 release.

Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments