Author: jzemerick

Introducing ngramdb

ngramdb provides a distributed means of storing and querying N-grams (or bags of words) organized under contexts. A REST interface provides the ability to insert n-grams, execute “starts with” and “top” queries, and calculate similarity metrics of contexts. Apache Ignite provides the distributed and highly available persistence and powers the querying abilities. ngramdb is open […]

Lucidworks Activate Search and AI Conference

Back in October 2018 I had the privilege of attending and presenting at Lucidworks Activate Search and AI Conference in Montreal, Canada. It was a first-class event with lots of great, informative sessions set in the middle of a remarkable city. I was a co-presenter of Embracing Diversity: Searching over multiple languages with Suneel Marthi in […]

Apache OpenNLP Language Detection in Apache NiFi

When making an NLP pipeline in Apache NiFi it can be a requirement to route the text through the pipeline based on the language of the text. But how do we get the language of the text inside our pipeline? This blog post introduces a processor for Apache NiFi that utilizes Apache OpenNLP’s language detection […]

NLP Pipeline using Apache NiFi and NLP Building Blocks

This blog post shows how we can create an NLP pipeline to perform named-entity extraction on natural language text using the NLP Building Blocks and Apache NiFi. The NLP Building Blocks provide the ability to perform sentence extraction, string tokenization, and named-entity extraction. They are implemented as microservices and can be deployed almost anywhere, such […]

OpenNLP’s RegexNameFinder and Tokenizing

OpenNLP’s RegexNameFinder takes one or more regular expressions and uses those expressions to extract entities from the input text. This is very useful for instances in which you want to extract things that follow a set format, like phone numbers and email addresses. However, when tokenizing the input to the RegexNameFinder be careful because it can affect […]