Category: natural-language-processing

Dataworks Summit 2019

Recently (in May 2019) I had the honor of attending and speaking at the Dataworks Summit in Washington D.C. The conference had many interesting topics and keynote speakers focused on big-data technologies and business applications. I also always enjoy exploring downtown Washington DC.  Whether it is doing the “hike” across the National Mall taking in […]

Creating an N-gram Language Model

A statistical language model is a probability distribution over sequences of words. (source) We can build a language model using n-grams and query it to determine the probability of an arbitrary sentence (a sequence of words) belonging to that language. Language modeling has uses in various NLP applications such as statistical machine translation and speech […]

Lucidworks Activate Search and AI Conference

Back in October 2018 I had the privilege of attending and presenting at Lucidworks Activate Search and AI Conference in Montreal, Canada. It was a first-class event with lots of great, informative sessions set in the middle of a remarkable city. I was a co-presenter of Embracing Diversity: Searching over multiple languages with Suneel Marthi in […]

Apache OpenNLP Language Detection in Apache NiFi

When making an NLP pipeline in Apache NiFi it can be a requirement to route the text through the pipeline based on the language of the text. But how do we get the language of the text inside our pipeline? This blog post introduces a processor for Apache NiFi that utilizes Apache OpenNLP’s language detection […]

NLP Pipeline using Apache NiFi and NLP Building Blocks

This blog post shows how we can create an NLP pipeline to perform named-entity extraction on natural language text using the NLP Building Blocks and Apache NiFi. The NLP Building Blocks provide the ability to perform sentence extraction, string tokenization, and named-entity extraction. They are implemented as microservices and can be deployed almost anywhere, such […]

OpenNLP’s RegexNameFinder and Tokenizing

OpenNLP’s RegexNameFinder takes one or more regular expressions and uses those expressions to extract entities from the input text. This is very useful for instances in which you want to extract things that follow a set format, like phone numbers and email addresses. However, when tokenizing the input to the RegexNameFinder be careful because it can affect […]