NLP Pipeline using Apache NiFi and NLP Building Blocks

This blog post shows how we can create an NLP pipeline to perform named-entity extraction on natural language text using the NLP Building Blocks and Apache NiFi. The NLP Building Blocks provide the ability to perform sentence extraction, string tokenization, and named-entity extraction. They are implemented as microservices and can be deployed almost anywhere, such as AWS, Azure, and as Docker containers.

At the completion of this blog post we will have a system that reads natural language text stored in files on the file system, pulls out the sentences of the each, finds the tokens in each sentence, and finds the named-entities in the tokens.

Apache NiFi is an open-source application that provides data flow capabilities. Using NiFi you can visually define how data should flow through your system. Using what NiFi calls “processors”, you can ingest data from many data sources, perform operations on the data such as transformations and aggregations, and then output the data to an external system. We will be using NiFi to facilitate the flow of text through our NLP pipeline. The text will be read from plain text files on the file system. We will then:

  • Identify the sentences in input text.
  • For each sentence, extract the tokens in the sentence.
  • Process the tokens for named-entities.

To get started we will stand up the NLP Building Blocks. We will launch these applications using a docker-compose script.

git clone https://github.com/mtnfog/nlp-building-blocks
cd nlp-building-blocks
docker-compose up

This will pull the docker images from DockerHub and run the containers. We now have each NLP building block up and running. Let’s get Apache NiFi up and running, too.

To get started with Apache NiFi we will download it. It is a big download at just over 1 GB. You can download it from the Apache NiFi Downloads page or directly from a mirror at this link for NiFi 1.4.0. Once the download is done we will unzip the download and start NiFi:

unzip nifi-1.4.0-bin.zip
cd nifi-1.4.0/bin
./nifi.sh start

NiFi will start and after a few minutes it will be available at http://localhost:8080/nifi. (If you are curious you can see the NiFi log under logs/nifi-app.log.) Open your browser to that page and you will see the NiFi canvas as shown below. We can now design our data flow around the NLP Building Blocks!

If you want to skip to the meat and potatoes you can get the NiFi template described below in the nlp-building-blocks repository.

Our source data is going to be read from text files on our computer stored under /tmp/in/. We will use NiFi’s GetFile processor to read the file. Add a GetFile processor to the canvas:


Right-click the GetFile processor and click Configure to bring up the processor’s properties. The only property we are going to set is the Input Directory property. Set it to /tmp/in/ and click Apply:

We will use the InvokeHTTP processor to send API requests to the NLP Building Blocks, so, add a new InvokeHTTP processor to the canvas:

This first InvokeHTTP processor will be used to send to the data to Prose Sentence Detection Engine to extract the sentences in the text. Open the InvokeHTTP processor’s properties and set the following values:

  • HTTP Method – POST
  • Remote URL – http://localhost:7070/api/sentences
  • Content Type – text/plain

Set the processor to autoterminate for everything except Response. We also set the processor’s name to ProseSentenceExtractionEngine. Since we will be using multiple InvokeHTTP processors this lets us easily differentiate between them. We can now create a connection between the GetFile and InvokeHTTP processors by clicking and drawing a line between them. Our flow right now reads files from the filesystem and sends the contents to Prose:

The sentences returned from Prose will be in a JSON array. We can split this array into individual FlowFiles with the SplitJson processor. Add a SplitJson processor to the canvas and set its JsonPath Expression property to $.* as shown below:

Connect the SplitJson processor to the ProseSentenceExtractionEngine processor for the Response relationship. The canvas should now look like this:

Now that we have the individual sentences in the text we can send those sentences to Sonnet Tokenization Engine to tokenize the sentences. Similar to before, add an InvokeHTTP processor and name it SonnetTokenizationEngine. Set its method to POST, the Remote URL to http://localhost:9040/api/tokenize, and the Content-Type to text/plain. Automatically terminate every relationship except Response. Connect it to the SplitJson processor using the Split relationship. The result of this processor will be an array of tokens from the input sentence.

While we are at it, let’s go ahead and add an InvokeHTTP processor for Idyl E3 Entity Extraction Engine. Add the processor to the canvas and set its name to IdylE3EntityExtractionEngine. Set its properties:

  • HTTP Method – POST
  • Remote URL – http://localhost:9000/api/extract
  • Content-Type – application/json

Connect the IdylE3EntityExtractionEngine processor to the SonnetTokenizationProcessor via the Response relationship. All other relationships can be set to autoterminate. To make things easier to see, we are going to add an UpdateAttribute processor that sets the filename for each FlowFile to a random UUID. Add an UpdateAttribute processor and add a new property called filename with the value ${uuid}.txt. We will also add a processor to write the FlowFiles to disk so we can see what happened during the flow’s execution. We will add a PutFile processor and set its Directory property to /tmp/out/.

Our finished flow looks like this:

To test our flow we are going to use a super simple text file. The full contents of the text file are:

George Washington was president. This is another sentence. Martha Washington was first lady.

Save this file as /tmp/in/test.txt.

Now, start up the NLP Building Blocks:

git clone https://github.com/mtnfog/nlp-building-blocks
cd nlp-building-blocks
docker-compose up

Now you can start the processors in the flow! The file /tmp/in/test.txt will disappear and three files will appear in /tmp/out/. The three files will have random UUIDs for filenames thanks to the UpdateAttribute processor. If we look at the contents of each of these files we see:

First file:

{"entities":[{"text":"George Washington","confidence":0.96,"span":{"tokenStart":0,"tokenEnd":2},"type":"person","languageCode":"eng","extractionDate":1514488188929,"metadata":{"x-model-filename":"mtnfog-en-person.bin"}}],"extractionTime":84}

Second file:

{"entities":[],"extractionTime":7}

Third file:

{"entities":[{"text":"Martha Washington","confidence":0.89,"span":{"tokenStart":0,"tokenEnd":2},"type":"person","languageCode":"eng","extractionDate":1514488189026,"metadata":{"x-model-filename":"mtnfog-en-person.bin"}}],"extractionTime":2}

The input text was broken into three sentences so we have three output files. In the first file we see that George Washington was extracted as a person entity. The second file did not have any entities. The third file had Martha Washington as a person entity. Our NLP pipeline orchestrated by Apache NiFi read the input, broke it into sentences, broke each sentence into tokens, and then identified named-entities from the tokens.

This flow assumed the language would always be English but if you are unsure you can add another InvokeHTTP processor to utilize Renku Language Detection Engine. This will enable language detection inside your flow and you can route the FlowFiles through the flow based on the detected language giving you a very powerful NLP pipeline.

There’s a lot of cool stuff here but arguably one of the coolest is that by using the NLP Building Blocks you don’t have to pay per-request pricing that many of the NLP services charge. You can run this pipeline as much as you need to. And if you are in an environment where your text can’t leave your network, this pipeline can be run completely behind a firewall (just like we did in this post).

 

Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments