I have worked with many companies to help them either migrate to the cloud or develop new cloud applications for over 10 years. A very common requirement is that the designed architecture avoid using any cloud vendor specific technologies or services. The rationale is usually that although we are running our application on vendor X […]
Author: jzemerick
Querqy Chorus
For the past couple of months I have attended occasional presentations about Chrous, an open source stack for search, created by Querqy. The presentations have focused on the stack components of Apache Solr, SMUI (Search Management UI), the search relevancy tool Quepid, among others. There is a decent amount of search-related open source projects out […]
Hello, world!
My 2021 resolution (one of them) is to blog more. I have never been much of a post author even though through my life and career I have benefited greatly from the posts of others. Not only does writing help others, it also helps the author form their thoughts and become more knowledgeable about the […]
Some First Steps for a New NiFi Cluster
After installing Apache NiFi there are a few steps you might want to take before making your cluster available for prime time. None of these steps are required so make sure they are appropriate for your use-case before implementing them. Lowering NiFi’s Log File Retention Properties By default, Apache NiFi’s nifi-app.log files are capped at […]
A Tool for Every Data Engineer’s Toolbox
Collecting data from edge devices in manufacturing, processing medical records from electronic health systems, and analyzing text all sound like very different problems each requiring unique solutions. While that certainly is true there are some commonalities between each of these tasks. Each task requires a scalable method of data ingestion, predictable performance, and capabilities for […]
Monitoring Apache NiFi’s Log with AWS CloudWatch
It’s inevitable that at some point while running Apache NiFi on a single node or as a cluster you will want to see what’s in NiFi’s log and maybe even be alerted when certain logged events are found. Maybe you are debugging your own processor or just looking for more insight into your data flow. […]
Monitoring Apache NiFi with Datadog
One of the most common requirements when using Apache NiFi is a means to adequately monitor the NiFi cluster. Insights into a NiFi cluster’s use of memory, disk space, CPU, and NiFi-level metrics are crucial to operating and optimizing data flows. NiFi’s Reporting Tasks provide the capability to publish metrics to external services. Datadog is […]
Apache NiFi’s MergeContent Processor
The MergeContent processor in Apache NiFi is one of the most useful processors but can also be one of the biggest sources of confusion. The processor (you guessed it!) merges flowfiles together based on a merge strategy. The processor’s purpose is straightforward but its properties can be tricky. In this post we describe how it […]
Dataworks Summit 2019
Recently (in May 2019) I had the honor of attending and speaking at the Dataworks Summit in Washington D.C. The conference had many interesting topics and keynote speakers focused on big-data technologies and business applications. I also always enjoy exploring downtown Washington DC. Whether it is doing the “hike” across the National Mall taking in […]
Creating an N-gram Language Model
A statistical language model is a probability distribution over sequences of words. (source) We can build a language model using n-grams and query it to determine the probability of an arbitrary sentence (a sequence of words) belonging to that language. Language modeling has uses in various NLP applications such as statistical machine translation and speech […]