ASAB

Streaming Data from Various Sources

One of the main research and development topics nowadays is data processing and analysis, which can help companies discover relevant information about their customers or technologies using reports, visualizations, dashboards, and other business intelligence outputs. In the previous article, I recalled our team’s workshop where we put a foundation to our data mining and analysis endeavors. The open-source product ASAB, which you can see and contribute to at GitHub, forms a basis for request processing, event management, and metrics computation. However, its focus is not to process data from various sources and send them to business intelligence applications, data warehouses, or databases. Rather, these tasks are solved with another layer of data handling, which this article is about. This layer is called BSPump, a short form of Black Swan Pump.

Origins of Black Swan

As with other foundations for our applications, this one also started with a common workshop. After a review of ASAB functions and its possibilities, we had a discussion about asynchronous data processing and its implementation in Python’s library, called asyncio. We then realized that we could use experience from our previous products and implement the data processing as multiple independent instances of the so-called pipelines. Generally, a pipeline is a linear set of connected data processors, where the first one of these processors receives raw data from a specified source, and the last one pushes the transformed, processed, and enriched bulk of data into a specified database, file, or application.

The BSPump pipeline

Picture: Schema of a BSPump pipeline.

My own task was to focus on Influx database outputs and the kind of processors that databases receive bulks of data from - the kind we call a sink. In the beginning, we had no clear definition of what the pipelines and processors should look like in the code or how they could be easily connected and configured. However, after a series of talks, a solution finally emerged, which you can view on GitHub. Like ASAB, BSPump is also open-source and you are free to contribute to it. Our basic idea dwells in using the publish-subscribe mechanism, which can start, finish, or temporarily pause certain processors in pipelines and in simple data flow illustrated in designs of abstract classes and their methods. I hope the previous description did not overwhelm you, but now you have an idea about what is going on with data in the BSPump.

Pipelines

The concept of pipelines with the publish-subscribe mechanism is a flexible and strong one. Not only can the pipelines run alongside one another and process data in real-time, they can also subscribe for events (such as system interrupts) to finish necessary data sending via sinks to output data stores or applications before they are shut down. In this way, we can be sure there are no data losses along the way. While I was working on the concept of database sinks and while my colleague Mila was focusing on source processors (reading data from logs and other inputs), Honza tried to implement an Elasticsearch connector from our previous project, which would also be used in sink processors. We work with Elasticsearch a lot and use Kibana visualizations that are formed from its indexed data, so implementing an Elasticsearch connector was one of our first decisions and considerations when it came to BSPump. Our team was quite busy with implementing all the features and we had to decide what to do next after the workshop had finished. Ales made a few refinements afterwards related to the design and architecture, but the workshop itself was successful and created a basis for BSPump, which we have been extending since then.

A real-time stream processor

So, technically speaking, BSPump can process data coming from a source stream in real-time, enrich them with information (like precise location), and then transform them into a specified output format or send them to data stores like Elasticsearch. One of the most exciting features is the computation of defined metrics (which form the basis for data mining analysis) and anomaly detection. The data transformation can be used for anonymizations of personal information such as emails, names as part of the GDPR solutions. If you are interested in the project or would like to contribute to it, please see our GitHub project or contact us at info@teskalabs.com or on Gitter. BSPump is open-source and ready to integrate thoughts and solutions from a wide community!

Continue to next article

About the Author

Premysl Cerny

Software Developer at TeskaLabs




You Might Be Interested in Reading These Articles

Engaged with ASAB

About microservices, coroutines, failures and enthusiasm. And most of all, about ASAB. ASAB is the first thing that probably every newcomer to TeskaLabs gets fond of.

Continue reading ...

asab development tech eliska

Published on June 15, 2022

SeaCat tutorial - Chapter 2: Simple Post (iOS)

The goal of this article is to create a simple iOS client which generates a simple POST Request which will be read in host written in Node.js and the output generated in the console. The whole comunication will be handled by SeaCat which help us to establish fast and secure connection among our key components.

Continue reading ...

tech tutorial ios osx

Published on September 09, 2014

SeaCat Tutorial - Chapter 5: Using Parse.com with REST Integration (iOS)

As the market with Cloud Computing and Mobile devices is getting bigger, there is another specific option available. It's called (Mobile)Backend-As-A-Service (BAAS) and it is extremely useful in situations we want to subscribe a complex backend service (alongside the core backend solution, there is usually a lot of additional functionality and statistics) and primary focus on development of client part of mobile apps for instance.

Continue reading ...

tech tutorial ios osx

Published on January 31, 2015