Building an Expandable IoT Data Pipeline - Part 1
Background and problem description
In late 2016 we were approached with an opportunity to build a data pipeline for one of our large clients. The client needed to obtain sensor data and media (images, video, etc) from custom IoT devices built by one of their vendors. The goal was to capture and analyze metric to improve efficiency of a business process.
The devices contained various sensors and cameras to capture environment and visual data. There were two groups of devices, which for simplicity I'll call labeled and unlabled. Labled devices would be affixed to an object moving through a local wireless mesh network gathering data from unlabed devices and capturing its own sensory and visual data. Unlabeled devices would remain contained within a region but potentally move around to various targetted areas within the region. Labeled devices would later return to a 'home base' and wirelessly upload the data connected to the pipeline.
Our client wanted us to quickly prototype and release a working a product for use. Over the next few posts, we'll describe our process and some of the implementation details of the project.
Determining an infrastructure and architecture
What did our client actually need? Starting backwards we knew once they had their data, they needed some simple way to access it. This application might also need an additional data layer (for things like user data) and authentication/authorization. Since the data might need to be accessible or service multiple sources or applications it was a good idea to create an API to service requests. Thus, our API would need to pull from a data warehouse and potentially other persisence layers which had aggregated, clean data. We also knew that we needed to store the raw data, so nothing was lost. The most difficult part would be the middle part: how will we process and transform the raw data into aggregated clean data.
To summarize we needed
- Raw data store
- Data processing / aggregation (ETL) infrastructure
- Data warehouse / application persistence layers
- API application
- Portal application
We chose to use the following technologies
- Python 3
AWS was convenient for quick infrastructure creation and provided everything we needed to implement and host a pipeline along with our applications
The IoT devices captured data in CSV format and needed to uploaded these to our AWS S3 buckets. We didn't know if the computer architecture for the devices would change. Our team had Go knowledge and it was simple to compile to different architectures. Hence, Go
Python has many benefits. It's a pretty great scripting language. Also, we could use it to implement any AWS Lambdas we needed. An additional benefit is it has a great support for data analytics and is one of the big data science languages. In an attempt to keep language variation to a minimum, we wrote the API in Python along with scripts and lambdas.
It is pretty much a no-brainer we would use a static site for the Portal. We wanted to use a clean, reactive framework and settled on VueJS. No regrets.
Next up: Data
The next part of this series will cover some of the implementation details of the data pipeline.