Every time an object is written, the system inserts a message into an Apache Kafka message queue (https://kafka.apache.org/). This way further processing can (asynchronously) happen on the freshly written data. This is called the “Data Pipeline” and is key in modern big data architectures. The goal of the internship is to set up a Kafka cluster and a stream processing framework to produce a suite of interesting post-processing applications. Depending on your interests and prior experience we can choose from the following areas:
- Uploaded image could be resized, auto-enhanced, filtered, ... The resulting artifacts would be reuploaded to the object store as auxiliary objects
- Feed to an image recognition algorithm (self-written or in the cloud) to categorize, tag, … the content and push the results to an external database/tool
- Transcode, post-process, … uploaded video and reupload as additional object
- Feed audio to a speech recognition algorithm (self-written or in the cloud) to autogenerate subtitles/transcripts
- Compute & visualize system statistics (average, histograms, percentiles, …) and metrics on object name and data size, object lifetime, capacity use per bucket, …
- Name + MD5sum could be fed to blockchain / merkle tree to do some sort of ‘digital notarization’
- If you’re passionate about an interesting application, that’s even better.
You will become familiar with cloud industry protocols such as the Amazon S3 API and open-source projects (Apache Kafka, stream processing) as well as build valuable coding, prototyping and debugging experience of distributed and cloud-based applications.
- Programming language of your choice: Python, Java, C++, Go, …
- AWS S3 API
- Apache Kafka
Create a demo that we can show to our customers to demonstrate the Data Pipeline
- 6-week internship
- Between July & September, you can choose when.
- Degree: Master Of Science - Computer Engineering.
Upload your CV or send an e-mail to email@example.com!