Realtime Data Streaming With TCP Socket, Apache Spark, OpenAI LLM, Kafka and Elasticsearch

Introduction
System Architecture
Technologies
Getting Started

Introduction

This project serves as a comprehensive guide to building an end-to-end data engineering pipeline using TCP/IP Socket, Apache Spark, OpenAI LLM, Kafka and Elasticsearch. It covers each stage from data acquisition, processing, sentiment analysis with ChatGPT, production to kafka topic and connection to elasticsearch.

System Architecture

The project is designed with the following components:

Data Source: We use yelp.com dataset for our pipeline.
TCP/IP Socket: Used to stream data over the network in chunks
Apache Spark: For data processing with its master and worker nodes.
Confluent Kafka: Our cluster on the cloud
Control Center and Schema Registry: Helps in monitoring and schema management of our Kafka streams.
Kafka Connect: For connecting to elasticsearch
Elasticsearch: For indexing and querying

Technologies

Python
TCP/IP
Confluent Kafka
Apache Spark
Docker
Elasticsearch

Getting Started

Clone the repository:

git clone https://github.com/FroCode/Real_Streaming_Kafka.git

Navigate to the project directory:
```
cd Real_Streaming_Kafka
```
Run Docker Compose to spin up the spark cluster:
```
docker-compose up
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Realtime Data Streaming With TCP Socket, Apache Spark, OpenAI LLM, Kafka and Elasticsearch

Table of Contents

Introduction

System Architecture

Technologies

Getting Started

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Realtime Data Streaming With TCP Socket, Apache Spark, OpenAI LLM, Kafka and Elasticsearch

Table of Contents

Introduction

System Architecture

Technologies

Getting Started