Skip to content

Commit f94e935

Browse files
Merge pull request #3 from BeamStackProj/readme-update
Readme update
2 parents 7102d47 + 5c4b28c commit f94e935

1 file changed

Lines changed: 113 additions & 6 deletions

File tree

README.md

Lines changed: 113 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,115 @@
1-
# transforms
2-
Custom Ptransform collection for beamstack
1+
# Beamstack PTransforms
32

3+
Beamstack has a robust collection of custom `PTransforms` specifically designed to simplify and accelerate data processing and machine learning workflows. Whether you’re embedding text, scraping web data, managing Elasticsearch vector stores, or performing machine learning inference, Beamstack provides the tools you need to build scalable and efficient data pipelines.
44

5-
## creating a package
6-
```sh
7-
python -m build --sdist
8-
```
5+
---
6+
7+
## Table of Contents
8+
9+
1. [Introduction](#introduction)
10+
2. [Features of Beamstack PTransforms](#features-of-beamstack-ptransforms)
11+
3. [Components of Beamstack PTransforms](#components-of-beamstack-ptransforms)
12+
4. [Detailed Overview of Beamstack PTransform Components](#detailed-overview-of-beamstack-ptransform-components)
13+
5. [Usage of Beamstack PTransforms](#usage-of-beamstack-ptransforms)
14+
15+
---
16+
17+
## Introduction
18+
19+
Beamstack extends Apache Beam's capabilities by introducing a set of specialized `PTransforms` tailored for modern data engineering and machine learning tasks. These transforms simplify complex workflows by providing pre-built components for common tasks such as text embeddings, web scraping, vector store management, and ML inference.
20+
21+
---
22+
23+
## Features of Beamstack PTransforms
24+
25+
Beamstack offers a comprehensive set of `PTransforms` that are easy to integrate into your Apache Beam pipelines. Key offerings include:
26+
27+
<details>
28+
<summary><b>Seamless Text Embedding in the processing pipeline:</b></summary>
29+
<ul>
30+
<li>Create embeddings using state-of-the-art models from Hugging Face, OpenAI, and more.</li>
31+
</ul>
32+
</details>
33+
34+
<details>
35+
<summary><b>Streamlined Web Scraping in the processing pipeline:</b></summary>
36+
<ul>
37+
<li>Efficiently extract data from websites and preprocess it for further analysis.</li>
38+
</ul>
39+
</details>
40+
41+
<details>
42+
<summary><b>Integrated Elasticsearch Vector Store in the processing pipeline:</b></summary>
43+
<ul>
44+
<li>Create, manage, and query Elasticsearch vector stores seamlessly within your Beam pipelines.</li>
45+
</ul>
46+
</details>
47+
48+
<details>
49+
<summary><b>Optimized Machine Learning Inferences in the processing pipeline:</b></summary>
50+
<ul>
51+
<li>Run ML inference directly within your Beam pipelines using popular frameworks.</li>
52+
</ul>
53+
</details>
54+
55+
---
56+
57+
## Components of Beamstack PTransforms
58+
59+
Beamstack is organized into the following core components:
60+
61+
<details>
62+
<summary><b>Embedding Transforms:</b></summary>
63+
<ul>
64+
<li>Supports various embedding models like Hugging Face and OpenAI.</li>
65+
<li>Easy integration with text preprocessing pipelines.</li>
66+
</ul>
67+
</details>
68+
69+
<details>
70+
<summary><b>Web Scraping Transforms:</b></summary>
71+
<ul>
72+
<li>Tools for fetching and parsing web content.</li>
73+
<li>Pre-built pipelines for common web scraping tasks.</li>
74+
</ul>
75+
</details>
76+
77+
<details>
78+
<summary><b>Elasticsearch Vector Store Transforms:</b></summary>
79+
<ul>
80+
<li>Manage vector stores for efficient similarity search.</li>
81+
<li>Tools for indexing and querying high-dimensional vectors.</li>
82+
</ul>
83+
</details>
84+
85+
<details>
86+
<summary><b>Machine Learning Inference Transforms:</b></summary>
87+
<ul>
88+
<li>Integrate ML models into your Beam pipelines for scalable inference.</li>
89+
<li>Supports frameworks like TensorFlow, PyTorch, etc.</li>
90+
</ul>
91+
</details>
92+
93+
---
94+
95+
## Detailed Overview of Beamstack PTransform Components
96+
97+
### 1. Embedding Transforms
98+
99+
The embedding transforms are designed to facilitate the creation of vector representations for text data. These transforms support various embedding models including those from Hugging Face and OpenAI. They are highly customizable and can be integrated into larger NLP pipelines for tasks such as sentiment analysis, topic modeling, and more.
100+
101+
### 2. Web Scraping Transforms
102+
103+
Beamstack provides a set of transforms specifically for web scraping. These transforms allow you to extract, clean, and preprocess data from websites, making it ready for downstream tasks like text analysis or machine learning.
104+
105+
### 3. Elasticsearch Vector Store Transforms
106+
107+
This component allows for the creation and management of Elasticsearch vector stores directly within your Beam pipelines. You can efficiently index and search for high-dimensional vectors, enabling tasks such as semantic search and nearest neighbor retrieval.
108+
109+
### 4. ML Inference Transforms
110+
111+
With ML Inference Transforms, you can seamlessly run machine learning models within your Apache Beam pipelines. This component supports popular ML frameworks and provides utilities for handling model inputs and outputs in a distributed manner.
112+
113+
---
114+
115+
## Usage of Beamstack PTransforms

0 commit comments

Comments
 (0)