Skip to content

Commit 172f241

Browse files
authored
Add Documentation Page (#42)
* Use mkdocs material to create a documentation website on Github Pages (https://manatee-project.github.io/manatee/) * Add CI/CD workflow to build document website on push * Minor fixes
1 parent b163314 commit 172f241

File tree

16 files changed

+284
-139
lines changed

16 files changed

+284
-139
lines changed

.github/workflows/docs.yml

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
name: docs
2+
on:
3+
push:
4+
branches: [main]
5+
permissions:
6+
contents: write
7+
jobs:
8+
deploy:
9+
runs-on: ubuntu-latest
10+
steps:
11+
- uses: actions/checkout@v4
12+
- name: Configure Git Credentials
13+
run: |
14+
git config user.name github-actions[bot]
15+
git config user.email 41898282+github-actions[bot]@users.noreply.github.com
16+
- uses: actions/setup-python@v5
17+
with:
18+
python-version: 3.x
19+
- run: echo "cache_id=$(date --utc '+%V')" >> $GITHUB_ENV
20+
- uses: actions/cache@v4
21+
with:
22+
key: mkdocs-material-${{ env.cache_id }}
23+
path: .cache
24+
restore-keys: |
25+
mkdocs-material-
26+
- run: pip install mkdocs-material
27+
- run: mkdocs gh-deploy --force

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
.terraform.lock.hcl
2+
site/*
23
*.bazel.lock
34
.gitconfig
45
bazel-*

LICENSE

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -186,7 +186,7 @@
186186
same "printed page" as the copyright notice for easier
187187
identification within third-party archives.
188188

189-
Copyright [yyyy] [name of copyright owner]
189+
Copyright [2024] TikTok Inc.
190190

191191
Licensed under the Apache License, Version 2.0 (the "License");
192192
you may not use this file except in compliance with the License.

README.md

Lines changed: 20 additions & 138 deletions
Original file line numberDiff line numberDiff line change
@@ -1,157 +1,39 @@
1-
<img width=100% src="logo.png">
1+
<img width=100% src="docs/assets/img/logo.png">
22

33
# ManaTEE Project
44

5-
> Note: we are releasing an alpha version, which may miss some necessary features.
6-
75
ManaTEE is an open-source project for easily building and deploying data collaboration framework to the cloud using trusted execution environments (TEEs).
86
It allows users to easily collaborate on private datasets without leaking privacy of individual data.
9-
ManaTEE achieves this by combining different privacy-enhancing technologies (PETs) in different programming stages.
10-
11-
In summary, ManaTEE is great tool for data collaboration with the following features.
12-
13-
* **Interactive Programming**: ManaTEE integrates with an existing Jupyter Notebook interface such that the data analysts can program interactively with popular languages like Python
14-
* **Multiparty**: ManaTEE allows multi-party data collaboration without needing to send the private data to each other
15-
* **Cloud-Ready**: ManaTEE can be easily deployed in TEEs in the cloud, including Google Confidential Space
16-
* **Accurate Results**: ManaTEE does not sacrifice accuracy for data privacy. This is achieved by a two-stage approach with different PETs applied to each stage.
17-
18-
### What is Different from Other Data Collaboration Frameworks?
19-
20-
Data collaboration is not a new concept, and numerous data collaboration frameworks already exist.
21-
However, different frameworks try to apply different privacy-enhancing technologies (PETs), which have different strengths and weaknesses.
22-
ManaTEE tries to utilize different PETs in different programming stages to maximize the usability while protecting individual data privacy.
23-
Specifically, ManaTEE divides data analytics in two stages: *Programming Stage* and *Secure Execution Stage*.
24-
25-
![Alt text](two-stage.png)
26-
27-
In Programming Stage, the data scientist uses Jupyter Notebook interface to explore the general data structure and statistical characteristics.
28-
The data providers can determine how they protect privacy of their data.
29-
For example, they can use differentially-private synthetic data, completely random data, or partial public data.
30-
This means that the it mathematically limits the leakage of privacy of individual data records.
31-
The finished notebook files can then be submitted to the Secure Execution Stage.
32-
33-
In Secure Execution Stage, the submitted notebook file is built into an image, and scheduled to a confidential virtual machine (CVMs) in the cloud.
34-
The data providers can set up their data such that only *attested* program can fetch the data.
35-
By using attestation, the data providers can control which program can access their data.
36-
TEE also assures the data scientists that the integrity of their program and the legitimacy of the output from executions by providing JWT-based attestation report.
37-
38-
### Use Cases
7+
ManaTEE achieves this by combining different privacy-enhancing technologies (PETs) in different stages.
398

40-
There are many potential use cases of the ManaTEE:
9+
# What does it offer?
4110

42-
* **Trusted Research Environments (TREs)**: Some data may be valuable to various research on public health, economic impact, and many other fields.
43-
TREs are a secure environment where authorized/vetted researchers and organizations can access the data. The data provider can choose to use ManaTEE to build their TRE.
44-
Currently, [TikTok's Research Tools Virtual Compute Environment (VCE)](https://developers.tiktok.com/doc/vce-getting-started) is built on top of ManaTEE.
11+
ManaTEE allows organizations to quickly customize and deploy data collaboration framework in the cloud.
12+
The organizations can provide an programming environment to the external data scientists to conduct research, while protecting the data privacy with a custom policy.
4513

46-
* **Advertisement and Marketing**: Ads is a popular use case of data collaboration frameworks. ManaTEE can be used for [lookalike segment analysis](https://en.wikipedia.org/wiki/Lookalike_audience) for advertisers, or [Ad Tracking](https://en.wikipedia.org/wiki/Ad_tracking) with private user data.
47-
48-
* **Machine Learning**: ManaTEE can be useful for machine learning involving private data or models. For example, a private model provider can provide their model for fine-tuning, but do not reveal the actual model in the Programming Stage.
49-
50-
### Project Status
51-
52-
We are releasing an alpha version, which may miss some necessary features.
53-
54-
| | Current (Alpha) | Future |
55-
|-------------------------|--------------------------|---------------------------|
56-
| **Users** | One-Way Collaboration | Multi-Way Collaboration |
57-
| **Backend** | Single Backend (Goole Cloud Platform) | Multiple Backend |
58-
| **Data Provisioning** | Manual | Automated |
59-
| **Policy and Attestation** | Manual | Automated |
60-
| **Compute** | CPU | CPU/GPU |
61-
62-
# Getting Started
63-
64-
## Prerequisites
65-
* A valid GCP account that has ability to create/destroy resources. For a GCP project, please enable the following apis:
66-
- serviceusage.googleapis.com
67-
- compute.googleapis.com
68-
- container.googleapis.com
69-
- cloudkms.googleapis.com
70-
- servicenetworking.googleapis.com
71-
- cloudresourcemanager.googleapis.com
72-
- sqladmin.googleapis.com
73-
- confidentialcomputing.googleapis.com
74-
* [Gcloud CLI](https://cloud.google.com/sdk/docs/install) Login to the GCP `gcloud auth login && gcloud auth application-default login && gcloud components install gke-gcloud-auth-plugin`
75-
* [Terraform](https://developer.hashicorp.com/terraform/tutorials/aws-get-started/install-cli) Terraform is an infrastructure as code tool that enables you to safely and predictably provision and manage infrastructure in any cloud.
76-
* [Helm](https://helm.sh/docs/intro/install/) Helm is a package manager for Kubernetes that allows developers and operators to more easily package, configure, and deploy applications and services onto Kubernetes clusters.
77-
* [Hertz](https://github.com/cloudwego/hertz) Hertz is a high-performance, high-usability, extensible HTTP framework for Go. It’s designed to make it easy for developers to build microservices.
78-
79-
## Deploying in Google Cloud Platform (GCP)
80-
### Defining environment variables
81-
First, copy the example environment variables template to the existing directory.
82-
```
83-
cp .env.example env.bzl
84-
```
85-
Edit the variables in `env.bzl`. The `env.bzl` file is the one that really takes effect, the other files are just templates. The double quotes around a variable name are needed. For example:
86-
```
87-
env="dev" # the deployment environment
88-
project_id="you project id" # gcp project id
89-
region="" # the region that the resources created in
90-
zone="" # the zone that the resources created in
91-
```
14+
> Note: ManaTEE is under active development, and it is not production-ready. We are looking forward to your feedback and contributions.
9215
93-
### Preparing resources
94-
The resources are created and managed by the project administrator who has the `Owner` role in the GCP project. Make sure you have correctly defined environment variables in the `env.bzl`. Only the project administrator is responsible to run these commands to create resources.
16+
# Quick Start
9517

96-
`resources/global` directory contains the global resources including: clusters, cloud sql instance, database, docker repositories, and service accounts. These resource are global and only created once.
97-
```
98-
pushd resources/global
99-
./apply.sh
100-
popd
18+
Install Bazel with [Bazelisk](https://github.com/bazelbuild/bazelisk):
19+
```sh
20+
brew install bazelisk # on MacOS
21+
choco install bazelisk # on Windows
10122
```
23+
On Ubuntu, download the latest Bazelisk binary via [Releases](https://github.com/bazelbuild/bazelisk/releases)
10224

103-
`resources/deployment` directory includes the resources releated to kunernates including: kubernetes namespace, role, secret. These resources are created under different namespace. So the namespace parameter is required, and you can create different deployments under different namespaces.
104-
```shell
105-
pushd resources/deployment
106-
./apply.sh --namespace=<namespace-to-deploy>
107-
popd
25+
Build all images
10826
```
109-
110-
### Building and Pushing Images
111-
`app` directory contains the source codes of the data clean room which has three components:
112-
113-
* `dcr_tee` contains tools that are used in the base image of stage2 such as a tool generates custom attestation report within GCP confidential space.
114-
* `dcr_api` is the backend service of the data clean room that processes the request from jupyterlab.
115-
* `dcr_monitor` is a cron job that monitors the execution of each job. The monitor is deployed to Kubernetes cluster and scheduled to run every minute.
116-
* `jupyterlab_manatee` is an JupyterLab extension for data clean room that submits a job on the fronted and queries the status of the jobs.
117-
118-
[Bazel](https://bazel.build/install) is required to build all of the binaries and push them to the artifact registry.
119-
120-
```shell
121-
gcloud auth configure-docker us-docker.pkg.dev # authenticate to artifact registry
122-
bazel run //:push_all_images --action_env=namespace=<namespace-to-deploy>
27+
bazelisk build //...
12328
```
12429

125-
> [!IMPORTANT]
126-
> the `--action_env=namespace=<namespace-to-deploy>` flag is required.
127-
128-
You can also push images separately by this command. Replace `<app>` by the directory name under `/app` (e.g., dcr_api)
129-
30+
Run all tests
13031
```
131-
bazel run //:push_<app>_image --action_env=namespace=<namespace-to-deploy>
32+
bazelisk test //...
13233
```
13334

35+
See [documents](https://manatee-project.github.io/manatee) for more details including cloud deployment.
36+
# License
13437

135-
If you'd like to load the images in your local container runtime (e.g., Docker), you can use `oci_load` rules.
136-
137-
```shell
138-
bazel query 'kind("oci_load", "//app/...")' | xargs -n1 bazel run
139-
```
140-
141-
Find individual rules from corresponding `BUILD.bazel` files.
142-
143-
### Deploying
144-
145-
Deploy data clean room and jupyterhub by helm chart.
146-
```shell
147-
source env.bzl
148-
gcloud container clusters get-credentials dcr-$env-cluster --zone $zone --project $project_id
149-
150-
pushd deployment
151-
./deploy.sh --namespace=<namespace-to-deploy>
152-
popd
153-
```
154-
When deployment is complete, you can follow the output of the script to get the public ip of jupyterhub.
155-
```
156-
kubectl --namespace=<namespace-to-deploy> get service proxy-public
157-
```
38+
ManaTEE is licensed under the Apache License 2.0.
39+
See [LICENSE](LICENSE) for details.
File renamed without changes.

docs/assets/img/manatee-white.png

35.9 KB
Loading

docs/assets/img/manatee.png

62.1 KB
Loading
File renamed without changes.

docs/developer/architecture.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
# Architecture
2+

docs/getting-started/building.md

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
# Building
2+
3+
ManaTEE uses [Bazel](https://bazel.build/install) for hermetic builds.
4+
Bazel is aware of all required tools and dependencies, thus building images is as easy as:
5+
6+
```
7+
bazel build //...
8+
```
9+
10+
Find individual rules from corresponding `BUILD.bazel` files.
11+
12+
## Components
13+
14+
`app` directory contains the source codes of the data clean room which has three components:
15+
16+
* `dcr_tee` contains tools that are used in the base image of stage2 such as a tool generates custom attestation report within GCP confidential space.
17+
* `dcr_api` is the backend service of the data clean room that processes the request from jupyterlab.
18+
* `dcr_monitor` is a cron job that monitors the execution of each job. The monitor is deployed to Kubernetes cluster and scheduled to run every minute.
19+
* `jupyterlab_manatee` is an JupyterLab extension for data clean room that submits a job on the fronted and queries the status of the jobs.
20+
21+
## Loading Container Images
22+
23+
If you'd like to load the images in your local container runtime (e.g., Docker), you can use `oci_load` rules.
24+
25+
```shell
26+
bazel query 'kind("oci_load", "//app/...")' | xargs -n1 bazel run
27+
```
28+
29+
# Testing
30+
31+
To run all tests, run:
32+
33+
```
34+
bazel test //...
35+
```

0 commit comments

Comments
 (0)