|
1 | | -<img width=100% src="logo.png"> |
| 1 | +<img width=100% src="docs/assets/img/logo.png"> |
2 | 2 |
|
3 | 3 | # ManaTEE Project |
4 | 4 |
|
5 | | -> Note: we are releasing an alpha version, which may miss some necessary features. |
6 | | -
|
7 | 5 | ManaTEE is an open-source project for easily building and deploying data collaboration framework to the cloud using trusted execution environments (TEEs). |
8 | 6 | It allows users to easily collaborate on private datasets without leaking privacy of individual data. |
9 | | -ManaTEE achieves this by combining different privacy-enhancing technologies (PETs) in different programming stages. |
10 | | - |
11 | | -In summary, ManaTEE is great tool for data collaboration with the following features. |
12 | | - |
13 | | -* **Interactive Programming**: ManaTEE integrates with an existing Jupyter Notebook interface such that the data analysts can program interactively with popular languages like Python |
14 | | -* **Multiparty**: ManaTEE allows multi-party data collaboration without needing to send the private data to each other |
15 | | -* **Cloud-Ready**: ManaTEE can be easily deployed in TEEs in the cloud, including Google Confidential Space |
16 | | -* **Accurate Results**: ManaTEE does not sacrifice accuracy for data privacy. This is achieved by a two-stage approach with different PETs applied to each stage. |
17 | | - |
18 | | -### What is Different from Other Data Collaboration Frameworks? |
19 | | - |
20 | | -Data collaboration is not a new concept, and numerous data collaboration frameworks already exist. |
21 | | -However, different frameworks try to apply different privacy-enhancing technologies (PETs), which have different strengths and weaknesses. |
22 | | -ManaTEE tries to utilize different PETs in different programming stages to maximize the usability while protecting individual data privacy. |
23 | | -Specifically, ManaTEE divides data analytics in two stages: *Programming Stage* and *Secure Execution Stage*. |
24 | | - |
25 | | - |
26 | | - |
27 | | -In Programming Stage, the data scientist uses Jupyter Notebook interface to explore the general data structure and statistical characteristics. |
28 | | -The data providers can determine how they protect privacy of their data. |
29 | | -For example, they can use differentially-private synthetic data, completely random data, or partial public data. |
30 | | -This means that the it mathematically limits the leakage of privacy of individual data records. |
31 | | -The finished notebook files can then be submitted to the Secure Execution Stage. |
32 | | - |
33 | | -In Secure Execution Stage, the submitted notebook file is built into an image, and scheduled to a confidential virtual machine (CVMs) in the cloud. |
34 | | -The data providers can set up their data such that only *attested* program can fetch the data. |
35 | | -By using attestation, the data providers can control which program can access their data. |
36 | | -TEE also assures the data scientists that the integrity of their program and the legitimacy of the output from executions by providing JWT-based attestation report. |
37 | | - |
38 | | -### Use Cases |
| 7 | +ManaTEE achieves this by combining different privacy-enhancing technologies (PETs) in different stages. |
39 | 8 |
|
40 | | -There are many potential use cases of the ManaTEE: |
| 9 | +# What does it offer? |
41 | 10 |
|
42 | | -* **Trusted Research Environments (TREs)**: Some data may be valuable to various research on public health, economic impact, and many other fields. |
43 | | -TREs are a secure environment where authorized/vetted researchers and organizations can access the data. The data provider can choose to use ManaTEE to build their TRE. |
44 | | -Currently, [TikTok's Research Tools Virtual Compute Environment (VCE)](https://developers.tiktok.com/doc/vce-getting-started) is built on top of ManaTEE. |
| 11 | +ManaTEE allows organizations to quickly customize and deploy data collaboration framework in the cloud. |
| 12 | +The organizations can provide an programming environment to the external data scientists to conduct research, while protecting the data privacy with a custom policy. |
45 | 13 |
|
46 | | -* **Advertisement and Marketing**: Ads is a popular use case of data collaboration frameworks. ManaTEE can be used for [lookalike segment analysis](https://en.wikipedia.org/wiki/Lookalike_audience) for advertisers, or [Ad Tracking](https://en.wikipedia.org/wiki/Ad_tracking) with private user data. |
47 | | - |
48 | | -* **Machine Learning**: ManaTEE can be useful for machine learning involving private data or models. For example, a private model provider can provide their model for fine-tuning, but do not reveal the actual model in the Programming Stage. |
49 | | - |
50 | | -### Project Status |
51 | | - |
52 | | -We are releasing an alpha version, which may miss some necessary features. |
53 | | - |
54 | | -| | Current (Alpha) | Future | |
55 | | -|-------------------------|--------------------------|---------------------------| |
56 | | -| **Users** | One-Way Collaboration | Multi-Way Collaboration | |
57 | | -| **Backend** | Single Backend (Goole Cloud Platform) | Multiple Backend | |
58 | | -| **Data Provisioning** | Manual | Automated | |
59 | | -| **Policy and Attestation** | Manual | Automated | |
60 | | -| **Compute** | CPU | CPU/GPU | |
61 | | - |
62 | | -# Getting Started |
63 | | - |
64 | | -## Prerequisites |
65 | | -* A valid GCP account that has ability to create/destroy resources. For a GCP project, please enable the following apis: |
66 | | - - serviceusage.googleapis.com |
67 | | - - compute.googleapis.com |
68 | | - - container.googleapis.com |
69 | | - - cloudkms.googleapis.com |
70 | | - - servicenetworking.googleapis.com |
71 | | - - cloudresourcemanager.googleapis.com |
72 | | - - sqladmin.googleapis.com |
73 | | - - confidentialcomputing.googleapis.com |
74 | | -* [Gcloud CLI](https://cloud.google.com/sdk/docs/install) Login to the GCP `gcloud auth login && gcloud auth application-default login && gcloud components install gke-gcloud-auth-plugin` |
75 | | -* [Terraform](https://developer.hashicorp.com/terraform/tutorials/aws-get-started/install-cli) Terraform is an infrastructure as code tool that enables you to safely and predictably provision and manage infrastructure in any cloud. |
76 | | -* [Helm](https://helm.sh/docs/intro/install/) Helm is a package manager for Kubernetes that allows developers and operators to more easily package, configure, and deploy applications and services onto Kubernetes clusters. |
77 | | -* [Hertz](https://github.com/cloudwego/hertz) Hertz is a high-performance, high-usability, extensible HTTP framework for Go. It’s designed to make it easy for developers to build microservices. |
78 | | - |
79 | | -## Deploying in Google Cloud Platform (GCP) |
80 | | -### Defining environment variables |
81 | | -First, copy the example environment variables template to the existing directory. |
82 | | -``` |
83 | | -cp .env.example env.bzl |
84 | | -``` |
85 | | -Edit the variables in `env.bzl`. The `env.bzl` file is the one that really takes effect, the other files are just templates. The double quotes around a variable name are needed. For example: |
86 | | -``` |
87 | | -env="dev" # the deployment environment |
88 | | -project_id="you project id" # gcp project id |
89 | | -region="" # the region that the resources created in |
90 | | -zone="" # the zone that the resources created in |
91 | | -``` |
| 14 | +> Note: ManaTEE is under active development, and it is not production-ready. We are looking forward to your feedback and contributions. |
92 | 15 |
|
93 | | -### Preparing resources |
94 | | -The resources are created and managed by the project administrator who has the `Owner` role in the GCP project. Make sure you have correctly defined environment variables in the `env.bzl`. Only the project administrator is responsible to run these commands to create resources. |
| 16 | +# Quick Start |
95 | 17 |
|
96 | | -`resources/global` directory contains the global resources including: clusters, cloud sql instance, database, docker repositories, and service accounts. These resource are global and only created once. |
97 | | -``` |
98 | | -pushd resources/global |
99 | | -./apply.sh |
100 | | -popd |
| 18 | +Install Bazel with [Bazelisk](https://github.com/bazelbuild/bazelisk): |
| 19 | +```sh |
| 20 | +brew install bazelisk # on MacOS |
| 21 | +choco install bazelisk # on Windows |
101 | 22 | ``` |
| 23 | +On Ubuntu, download the latest Bazelisk binary via [Releases](https://github.com/bazelbuild/bazelisk/releases) |
102 | 24 |
|
103 | | -`resources/deployment` directory includes the resources releated to kunernates including: kubernetes namespace, role, secret. These resources are created under different namespace. So the namespace parameter is required, and you can create different deployments under different namespaces. |
104 | | -```shell |
105 | | -pushd resources/deployment |
106 | | -./apply.sh --namespace=<namespace-to-deploy> |
107 | | -popd |
| 25 | +Build all images |
108 | 26 | ``` |
109 | | - |
110 | | -### Building and Pushing Images |
111 | | -`app` directory contains the source codes of the data clean room which has three components: |
112 | | - |
113 | | -* `dcr_tee` contains tools that are used in the base image of stage2 such as a tool generates custom attestation report within GCP confidential space. |
114 | | -* `dcr_api` is the backend service of the data clean room that processes the request from jupyterlab. |
115 | | -* `dcr_monitor` is a cron job that monitors the execution of each job. The monitor is deployed to Kubernetes cluster and scheduled to run every minute. |
116 | | -* `jupyterlab_manatee` is an JupyterLab extension for data clean room that submits a job on the fronted and queries the status of the jobs. |
117 | | - |
118 | | -[Bazel](https://bazel.build/install) is required to build all of the binaries and push them to the artifact registry. |
119 | | - |
120 | | -```shell |
121 | | -gcloud auth configure-docker us-docker.pkg.dev # authenticate to artifact registry |
122 | | -bazel run //:push_all_images --action_env=namespace=<namespace-to-deploy> |
| 27 | +bazelisk build //... |
123 | 28 | ``` |
124 | 29 |
|
125 | | -> [!IMPORTANT] |
126 | | -> the `--action_env=namespace=<namespace-to-deploy>` flag is required. |
127 | | -
|
128 | | -You can also push images separately by this command. Replace `<app>` by the directory name under `/app` (e.g., dcr_api) |
129 | | - |
| 30 | +Run all tests |
130 | 31 | ``` |
131 | | -bazel run //:push_<app>_image --action_env=namespace=<namespace-to-deploy> |
| 32 | +bazelisk test //... |
132 | 33 | ``` |
133 | 34 |
|
| 35 | +See [documents](https://manatee-project.github.io/manatee) for more details including cloud deployment. |
| 36 | +# License |
134 | 37 |
|
135 | | -If you'd like to load the images in your local container runtime (e.g., Docker), you can use `oci_load` rules. |
136 | | - |
137 | | -```shell |
138 | | -bazel query 'kind("oci_load", "//app/...")' | xargs -n1 bazel run |
139 | | -``` |
140 | | - |
141 | | -Find individual rules from corresponding `BUILD.bazel` files. |
142 | | - |
143 | | -### Deploying |
144 | | - |
145 | | -Deploy data clean room and jupyterhub by helm chart. |
146 | | -```shell |
147 | | -source env.bzl |
148 | | -gcloud container clusters get-credentials dcr-$env-cluster --zone $zone --project $project_id |
149 | | - |
150 | | -pushd deployment |
151 | | -./deploy.sh --namespace=<namespace-to-deploy> |
152 | | -popd |
153 | | -``` |
154 | | -When deployment is complete, you can follow the output of the script to get the public ip of jupyterhub. |
155 | | -``` |
156 | | -kubectl --namespace=<namespace-to-deploy> get service proxy-public |
157 | | -``` |
| 38 | +ManaTEE is licensed under the Apache License 2.0. |
| 39 | +See [LICENSE](LICENSE) for details. |
0 commit comments