Skip to content

Latest commit

 

History

History
37 lines (30 loc) · 1.33 KB

internals.rst

File metadata and controls

37 lines (30 loc) · 1.33 KB

Internals

Data Root

Smallpond stores all data in a single directory called data root.

This directory has the following structure:

data_root
└── 2024-12-11-12-00-28.2cc39990-296f-48a3-8063-78cf6dca460b # job_time.job_id
    ├── config  # configuration and state
    │   ├── exec_plan.pickle
    │   ├── logical_plan.pickle
    │   └── runtime_ctx.pickle
    ├── log     # logs
    │   ├── graph.png
    │   └── scheduler.log
    ├── queue   # message queue between scheduler and workers
    ├── output  # output data
    ├── staging # intermediate data
    │   ├── DataSourceTask.000001
    │   ├── EvenlyDistributedPartitionProducerTask.000002
    │   ├── completed_tasks  # output dataset of completed tasks
    │   └── started_tasks    # used for checkpoint
    └── temp    # temporary data
        ├── DataSourceTask.000001
        └── EvenlyDistributedPartitionProducerTask.000002

Failure Recovery

Smallpond can recover from failure and resume execution from the last checkpoint. Checkpoint is task-level. A few tasks, such as ArrowBatchTask, support checkpointing at the batch level.