Skip to content

Commit aafc5bf

Browse files
Switch task queue to procrastinate based on psql (#11)
* Add openaleph-procrastinate. Bump versions to satisfy dependencies (poetry lock). * πŸ§‘β€πŸ’» Add pre-commit, use requirements.txt, upgrade to python3.13 * πŸ§‘β€πŸ’» Add dev requirements only for test build * πŸ”₯ (github) Drop daily cache job * βœ… (tests/test_pdf) Fix whitespace errors from test results * πŸ”¨ (make) Build before test * πŸ‘· Inline base build * 🚧 Tweak builds and tags * πŸ‘· (github) Skip intermediate arm46 build for tests * πŸ‘· (github) Skip cache-from [tmp] * Revert "πŸ‘· (github) Skip cache-from [tmp]" This reverts commit 03f86fd. * πŸ‘· (github/docker) Try this * 🚨 Apply black * πŸ‘· (github/docker) Don't use registry cache * πŸ§ͺ (test_image) Skip gif test * πŸ‘· (github/docker) maybe this * ✨ Boilerplate ingest task for procrastinate * πŸ”§ Use pydantic_settings * πŸ“Œ Use openaleph-procrastinate from git * ⚰️ Drop TranscriptionSupport * πŸ”₯ Remove analysis part * πŸ”₯ Remove servicelayer worker * ♻️ Refactor manager and supports to work with procrastinate * 🚧 Make procrastinate task to work with manager * βž– languagecodes, pantomime -> rigour * πŸ”Š Tweak global logging * ♻️ Refactor cli with typer * πŸ§ͺ Make tests work with procrastinate refactor * 🩹 (ingestors/email) Use relative path * ✨ (support/timestamp) Fall back to dateparser for unknown formats * πŸ™ˆ Ignore more * πŸ”₯ Remove unused lid model * πŸ‘· (github) Tag base image properly * πŸ“¦ (docker) Use entrypoint and run procrastinate worker * πŸ§‘β€πŸ’» (contrib) Add non-docker debian install dependencies * πŸ§ͺ Add end-to-end testing setup * πŸ§ͺ (e2e) Working example * πŸ”§ (settings) Move deferring settings up to openaleph-procrastinate * πŸ“Œ requirements * Pin Tesserocr to 2.6.2 * Add ENV LD_PRELOAD for Apple Silicone as comment * Solve minor errors * Bump openaleph_procrastinate version * πŸ› (cli) Use defer settings correctly in debug mode * ⬆️ openaleph-procrastinate v0.0.7 * πŸ‘½οΈ Adapt explicit defers from openaleph-procrastinate v0.0.7 * πŸ‘· Tweak compose settings * Bump openaleph-procrastinate version * Add namespace to entities. Remove app user * Add namespace info to test setup * Explicitly set the testing DB to sqlite * Pin procrastinate to 3.2.2 for tests * Add transcription procrastinate task * πŸ“Œ Pin procrastinate==3.2.2 for test docker build * 🚧 (docker) Cleanup duplicated RUN * 🚧 (cli) Adjust settings display * πŸ”§ (tests) Properly set FTM_STORE_URI * ⬆️ Dependencies * ✨ Documentation * πŸ”₯ Drop google cloud vision support * Always index entities after ingesting * Replace get_dataset with get_fragments (ftmq.store) * ⬆️ ftm(q) 4.1.x, openaleph-procrastinate 0.0.13 * πŸ› (support/email) Catch empty name * 🎨 (support/transcription) Cleanup * πŸ”₯ Drop unused settings * ⬆️ openaleph-procrastinate 0.0.14 * 🚧 (cli) Make foreign_id optional * βœ… Add e2e testing with minio * ⬆️ openaleph-procrastinate 0.0.16 * ⬆️ openaleph-procrastinate 0.0.16 * πŸ’š (github) Skip e2e * ⬆️ ftmq 4.1.1, openaleph-procrastinate 0.0.18 * 🚧 (tests/e2e) Adjustments * πŸ”– Bump version: 3.24.0 β†’ 5.0.0rc1 * πŸ’š (github) Enable e2e again * ⬆️ openaleph-procrastinate 0.0.20 * 🚧 (ingestors/image) Explicitly close PIL obj after processing * 🚧 (support/shell) Write to subprocess special DEVNULL * 🚧 (ingestors/access) Wrap subprocess call in context manager * 🚧 (ingestors/csv) Properly use context manager for file open * 🚧 (support/ocr) Clean up OCR engine after use * βš—οΈ memray * πŸ”§ (settings) Properly configure servicelayer tags * 🚧 (tasks) Collect garbage, just in case * πŸ“Œ Pin olefile<0.47 as this leaks crazy memory * πŸ“Œ Fix RC version string * ⬆️ All the things * πŸ”– Bump version: 5.0.0-rc1 β†’ 5.0.0-rc2 * ⬆️ openaleph-procrastinate 0.0.25 and others * πŸ”– Bump version: 5.0.0-rc2 β†’ 5.0.0-rc3 * 🩹 (tasks) Pass through batch (formerly job_id) * πŸ“Œ Pin back tesserocr=2.6.2 * Dockerfile refactorings (tesserocr 2.6.2, openaleph-servicelayer etc.) (#16) * Compile tesserocr with c++ 14; use openaleph-servicelayer * Build tesserocr in Dockerfile.base; don't build Apple base docker image * Separate test docker image * Move tesserocr to ocr dependencies * Only generate main requirements from pre-commit hook * Move tesserocr to optional dependencies * Add build-test to Makefile test, before running tests * πŸ”– Bump version: 5.0.0-rc3 β†’ 5.0.0-rc4 * ⬆️ followthemoney 4.2.0 * ⬆️ ftmq 4.2.2 (psycopg3) * ⬆️ openaleph-procrastinate 0.0.29 * πŸ”§ Ensure psycopg3 for sl tags db * Temporarily disable daily ingest-file-base build * Update poetry.lock * πŸ”– Bump version: 5.0.0-rc4 β†’ 5.0.0-rc5 --------- Co-authored-by: Alex ȘtefΔƒnescu <alex.stefanescu@pm.me> Co-authored-by: Alex ȘtefΔƒnescu <catileptic@users.noreply.github.com>
1 parent e9ef1db commit aafc5bf

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

75 files changed

+3036
-3749
lines changed

β€Ž.bumpversion.cfgβ€Ž

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
[bumpversion]
2-
current_version = 3.24.0
2+
current_version = 5.0.0-rc5
33
tag_name = {new_version}
44
commit = True
55
tag = True

β€Ž.dockerignoreβ€Ž

Lines changed: 0 additions & 5 deletions
This file was deleted.

β€Ž.dockerignoreβ€Ž

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
.gitignore

β€Ž.github/workflows/build.ymlβ€Ž

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -82,3 +82,7 @@ jobs:
8282
labels: ${{ steps.meta.outputs.labels }}
8383
cache-from: type=gha
8484
cache-to: type=gha,mode=max
85+
86+
- name: Run e2e test
87+
working-directory: ./e2e
88+
run: bash -c ./test_e2e.sh

β€Ž.github/workflows/docker-base.ymlβ€Ž

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,8 @@ name: Build ingest-file-base
22

33
on:
44
workflow_dispatch: {}
5-
schedule:
6-
- cron: "0 0 * * *"
5+
# schedule:
6+
# - cron: "0 0 * * *"
77
push:
88
paths:
99
- Dockerfile.base
@@ -28,7 +28,8 @@ jobs:
2828
type=ref,event=branch
2929
type=semver,pattern={{version}}
3030
type=sha
31-
type=raw,value=latest
31+
type=raw,value=cache
32+
type=raw,value=latest,enable=${{ startsWith(github.ref, 'refs/tags') }}
3233
- name: Set up Docker Buildx
3334
uses: docker/setup-buildx-action@v2
3435
with:
@@ -44,7 +45,7 @@ jobs:
4445
with:
4546
context: .
4647
file: ./Dockerfile.base
47-
platforms: linux/amd64,linux/arm64
48+
platforms: linux/amd64 #,linux/arm64
4849
push: true
4950
tags: ${{ steps.meta.outputs.tags }}
5051
labels: ${{ steps.meta.outputs.labels }}

β€Ž.gitignoreβ€Ž

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,10 @@
1+
contrib/*.bin
2+
data/archive
13
data/model_type_prediction.ftz
4+
debug.sqlite3
25
data/servicelayer-archive
6+
# documentation
7+
site
38
# Byte-compiled / optimized / DLL files
49
__pycache__/
510
*.py[cod]

β€Ž.pre-commit-config.yamlβ€Ž

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,20 +5,21 @@
55
# * Run "pre-commit install".
66
repos:
77
- repo: https://github.com/pre-commit/pre-commit-hooks
8-
rev: v5.0.0
8+
rev: v6.0.0
99
hooks:
1010
- id: check-added-large-files
1111
- id: check-case-conflict
1212
- id: check-merge-conflict
1313
- id: check-symlinks
1414
- id: check-toml
1515
- id: check-yaml
16+
exclude: "mkdocs.yml"
1617
- id: debug-statements
1718
- id: end-of-file-fixer
1819
- id: mixed-line-ending
1920
args: ["--fix=lf"]
2021
- id: trailing-whitespace
21-
exclude: ".bumpversion.cfg" # wtf
22+
exclude: ".bumpversion.cfg" # wtf
2223

2324
# - repo: https://github.com/asottile/pyupgrade
2425
# rev: v3.10.1
@@ -78,7 +79,7 @@ repos:
7879
rev: 1.9.0
7980
hooks:
8081
- id: poetry-export
81-
args: ["--without-hashes", "-o", "requirements.txt"]
82+
args: ["--without-hashes", "--with", "main", "-o", "requirements.txt"]
8283
- id: poetry-export
8384
args:
8485
["--without-hashes", "--only", "dev", "-o", "requirements-dev.txt"]

β€ŽDockerfileβ€Ž

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,22 @@
11
FROM ghcr.io/openaleph/ingest-file-base:latest
22

3+
# uncomment when running on Apple Silicon
4+
# ENV LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libgomp.so.1
5+
ENV LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libgomp.so.1
6+
37
COPY . /ingestors
8+
RUN rm -rf /ingestors/tests
49
WORKDIR /ingestors
5-
RUN pip3 install --no-cache-dir -r /ingestors/requirements.txt
6-
RUN pip3 install --no-cache-dir /ingestors
10+
11+
RUN pip3 install --no-cache-dir --no-deps -r /ingestors/requirements.txt
12+
RUN pip3 install --no-deps --no-cache-dir /ingestors
713

814
ENV ARCHIVE_TYPE=file \
915
ARCHIVE_PATH=/data \
10-
FTM_STORE_URI=postgresql://aleph:aleph@postgres/aleph \
16+
OPENALEPH_DB_URI=postgresql://aleph:aleph@postgres/aleph \
1117
REDIS_URL=redis://redis:6379/0 \
1218
TESSDATA_PREFIX=/usr/share/tesseract-ocr/5/tessdata
1319

14-
USER app
15-
CMD ingestors process
20+
ENV PROCRASTINATE_APP="ingestors.tasks.app"
21+
22+
CMD ["procrastinate", "worker", "-q", "ingest"]

β€ŽDockerfile.baseβ€Ž

Lines changed: 9 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -12,12 +12,12 @@ RUN apt-get -qq -y update \
1212
# python deps (mostly to install their dependencies)
1313
python3-pip python3-dev python3-pil \
1414
# tesseract
15-
tesseract-ocr libtesseract-dev libleptonica-dev pkg-config\
15+
tesseract-ocr libtesseract-dev libleptonica-dev pkg-config \
1616
# libraries
1717
libxslt1-dev libpq-dev libldap2-dev libsasl2-dev \
1818
zlib1g-dev libicu-dev libxml2-dev \
1919
# package tools
20-
unrar p7zip-full \
20+
unrar \
2121
# audio & video metadata
2222
libmediainfo-dev \
2323
# image processing, djvu
@@ -116,41 +116,22 @@ ENV LANG='en_US.UTF-8' \
116116
OMP_THREAD_LIMIT='1' \
117117
OPENBLAS_NUM_THREADS='1'
118118

119-
ENV LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libgomp.so.1"
119+
# force compile tesserocr 2.6.2 with C++ 14
120+
# to make it compatible with Tesseract 5
121+
RUN pip download --no-binary=:all: "tesserocr==2.6.2" \
122+
&& tar -xzf tesserocr-2.6.2.tar.gz \
123+
&& sed -i "s/-std=c++11/-std=c++14/" tesserocr-2.6.2/setup.py \
124+
&& cd tesserocr-2.6.2 \
125+
&& CXXFLAGS="-std=c++14" pip install --no-cache-dir .
120126

121127
# tesseract 5
122128
ENV TESSDATA_PREFIX=/usr/share/tesseract-ocr/5/tessdata
123129

124130
RUN groupadd -g 1000 -r app \
125131
&& useradd -m -u 1000 -s /bin/false -g app app
126132

127-
# Download the ftm-typepredict model
128-
RUN mkdir /models/ && \
129-
curl -o "/models/model_type_prediction.ftz" "https://public.data.occrp.org/develop/models/types/type-08012020-7a69d1b.ftz"
130-
131133
RUN pip3 install --no-cache-dir --prefer-binary --upgrade pip
132134
RUN pip3 install --no-cache-dir --prefer-binary --upgrade setuptools wheel
133135

134-
# Install spaCy
135-
RUN pip3 install --no-cache-dir spacy
136136
# Install PyICU
137137
RUN pip3 install --no-binary=:pyicu: pyicu
138-
# Install TesserOCR
139-
RUN pip3 install --no-binary=:tesserocr: tesserocr
140-
141-
# Install default (small) spaCy models
142-
RUN python3 -m spacy download en_core_web_sm
143-
RUN python3 -m spacy download de_core_news_sm
144-
RUN python3 -m spacy download fr_core_news_sm
145-
RUN python3 -m spacy download es_core_news_sm
146-
RUN python3 -m spacy download ru_core_news_sm
147-
RUN python3 -m spacy download pt_core_news_sm
148-
RUN python3 -m spacy download ro_core_news_sm
149-
RUN python3 -m spacy download mk_core_news_sm
150-
RUN python3 -m spacy download el_core_news_sm
151-
RUN python3 -m spacy download pl_core_news_sm
152-
RUN python3 -m spacy download it_core_news_sm
153-
RUN python3 -m spacy download lt_core_news_sm
154-
RUN python3 -m spacy download nl_core_news_sm
155-
RUN python3 -m spacy download nb_core_news_sm
156-
RUN python3 -m spacy download da_core_news_sm

β€ŽDockerfile.testβ€Ž

Lines changed: 12 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,26 @@
11
FROM ghcr.io/openaleph/ingest-file-base:latest
22

3+
# uncomment when running on Apple Silicon
34
# ENV LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libgomp.so.1
5+
ENV LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libgomp.so.1
46

57
COPY . /ingestors
68
WORKDIR /ingestors
7-
RUN pip3 install --no-cache-dir -r /ingestors/requirements.txt
8-
RUN pip3 install --no-cache-dir /ingestors
99

10-
RUN pip3 install -r /ingestors/requirements-dev.txt
10+
RUN pip3 install --no-cache-dir --no-deps -r /ingestors/requirements.txt
11+
RUN pip3 install --no-deps --no-cache-dir /ingestors
12+
13+
RUN pip3 install --no-deps -r /ingestors/requirements-dev.txt
14+
RUN pip3 install --no-cache-dir procrastinate==3.2.2
1115
RUN chown -R app:app /ingestors
1216

1317
ENV ARCHIVE_TYPE=file \
1418
ARCHIVE_PATH=/data \
1519
FTM_STORE_URI=postgresql://aleph:aleph@postgres/aleph \
1620
REDIS_URL=redis://redis:6379/0 \
17-
TESSDATA_PREFIX=/usr/share/tesseract-ocr/5/tessdata
21+
TESSDATA_PREFIX=/usr/share/tesseract-ocr/5/tessdata \
22+
DEBUG=1
23+
24+
ENV PROCRASTINATE_APP="ingestors.tasks.app"
1825

19-
USER app
20-
CMD ingestors process
26+
CMD ["pytest"]

0 commit comments

Comments
Β (0)