Skip to content

Commit 5a311db

Browse files
authored
Tech optimisation tutorials (#11)
* Add techinical optimisation tutorial with JSON serialisation and memory optimisation stuff * Add new part of tutorials to README * Bump medcat version to 1.8.0 * Update memory optimisation tutorial with details on undoing, warning about version compatibility, and hints regarding model saving * Update workflow run target to latest macos * Update medcat requirement to 1.8.0 * Update install version to 1.8.0 for medcat in all parts * Updating macos target to macos-11 * Made the worflkow target ubuntu 22.04 instead * Bumped python version in workflow to 3.8 * Add additional dependency install * Undo last change; Move back to macos-11 target * Pin pandas version to less than 2.0
1 parent 15afd97 commit 5a311db

19 files changed

+14102
-60
lines changed

.github/workflows/main.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ on:
99
jobs:
1010
main:
1111

12-
runs-on: macos-10.15
12+
runs-on: macos-11
1313
strategy:
1414
matrix:
1515
part: [
@@ -27,7 +27,7 @@ jobs:
2727
- name: Setup Python
2828
uses: actions/setup-python@v2
2929
with:
30-
python-version: "3.7"
30+
python-version: "3.8"
3131
- name: Install dependencies
3232
run: |
3333
pip install -U pip

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ In this tutorial, we will walk you through each stage of a basic MedCAT project.
1313
| 2 | [Data set Preparation and Basic Statistics](https://htmlpreview.github.io/?https://github.com/CogStack/MedCATtutorials/blob/main/notebooks/introductory/Part_2_Dataset_Analysis_and_Preparation.html) | [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/CogStack/MedCATtutorials/blob/main/notebooks/introductory/Part_2_Dataset_Analysis_and_Preparation.ipynb) | [TDS](https://towardsdatascience.com/medcat-dataset-analysis-and-preparation-be8bc910bd6d) |
1414
| 3.1 | [Building a new Concept Database (CDB) and Vocabulary (Vocab)](https://htmlpreview.github.io/?https://github.com/CogStack/MedCATtutorials/blob/main/notebooks/introductory/Part_3_1_Building_a_Concept_Database_and_Vocabulary.html) | [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/CogStack/MedCATtutorials/blob/main/notebooks/introductory/Part_3_1_Building_a_Concept_Database_and_Vocabulary.ipynb) | [TDS](https://towardsdatascience.com/medcat-extracting-diseases-from-electronic-health-records-f53c45b3d1c1) |
1515
| 3.2 | [Unsupervised training and NER+L](https://htmlpreview.github.io/?https://github.com/CogStack/MedCATtutorials/blob/main/notebooks/introductory/Part_3_2_Extracting_Diseases_from_Electronic_Health_Records.html) | [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/CogStack/MedCATtutorials/blob/main/notebooks/introductory/Part_3_2_Extracting_Diseases_from_Electronic_Health_Records.ipynb) | [TDS](https://towardsdatascience.com/medcat-extracting-diseases-from-electronic-health-records-f53c45b3d1c1) |
16+
| 3.3 | [Technical model optimisations](https://htmlpreview.github.io/?https://github.com/CogStack/MedCATtutorials/blob/main/notebooks/introductory/Part_3_3_Model_technical_optimisations.html) | [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/CogStack/MedCATtutorials/blob/main/notebooks/introductory/Part_3_3_Model_technical_optimisations.ipynb) | - |
1617
| 4.1 | [Creating a tokenizer model (huggingface) and embeddings for MetaAnnotations](https://htmlpreview.github.io/?https://github.com/CogStack/MedCATtutorials/blob/main/notebooks/introductory/Part_4_1_ByteLevelBPETokenizer_and_Embeddings.html) | [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/CogStack/MedCATtutorials/blob/main/notebooks/introductory/Part_4_1_ByteLevelBPETokenizer_and_Embeddings.ipynb) | - |
1718
| 4.2 | [Supervised training and fine-tuning + Meta-annotations](https://htmlpreview.github.io/?https://github.com/CogStack/MedCATtutorials/blob/main/notebooks/introductory/Part_4_2_Supervised_Training_and_Meta_annotations.html) | [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/CogStack/MedCATtutorials/blob/main/notebooks/introductory/Part_4_2_Supervised_Training_and_Meta_annotations.ipynb) | - |
1819
| 4.3 | [Annotating documents with the full MedCAT pipeline with MetaAnnotations](https://htmlpreview.github.io/?https://github.com/CogStack/MedCATtutorials/blob/main/notebooks/introductory/Part_4_3_Annotating_documents_with_the_full_MedCAT_pipeline_with_MetaAnnotations.html) | [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/CogStack/MedCATtutorials/blob/main/notebooks/introductory/Part_4_3_Annotating_documents_with_the_full_MedCAT_pipeline_with_MetaAnnotations.ipynb) | - |

notebooks/introductory/Part_1_1_OPTIONAL_Logging_With_MedCAT.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13095,7 +13095,7 @@ <h1 id="MedCAT-tutorial---logging-with-MedCAT">MedCAT tutorial - logging with Me
1309513095
<div class="inner_cell">
1309613096
<div class="input_area">
1309713097
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># Install medcat</span>
13098-
<span class="o">!</span> pip install <span class="nv">medcat</span><span class="o">==</span><span class="m">1</span>.5.0
13098+
<span class="o">!</span> pip install <span class="nv">medcat</span><span class="o">==</span><span class="m">1</span>.8.0
1309913099
<span class="k">try</span><span class="p">:</span>
1310013100
<span class="kn">from</span> <span class="nn">medcat.cat</span> <span class="kn">import</span> <span class="n">CAT</span>
1310113101
<span class="k">except</span><span class="p">:</span>

notebooks/introductory/Part_1_1_OPTIONAL_Logging_With_MedCAT.ipynb

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
{
22
"cells": [
33
{
4+
"attachments": {},
45
"cell_type": "markdown",
56
"metadata": {},
67
"source": [
@@ -18,7 +19,7 @@
1819
"outputs": [],
1920
"source": [
2021
"# Install medcat\n",
21-
"! pip install medcat==1.5.0\n",
22+
"! pip install medcat==1.8.0\n",
2223
"try:\n",
2324
" from medcat.cat import CAT\n",
2425
"except:\n",
@@ -27,6 +28,7 @@
2728
]
2829
},
2930
{
31+
"attachments": {},
3032
"cell_type": "markdown",
3133
"metadata": {},
3234
"source": [
@@ -62,13 +64,15 @@
6264
]
6365
},
6466
{
67+
"attachments": {},
6568
"cell_type": "markdown",
6669
"metadata": {},
6770
"source": [
6871
"What we must now understand is that the `logging` library uses a hierarchical system for the loggers. That means that all the module-level loggers within MedCAT have the `medcat.logger` (which is the package-level logger) as their parent logger. So if we want to change the logging behaviour for the entire project, we can just interact with this one logger. However, if we want fine grained control, we can interact with each module-level logger separately."
6972
]
7073
},
7174
{
75+
"attachments": {},
7276
"cell_type": "markdown",
7377
"metadata": {},
7478
"source": [
@@ -100,6 +104,7 @@
100104
]
101105
},
102106
{
107+
"attachments": {},
103108
"cell_type": "markdown",
104109
"metadata": {},
105110
"source": [
@@ -111,6 +116,7 @@
111116
]
112117
},
113118
{
119+
"attachments": {},
114120
"cell_type": "markdown",
115121
"metadata": {},
116122
"source": [
@@ -136,6 +142,7 @@
136142
]
137143
},
138144
{
145+
"attachments": {},
139146
"cell_type": "markdown",
140147
"metadata": {},
141148
"source": [
@@ -172,6 +179,7 @@
172179
]
173180
},
174181
{
182+
"attachments": {},
175183
"cell_type": "markdown",
176184
"metadata": {},
177185
"source": [

notebooks/introductory/Part_3_1_Building_a_Concept_Database_and_Vocabulary.html

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -13099,7 +13099,7 @@ <h3 id="First-we-need-to-install-MedCAT">First we need to install MedCAT<a class
1309913099
<div class="inner_cell">
1310013100
<div class="input_area">
1310113101
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># Install MedCAT</span>
13102-
<span class="o">!</span> pip install <span class="nv">medcat</span><span class="o">==</span><span class="m">1</span>.5.0
13102+
<span class="o">!</span> pip install <span class="nv">medcat</span><span class="o">==</span><span class="m">1</span>.8.0
1310313103
<span class="c1"># Get the scispacy model</span>
1310413104
<span class="o">!</span> python -m spacy download en_core_web_md
1310513105
<span class="k">try</span><span class="p">:</span>
@@ -13445,7 +13445,8 @@ <h3 id="First-we-need-to-install-MedCAT">First we need to install MedCAT<a class
1344513445
<div class="prompt input_prompt">In&nbsp;[2]:</div>
1344613446
<div class="inner_cell">
1344713447
<div class="input_area">
13448-
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">DATA_DIR</span> <span class="o">=</span> <span class="s2">&quot;./data/&quot;</span>
13448+
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">DATA_DIR</span> <span class="o">=</span> <span class="s2">&quot;./data_p3.1/&quot;</span>
13449+
<span class="o">!</span> <span class="nv">DATA_DIR</span><span class="o">=</span><span class="s2">&quot;./data_p3.1/&quot;</span>
1344913450
</pre></div>
1345013451

1345113452
</div>
@@ -13459,9 +13460,9 @@ <h3 id="First-we-need-to-install-MedCAT">First we need to install MedCAT<a class
1345913460
<div class="inner_cell">
1346013461
<div class="input_area">
1346113462
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># Load files if in google colab, otherwise skip this step</span>
13462-
<span class="o">!</span>wget https://raw.githubusercontent.com/CogStack/MedCATtutorials/main/notebooks/introductory/data/cdb_simple.csv -P ./data/
13463-
<span class="o">!</span>wget https://raw.githubusercontent.com/CogStack/MedCATtutorials/main/notebooks/introductory/data/cdb_advanced.csv -P ./data/
13464-
<span class="o">!</span>wget https://raw.githubusercontent.com/CogStack/MedCATtutorials/main/notebooks/introductory/data/vocab_data.txt -P ./data/
13463+
<span class="o">!</span>wget -N https://raw.githubusercontent.com/CogStack/MedCATtutorials/main/notebooks/introductory/data/cdb_simple.csv -P <span class="nv">$DATA_DIR</span>
13464+
<span class="o">!</span>wget -N https://raw.githubusercontent.com/CogStack/MedCATtutorials/main/notebooks/introductory/data/cdb_advanced.csv -P <span class="nv">$DATA_DIR</span>
13465+
<span class="o">!</span>wget -N https://raw.githubusercontent.com/CogStack/MedCATtutorials/main/notebooks/introductory/data/vocab_data.txt -P <span class="nv">$DATA_DIR</span>
1346513466
</pre></div>
1346613467

1346713468
</div>
@@ -13660,7 +13661,7 @@ <h2 id="Building-a-Vocabulary">Building a Vocabulary<a class="anchor-link" href=
1366013661
<div class="inner_cell">
1366113662
<div class="input_area">
1366213663
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># If you want to add words manually (one by one) use:</span>
13663-
<span class="n">vocab</span><span class="o">.</span><span class="n">add_word</span><span class="p">(</span><span class="s2">&quot;test&quot;</span><span class="p">,</span> <span class="n">cnt</span><span class="o">=</span><span class="mi">31</span><span class="p">,</span> <span class="n">vec</span><span class="o">=</span><span class="p">[</span><span class="mf">1.42</span><span class="p">,</span> <span class="mf">1.44</span><span class="p">,</span> <span class="mf">1.55</span><span class="p">],</span> <span class="n">replace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
13664+
<span class="n">vocab</span><span class="o">.</span><span class="n">add_word</span><span class="p">(</span><span class="s2">&quot;test&quot;</span><span class="p">,</span> <span class="n">cnt</span><span class="o">=</span><span class="mi">31</span><span class="p">,</span> <span class="n">vec</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mf">1.42</span><span class="p">,</span> <span class="mf">1.44</span><span class="p">,</span> <span class="mf">1.55</span><span class="p">]),</span> <span class="n">replace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
1366413665
<span class="n">vocab</span><span class="o">.</span><span class="n">vocab</span><span class="o">.</span><span class="n">keys</span><span class="p">()</span>
1366513666
</pre></div>
1366613667

notebooks/introductory/Part_3_1_Building_a_Concept_Database_and_Vocabulary.ipynb

Lines changed: 20 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
{
22
"cells": [
33
{
4+
"attachments": {},
45
"cell_type": "markdown",
56
"metadata": {
67
"id": "s_j_Gu7s3wTO"
@@ -10,6 +11,7 @@
1011
]
1112
},
1213
{
14+
"attachments": {},
1315
"cell_type": "markdown",
1416
"metadata": {
1517
"id": "i4bQfWfXlKWJ"
@@ -320,7 +322,7 @@
320322
],
321323
"source": [
322324
"# Install MedCAT\n",
323-
"! pip install medcat==1.5.0\n",
325+
"! pip install medcat==1.8.0\n",
324326
"# Get the scispacy model\n",
325327
"! python -m spacy download en_core_web_md\n",
326328
"try:\n",
@@ -331,6 +333,7 @@
331333
]
332334
},
333335
{
336+
"attachments": {},
334337
"cell_type": "markdown",
335338
"metadata": {
336339
"id": "LWScf8BW0BpY"
@@ -428,6 +431,7 @@
428431
]
429432
},
430433
{
434+
"attachments": {},
431435
"cell_type": "markdown",
432436
"metadata": {
433437
"id": "Kj24ZU79D-xE"
@@ -441,6 +445,7 @@
441445
]
442446
},
443447
{
448+
"attachments": {},
444449
"cell_type": "markdown",
445450
"metadata": {
446451
"id": "9POZ_dwsk7gu"
@@ -492,6 +497,7 @@
492497
]
493498
},
494499
{
500+
"attachments": {},
495501
"cell_type": "markdown",
496502
"metadata": {
497503
"id": "xPl6ghXUk7gy"
@@ -670,6 +676,7 @@
670676
]
671677
},
672678
{
679+
"attachments": {},
673680
"cell_type": "markdown",
674681
"metadata": {
675682
"id": "xG3FCinSl_Sq"
@@ -691,6 +698,7 @@
691698
]
692699
},
693700
{
701+
"attachments": {},
694702
"cell_type": "markdown",
695703
"metadata": {
696704
"id": "o6itJcEXk7hA"
@@ -711,6 +719,7 @@
711719
]
712720
},
713721
{
722+
"attachments": {},
714723
"cell_type": "markdown",
715724
"metadata": {
716725
"id": "-YBbwcNUk7hD"
@@ -731,6 +740,7 @@
731740
]
732741
},
733742
{
743+
"attachments": {},
734744
"cell_type": "markdown",
735745
"metadata": {
736746
"id": "ptRmHln9k7hG"
@@ -942,6 +952,7 @@
942952
]
943953
},
944954
{
955+
"attachments": {},
945956
"cell_type": "markdown",
946957
"metadata": {
947958
"id": "Rasu5PajojYZ"
@@ -998,6 +1009,7 @@
9981009
]
9991010
},
10001011
{
1012+
"attachments": {},
10011013
"cell_type": "markdown",
10021014
"metadata": {
10031015
"id": "08agsFBnk7hQ"
@@ -1107,6 +1119,7 @@
11071119
]
11081120
},
11091121
{
1122+
"attachments": {},
11101123
"cell_type": "markdown",
11111124
"metadata": {
11121125
"id": "Lpx7zGvwk7ha"
@@ -1127,6 +1140,7 @@
11271140
]
11281141
},
11291142
{
1143+
"attachments": {},
11301144
"cell_type": "markdown",
11311145
"metadata": {
11321146
"id": "97uiDwvAk7hc"
@@ -1159,6 +1173,7 @@
11591173
]
11601174
},
11611175
{
1176+
"attachments": {},
11621177
"cell_type": "markdown",
11631178
"metadata": {
11641179
"id": "0xqmmAue-UE4"
@@ -1205,6 +1220,7 @@
12051220
]
12061221
},
12071222
{
1223+
"attachments": {},
12081224
"cell_type": "markdown",
12091225
"metadata": {
12101226
"id": "cqmVITvWCIr6"
@@ -1214,6 +1230,7 @@
12141230
]
12151231
},
12161232
{
1233+
"attachments": {},
12171234
"cell_type": "markdown",
12181235
"metadata": {
12191236
"id": "DZvhmkIL8433"
@@ -1326,6 +1343,7 @@
13261343
]
13271344
},
13281345
{
1346+
"attachments": {},
13291347
"cell_type": "markdown",
13301348
"metadata": {
13311349
"id": "9fwiKys4k7he"
@@ -1358,7 +1376,7 @@
13581376
"name": "python",
13591377
"nbconvert_exporter": "python",
13601378
"pygments_lexer": "ipython3",
1361-
"version": "3.8.5"
1379+
"version": "3.9.6"
13621380
},
13631381
"vscode": {
13641382
"interpreter": {

notebooks/introductory/Part_3_2_Extracting_Diseases_from_Electronic_Health_Records.html

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -13092,7 +13092,7 @@ <h1 id="Now-let's-start-extracting-concepts-from-unstructured-text!">Now let's s
1309213092
<div class="inner_cell">
1309313093
<div class="input_area">
1309413094
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># Install medcat</span>
13095-
<span class="o">!</span> pip install <span class="nv">medcat</span><span class="o">==</span><span class="m">1</span>.5.0
13095+
<span class="o">!</span> pip install <span class="nv">medcat</span><span class="o">==</span><span class="m">1</span>.8.0
1309613096
<span class="k">try</span><span class="p">:</span>
1309713097
<span class="kn">from</span> <span class="nn">medcat.cat</span> <span class="kn">import</span> <span class="n">CAT</span>
1309813098
<span class="k">except</span><span class="p">:</span>
@@ -13417,7 +13417,8 @@ <h1 id="Now-let's-start-extracting-concepts-from-unstructured-text!">Now let's s
1341713417
<div class="prompt input_prompt">In&nbsp;[2]:</div>
1341813418
<div class="inner_cell">
1341913419
<div class="input_area">
13420-
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">DATA_DIR</span> <span class="o">=</span> <span class="s2">&quot;./data/&quot;</span>
13420+
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">DATA_DIR</span> <span class="o">=</span> <span class="s2">&quot;./data_p3.2/&quot;</span>
13421+
<span class="o">!</span> <span class="nv">DATA_DIR</span><span class="o">=</span><span class="s2">&quot;./data_p3.2/&quot;</span>
1342113422
<span class="n">model_pack_path</span> <span class="o">=</span> <span class="n">DATA_DIR</span> <span class="o">+</span> <span class="s2">&quot;medmen_wstatus_2021_oct.zip&quot;</span>
1342213423
</pre></div>
1342313424

@@ -13432,8 +13433,8 @@ <h1 id="Now-let's-start-extracting-concepts-from-unstructured-text!">Now let's s
1343213433
<div class="inner_cell">
1343313434
<div class="input_area">
1343413435
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># Download the models and required data</span>
13435-
<span class="o">!</span>wget https://medcat.rosalind.kcl.ac.uk/media/medmen_wstatus_2021_oct.zip -P ./data/
13436-
<span class="o">!</span>wget https://raw.githubusercontent.com/CogStack/MedCATtutorials/main/notebooks/introductory/data/pt_notes.csv -P ./data/
13436+
<span class="o">!</span>wget -N https://medcat.rosalind.kcl.ac.uk/media/medmen_wstatus_2021_oct.zip -P <span class="nv">$DATA_DIR</span>
13437+
<span class="o">!</span>wget -N https://raw.githubusercontent.com/CogStack/MedCATtutorials/main/notebooks/introductory/data/pt_notes.csv -P <span class="nv">$DATA_DIR</span>
1343713438
</pre></div>
1343813439

1343913440
</div>
@@ -14695,10 +14696,10 @@ <h2 id="Use-Multiprocessing">Use Multiprocessing<a class="anchor-link" href="#Us
1469514696

1469614697

1469714698

14698-
<div id="baa73754-e2b0-4efb-a3ff-3f7829c43498"></div>
14699+
<div id="25a2996e-6950-4a46-b96a-2244bd1dbcec"></div>
1469914700
<div class="output_subarea output_widget_view ">
1470014701
<script type="text/javascript">
14701-
var element = $('#baa73754-e2b0-4efb-a3ff-3f7829c43498');
14702+
var element = $('#25a2996e-6950-4a46-b96a-2244bd1dbcec');
1470214703
</script>
1470314704
<script type="application/vnd.jupyter.widget-view+json">
1470414705
{"model_id": "05b18c97da9d4d05b9280df006a5fb82", "version_major": 2, "version_minor": 0}

0 commit comments

Comments
 (0)