Skip to content

Conversation

@aanastasiou
Copy link

I am pre-processing a large number of .docx documents with really oddly shaped tables containing text that has to be extracted verbatim.

As useful python-docx has been in this task, a subset of those documents revealed a tiny little bug in this line.

This PR fixes cases of odd table shapes were the strategy of populating a cell with the value of the previous cell (e.g. in the case of row/cell merges) fails, because there simply has not been a 'previous cell' yet.

Please note, I would be glad to contribute a test case as well but this might take a bit more time, tracking down the exact table (within the XML) that causes the bug and creating an "equivalent" test case.

Hope this helps.

…l with the value of the previous cell (e.g. in the case of row/cell merges) fails, because there simply has not been a 'previous cell' yet
@scanny
Copy link
Contributor

scanny commented Jun 11, 2024

@aanastasiou there was a recent update that addressed the "skipped-cells" condition that is actually a legitimate (although relatively unusual) table state.

If you use Table.rows to get rows and then iterate _Row.cells to get each cell you shouldn't have a problem there.

Depending on your needs for column alignment you may want to use _Row.grid_cols_before and .grid_cols_after to discover the empty leading and trailing cells.

There is also a new _Cell.grid_span property so you can tell how many grid-cells a horizontally-merged cell occupies.

I'm not sure what we'll do with Table._cells. It's possible that collection will be deprecated or perhaps we'll reimplement it based on the new "skipped-cell-aware" code, but for now it is probably better to avoid it in favor of the new methods.

@aanastasiou
Copy link
Author

@scanny thank you very much for the prompt response. This was using the latest python-docx from pypi, would this recent update be applied to the version on github rather than pypi? Thanks for the rest of the information, it's good to know for our next code revision.

@scanny
Copy link
Contributor

scanny commented Jun 25, 2024

This change appears in v1.1.2, which is the current PyPI version, released on May 1, 2024:
https://pypi.org/project/python-docx/
f4a48b5

@aanastasiou
Copy link
Author

@scanny This is the version that I used (and eventually led me to file this PR)

@scanny
Copy link
Contributor

scanny commented Jun 28, 2024

Show me the client code that isn't working the way you want.

@aanastasiou
Copy link
Author

@scanny The PR contains the exact problem that I dealt with (and how), what might take longer is me locating the exact document that causes this behaviour.

@scanny
Copy link
Contributor

scanny commented Jul 12, 2024

@aanastasiou the idea there is not that this problem with table._cells is fixed for your case, but rather that you should no longer need to use table._cells and can use something like (c for row in table.rows for c in row.cells).

If you can post the code you're using to traverse cells and which gives rise to the error you mention I expect I'll be able to describe how to modify it to avoid any exceptions for uneven row lengths.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants