Skip to content

stream.read(8192) on image heavy repository returns 0 despite having more data #43

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dividuum opened this issue Dec 18, 2017 · 4 comments

Comments

@dividuum
Copy link

I'm using gitdb in combination with GitPython to handle files directly from git repositories. So far this worked perfectly. Thanks!

Something odd is happening in a repository I'm handling at the moment. I've traced it down to the
read function within the DecompressMemMapReader. More precisely this line of code:

# if window is too small, make it larger so zip can decompress something

If I understand things correctly, this check tries to enlarge the input buffer (containing compressed data) to at least 8 bytes, so the following self._zip.decompress call returns at last some data.

In my repository this doesn't help: len(dcompdat) is 0 and all the way back in my code, a read(8192) returns '' despite more data being available. Changing 8192 to 8191 (or other random values) most of the time "fixes" this. I suspect this is the result of different internal buffering.

Not sure if it helps with finding a reason why decompress doesn't return anything from 8 input bytes, but the file responsible is an already compressed JPEG file.

How to fix this? Changing the minimum window size to 48 seems to help

if self._cwe - self._cws < 48:
    self._cwe = self._cws + 48

but I'm not sure if this has consequences I don't fully understand. Any help would be appreciated.

@Byron
Copy link
Member

Byron commented Dec 18, 2017

Thanks a lot of the detailed description of what's going on - it's a pleasure to read!
It's also a bit of a skeleton in my basement, as I remember trying to get this to work in a streaming fashion even though pythons interface to zip is not providing you with all the required information to do that safely. So I devised a more indirect way to do that, which is for one hard to understand and a bit magical, and which is also (apparently) prone to not working in certain edge cases.

Unless you have a specific requirement for using the GitDb implementation in GitPython, I would recommend you to replace it with GitCmdObjectDB available in GitPython. Performance-wise, it should be absolutely equivalent if not faster. After all, the limiting factor for git streaming performance is the zip compression, which is implemented similarly both in the git program and in python.

I hope that helps!
Alternatively, please submit a PR adjusting the window value to work for you. I don't think it will affect anyone else negatively.

@Byron Byron added the feedback label Dec 18, 2017
@dividuum
Copy link
Author

Thanks for the fast response. The GitCmdObjectDB does indeed solve my problem. Where does the higher memory usage ("When extracting large files, memory usage will be much higher") come from? For purely streaming out data from a repository, the code responsible seems to be

https://github.com/gitpython-developers/GitPython/blob/1c1e984b212637fe108c0ddade166bc39f0dd2ef/git/cmd.py#L423

which doesn't look like it's slurping in the complete file before returning anything to the caller. Am I missing something?

@Byron Byron removed the feedback label Dec 18, 2017
@Byron
Copy link
Member

Byron commented Dec 18, 2017

I think that note can safely be ignored, unless you find yourself actually running out of memory. Maybe I wrote this because I found a long-running git process to not free memory, or to build a cache of some sort.
These days I would think it's not an issue anymore, and maybe has never been.

@dividuum
Copy link
Author

Thanks. Closing for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants