Skip to content

Memory leak in pandas.read_msgpack when reading from string #16647

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bluenote10 opened this issue Jun 9, 2017 · 3 comments
Closed

Memory leak in pandas.read_msgpack when reading from string #16647

bluenote10 opened this issue Jun 9, 2017 · 3 comments
Labels
Bug Performance Memory or execution speed performance

Comments

@bluenote10
Copy link

Code Sample (copy-pastable)

from __future__ import division, print_function
import pandas as pd
import numpy as np
import os
import gc
import psutil


def log_memory(label):
    for i in xrange(3):
        gc.collect(i)
    process = psutil.Process(os.getpid())
    mem_usage = process.memory_info().rss / float(2 ** 20)
    print("[Memory usage] {:<25s} {:12.1f} MB".format(
        label, mem_usage
    ))


def generate_test_data(num_partitions=20):
    for i in range(num_partitions):
        N = 10 * 1000 * 1000
        # randomness required, identical files don't have the issue
        df = pd.DataFrame({
            "A": np.random.uniform(0, 1, size=N),
        })
        df.to_msgpack("/tmp/pd_test_{:02d}.msg".format(i), compress='zlib')


def load_msgpack(f):
    data = open(f).read()
    df = pd.read_msgpack(data)
    return df


def load_partitions_sequentially(num_partitions=20):
    for i in range(num_partitions):
        fn = "/tmp/pd_test_{:02d}.msg".format(i)
        df = load_msgpack(fn)
        del df
        log_memory("After partition {}".format(i+1))


log_memory("At initialization")
generate_test_data()
log_memory("After data generation")

load_partitions_sequentially()

Problem description

There is a memory leak in pandas.read_msgpack when reading from a string. Calling pandas.read_msgpack(str_data) increases the ref count of str_data if and only if read_msgpack sees the content of str_data for the first time. This implies that there is a memory leak, but only when reading different files -- when reading the same file over and over again str_data will only leak once.

The problem does not exist when reading from file handles or BytesIO.

Output of above example

The output clearly shows the effect of the memory leak when loading data frame partitions sequentially:

[Memory usage] At initialization                 39.4 MB
[Memory usage] After data generation             39.9 MB
[Memory usage] After partition 1                185.9 MB
[Memory usage] After partition 2                329.8 MB
[Memory usage] After partition 3                473.7 MB
[Memory usage] After partition 4                617.6 MB
[Memory usage] After partition 5                761.5 MB
[Memory usage] After partition 6                905.4 MB
[Memory usage] After partition 7               1049.3 MB
[Memory usage] After partition 8               1193.2 MB
[Memory usage] After partition 9               1337.1 MB
[Memory usage] After partition 10              1481.0 MB
[Memory usage] After partition 11              1624.9 MB
[Memory usage] After partition 12              1768.8 MB
[Memory usage] After partition 13              1912.7 MB
[Memory usage] After partition 14              2056.6 MB
[Memory usage] After partition 15              2200.4 MB
[Memory usage] After partition 16              2344.3 MB
[Memory usage] After partition 17              2488.2 MB
[Memory usage] After partition 18              2631.7 MB
[Memory usage] After partition 19              2775.6 MB
[Memory usage] After partition 20              2919.5 MB

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 2.7.3.final.0 python-bits: 64 OS: Linux OS-release: 3.13.0-100-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.20.2
pytest: None
pip: 9.0.1
setuptools: 36.0.1
Cython: None
numpy: 1.13.0
scipy: None
xarray: None
IPython: 5.4.1
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
pandas_gbq: None
pandas_datareader: None

@chris-b1
Copy link
Contributor

chris-b1 commented Jun 9, 2017

This finally block that never runs looks like a possible cause.

finally:

@chris-b1 chris-b1 added Bug Msgpack Performance Memory or execution speed performance labels Jun 9, 2017
@chris-b1 chris-b1 added this to the Next Major Release milestone Jun 9, 2017
@jreback jreback changed the title Memory leak in pandas.read_msgpack when reading from string Memory leak in pandas.read_msgpack when reading from string Jun 9, 2017
@jreback
Copy link
Contributor

jreback commented Jun 9, 2017

@bluenote10 welcome for you to have a look :>

@simonjayhawkins
Copy link
Member

msgpack is deprecated #30112

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

5 participants