Memory leak in pandas.read_msgpack when reading from string #16647

bluenote10 · 2017-06-09T15:41:42Z

Code Sample (copy-pastable)

from __future__ import division, print_function
import pandas as pd
import numpy as np
import os
import gc
import psutil


def log_memory(label):
    for i in xrange(3):
        gc.collect(i)
    process = psutil.Process(os.getpid())
    mem_usage = process.memory_info().rss / float(2 ** 20)
    print("[Memory usage] {:<25s} {:12.1f} MB".format(
        label, mem_usage
    ))


def generate_test_data(num_partitions=20):
    for i in range(num_partitions):
        N = 10 * 1000 * 1000
        # randomness required, identical files don't have the issue
        df = pd.DataFrame({
            "A": np.random.uniform(0, 1, size=N),
        })
        df.to_msgpack("/tmp/pd_test_{:02d}.msg".format(i), compress='zlib')


def load_msgpack(f):
    data = open(f).read()
    df = pd.read_msgpack(data)
    return df


def load_partitions_sequentially(num_partitions=20):
    for i in range(num_partitions):
        fn = "/tmp/pd_test_{:02d}.msg".format(i)
        df = load_msgpack(fn)
        del df
        log_memory("After partition {}".format(i+1))


log_memory("At initialization")
generate_test_data()
log_memory("After data generation")

load_partitions_sequentially()

Problem description

There is a memory leak in pandas.read_msgpack when reading from a string. Calling pandas.read_msgpack(str_data) increases the ref count of str_data if and only if read_msgpack sees the content of str_data for the first time. This implies that there is a memory leak, but only when reading different files -- when reading the same file over and over again str_data will only leak once.

The problem does not exist when reading from file handles or BytesIO.

Output of above example

The output clearly shows the effect of the memory leak when loading data frame partitions sequentially:

[Memory usage] At initialization                 39.4 MB
[Memory usage] After data generation             39.9 MB
[Memory usage] After partition 1                185.9 MB
[Memory usage] After partition 2                329.8 MB
[Memory usage] After partition 3                473.7 MB
[Memory usage] After partition 4                617.6 MB
[Memory usage] After partition 5                761.5 MB
[Memory usage] After partition 6                905.4 MB
[Memory usage] After partition 7               1049.3 MB
[Memory usage] After partition 8               1193.2 MB
[Memory usage] After partition 9               1337.1 MB
[Memory usage] After partition 10              1481.0 MB
[Memory usage] After partition 11              1624.9 MB
[Memory usage] After partition 12              1768.8 MB
[Memory usage] After partition 13              1912.7 MB
[Memory usage] After partition 14              2056.6 MB
[Memory usage] After partition 15              2200.4 MB
[Memory usage] After partition 16              2344.3 MB
[Memory usage] After partition 17              2488.2 MB
[Memory usage] After partition 18              2631.7 MB
[Memory usage] After partition 19              2775.6 MB
[Memory usage] After partition 20              2919.5 MB

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 2.7.3.final.0 python-bits: 64 OS: Linux OS-release: 3.13.0-100-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.20.2
pytest: None
pip: 9.0.1
setuptools: 36.0.1
Cython: None
numpy: 1.13.0
scipy: None
xarray: None
IPython: 5.4.1
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

chris-b1 · 2017-06-09T16:49:46Z

This finally block that never runs looks like a possible cause.

pandas/pandas/io/packers.py

Line 212 in 9fdea65

finally:

jreback · 2017-06-09T22:19:02Z

@bluenote10 welcome for you to have a look :>

simonjayhawkins · 2019-12-11T13:29:13Z

msgpack is deprecated #30112

chris-b1 added Bug Msgpack Performance Memory or execution speed performance labels Jun 9, 2017

chris-b1 added this to the Next Major Release milestone Jun 9, 2017

jreback changed the title ~~Memory leak in pandas.read_msgpack when reading from string~~ Memory leak in pandas.read_msgpack when reading from string Jun 9, 2017

jreback added Difficulty Intermediate labels Jun 9, 2017

jbrockmendel removed Difficulty Intermediate labels Oct 21, 2019

simonjayhawkins closed this as completed Dec 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak in pandas.read_msgpack when reading from string #16647

Memory leak in pandas.read_msgpack when reading from string #16647

bluenote10 commented Jun 9, 2017

chris-b1 commented Jun 9, 2017

jreback commented Jun 9, 2017

simonjayhawkins commented Dec 11, 2019

Memory leak in pandas.read_msgpack when reading from string #16647

Memory leak in pandas.read_msgpack when reading from string #16647

Comments

bluenote10 commented Jun 9, 2017

Code Sample (copy-pastable)

Problem description

Output of above example

Output of pd.show_versions()

chris-b1 commented Jun 9, 2017

jreback commented Jun 9, 2017

simonjayhawkins commented Dec 11, 2019

Output of `pd.show_versions()`