Streaming queries #124

mcmcgrath13 · 2019-02-26T21:42:18Z

Implement new type StreamingQuery. This is similar to the Query structure, but instead of storing the result, uses the result. Due to this difference, StreamingQuery does not implement the tables interface and the only way to fetch the result is via iteration. This is intended to be used as a lower level function for those who need to work with large queries that can't fit in memory.

Still to do:

Update README
Test on full scale data
Pull current master changes in
(unrelated) fix Libdl warning

For large datasets, often don't want to load the full set at once, this allows fetching and processing rows one at a time

src/types.jl

quinnj · 2019-02-27T07:12:55Z

@mcmcgrath13, I'm not sure I understand the reason for this change. I believe we already have this functionality w/ MySQL.Query itself, right? You can already just iterate the MySQL.Query and get NamedTuples, same as what you're proposing w/ this StreamingQuery type. Am I missing something that's different here?

mcmcgrath13 · 2019-02-27T13:36:12Z

@quinnj it's definitely a subtle difference. With MySQL.Query we use mysql_store_result which causes MySQL to store the full result on the server and because of this we know things like how many rows are in a result upon callling MySQL.Query. With the proposed MySQL.StreamingQuery we would instead use mysql_use_result which instead returns the rows one-by-one on mysql_fetch_row without intermediate storage and therefore we can't know the number of rows in the result after a call to MySQL.StreamingQuery, which I believes prevents using the Tables.jl interface (though my understanding is definitely far from complete).

From the MySQL documentation:

After invoking mysql_query() or mysql_real_query(), you must call mysql_store_result() or mysql_use_result() for every statement that successfully produces a result set (SELECT, SHOW, DESCRIBE, EXPLAIN, CHECK TABLE, and so forth). You must also call mysql_free_result() after you are done with the result set.

mysql_use_result() initiates a result set retrieval but does not actually read the result set into the client like mysql_store_result() does. Instead, each row must be retrieved individually by making calls to mysql_fetch_row(). This reads the result of a query directly from the server without storing it in a temporary table or local buffer, which is somewhat faster and uses much less memory than mysql_store_result(). The client allocates memory only for the current row and a communication buffer that may grow up to max_allowed_packet bytes.

I am running into an issue where I need to query a large amount of data from one MySQL database (a simple select a,b,c from y), compare/process it against a smaller set of data and then only store a small fraction for analysis. I am currently getting an Out of Memory error from MySQL.Query. I am going to try running the code with a MySQL.StreamingQuery once I have some time to port over the code later today and will report back.

I wanted to get the PR started to start the conversation around how to implement this functionality, but it's not quite ready for merging and I wonder if you think there may be a better solution.

Stream query

mcmcgrath13 · 2019-02-27T18:15:05Z

@quinnj I was able to test this on the full scale database and it worked successfully

quinnj · 2019-02-27T18:47:59Z

@mcmcgrath13, thanks for the explanation, I get the idea now. So here's my thought so we can hopefully avoid the code duplication:

in the mutable struct Query definition, let's change the hasresult type parameter to be named resulttype which will have values of :default, :streaming, and :none (Symbols are allowed as type parameters)
in the Query constructor, we'll need to accept the streaming::Bool=false keyword argument, then set the resulttype type parameter as appropriate (to one of the symbol values)
add your logic to do mysql_use_result in the Query constructor itself
add Base.IteratorSize(::Type{Query{resulttype, names, T}}) where {resulttype, names, T} = resulttype == :streaming ? Base.SizeUnknown() : Base.HasLength() around line 99 of types.jl or so (by the other iteration interface definitions)

This should allow re-using the existing Query type, support the new streaming functionality, and keep the Tables.jl interface happy (if the Query is streaming, Tables.jl adjusts to accumulating rows by push!-ing instead of pre-allocating).

Does that sound ok?

mcmcgrath13 · 2019-02-27T19:52:01Z

@quinnj sounds good! I'll take a pass at these changes tomorrow and let you know when it's ready for review.

…cs, tests

codecov-io · 2019-02-28T21:01:28Z

Codecov Report

Merging #124 into master will decrease coverage by 35.01%.
The diff coverage is 55.55%.

@@             Coverage Diff             @@
##           master     #124       +/-   ##
===========================================
- Coverage   87.07%   52.05%   -35.02%     
===========================================
  Files           5        5               
  Lines         147      267      +120     
===========================================
+ Hits          128      139       +11     
- Misses         19      128      +109

Impacted Files	Coverage Δ
src/api.jl	`27.11% <0%> (-66.64%)`	⬇️
src/types.jl	`66.15% <62.5%> (-21.03%)`	⬇️
src/MySQL.jl	`39.28% <0%> (-42.86%)`	⬇️
src/consts.jl	`55.1% <0%> (-24.31%)`	⬇️
src/prepared.jl	`81.57% <0%> (-15.09%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a0e70de...afa4b6e. Read the comment docs.

mcmcgrath13 · 2019-02-28T21:06:31Z

@quinnj I've incorporated your suggestions - could you review?

README.md

src/types.jl

quinnj

I left a few comments of things to tweak, but overall this looks great! Thanks @mcmcgrath13!

mcmcgrath13 · 2019-03-01T00:09:50Z

@quinnj thanks for reviewing! I've incorporated your suggestions and I believe it's ready to go

quinnj · 2019-03-01T04:03:18Z

Thanks @mcmcgrath13! I'm going to take another stab at updating the binary builder for the package and then do a release.

mcmcgrath13 added 4 commits January 18, 2019 14:19

refactor: add project.toml, set up so tests can run from pkg>

7c679f2

test: query -> query, materialize results into columntable

f6ce7a3

docs: update docs to reflect tables.jl interface

0aba615

feat(query): imlement streaming functionality

9038646

For large datasets, often don't want to load the full set at once, this allows fetching and processing rows one at a time

quinnj reviewed Feb 27, 2019

View reviewed changes

src/types.jl Outdated Show resolved Hide resolved

mcmcgrath13 added 6 commits February 27, 2019 10:08

Merge pull request #1 from mcmcgrath13/stream_query

fd7c38a

Stream query

Merge branch 'master' of https://github.com/JuliaDatabases/MySQL.jl

00f90b8

Merge branch 'master' of https://github.com/mcmcgrath13/MySQL.jl

95cab03

docs: table instead of data

7d83e1e

fix(libdl): add it to dependencies - used in deps.jl

932099a

docs(queries): add sreamingquery docs, add more to query docs

e21a29f

refactor(streaming): incorporate into mysql.query as kwarg, update do…

6d446d9

…cs, tests

quinnj reviewed Feb 28, 2019

View reviewed changes

README.md Outdated Show resolved Hide resolved

quinnj reviewed Feb 28, 2019

View reviewed changes

src/types.jl Outdated Show resolved Hide resolved

quinnj reviewed Feb 28, 2019

View reviewed changes

src/types.jl Outdated Show resolved Hide resolved

quinnj reviewed Feb 28, 2019

View reviewed changes

src/types.jl Outdated Show resolved Hide resolved

quinnj approved these changes Feb 28, 2019

View reviewed changes

fix: fixes per review, typos in docs, delete struct, declare streaming

afa4b6e

quinnj merged commit 70472ff into JuliaDatabases:master Mar 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming queries #124

Streaming queries #124

mcmcgrath13 commented Feb 26, 2019 •

edited

Loading

quinnj commented Feb 27, 2019

mcmcgrath13 commented Feb 27, 2019 •

edited

Loading

mcmcgrath13 commented Feb 27, 2019

quinnj commented Feb 27, 2019

mcmcgrath13 commented Feb 27, 2019

codecov-io commented Feb 28, 2019 •

edited

Loading

mcmcgrath13 commented Feb 28, 2019

quinnj left a comment

mcmcgrath13 commented Mar 1, 2019

quinnj commented Mar 1, 2019

Streaming queries #124

Streaming queries #124

Conversation

mcmcgrath13 commented Feb 26, 2019 • edited Loading

quinnj commented Feb 27, 2019

mcmcgrath13 commented Feb 27, 2019 • edited Loading

mcmcgrath13 commented Feb 27, 2019

quinnj commented Feb 27, 2019

mcmcgrath13 commented Feb 27, 2019

codecov-io commented Feb 28, 2019 • edited Loading

Codecov Report

mcmcgrath13 commented Feb 28, 2019

quinnj left a comment

Choose a reason for hiding this comment

mcmcgrath13 commented Mar 1, 2019

quinnj commented Mar 1, 2019

mcmcgrath13 commented Feb 26, 2019 •

edited

Loading

mcmcgrath13 commented Feb 27, 2019 •

edited

Loading

codecov-io commented Feb 28, 2019 •

edited

Loading