Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming queries #124

Merged
merged 12 commits into from
Mar 1, 2019
Merged

Streaming queries #124

merged 12 commits into from
Mar 1, 2019

Conversation

mcmcgrath13
Copy link
Contributor

@mcmcgrath13 mcmcgrath13 commented Feb 26, 2019

Implement new type StreamingQuery. This is similar to the Query structure, but instead of storing the result, uses the result. Due to this difference, StreamingQuery does not implement the tables interface and the only way to fetch the result is via iteration. This is intended to be used as a lower level function for those who need to work with large queries that can't fit in memory.

Still to do:

  • Update README
  • Test on full scale data
  • Pull current master changes in
  • (unrelated) fix Libdl warning

@quinnj
Copy link
Member

quinnj commented Feb 27, 2019

@mcmcgrath13, I'm not sure I understand the reason for this change. I believe we already have this functionality w/ MySQL.Query itself, right? You can already just iterate the MySQL.Query and get NamedTuples, same as what you're proposing w/ this StreamingQuery type. Am I missing something that's different here?

@mcmcgrath13
Copy link
Contributor Author

mcmcgrath13 commented Feb 27, 2019

@quinnj it's definitely a subtle difference. With MySQL.Query we use mysql_store_result which causes MySQL to store the full result on the server and because of this we know things like how many rows are in a result upon callling MySQL.Query. With the proposed MySQL.StreamingQuery we would instead use mysql_use_result which instead returns the rows one-by-one on mysql_fetch_row without intermediate storage and therefore we can't know the number of rows in the result after a call to MySQL.StreamingQuery, which I believes prevents using the Tables.jl interface (though my understanding is definitely far from complete).

From the MySQL documentation:

After invoking mysql_query() or mysql_real_query(), you must call mysql_store_result() or mysql_use_result() for every statement that successfully produces a result set (SELECT, SHOW, DESCRIBE, EXPLAIN, CHECK TABLE, and so forth). You must also call mysql_free_result() after you are done with the result set.

mysql_use_result() initiates a result set retrieval but does not actually read the result set into the client like mysql_store_result() does. Instead, each row must be retrieved individually by making calls to mysql_fetch_row(). This reads the result of a query directly from the server without storing it in a temporary table or local buffer, which is somewhat faster and uses much less memory than mysql_store_result(). The client allocates memory only for the current row and a communication buffer that may grow up to max_allowed_packet bytes.

I am running into an issue where I need to query a large amount of data from one MySQL database (a simple select a,b,c from y), compare/process it against a smaller set of data and then only store a small fraction for analysis. I am currently getting an Out of Memory error from MySQL.Query. I am going to try running the code with a MySQL.StreamingQuery once I have some time to port over the code later today and will report back.

I wanted to get the PR started to start the conversation around how to implement this functionality, but it's not quite ready for merging and I wonder if you think there may be a better solution.

@mcmcgrath13
Copy link
Contributor Author

@quinnj I was able to test this on the full scale database and it worked successfully

@quinnj
Copy link
Member

quinnj commented Feb 27, 2019

@mcmcgrath13, thanks for the explanation, I get the idea now. So here's my thought so we can hopefully avoid the code duplication:

  • in the mutable struct Query definition, let's change the hasresult type parameter to be named resulttype which will have values of :default, :streaming, and :none (Symbols are allowed as type parameters)
  • in the Query constructor, we'll need to accept the streaming::Bool=false keyword argument, then set the resulttype type parameter as appropriate (to one of the symbol values)
  • add your logic to do mysql_use_result in the Query constructor itself
  • add Base.IteratorSize(::Type{Query{resulttype, names, T}}) where {resulttype, names, T} = resulttype == :streaming ? Base.SizeUnknown() : Base.HasLength() around line 99 of types.jl or so (by the other iteration interface definitions)

This should allow re-using the existing Query type, support the new streaming functionality, and keep the Tables.jl interface happy (if the Query is streaming, Tables.jl adjusts to accumulating rows by push!-ing instead of pre-allocating).

Does that sound ok?

@mcmcgrath13
Copy link
Contributor Author

@quinnj sounds good! I'll take a pass at these changes tomorrow and let you know when it's ready for review.

@codecov-io
Copy link

codecov-io commented Feb 28, 2019

Codecov Report

Merging #124 into master will decrease coverage by 35.01%.
The diff coverage is 55.55%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master     #124       +/-   ##
===========================================
- Coverage   87.07%   52.05%   -35.02%     
===========================================
  Files           5        5               
  Lines         147      267      +120     
===========================================
+ Hits          128      139       +11     
- Misses         19      128      +109
Impacted Files Coverage Δ
src/api.jl 27.11% <0%> (-66.64%) ⬇️
src/types.jl 66.15% <62.5%> (-21.03%) ⬇️
src/MySQL.jl 39.28% <0%> (-42.86%) ⬇️
src/consts.jl 55.1% <0%> (-24.31%) ⬇️
src/prepared.jl 81.57% <0%> (-15.09%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a0e70de...afa4b6e. Read the comment docs.

@mcmcgrath13
Copy link
Contributor Author

@quinnj I've incorporated your suggestions - could you review?

Copy link
Member

@quinnj quinnj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments of things to tweak, but overall this looks great! Thanks @mcmcgrath13!

@mcmcgrath13
Copy link
Contributor Author

@quinnj thanks for reviewing! I've incorporated your suggestions and I believe it's ready to go

@quinnj quinnj merged commit 70472ff into JuliaDatabases:master Mar 1, 2019
@quinnj
Copy link
Member

quinnj commented Mar 1, 2019

Thanks @mcmcgrath13! I'm going to take another stab at updating the binary builder for the package and then do a release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants