Skip to content

crawl_google_results.py update, modularize, documentation and doctest #4847

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 24 commits into from

Conversation

appledora
Copy link

@appledora appledora commented Oct 1, 2021

Added large changes to web_programming/crawl_google_results.py with documentations and doctests. Modularized the script and made it more customizable. Fixed formatting using black and flake8.

  • Add an algorithm?
  • Fix a bug or typo in an existing algorithm?
  • Documentation change?

Checklist:

  • I have read CONTRIBUTING.md.
  • This pull request is all my own work -- I have not plagiarized.
  • I know that pull requests will not be merged if they fail the automated tests.
  • This PR only changes one algorithm file. To ease review, please open separate PRs for separate algorithms.
  • All new Python files are placed inside an existing directory.
  • All filenames are in all lowercase characters with no spaces or dashes.
  • All functions and variable names follow Python naming conventions.
  • All function parameters and return values are annotated with Python type hints.
  • All functions have doctests that pass the automated testing.
  • All new algorithms have a URL in its comments that points to Wikipedia or other similar explanation.
  • [] If this pull request resolves one or more open issues then the commit message contains Fixes: #{$ISSUE_NO}.

@appledora appledora requested a review from cclauss as a code owner October 1, 2021 14:35
@ghost ghost added awaiting reviews This PR is ready to be reviewed enhancement This PR modified some existing files labels Oct 1, 2021
@appledora appledora changed the title Crawl google results crawl_google_results.py update, modularize, documentation and doctest Oct 1, 2021
@ghost ghost added the tests are failing Do not merge until tests pass label Oct 1, 2021
appledora and others added 2 commits October 1, 2021 20:51
Co-authored-by: Christian Clauss <cclauss@me.com>
Co-authored-by: Christian Clauss <cclauss@me.com>
@appledora appledora requested a review from cclauss October 1, 2021 17:37
@cclauss
Copy link
Member

cclauss commented Oct 2, 2021

Please undo the changes to requirements.txt.

@appledora
Copy link
Author

Please undo the changes to requirements.txt.

Done.
image

@cclauss
Copy link
Member

cclauss commented Oct 2, 2021

I think that this is a different algorithm than the original so let's keep both. Please rename this file to get_google_search_results.py with an algorithmic function get_google_search_results() which returns a list or tuple of search results. Put the writing of those results to a file in a separate function. Crawling is traversing each search result to go deeper which this algorithm does not do. Please do not print in the get_google_search_results() function.

@appledora
Copy link
Author

I think that this is a different algorithm than the original so let's keep both. Please rename this file to get_google_search_results.py with an algorithmic function get_google_search_results() which returns a list or tuple of search results. Put the writing of those results to a file in a separate function. Crawling is traversing each search result to go deeper which this algorithm does not do. Please do not print in the get_google_search_results() function.

On it!

@ghost ghost added the require type hints https://docs.python.org/3/library/typing.html label Oct 2, 2021
Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Click here to look at the relevant links ⬇️

🔗 Relevant Links

Repository:

Python:

Automated review generated by algorithms-keeper. If there's any problem regarding this review, please open an issue about it.

algorithms-keeper commands and options

algorithms-keeper actions can be triggered by commenting on this PR:

  • @algorithms-keeper review to trigger the checks for only added pull request files
  • @algorithms-keeper review-all to trigger the checks for all the pull request files, including the modified files. As we cannot post review comments on lines not part of the diff, this command will post all the messages in one comment.

NOTE: Commands are in beta and so this feature is restricted only to a member or owner of the organization.

Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Click here to look at the relevant links ⬇️

🔗 Relevant Links

Repository:

Python:

Automated review generated by algorithms-keeper. If there's any problem regarding this review, please open an issue about it.

algorithms-keeper commands and options

algorithms-keeper actions can be triggered by commenting on this PR:

  • @algorithms-keeper review to trigger the checks for only added pull request files
  • @algorithms-keeper review-all to trigger the checks for all the pull request files, including the modified files. As we cannot post review comments on lines not part of the diff, this command will post all the messages in one comment.

NOTE: Commands are in beta and so this feature is restricted only to a member or owner of the organization.

@ghost ghost removed the require type hints https://docs.python.org/3/library/typing.html label Oct 2, 2021
@ghost ghost removed the tests are failing Do not merge until tests pass label Oct 2, 2021
@ghost ghost added the require tests Tests [doctest/unittest/pytest] are required label Oct 2, 2021
@appledora appledora requested a review from cclauss October 5, 2021 14:00
@ghost ghost mentioned this pull request Oct 5, 2021
14 tasks
@appledora
Copy link
Author

@cclauss , any comments on this one?

@appledora
Copy link
Author

@poyea I think I fixed the technical and conventional errors on this one. Could you take a look?

Co-authored-by: John Law <johnlaw.po@gmail.com>
@ghost ghost removed the tests are failing Do not merge until tests pass label Oct 18, 2021
@appledora appledora requested a review from poyea October 18, 2021 04:21
@poyea
Copy link
Member

poyea commented Oct 19, 2021

So if we add

    import doctest

    doctest.testmod()

Then every single time we run the tests, it fires requests to google. And that those .txt files will be generated.

@poyea
Copy link
Member

poyea commented Oct 19, 2021

Please see https://github.com/TheAlgorithms/Python/blob/master/web_programming/instagram_crawler.py for what's current handling... Ideally we would have to mock the requests, but in the case we may further factor out the processing functions, and let's not test the request part.

@poyea poyea added the hacktoberfest-accepted Accepted to be counted towards Hacktoberfest label Oct 19, 2021
changing constant name

Co-authored-by: John Law <johnlaw.po@gmail.com>
@ghost ghost added the tests are failing Do not merge until tests pass label Oct 19, 2021
@appledora
Copy link
Author

@poyea, for the sake of clarification are you suggesting I follow the coding pattern in the https://github.com/TheAlgorithms/Python/blob/master/web_programming/instagram_crawler.py ?

@poyea
Copy link
Member

poyea commented Oct 19, 2021

@poyea, for the sake of clarification are you suggesting I follow the coding pattern in the https://github.com/TheAlgorithms/Python/blob/master/web_programming/instagram_crawler.py ?

Not necessarily. Now it's just a matter of how we write our test suite

@appledora
Copy link
Author

appledora commented Oct 20, 2021

@poyea, forgive me, but I believe I am still a little confused about the testing requirements. :|
What I "think" you ask for is, to only test the write_google_search_results() method, without actually calling the parse_results() method which sends out the actual REQUEST to google. And hence, I could write a mock test method like test_instagram_user() as given in this script . And by utilizing the doctest library, I will be calling this mock method?
Am I going in the right direction here?

@cclauss
Copy link
Member

cclauss commented Oct 20, 2021

web_programming/get_google_search_results.py:30: error: Name "headers" is not defined

@ghost ghost removed the tests are failing Do not merge until tests pass label Oct 21, 2021
@poyea
Copy link
Member

poyea commented Oct 26, 2021

@poyea, forgive me, but I believe I am still a little confused about the testing requirements. :| What I "think" you ask for is, to only test the write_google_search_results() method, without actually calling the parse_results() method which sends out the actual REQUEST to google. And hence, I could write a mock test method like test_instagram_user() as given in this script . And by utilizing the doctest library, I will be calling this mock method? Am I going in the right direction here?

@appledora You may try to run the test locally

@stale
Copy link

stale bot commented Apr 28, 2022

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Used to mark an issue or pull request stale. label Apr 28, 2022
@stale stale bot removed stale Used to mark an issue or pull request stale. labels Mar 18, 2023
@Isskta404

This comment was marked as off-topic.

}


def parse_results(query: str = "") -> list:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def parse_results(query: str = "") -> list:
def parse_results(query: str = "") -> list[dict[str, str | None]]:

Annotate the list elements' type

Comment on lines +54 to +57
next_page = []

for item in table_data:
next_page.append(item["href"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
next_page = []
for item in table_data:
next_page.append(item["href"])
next_page = (item["href"] for item in table_data)

new_link = BASE_URL + next_page_link
try:
response = requests.get(new_link, headers=HEADERS)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

Comment on lines +106 to +107
if filename == "":
filename = query + "-query.txt"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if filename == "":
filename = query + "-query.txt"
if not filename:
filename = query + "-query.txt"

if filename == "":
filename = query + "-query.txt"
elif not filename.endswith(".txt"):
filename = filename + ".txt"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
filename = filename + ".txt"
filename += ".txt"

Comment on lines +104 to +105
if query == "":
query = "potato"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if query == "":
query = "potato"
if not query:
query = "potato"

Comment on lines +89 to +102
>>> write_google_search_results("python", "test") != None
True
>>> write_google_search_results("", "tet.html") != None
True
>>> write_google_search_results("python", "") != None
True
>>> write_google_search_results("", "") != None
True
>>> "test" in write_google_search_results("python", "test")
True
>>> "test1" in write_google_search_results("", "test1")
True
>>> "potato" in write_google_search_results("", "")
True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make the doctests more strict by testing the outputs for equality:

Suggested change
>>> write_google_search_results("python", "test") != None
True
>>> write_google_search_results("", "tet.html") != None
True
>>> write_google_search_results("python", "") != None
True
>>> write_google_search_results("", "") != None
True
>>> "test" in write_google_search_results("python", "test")
True
>>> "test1" in write_google_search_results("", "test1")
True
>>> "potato" in write_google_search_results("", "")
True
>>> write_google_search_results("python", "test") == "test.txt"
True
>>> write_google_search_results("", "tet.html") == "tet.html.txt"
True
>>> write_google_search_results("python", "") == "python.txt"
True
>>> write_google_search_results("", "test1") == "test1.txt"
True
>>> write_google_search_results("", "") == "potato.txt"
True

Copy link
Contributor

@tianyizheng02 tianyizheng02 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Closing this PR because the code no longer works. I tried running this file locally and it wasn't able to get any Google results. I believe this is because the CSS identifiers have changed, so this file is no longer able to find the results in the HTTP response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting reviews This PR is ready to be reviewed enhancement This PR modified some existing files hacktoberfest-accepted Accepted to be counted towards Hacktoberfest
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants