You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
+ An RStudio [guide](https://db.rstudio.com/getting-started/connect-to-database/) on how to connect to an existing database using the `odbc` and `DBI` packages.
357
357
+ An Rstudio [guide](https://db.rstudio.com/databases/big-query/) on how to connect to Google BigQuery using `odbc` or `bigrquery`.
358
358
+ The official [vignette](https://cran.r-project.org/web/packages/dbplyr/vignettes/dbplyr.html) to using `dbplyr` for performing SQL queries via the tidyverse API.
359
-
+ A great series of blog posts by Vebash Naidoo, with [part 1](https://sciencificity-blog.netlify.app/posts/2020-12-12-using-the-tidyverse-with-databases/), [part 2](https://sciencificity-blog.netlify.app/posts/2020-12-20-using-the-tidyverse-with-dbs-partii/) and [part 3](https://sciencificity-blog.netlify.app/posts/2020-12-31-using-tidyverse-with-dbs-partiii/) containing advanced tips for using `dbplyr` to query SQL databases in R .
359
+
+ A great series of blog posts by Vebash Naidoo, with [part 1](https://sciencificity-blog.netlify.app/posts/2020-12-12-using-the-tidyverse-with-databases/), [part 2](https://sciencificity-blog.netlify.app/posts/2020-12-20-using-the-tidyverse-with-dbs-partii/) and [part 3](https://sciencificity-blog.netlify.app/posts/2020-12-31-using-tidyverse-with-dbs-partiii/) containing advanced tips for using `dbplyr` to query SQL databases in R.
In SQL, the sequence in which SQL code is written does not correspond to the sequence in which it is executed, which leads to common errors like the example below.
-- student_id is first filtered via WHERE and lastly renamed to hobbit_id
82
+
-- student_id is first filtered via WHERE and then renamed as hobbit_id via SELECT
82
83
"
83
84
) %>%
84
85
odbc::dbFetch() %>%
85
86
knitr::kable()
86
87
```
87
88
88
89
89
-
## Writing `GROUP BY` SQL queries
90
+
## Writing SQL `JOIN`queries
90
91
91
-
## Writing SQL subqueries
92
+
The `dplyr` syntax for joining tables is very similar to its corresponding SQL syntax. The concept of left joins, right joins, inner joins and full joins are shared across both languages. In SQL, `JOIN` is executed directly after `FROM` and it is best practice to aliasing (rename via `AS`) table names to specify the records to join.
93
+
94
+
**Note:** Always ensure that you are joining to at least one column containing unique Ids to prevent unexpected many-to-many joined results.
95
+
96
+
```{r sql join query, results='hold'}
97
+
# Perform inner join using SQL query -------------------------------------------
98
+
odbc::dbSendQuery(
99
+
tsql_conn,
100
+
"
101
+
SELECT
102
+
c.course_name,
103
+
c.course_desc,
104
+
p.platform_name,
105
+
p.company_name
106
+
107
+
FROM education.course AS c
108
+
INNER JOIN education.platform AS p
109
+
ON c.platform_id = p.platform_id
110
+
111
+
WHERE is_active = 1 OR is_active IS NULL
112
+
113
+
-- select course name and description and their corresponding platform and
114
+
-- company name for platforms that are active or NULL for is_active
115
+
"
116
+
) %>%
117
+
odbc::dbFetch() %>%
118
+
knitr::kable()
119
+
```
120
+
121
+
The equivalent R `dplyr` syntax follows the execution order, rather than written order, of the SQL join query.
122
+
123
+
**Note:** When using `dbplyr`, the education schema needs to explicitly passed using the function `in_schema("schema", "table")` inside `tbl()`.
124
+
125
+
```{r dbplyr join query, results='hold'}
126
+
# Perform inner join using R syntax --------------------------------------------
In SQL, grouping by column(s) causes individual records to be grouped together as record tuples. Because SQL queries return atomic records, this is why `SELECT` can only be performed on the group by column(s) and aggregations of other columns.
# -- this query will generate an error as course_id is no longer an atomic
160
+
# -- record once grouped by platform_id and platform_name
161
+
# "
162
+
# )
163
+
```
92
164
165
+
In SQL, `WHERE` and `HAVING` are separate filtering methods as `WHERE` is executed first across individual records, before the `GROUP BY` statement. `HAVING` is executed after `GROUP BY` to enable filtering across individual grouped records and therefore requires an aggregation as its input.
166
+
167
+
As `SELECT` is executed last, this also means that the `SELECT` statement can only refer to the column(s) being grouped and aggregations of other columns.
168
+
169
+
```{r sql group by query, results='hold'}
170
+
# Perform group by and aggregate SQL query -------------------------------------
171
+
odbc::dbSendQuery(
172
+
tsql_conn,
173
+
"
174
+
SELECT
175
+
p.platform_id,
176
+
p.platform_name,
177
+
COUNT(course_id) as total_courses,
178
+
AVG(course_length) AS avg_course_length,
179
+
MIN(course_length) AS min_course_length,
180
+
MAX(course_length) AS max_course_length
181
+
182
+
FROM education.course AS c
183
+
INNER JOIN education.platform AS p
184
+
ON c.platform_id = p.platform_id
185
+
186
+
WHERE course_length IS NOT NULL
187
+
188
+
GROUP BY p.platform_id, p.platform_name
189
+
190
+
HAVING COUNT(course_id) > 1
191
+
192
+
-- join course and platform tables, filter out courses without a course length
193
+
-- and then group records by platform_id and platform_name
194
+
-- filter to exclude platforms which only have one course
195
+
-- finally select plaform name, total course count, average course length, min
196
+
-- course length and max course length
197
+
"
198
+
) %>%
199
+
odbc::dbFetch() %>%
200
+
knitr::kable()
201
+
```
202
+
203
+
The equivalent R `dplyr` syntax uses `filter()` before and after `group_by()` and aggregations are performed inside `summarise()`. R also allows `mutate()` to be used following `group_by()` to create a new column that relies on group-based transformations outputted across all individual records.
204
+
205
+
```{r dbplyr group by query, results='hold'}
206
+
# Perform group by and aggregate using R syntax --------------------------------
+ A great series of blog posts by Vebash Naidoo, with [part 1](https://sciencificity-blog.netlify.app/posts/2020-12-12-using-the-tidyverse-with-databases/), [part 2](https://sciencificity-blog.netlify.app/posts/2020-12-20-using-the-tidyverse-with-dbs-partii/) and [part 3](https://sciencificity-blog.netlify.app/posts/2020-12-31-using-tidyverse-with-dbs-partiii/) containing advanced tips for using `dbplyr` to query SQL databases in R.
99
233
+ Emily Riederer's excellent [blog post](https://emilyriederer.netlify.app/post/sql-r-flow/) containing ideas for creating flexible SQL <> R project workflows.
100
-
+ Julia Evan's entertaining [programming zine](https://wizardzines.com/zines/sql/) explaining all things SQL. (Paywalled resource)
101
-
102
-
103
-
234
+
+ Julia Evan's entertaining [programming zine](https://wizardzines.com/zines/sql/) explaining all things SQL. (Paywalled resource)
0 commit comments