update sql part 2

erikaduan · erikaduan · commit ef901fad5252 · 2022-01-30T20:20:10.000+11:00
diff --git a/tutorials/p-sql_to_r_workflows/p-sql_to_r_workflows_part_1.Rmd b/tutorials/p-sql_to_r_workflows/p-sql_to_r_workflows_part_1.Rmd
@@ -356,4 +356,4 @@ bigrquery::dbDisconnect(bigquery_conn)
 + An RStudio [guide](https://db.rstudio.com/getting-started/connect-to-database/) on how to connect to an existing database using the `odbc` and `DBI` packages.    
 + An Rstudio [guide](https://db.rstudio.com/databases/big-query/) on how to connect to Google BigQuery using `odbc` or `bigrquery`.   
 + The official [vignette](https://cran.r-project.org/web/packages/dbplyr/vignettes/dbplyr.html) to using `dbplyr` for performing SQL queries via the tidyverse API.      
-+ A great series of blog posts by Vebash Naidoo, with [part 1](https://sciencificity-blog.netlify.app/posts/2020-12-12-using-the-tidyverse-with-databases/), [part 2](https://sciencificity-blog.netlify.app/posts/2020-12-20-using-the-tidyverse-with-dbs-partii/) and [part 3](https://sciencificity-blog.netlify.app/posts/2020-12-31-using-tidyverse-with-dbs-partiii/) containing advanced tips for using `dbplyr` to query SQL databases in R .   
++ A great series of blog posts by Vebash Naidoo, with [part 1](https://sciencificity-blog.netlify.app/posts/2020-12-12-using-the-tidyverse-with-databases/), [part 2](https://sciencificity-blog.netlify.app/posts/2020-12-20-using-the-tidyverse-with-dbs-partii/) and [part 3](https://sciencificity-blog.netlify.app/posts/2020-12-31-using-tidyverse-with-dbs-partiii/) containing advanced tips for using `dbplyr` to query SQL databases in R.     
diff --git a/tutorials/p-sql_to_r_workflows/p-sql_to_r_workflows_part_1.md b/tutorials/p-sql_to_r_workflows/p-sql_to_r_workflows_part_1.md
@@ -1,7 +1,7 @@
 How to integrate SQL and R - Part 1
 ================
 Erika Duan
-2022-01-26
+2022-01-30
 
 -   [Introduction to the relational
     model](#introduction-to-the-relational-model)
@@ -467,4 +467,4 @@ bigrquery::dbDisconnect(bigquery_conn)
     and [part
     3](https://sciencificity-blog.netlify.app/posts/2020-12-31-using-tidyverse-with-dbs-partiii/)
     containing advanced tips for using `dbplyr` to query SQL databases
-    in R .
+    in R.
diff --git a/tutorials/p-sql_to_r_workflows/p-sql_to_r_workflows_part_2.Rmd b/tutorials/p-sql_to_r_workflows/p-sql_to_r_workflows_part_2.Rmd
@@ -10,13 +10,14 @@ output:
 
 ```{r setup, include=FALSE}
 # Set up global environment ----------------------------------------------------
-knitr::opts_chunk$set(echo=TRUE, results='hide', fig.show='hold', fig.align='center')  
+knitr::opts_chunk$set(echo=TRUE, warning=FALSE, results='hide', fig.show='hold', fig.align='center')  
 ```
 
-```{r load libraries, message=FALSE, warning=FALSE}  
+```{r load libraries, message=FALSE}  
 # Load required packages -------------------------------------------------------  
 if (!require("pacman")) install.packages("pacman")
 pacman::p_load(here,  
+               dbplyr,
                tidyverse,
                odbc,
                DBI,
@@ -48,8 +49,8 @@ knitr::include_graphics("../../figures/p-sql_to_r_workflows-execution_order.svg"
 
 In SQL, the sequence in which SQL code is written does not correspond to the sequence in which it is executed, which leads to common errors like the example below.   
 
-```{r, results='hold'}
-# Incorrect SQL code -----------------------------------------------------------
+```{r, sql select queries, results='hold'}
+# Incorrect SQL query ----------------------------------------------------------
 # odbc::dbSendQuery(
 #   tsql_conn,
 #   "
@@ -62,11 +63,11 @@ In SQL, the sequence in which SQL code is written does not correspond to the seq
 #   WHERE hobbit_id <> 2
 #   
 #   -- this query will generate an error as WHERE is executed before SELECT and
-#   -- the renaming of student_id to hobbit_id happens after WHERE 
+#   -- the renaming of student_id to hobbit_id only happens via SELECT 
 #   "
 # ) 
 
-# Correct SQL code -------------------------------------------------------------
+# Correct SQL query ------------------------------------------------------------
 odbc::dbSendQuery(
   tsql_conn,
   "
@@ -78,26 +79,156 @@ odbc::dbSendQuery(
   FROM education.student
   WHERE student_id <> 2
   
-  -- student_id is first filtered via WHERE and lastly renamed to hobbit_id
+  -- student_id is first filtered via WHERE and then renamed as hobbit_id via SELECT
   "
 ) %>%
   odbc::dbFetch() %>%
   knitr::kable()
 ```
 
 
-## Writing `GROUP BY` SQL queries  
+## Writing SQL `JOIN` queries   
 
-## Writing SQL subqueries  
+The `dplyr` syntax for joining tables is very similar to its corresponding SQL syntax. The concept of left joins, right joins, inner joins and full joins are shared across both languages. In SQL, `JOIN` is executed directly after `FROM` and it is best practice to aliasing (rename via `AS`) table names to specify the records to join.  
+
+**Note:** Always ensure that you are joining to at least one column containing unique Ids to prevent unexpected many-to-many joined results.   
+
+```{r sql join query, results='hold'}
+# Perform inner join using SQL query -------------------------------------------
+odbc::dbSendQuery(
+  tsql_conn,
+  "
+  SELECT
+  c.course_name,
+  c.course_desc,
+  p.platform_name,
+  p.company_name
+  
+  FROM education.course AS c
+  INNER JOIN education.platform AS p
+    ON c.platform_id = p.platform_id
+  
+  WHERE is_active = 1 OR is_active IS NULL
+  
+  -- select course name and description and their corresponding platform and
+  -- company name for platforms that are active or NULL for is_active 
+  "
+) %>%
+  odbc::dbFetch() %>%
+  knitr::kable()
+```
+
+The equivalent R `dplyr` syntax follows the execution order, rather than written order, of the SQL join query.   
+
+**Note:** When using `dbplyr`, the education schema needs to explicitly passed using the function `in_schema("schema", "table")` inside `tbl()`. 
+
+```{r dbplyr join query, results='hold'}
+# Perform inner join using R syntax --------------------------------------------
+tbl(tsql_conn, in_schema("education", "course")) %>%
+  inner_join(tbl(tsql_conn, in_schema("education", "platform")),
+             by = c("platform_id" = "platform_id")) %>%
+  filter(is_active == 1 | is.na(is_active)) %>%
+  select(course_name,
+         course_desc,
+         platform_name,
+         company_name) %>%
+  collect() 
+```
+
+
+## Writing SQL `GROUP BY` queries  
+
+In SQL, grouping by column(s) causes individual records to be grouped together as record tuples. Because SQL queries return atomic records, this is why `SELECT` can only be performed on the group by column(s) and aggregations of other columns.   
+
+```{r incorrect sql group by query}
+# Incorrect SQL query ----------------------------------------------------------
+# odbc::dbSendQuery(
+#   tsql_conn,
+#   "
+#   SELECT
+#   p.platform_name,
+#   COUNT(course_id) as total_courses,
+#   c.course_id
+#   
+#   FROM education.course AS c
+#   INNER JOIN education.platform AS p
+#     ON c.platform_id = p.platform_id
+#   
+#   GROUP BY p.platform_id, p.platform_name
+#   
+#   -- this query will generate an error as course_id is no longer an atomic 
+#   -- record once grouped by platform_id and platform_name 
+#   "
+# )
+```
 
+In SQL, `WHERE` and `HAVING` are separate filtering methods as `WHERE` is executed first across individual records, before the `GROUP BY` statement. `HAVING` is executed after `GROUP BY` to enable filtering across individual grouped records and therefore requires an aggregation as its input.     
+
+As `SELECT` is executed last, this also means that the `SELECT` statement can only refer to the column(s) being grouped and aggregations of other columns.   
+
+```{r sql group by query, results='hold'}
+# Perform group by and aggregate SQL query -------------------------------------
+odbc::dbSendQuery(
+  tsql_conn,
+  "
+  SELECT
+  p.platform_id, 
+  p.platform_name,
+  COUNT(course_id) as total_courses,
+  AVG(course_length) AS avg_course_length, 
+  MIN(course_length) AS min_course_length,
+  MAX(course_length) AS max_course_length
+  
+  FROM education.course AS c
+  INNER JOIN education.platform AS p
+    ON c.platform_id = p.platform_id
+  
+  WHERE course_length IS NOT NULL
+  
+  GROUP BY p.platform_id, p.platform_name
+  
+  HAVING COUNT(course_id) > 1
+  
+  -- join course and platform tables, filter out courses without a course length
+  -- and then group records by platform_id and platform_name 
+  -- filter to exclude platforms which only have one course
+  -- finally select plaform name, total course count, average course length, min
+  -- course length and max course length   
+  "
+) %>%
+  odbc::dbFetch() %>%
+  knitr::kable()
+```
+
+The equivalent R `dplyr` syntax uses `filter()` before and after `group_by()` and aggregations are performed inside `summarise()`. R also allows `mutate()` to be used following `group_by()` to create a new column that relies on group-based transformations outputted across all individual records.      
+
+```{r dbplyr group by query, results='hold'}
+# Perform group by and aggregate using R syntax --------------------------------
+tbl(tsql_conn, in_schema("education", "course")) %>%
+  inner_join(tbl(tsql_conn, in_schema("education", "platform")),
+             by = c("platform_id" = "platform_id")) %>%
+  filter(!is.na(course_length)) %>%
+  group_by(platform_id, platform_name) %>%
+  summarise(total_courses = n_distinct(course_id),
+            avg_course_length = mean(course_length),
+            min_course_length = min(course_length),
+            max_course_length = max(course_length)) %>%
+  filter(total_courses > 1) %>%
+  ungroup() %>% 
+  collect()  
+```
+
+
+## Writing SQL `WINDOW` functions
+
+
+## Writing SQL subqueries  
 
 
 # Production friendly workflows  
 
 
 # Other resources  
++ A great series of blog posts by Vebash Naidoo, with [part 1](https://sciencificity-blog.netlify.app/posts/2020-12-12-using-the-tidyverse-with-databases/), [part 2](https://sciencificity-blog.netlify.app/posts/2020-12-20-using-the-tidyverse-with-dbs-partii/) and [part 3](https://sciencificity-blog.netlify.app/posts/2020-12-31-using-tidyverse-with-dbs-partiii/) containing advanced tips for using `dbplyr` to query SQL databases in R.     
 + Emily Riederer's excellent [blog post](https://emilyriederer.netlify.app/post/sql-r-flow/) containing ideas for creating flexible SQL <> R project workflows.   
-+ Julia Evan's entertaining [programming zine](https://wizardzines.com/zines/sql/) explaining all things SQL. (Paywalled resource)    
-
-
-    
++ Julia Evan's entertaining [programming zine](https://wizardzines.com/zines/sql/) explaining all things SQL. (Paywalled resource)    
diff --git a/tutorials/p-sql_to_r_workflows/p-sql_to_r_workflows_part_2.md b/tutorials/p-sql_to_r_workflows/p-sql_to_r_workflows_part_2.md