Google Analytics - 3 Prepare - Module 3 - Database essentials

# # Google Analytics - 3 Prepare - Module 3 - Database essentials ## 0. Overview >[! quote] When you analyze large datasets, you’ll access much of the data from a database. In this part of the course, you will learn about databases, including how to access them and extract, filter, and sort the data they contain. You’ll also explore metadata to discover its many facets and how analysts use it to better understand their data. ### Learning Objectives >[! info] Learning objectives for Course 3, module 3 >- [ ] Describe databases with references to their functions and components >- [ ] Explain [[Metadata]] as it relates to [[Database|databases]] >- [ ] Discuss the importance of metadata and how it relates to the work of a data analyst >- [ ] Recall the issues and steps involved in access data from multiple sources >- [ ] Explain the use of filters and sorting functionality in spreadsheets >- [ ] Demonstrate how to use spreadsheet functionality to import and inspect a given set of data >- [ ] Demonstrate how to use SQL functions to extract data from a given database. --- ### Glossary for Module 3 - [[Administrative metadata]] - [[CSV (comma separated values) file]] - [[Data governance]] | [[Descriptive metadata]] - [[Foreign key]] | [[FROM]] - [[Geolocation]] - [[Metadata]] | [[Metadata repository]] - [[Naming conventions]] | [[Normalized database]] | [[Notebook]] - [[Primary key]] - [[Redundancy]] - [[Schema]] | [[SELECT]] | [[Structural metadata]] - [[WHERE]] | [[World Health Organization]] <hr> ## 1. Work with databases ### 1.1 Video: _All about databases_ >[! cue] DEF [[Metadata]] - Metadata can help you understand your database, it is a reference guide. >[! info] Metadata >Is data about data This part of the process is where analysts figure out what data is going to be the most helpful to them. ### 1.2 Video: _Database features and components_ - [[Database|databases]] are essential tools for analysts. Database features and components >[! cue] Def [[Relational database]] >[! info] Relational database >A database that contains a series of related tables that can be connected via their relationships. In order for tables to have a relationship one or more of the fields must exist in both tables. Two types of keys. >[! cue] Def [[Primary key]] Primary keys need to be unique, null or blanck >[! cue] Def [[Foreign key]] Foreign keys is a column or group of column, refers to a field in a table that is primary key of another table. The presenter made a point of really understanding the concepts of [[Primary key]] and [[Foreign key]] ### 1.3 Reading: _Maximize databases in data analytics_ - relational databases enable fast analysis because of the ability to link or relate tables to one another. >[! cue] Def [[Normalized database]] Normalization is a process of organizing data in a relational database. - example creating a table sand establishing relationships between those tables ### 1.4 Reading: _Inspect a dataset, a guided hands on tour_ For this activity a messy dataset is produced in a spreadsheet. The key takeaway is that data provided might not be sufficient in answering all the questions being posed. ### 1.5 Quiz: _Test your knowledge on working with databases_ 4 questions untimed. 75% minimum to pass #### Results: 75% due to an error on my part with the definition of [[Relational database]] It was a fill in the blank question and I interpreted it incorrectly (fields that can be connected as opposed to the definition tables that can be connected...) I understand the mistake. --- ## 2. Manage data with metadata ### 2.1 Video: _Demystify metadata_ > Lets talk about how you can describe that data. ... [[Metadata]] EG.. demo of a photo and its metadata, Sending and recieving email includes additional metadata Information used to describe data > Metadata is used in database management to help data analysts interpret the contents of the data within the database. Three types of metadata encountered as analyst 1. [[Descriptive metadata]] 2. [[Structural metadata]] 3. [[Administrative metadata]] ### 2.2 Reading: _Metadata is as important as the data itself_ This reading went over a basic tour of different "encounters" with metadata. - emails - websites - books - audiobooks - Photos ### 2.3 Reading: _Metadata and metadata repositories_ #### Benefits of Metadata - Reliability; Metadata helps analysts understand the following about the data they are utilizing - Accurate - Precise - Relevant - Timely - Consistency; consistent DB's make life easier. - Organized - Classified - Stored - Accessed; efficient access is enabled with consistent methods used/ #### [[Metadata repository]] - these repos help ensure data is both reliable and consistent - specialized dbs specifically created to store and manage metadata - Collibra!!! - enable the discovery of what tables/columns available for analysis that might be relevant to their project. #### Metadata on external db's - analysts should take great care in understanding all the data that they will be using. This includes verifying external sources to avoid biased results. ### 2.4 Video: _Manage data with metadata_ ==essential tool: used to create a single source of truth== Metadata is stored in a single central location >[! cue] Def: [[Data governance]] Another important aspect is [[Data governance]] >[! info] Data governance >A process to ensure the formal management of a company's data assets. Key benefit is that it enables a company to - better control data - manage issue related to data security and data privacy - data integrity - usability, and internal and external data flows. More than just rules and definition. Metadata analysts are mentioned here. ### 2.5 Video: _Megan: Fun with metadata_ Great video from a metadata practitioner. ### 2.6 Quiz: _Test your knowledge on metadata_ 4 questions, untimed, 75% needed to continue #### Results: 100% --- ## 3. Access different data sources ### 3.1 Video: _So many places to find data_ - Discuss different places analysts go to connect with data 2 basic types of data - internal - external Gathering data can be complicated. sometimes internal data doesn't give you the full picture. Openness has created a lot of data for analysts to use. ### 3.2 Reading: _Working with csv files_ To use CSV files you need them on your local machine first. Once on local you will then need to import that data into an application. - chrome can read csv in browser. - python needs specific commands - tableau >?? not covered yet - spreadsheet programs often have no problem importing the csv into the spreadsheet environment. ### 3.3 Reading: _Step-by-step: Import data from spreadsheets and databases_ Basic workflow: 1. get csv on local 2. select application to process. (google sheets) 3. upload csv AKA import into sheets ### 3.4 Video: _Import data from spreadsheets and databases_ Covers all the basics. Of note The presenter did mention that they often are in spreadsheets viewing data or subsets of data to see if the data is useful. >[! cue] Spreadsheets are useful So the spreadsheet tool, though boring and like 4500 years old is a vital component to the prepare phase of the [[data analysis process]] ### 3.5 Reading: _Import data dynamically_ Learning how to import data dynamically. >[! cue] Function: `IMPORTRANGE` >REF: [Documentation](https://support.google.com/docs/answer/3093340?hl=en&ref_topic=9199554) The [[IMPORTRANGE|=IMPORTRANGE()]] function import all or part of a dataset from another google sheet. Two arguments: - The URL of the source sheet - Name of the sheet and the range of cells called. ```Spreadsheet =IMPORTRANGE("URL","sheet_name!cell_range") ``` Example: ```spreadsheet =IMPORTRANGE("https://docs.google.com/thisisatestabc123", "sheet1!A1:F13") ``` >[! cue] Function: `IMPORTHTML` >REF: [Documentation](https://support.google.com/docs/answer/3093339?hl=en) [[IMPORTHTML|=IMPORTHTML()]] is a method of Web Scraping technique to get website html table or list from web down to your google sheet >[! cue] Function: `IMPORTDATA` >REF: [Documentation](https://support.google.com/docs/answer/3093335?hl=en) Slightly different than [[IMPORTRANGE|=IMPORTRANGE()]] [[IMPORTDATA|=IMPORTDATA()]] Imports data at a given url in .csv (comma-separated value) or .tsv (tab-separated value) format ### 3.6 Reading: _Explore public datasets_ [[Open data]] results in a lot of public datasets. Yippee. This reading covers a list of open data sources. #### General open data sources 1. [Google Cloud Public Datasets](https://cloud.google.com/public-datasets) 2. [Dataset Search](https://datasetsearch.research.google.com/) 3. [KAGGLE](https://www.kaggle.com/datasets?utm_medium=paid&utm_source=google.com+search&utm_campaign=datasets&gclid=CjwKCAiAt9z-BRBCEiwA_bWv-L6PpACh6RzmrJjQjmNGCCE7kky1FCtc6Jf1qld-4NwDMYL0WsUyxBoCdwAQAvD_BwE) 4. [BigQuery](https://cloud.google.com/bigquery/public-data) #### Public health datasets 1. [Global Health Observatory](https://www.who.int/data/collections) 2. [The Cancer Imaging Archive (TCIA) dataset](https://cloud.google.com/healthcare/docs/resources/public-datasets/tcia) 3. [1000 Genomes](https://cloud.google.com/life-sciences/docs/resources/public-datasets/1000-genomes) #### Public climate datasets 1. [National Climatic Data Center](https://www.ncei.noaa.gov/products) 2. [NOAA Public Dataset Gallery](https://www.climate.gov/maps-data/datasets) #### Public social-political datasets 1. [UNICEF State of the World's Children](https://data.unicef.org/resources/dataset/sowc-2019-statistical-tables/) 2. [CPS Labor Force Statistics](https://www.bls.gov/cps/tables.htm) 3. [Stanford Open Policing Project](https://openpolicing.stanford.edu/) ### 3.7 Quiz: _Test your knowledge on accessing data sources_ 4 questions, 75% min score required #### Results: 100% --- ## 4. Sort and filter data ### 4.1 Video: _Sort and filter to focus on relevant data_ No notes ### 4.2 Hands-on: _Clean data in spreadsheets with sorting and filtering_ No notes ### 4.3 Self-reflection: _Compare databases and spreadsheets_ No notes ### 4.4 Quiz: _Test your knowledge on sorting and filtering_ 4 questions, untimed. 75% minimum score to pass. #### Results: 100% --- ## 5. Large datasets in SQL ### 5.1 Video: _Get to know BigQuery, including sandbox and billing options_ Goes over the options between Sandbox account type and free-trial account type. For the course only need sandbox #### Sandbox account type limitations - 12 projects at a time - no INSERT new records to a DB - no UPDATE field values of existing records The presenter assures that for this course, sandbox is sufficient. [[BigQuery]] is a database management system in the cloud. The course authors are not delineating a difference between database and database management system software. There is a difference though. Suffice to say that BigQuery is a DBMS ### 5.2 Reading: _Set up your BigQuery account_ Expanded info and directions on setting up a BigQuery account. Getting to the sandbox is convoluted and is recommended to access from the docs page [BigQuery Sandbox Docs](https://cloud.google.com/bigquery/docs/sandbox#limits) ### 5.3 Reading: _Get Started with BigQuery_ #### Explorer >[! tip] Use the `+ADD` to add datasets In this activity we added a public dataset and learned some features of BQ ### 5.4 Step-by-step: _BigQuery in action_ This step by step walked through the process that will executed in the video ```SQL SELECT * FROM `bigquery-public-data.sunroof_solar.solar_potential_by_postal_code` WHERE state_name = 'Pennsylvania' LIMIT 1000; ``` Note the source From is a BigQuery resource path. This should auto-populate when previously selecting the data source. ### 5.5 Video: _BigQuery in action_ The presenter walks through a live demo of the process the prior section. ### 5.6 Hands-on: _Introduction to BigQuery_ ### 5.7 Reading: _In depth guide: SQL best practices_ 1. Use snake_case for column names. 2. Use CamelCase for table names ### 5.8 Hands-on: _Create a custom table in BigQuery_ This activity practiced importing data to a new dataset in BigQuery ### 5.9 Hands-on: _Choose the right tool for the job_ ### 5.10 Hands-on: _More practice with SQL_ ### 5.11 Quiz: _Test your knowledge on using SQL with large datasets_ --- ## 6. Module 3 challenge ### 6.1 Review: _Review the Glossary of terms covered in this module_ - [[Administrative metadata]] - [[CSV (comma separated values) file]] - [[Data governance]] | [[Descriptive metadata]] - [[Foreign key]] | [[FROM]] - [[Geolocation]] - [[Metadata]] | [[Metadata repository]] - [[Naming conventions]] | [[Normalized database]] | [[Notebook]] - [[Primary key]] - [[Redundancy]] - [[Schema]] | [[SELECT]] | [[Structural metadata]] - [[WHERE]] | [[World Health Organization]] ### 6.2 Challenge: _Module 3 Challenge Quiz_ 10 questions, timed. minimum of 80% to pass #### Results 100% --- ## Summary and takeaways >[! summary-top] Module ___ Key Take Aways >Note page summary