# SQL Wizard

This notebook is designed to teach you the language of SQL. SQL stands for Structured Query Language. It is a language designed to query databases. These queries allow you to pull data from the database, update rows in the database, and even delete data permanently. There are many different SQL databases, each of which can all have different plugins to enhance their capabilities even further. Remember that apartment searching tool I made for you? That's all using SQL :) It just also used a plugin meant for operating on geographic locations.

The most common SQL database is SQLite. It is a lightweight library which any program can embed within itself to give it access to a SQL database. Most other databases are separate servers which must be run as an entirely separate software hosted on a sever somewhere. Because of this SQLite is very easy to get started with.

Run the following cell to install SQL capabilities into Jupyter:

In [1]:
%load_ext sql

[32mDeploy AI and data apps for free on Ploomber Cloud! Learn more: https://docs.cloud.ploomber.io/en/latest/quickstart/signup.html[0m


### Getting started with SQL

Before we can get started with SQLite, let's first open a database! This is something you only need to do once in Jupyter since it will keep the database open for you. Run the following cell to open the database called "properties.db", which as you'll find out contains information about every Railey property at Deep Creek Lake.

SQL databases all generally follow this model: Every database file can contain multiple "tables", and each table will contain rows and columns. If you know Excel this should be very familiar. Each Excel file can contain multiple sheet tabs at the bottom which allow you to store multiple spreadsheets in the same file. Those tabs at the bottom are equilavent to SQL tables.

In [2]:
%sql sqlite:///properties.db

### Your first query

SQL queries are made of individual statements. These statements start with a word which specifies what you'd like to do, such as `SELECT` to retrieve data, `UPDATE` to update existing rows, and `DELETE` to delete records from the database. Let's practice a SELECT statement.

The syntax of `SELECT` is as follows: `SELECT <the names of columns I would like returned separated by commas> FROM <table name>;`. SQL also has a shorthand for selecting all columns: `SELECT * FROM <table name>;`

One last thing before we dive in! To make a Jupyter notebook cell run SQL, you must start the cell with `%%sql`. Without this, you'll likely get some errors since Jupyter will try to run it as SQL code. I will include this for you to make your life easier but you should know why it's there :)

Additionally some queries will return a lot of data, so you may want to click on the area to the left of the returned data to shrink it 

### Exercise 1: Getting started with SELECT

As a first exercise, try selecting all columns from the table all_properties.

In [None]:
%%sql
SELECT * from all_railey_properties

## Indexing review

Great job with that exercise! Indexing is a valuable tool when working with sequences and we'll be relying on it heavily in the rest of the module. 

Let's learn a little bit more about indexing. The first problem we'll look at is how to get the last character in a string programatically. Above since you know the value ahead of time you can simply count and hard code the index of the last character. But what if you're not working on data which a known size? Here's a small demo - enter your name in the box below:

In [4]:
student_name = input()

 Christopher


In [5]:
length = len(student_name)
print(length)

11


If you want to print the last character of a string you don't know the length of ahead of time you can use the length to index the last character

In [6]:
last_index = length - 1
print(student_name[last_index])

r


### Exercise 2: Why do we use `length - 1` for the index of the last character instead of `length`?

Answer: 

### One More Indexing Trick

This code is powerful because it works for strings of any length, not just 6-letter strings. Python also has a more idomatic method of indexing the last character in a list.

In [7]:
print(student_name[-1])

r


Using negative indices starts from the end of a sequence and moves backwards, so `-2` is the second to last character and so on

## Iterating through sequences

A very important part of working with data is iteration, or going through the items in a list one by one until you find one that you need or to do some analysis of each one. Let's examine a wordle letter by letter

In [8]:
wordle = "carry"

for letter in wordle:
    print(letter)

c
a
r
r
y


The syntax for a `for` loop is:

```
for NEW_VARIABLE in SEQUENCE:
    # Code which gets called once for each item in SEQUENCE with NEW_VARIBLE being updated to 
    # the next item in the list after the block inside the for loop is run
```

`for` loops are great for whenever you want to run a bit of code for each element in a sequence. Lets do somehting more interesting interesting with the block of code!

### Exercise 1: Print a message each time you see an R in the wordle

Hint: use an if statement

In [9]:
for letter in wordle:
    # Your code goes here
    pass

### Exercise 2: Count how many R's you see in the wordle and print the result after the for loop

Hint: This builds upon your previous exercise. Don't be afraid to create a new variable

In [10]:


for letter in wordle:
    pass

### Advanced looping

Brief reprieve: remember that loops can be broken out of with the `break` keyword. This is useful if you would just like to act on the first instance of something you find, you can just break after you act on it. This is useful for answering questions like "Are there any R's in this word". Since you only need to find one to answer that question, continuing to go through the rest of the word after you found on is redundant work and will save you time to stop as soon as possible.

### Exercise 3: Print only one message for the first R in the wordle

Hint: Copy your code from the first exercise and figure out how to modify it to only print once

In [11]:
for letter in wordle:
    # Your code goes here
    pass

Did you know that if you just want to find if an element is in a list python has an easy shortcut for that?

In [12]:
if 'r' in wordle:
    print("We have an R!")
else:
    print("No R in the wordle")

We have an R!


Neat right? Feel free to change the code above and play around with it. The in operator with if statements will be very useful later in this notebook so keep it in mind

## List of lists

A lot of data is multidimensional. If we have a list of words, to python that looks like a bunch of sequences inside one large sequence. We'll go through some simple examples to build some skills before we work with the full wordle dictionary


In [13]:
dictionary = ['apple', 'butt', 'carp', 'dick']

If loop through dictionary with a for loop, each word will be accessed one after another.

### Exercise 1: Print each word in the dictionary

Hint: use a for loop

In [14]:
# Your code here

But now, how are you supposed to access the letters in each word now, if you wanted to process each letter, not each word? Easy! Use a for loop inside of your first for loop :)

### Exercise 2: Print each letter in the dictionary (in order)

Hint: your code from the previous answer should be very useful here

In [15]:
# Your code here

### Multidimensional indexing

Congratulations! You just did multidimensional indexing! What now? Multidimensional indexing is just a fancy way of saying we'll need to index a sequence which contains sequences multiple times to get one single item from our dictionary.

Let's take a look at example:

In [16]:
first_word = dictionary[0]
first_letter_of_first_word = first_word[0]

print(first_word)
print(first_letter_of_first_word)

apple
a


Try modifying the example above to get the second letter of the first word, the second letter of the second word, or even the last letter of the last word. (Bonus points if you remember the trick from before)

Here's another trick: indexing is an expression which returns a value. In the example above, we store that value in a temporary variable called `first_word`. We don't need to do that though, we can actually combine the indexing for a particular letter into one line without a temporary variable. Try playing around with the indexes below to get a feel for it.

In [17]:
a_cool_letter = dictionary[0][0]
print(a_cool_letter)

a


In [18]:
# Technically we don't even need the letter variable either. As the programmer its up to you to decide how explicit
# you want to write your code. There is no solution or "right" amount of explicitness
print(dictionary[0][0])

a


## Indexing with numbers

Instead of using a for loop to automatically go through each element in a sequence, occasionally it's useful to use the for loop to produce indices instead values. That's a lot of words which probably invokes a why but let's jump into an example to see why:

In [19]:
for index in range(5):
    print(wordle[index])

c
a
r
r
y


Well that didn't explain anything but trust me it will in due time. Really this probably just looks like a complicated and annoying way to print each letter. Before we get into why this is useful I'd like to highlight that index is a variable that gets assigned the values 0-4 (If you like an illustration, feel free to add a `print(index)`). The indexing operator, `[]`, can take any expression, not just integer literals. These facts are why this works.

Now to answer the question I've been avoiding. Why is this ever useful? It's a niche thing but it's useful if you ever want to go through multiple sequences in lock step at the same time. Modify the example above to print each guessed letter alongside the actual letter in that position in the wordle:

In [20]:
# Here is a wordle guess a user input
wordle_guess = "carts"

for index in range(5):
    print(wordle[index])

c
a
r
r
y


### Exercise 1: Print how many letter are in the correct position

Building on top of your previous solution, if you have the guess's letter at position x and the wordle's letter at position x, now you can compare them and count how many are correct. This builds off of a lot of previous examples so don't be afraid to look back

In [21]:
# Your code goes here

## Validating data

Imagine you didn't get the wordle dictionary from a reputible site and instead got the wordle dictionary with a bunch of extra crap in it. Data cleaning and validation is a huge part of data analysis. Answering questions like "Is this a valid word", "Is it the right size". Or for other datasets, "Does every person have an email address or is anyone's email address empty".

Sometimes you just need answers: "Is the data ok". Other times you'll need to decide how you want to fix problems in your data.

Let's start with a simple example. Your sketchy wordle dictionary has words that are too short and too long in it. A very simple solution to this problem is to go through every entry in the sketchy dictionary and add it to a "known good wordles" list if it's the appropriate size. Let's try this below:

Hint: `.append()` is a method to add values to a list

In [22]:
sketchy_dictionary = ["carts", "carry", "carriage", "ropes", "tights", "dude", "horse"]
validated = []

# Your code goes here

        
print(validated)

[]


Congrats! You successfully cleaned and validated that bad dictionary

## Analyzing some wordles

Finally! The main event :) First thing's first, let's talk about CSV (comman seperated values) files. Our wordle datasets are stored in two different CSV files. Please take the opportunity to go look at them by clicking the folder icon on the sidebar and double clicking either of the two csv files to explore them. When you're done, click on Wordle Master tab to come back here.

CSV files are a very simple "spreadsheet" format. They give you rows and columns. These files have one column called "word". If you were to open the file in a text editor, it would look like:

```
word,
aback,
abase,
abate,
abbey,
abbot,
abhor,
abide,
abled,
abode,
...
```

Each column in a CSV file is separated with a comma (Where it gets it's name from) and each row of the spreadsheet is separated with a newline character. It's a very simple text format but it's very easy to work with in any programming language and is therefore prety ubiquitous. Sometimes, but not all the times, the first row of a CSV file is a header row which provides a label for each column. If you are to process a CSV file it's important to know whether the first line is a header or part of the data so you can process it accordingly.

Below is some python code which reads these csv files into multiple arrays, we can go through these after you take a look at them. See if you can take a guess as to what's happening:

In [23]:
valid_solutions = []

import csv

with open("valid_solutions.csv", "r") as csvfile:
    csvreader = csv.reader(csvfile)

    # This skips the first header row of the CSV file.
    next(csvreader)
    
    for row in csvreader:
        valid_solutions.append(row[0])

In [24]:
valid_guesses = []

import csv

with open("valid_guesses.csv", "r") as csvfile:
    csvreader = csv.reader(csvfile)

    # This skips the first header row of the CSV file.
    next(csvreader)
    
    for row in csvreader:
        valid_guesses.append(row[0])

Now the variables `valid_solutions` and `valid_guesses` are usable in other cells below! The block of code is pretty straightforward but introduces a lot of new things we won't dive into too much in this module. All this code does is use the built in python csv library to read a CSV file (opened with the `open` function). We use that reader to read each entry from the first column of each row into a list called `valid_guesses` and `valid_solutions`.

Wordle has two word lists, one for all possible valid guesses (every 5 letter word in the dictionary) and a hand curated one from which the solution is chosen (so the player doesn't feel gypped because of a esoteric word they never heard of before. Since our word lists are in python lists now, lets use our data processing skills to see how long is each dictionary, and then after you figure that out, determine how many more words there are in the `guesses` list than there are in the `solutions` list:

In [25]:
# Your code goes here

Look at you! Answering interesting questions about the wordle dataset and we've barely got started! Let's try something harder!

### Exercise 1: Count how many solutions start with the letter A

In [26]:
# Your code goes here

### Exercise 2: Count how many solutions contain an A

In [27]:
# Your code goes here

### Exercise 3: Pick a wordle from the solutions list at random

Hint: You've done this before but this is a good time to break out google and see if you can figure it out again if you forgot ;)

In [28]:
# Your code goes here

### Exercise 4: Determine if a given word is in the solution list

Hint: There's an operator we've used before in this module which let's you test to see if a value is contained in a given sequence. If you can't find it or remember it there's always the trusty for loop

In [29]:
is_valid_solution1 = "grass"
is_valid_solution2 = "aiyee"

# Your code goes here

### Exercise 5: Print all the words that start with "sh"

This one is the first exercise which has a bit of practical value for you. If you have a word that starts with a green 'sh', this will let you see what they are

In [30]:
### Your code goes here

### Aside about comparing more than just one letter

Slicing a list is something which goes handing in hand with indexing but we have not touched on it yet. If you remember, slicing is a way to return a subset of a list. This can make a problem like the one above easier

In [31]:
to_slice = "shell"

print(to_slice[0:1])
print(to_slice[0:2])
print(to_slice[:2])
print()
print(to_slice[2:5])
print(to_slice[2:])

s
sh
sh

ell
ell


Play around with the indicies above and try and get a feel for it! What happens if the starting index is after the ending index? What happens if the starting index is negative and you don't provide an ending index? Can you use two negative indexes? Can you use negative indexes to the the last two characters of the word?

As a refresher, the slicing syntax appears inside the indexing operator and instructs python to return a subset of the list where the number before the colon is the starting index and the number after the colon is where python goes up to but doesn't include. If the starting number is left out, that means start from the beginning and go to the ending index. If the ending index is left out that means start at the starting index and go to the end. If they are both left out, it returns a copy of the list from the beginning to the end. Not the most useful thing but possible.

The reason I bring all of this up is because of the previous exercise. An easy way to get the first two letters is with slicing. Take a look at the solution to Exercise 5 below:

In [32]:
for solution in valid_solutions:
    if solution[:3] == "sha":
        print(solution)

shack
shade
shady
shaft
shake
shaky
shale
shall
shalt
shame
shank
shape
shard
share
shark
sharp
shave
shawl


The alternative would be writing a much larger if statement with and and explicit indexing. This is a fairly compact and readable solution, so long as the reader understands slicing syntax.