China is an economic superpower. It’s is one of the world’s largest economies, and has tremendous growth each year. But Japan is also one of the world’s largest economies - how much of Asia’s total GDP does China generate, compared to the other Asian countries? How could you answer this question?
Turn to Page 30 and take two minutes to write down what you think.
We have our countries table, which lists every country in the world and shows their GDP. But to answer this question, we need to learn two new things:
How to write code that checks if a country is in Asia.
How to write a query that uses that check, so that we can generate a table showing only countries in Asia. Essentially, we want to create a filter that traps all the rows we want - getting rid of the ones we don’t.
Booleans and Comparison
Booleans and Comparison
(Time 10 minutes)
Let’s start with our first item: how to write Boolean expressions.
What do you think each of these expressions evaluate to?
5 + 3
4 * 2
3 > -2
8 < (1 + 1)
What type do the last two expressions produce?
We recommend projecting/displaying what the last two expressions evaluates to with a live coding section.
The last two expressions produce a new type, called a Boolean. A Boolean type can only be one of two values: true, or false. Computer scientists and data scientists use Boolean values whenever they are asking yes or no questions of data. For example, the expression 2 > 1 is asking the question "is 2 greater than one?". The answer is yes, so the computer will produce true.
Some students may confuse true and false for Strings, because "true and false are words". Point out that what is printed does not have any quotation marks, and only Strings are in quotes!
What do each of these expressions evaluate to?
18 > 18
18 >= 18
-5 <= 20
-4 == -4
3 == 2
(-8 + 8) == 0
(12 - 4) == (-4 + 12)
Is there a difference between = and ==?
You can point out the difference between definitions (=) and equality expressions (==) by writing x = 4 in the Interactions Area, and then evaluating the expression x = 10. This will produce an error because x is already defined, while x == 10 will produce false.
== is very different from =, which defines a variable to be equal to some value, whereas == asks a question: are these two things equal?
Pyret also allows you to ask "are these two values NOT equal?" with this operator: <>.
Turn to Page 31 in your workbooks, and complete the exercise. Call over the teacher when you have finished the worksheet
The exercise contains challenge questions where students must compare Strings for equality. Some students may have some intuition about this, but this activity "salts the waters" with a discussion of String comparisons.
The second table has expressions that evaluate to booleans, but they are different from other boolean expressions because they compare Strings for equality.
Strings can only be equal if they are EXACTLY equal, down to every character. If two strings have the same characters, but one is upper case and the other is lower case, they are NOT equal!
How might the last expression (continent == "Asia") be useful to us, if we want to find out how China’s GDP stacks up compared to other Asian countries?
A very common bug when writing sieve queries is for students to use the incorrect case, or add extra spaces, within the target String. If students are having trouble with their programs, or if their sieve queries produce completely empty tables, ask them if their target String is exactly what they want it to be.
(Time 20 minutes)
Now that we know how to write Boolean expressions, we can start using them inside our queries. A "sieve" is a tool used to separate gold from worthless dirt, and we use the sieve operator to separate the rows we care about from the ones we don’t.
"Why not call it filter?" Sometimes we use filters to keep stuff we like, and other times we use it to remove stuff we dislike. Sieve might be a strange word, but there’s no ambiguity about "keeping the gold"!
Type restaurants-sieved into the Interactions Area.
What is different about these two tables?
Guide students’ discussion towards these points:
restaurants-sieved has fewer rows
All of the restaurants in restaurants-sieved have a rating greater than or equal to 4.0
Let’s explore this query piece by piece:
Every sieve query starts with the same keyword: sieve.
Next we name the table we want to sieve. In this case, it is the restaurants table.
After the table name comes the keyword using, which tells Pyret what columns are being used to ask a question of each row.
Then we give the names of each column being used, followed by a colon :. In this case we are only using one column, which is the rating column.
On the next line is the most important part: this is a boolean expression that asks the question "is the restaurant’s rating at least 4.0?". Notice that we can use the column name the same way we can use a variable, within this expression.
Finally, like all table queries, we finish with the end keyword.
Complete the next three excersises in the Definitions Area: Low-Calorie, CA Presidents, and Asian Countries.
On page Page 45, write down what a sieve query is for.
Warning: When attempting the presidents-sieved problem, some students will probably run into a bug if they’re not careful about capitalization!
3-Step Table Plans
3-Step Table Plans
(Time 30 minutes)
sieve queries will now become the first step in our Table Plan. Before we worry about ordering or selecting, we’ll ask whether or not we want to drop any rows from our dataset. If the answer is no, we can skip ahead to ordering. But if it’s yes, we’ll write that sieve query first.
Turn to Page 32 in your workbook, and fill in the Table Plan that will solve the Word Problem and get you from the start table to the end table. When you’re done, type your solution queries into the Definitions Area under "Recent Title/Area".
Sometimes it’s obvious what your end table will have to look like, but a lot of the time you’ll need to figure that out for yourself. For practice, turn to Page 33 and read the Word Problem carefully. This time, you’ll have to fill in the end table yourself, before you start your table plan! When you’re done, type your solution queries into the Definitions Area under "Title and Overseas".
Data scientists often have to work with enormous tables, containing thousands or even millions of rows! When figuring out your table plan, it’s helpful to create just a small "starter table" so you can think things through. But what makes a good starter table?
A good starter table contains at least the columns that matter - whether we’ll be ordering, selecting, or sieving by those columns.
A good starter table has enough rows to be a representative sample of the dataset.
A good starter table has rows in truly random order, so that we’ll notice if we need to order the table or not.
A good starter table has a representative sample of rows from our full table. For example, a starter table based on presidents isn’t very good if it only has Democratic presidents, or only presidents from the 1800s. That’s a sampling bias that makes it harder to realize what we need to sieve by!
It will take some practice for you to get good at making Starter Tables, but you can start by identifying bad ones! turn to Page 34, and write down what’s wrong with each of these tables.
Turn to Page 35 and read the Word Problem. This time you’ll need to come up with a good starter table and end table! When you’re done, type your solution queries into the Definitions Area under "Asian GDPs".
Now it’s time to return to our original question: how much of Asia’s total GDP does China generate, compared to the other Asian countries?
Use the table you’ve created to generate a chart, showing the GDP of every country in Asia.
Some students will likely use pie, bar charts, or perhaps something else for this - point out the differences, and ask the class to discuss the pros and cons of each chart!
(Time 15 minutes)
Open the Sieve Syntax Errors file, and see if you can fix all the bugs you find. Once you’re done, uncomment each query by removing the hash sign (#) and click Run.
Take a few minutes and record your findings on Page 30. Do your findings match your hypothesis? What new questions does this raise?