instagram

Students explore plotting points to represent documents, data normalization, the dimensionality of language, and training the model.

Lesson Goals

Students will be able to…​

  • Recognize that for a bag-of-words model, dimensionality reflects how many different words are in the corpus.

  • Define a model as a simplified representation of the relationship(s) between variables in a dataset.

Student-facing Lesson Goals

  • Let’s think about training, the act of transforming data into a model.

Materials

🔗The Dimensionality of Natural Language

Overview

We made bags of words with jazz vocalization in order to make meaningful "sentences" with very few different words. Obviously, most student essays will contain many more words than these jazz vocalizations do. What happens when we try to handle something closer to ordinary “language”?

Launch

So far, we’ve looked at four documents.

  • Document a: "doo be doo be doo"

  • Document b: "doo doo be doo be"

  • Document c: "doo be doo be doo doo doo"

  • Document d: "be bop bop bop be bop bop"

Although the documents contain 24 words in total, there are just three unique words: doo, be, and bop…​ so we can plot each of these documents as points in a three-dimensional space.

Obviously, most texts will contain more than three unique words!

Take a minute to consider what it would like to plot a point for "doo be bop ski bop bop" in four-dimensional space.

Having trouble visualizing a four-dimensional space? You’re not alone!

Fortunately, computers—​unlike humans—​have no issue working with multi-dimensional spaces, even when they have hundreds of thousands of dimensions.

Investigate

A training corpus is a collection of data used to train AI/ML models, enabling them to learn patterns and make predictions.

For a plagiarism checker, the corpus is the set of documents previously fed into the checker.

  • What sorts of documents make up the training corpus of an effective plagiarism detector? List as many as you can.

  • The corpus would likely include:

    • essays written and submitted by students currently in the class

    • essays written and submitted by students previously in the class

    • Wikipedia articles

    • articles on relevant topics that are available on the internet, etc.

A teacher who wants to catch plagiarism will likely opt for a plagiarism detector that has trained on an extremely large collection of documents. Processing a large training corpus will produce a complex, multi-dimensional model. Every single additional word will add another dimension to the space.

  • Let’s say your teacher asks all 20 students in her class to write a 500-word essay. She plans to feed those 20 essays into a plagiarism detector to use as the training corpus, allowing her to detect if two students submitted essays that were a little too similar. About how many dimensions will there be in the model?

  • Students should provide a wide range of estimates.

  • The largest possible estimate would be 10,000 dimensions (20 essays multiplied by 500 words) --but it is not a good estimate, because we commonly repeat and reuse words like "the", "and", "a", and so on.

  • Before making an estimate, students might have clarifying questions, like:

    • Did all of the students write about the same topic?

    • How sophisticated is the student writing?

    • Did all students actually write 500 words?

  • A reasonable prediction would probably be that there would be at least a few thousand dimensions in the model.

  • What strategies did you use to complete the matching activity? Were any of the scenarios especially challenging to match?

Synthesize

Although we can’t visualize the multi-dimensional spaces for wiki-article and student-essay, we can apply what we have learned to consider angle differences.

wiki-article = "The elephant has been a contributor to Thai society and its icon for many centuries. The elephant has had a considerable impact on Thai culture. The Thai elephant is the official national animal of Thailand. The elephant found in Thailand is the Indian elephant, a subspecies of the Asian elephant."

student-essay = "The elephant is a contributor to Thai society. It has been an icon of Thai life for many centuries. The elephant, which it is possible to see found in every part of Thailand, is the Indian elephant, which is a subspecies of the Asian elephant. The Thai elephant has a considerable impact on culture. The elephant is the official national animal of Thailand."
  • Do you predict that the angle difference for the wiki-article and student-essay will be closer to 0° or closer to 90°?

  • Since the student essay is nearly identical to the wikipedia article, we would expect a difference closer to zero. (It’s actually ~23.706°.)

🔗Training a Model

Overview

Now that we’ve seen how to create a compressed representation of one piece of text, how can we handle many pieces of text?

Launch

The problem with a corpus:

Imagine running angle-difference on the same documents, over and over. That would take very long!

To avoid repetition, we need the next step of the process: training.

This generates a bag-of-words for each document and stores that; this is the model. When a new document comes, it is compared against the model, without referring back to the initial documents.

Investigate

Specifically, let’s suppose the teacher wants a plagiarism detector for (short) animal essays. In addition to the paragraph we’ve already seen about the elephant, she gathers up paragraphs describing nine other animals. Each one is turned into a bag of words and added to our model. All this work is only done once; it can then be used on many different student submissions.

Once a model is trained, the corpus can be queried as many times as we want without having to repeat any of the work done during training!

Let’s return to the Plagiarism Detection Starter File.

  • We’ve seen that angle-difference takes in any two articles we give it, builds their bag of words, and computes the difference.

  • The Plagiarism Detection Starter File actually contains a hidden training corpus! Try typing chimpanzee-article to take a peek. Now try snail-article. What do you observe?

  • The distance-to function is much more powerful than angle-difference, allowing us to compare any article to all of the articles that we trained our model on without recomputing the bags for each of those documents every time.

Turn to the first section of Exploring the Model and complete the questions to explore how distance-to works.

  • What are some advantages of working with distance-to instead of angle-difference?

  • It’s nice to be able to see many angle differences, rather than just the one that we have specified. distance-to does many times the work of angle-difference!

  • Is distance-to sophisticated enough to be able to determine with certainty whether or not plagiarism occurred?

  • No. If two essays have an unusually small angle difference, that is a signal for a human to investigate further. A plagiarism detector cannot conclusively decide if plagiarism occurred.

  • Imagine that there’s an actual teacher out there who desperately wants to catch the student who handed in student-essay. He really wants the plagiarism detector to declare without a shade of doubt that the student is guilty. What ideas do you have for how he might be able to improve the model to get more conclusive results?

  • Solicit student answers before exploring the next iteration.

Removing common words can simplify text processing and increase focus on more meaningful words.

  • What did "cleaning" our bags of words entail? What did we remove from the bags when we used this function?

  • We removed words that are commonly used in the English language.

  • Can you think of any reasons or scenarios when it might be useful to "clean" text of commonly used words?

  • Invite student discussion before sharing the explanation provided in the lesson.

The common words that are often filtered out in text analysis are called stopwords.

  • Did removing stopwords from the corpus improve the model? Why or why not?

  • Removing the stopwords, words that contribute little to the meaning of the text, allowed an increase in the focus on the more meaningful content. Removing stopwords from the corpus dramatically reduced the angle difference between student-essay and elephant-essay to zero!

Synthesize

  • Now that you understand a little bit more about how plagiarism detection programs work: Can you explain to your teacher how a plagiarism detector could have mistakenly flagged your friend? Or do you agree with your teacher, that the plagiarism detector can be trusted with certainty?

  • Students' responses will vary.

These materials were developed partly through support of the National Science Foundation, (awards 1042210, 1535276, 1648684, 1738598, 2031479, and 1501927). CCbadge Bootstrap by the Bootstrap Community is licensed under a Creative Commons 4.0 Unported License. This license does not grant permission to run training or professional development. Offering training or professional development with materials substantially derived from Bootstrap must be approved in writing by a Bootstrap Director. Permissions beyond the scope of this license, such as to run training, may be available by contacting contact@BootstrapWorld.org.