CMU 15-112 Fall 2018: Fundamentals of Programming and Computer Science
Homework 8 (Due Sunday 21-Oct, at 5pm)

Reminder: all problems that are not explicitly marked COLLABORATIVE must be completed individually, as stated in the course syllabus.

To start:
1. Go to your folder named 'week8'
2. Create a file named hw8.py. There is no starter file this week!
3. Please also download and unzip these datasets. Place the text files in the same directory as hw8.py for easy access. You'll want to use these to test Problem 7, but don't submit these files to Autolab!
4. Edit hw8.py using Pyzo.
5. When you're done, submit hw8.py to Autolab. For this hw, you may submit up to 5 times.
Do not use recursion this week.
Do not hardcode the test cases in your solutions.
Don't forget to write test cases! Otherwise, you'll lose style points.

Midsemester Course Survey [10pts]
Take the midsemester course survey here! We want your feedback on how the course is going, so we can keep trying to make it better.

Note: this survey will be anonymous. When you complete the survey, you'll get a link to a different form where you can enter your andrewID to receive the homework credit. We will not be able to map your survey response to your andrewID.

If you want to give additional feedback, you can also give specific TAs feedback using the Midsemester TA Survey. This is not required and not worth any points, but is highly appreciated.

Collaborative problems

COLLABORATIVE: movieAwards(oscarResults) [10pts]
Write the function movieAwards(oscarResults) that takes a set of tuples, where each tuple holds the name of a category and the name of the winning movie, then returns a dictionary mapping each movie to the number of the awards that it won. For example, if we provide the set:

  { 
    ("Best Picture", "The Shape of Water"), 
    ("Best Actor", "Darkest Hour"),
    ("Best Actress", "Three Billboards Outside Ebbing, Missouri"),
    ("Best Director", "The Shape of Water"),
    ("Best Supporting Actor", "Three Billboards Outside Ebbing, Missouri"),
    ("Best Supporting Actress", "I, Tonya"),
    ("Best Original Score", "The Shape of Water")
  }

the program should return:

  { 
    "Darkest Hour" : 1,
    "Three Billboards Outside Ebbing, Missouri" : 2,
    "The Shape of Water" : 3,
    "I, Tonya" : 1
  }

Note: Remember that sets are unordered! For the example above, the returned set may be in a different order than what we have shown, and that is ok.

COLLABORATIVE: instrumentedLinearSearch(lst, item) and instrumentedBinarySearch(lst, item) [10pts]
Write the functions instrumentedLinearSearch(lst, item) and instrumentedBinarySearch(lst, item). Each of these functions should work just like the functions we went over in class, except that instead of returning the index of an element, they should return a list of all the indicies that were checked. For example, instrumentedLinearSearch([2, 4, 6, 8, 10, 12], 8) would return [0, 1, 2, 3], since it has to check each element in order until it reaches 8. On the other hand, instrumentedBinarySearch([2, 4, 6, 8, 10, 12, 14], 12) would return [3, 5] since it starts at the middle, narrows the search down to the top half, then immediately finds 12.

Note: Binary Search should round down when choosing a mid-index over an even number of elements. Otherwise, you won't pass the test cases!

COLLABORATIVE: containsPythagoreanTriple(a) [10pts]
Write the function containsPythagoreanTriple(a) that takes a list of positive integers and returns True if there are 3 values (a,b,c) anywhere in the list such that (a,b,c) form a Pythagorean Triple (where a**2 + b**2 == c**2). So [1,3,6,2,5,1,4] returns True because of (3,4,5).

A naive solution would be to check every possible triple (a,b,c) in the list. That runs in O(n**3). You'll have to do better than that!

Solo problems

Big-O Calculation (Manually graded) [20pts]
In a triple-quoted string at the top of your file (just below your name), include solutions to this exercise. For each of the following functions:
1. State in just a few words what it does in general.
2. Write the Big-O time or number of loops for each line of the program, then state the resulting Big-O of the whole program.
3. Provide an equivalent Python function that is more than a constant-factor faster (so its worst-case Big-O runtime is in a different function family). The better your solution's Big-O runtime, the more points you get!
4. Write the Big-O time or number of loops for each line of the new program, then state the resulting Big-O of the whole program.
def slow1(lst): # N is the length of the list lst assert(len(lst) >= 2) a = lst.pop() b = lst.pop(0) lst.insert(0, a) lst.append(b) def slow2(lst): # N is the length of the list lst counter = 0 for i in range(len(lst)): if lst[i] not in lst[:i]: counter += 1 return counter import string def slow3(s): # N is the length of the string s maxLetter = "" maxCount = 0 for c in s: for letter in string.ascii_lowercase: if c == letter: if s.count(c) > maxCount or \ s.count(c) == maxCount and c < maxLetter: maxCount = s.count(c) maxLetter = c return maxLetter def slow4(a, b): # a and b are lists with the same length N n = len(a) assert(n == len(b)) result = abs(a[0] - b[0]) for c in a: for d in b: delta = abs(c - d) if (delta > result): result = delta return result

instrumentedSelectionSort(a) and instrumentedBubbleSort(a) [10pts]
Write the functions instrumentedSelectionSort(a) and instrumentedBubbleSort(a). Each of these should work just like the versions given in the course notes, except instead of returning None, they return a triple of values: the number of comparisons, the number of swaps, and the time in seconds to run the sort.

Facebook friends [15pts]
When performing tasks on large amounts of data, we frequently need to format the dataset in a way that allows us to perform our task efficiently. For example, if we want to study social networks to find mutual friendships between specific users, we may want to format our data in such a way where we don't need to iterate over the entire dataset for every single search. This is especially true if we need to perform many searches. Perhaps we have recently discussed a data type that is well suited to this task...?

For this problem, we will provide a text file of hundreds of anonymized Facebook friendships. Please download them here if you haven't already, move the unzipped text files to the same directory as hw8.py, and then open the files in a text editor to see how the data is represented. Individual users are identified by a unique integer. Each line consists of a pair of integers representing a "friendship" between two users. For example, the line "123 234" indicates that User #123 lists User #234 as a friend. See below for a very short example of how these text files are formatted:

403 401 403 402 401 402 402 400 401 399 402 403 403 399 401 403 402 403 402 401 400 402 399 401 403 402 399 403
Given a specific user, we want to be able to efficiently search for all of the IDs listed as friends by that user. If we have to perform lots and lots of searches, an inefficient approach would be to iterate over every line of the text file for each query, and when a line begins with the user's ID, the friend's ID is appended to a list, and the list is returned after the entire file has been searched. This is potentially very slow for large datasets and many queries! We don't want to have to iterate over the entire dataset every time we want to know who one person is friends with.

Let's do better than that. We'll format the dataset once in a way where we can search for a specific user's friends easily, and then we'll create some functions that will quickly perform those searches on our formatted dataset. We should be able to call these functions many times without having to worry about how big our dataset is.
1. First, write the function formatDataset(filename), which reads the specified file and returns friendData, which is the dataset represented in a format of your choosing. For example, you might choose to return friendData as a string, though this is likely a poor choice. You could also return friendData as some sort of list or set or dictionary. It's up to you! This function does not have to be exceptionally efficient. Just make sure formatDataset(filename) runs in O(n³) time or better, and does not time out the Autograder (i.e. it should run in under 1 minute.). You should be able to do better than O(n³ pretty easily, as our solution function runs in O(n). This function will only be called once, and once we have friendData, we'll use that as an input for our queries.
  
  Hint: In this problem, n scales with the number of unique users in the input file which (in this problem) means that it also scales with the number of lines. For large enough datasets, the number of friends each user has is fairly constant. (i.e. maybe user 123 has 10 friends in a dataset of 100 ids. User 123 would also have about 10 friends in a dataset of 1000 ids.).
2. Now, write the function friendsOfUser(id, friendData) where id is the integer representing a specific user, and friendData contains the stored output of your previous function. (Do not loop over the original input file for each query!) Your function should return the set of user IDs identified by the input user as Facebook friends.
  
  For example, if friendData is the formatted data from our previous example, friendsOfUser(401, friendData) should return:
  
  {402, 399, 403}
3. Finally, write the function mutualFriends(id1, id2, friendData) where id1 and id2 are integers representing two specific users, and friendData contains the stored output of formatDataset(filename). (Again, do not loop over the original input file for each query!) Your function should return the set of user IDs identified by BOTH users as Facebook friends.
  
  For example, if friendData is the formatted data from our previous example, mutualFriends(401, 403, friendData) should return:
  
  {402, 399}
Here's the catch: friendsofUser(id, friendData) and mutalFriends(id1, id2, friendData) must run efficiently! The autograder is going to test these functions many times with different user IDs. For full credit, friendsofUser(id, friendData) and mutalFriends(id1, id2, friendData) must run in O(n) time or better (where n is the number of unique user IDs in the input file)! Only partial credit is given for a function that runs in O(n²), and no credit for any function that times out the autograder. This means you can't simply check the entire text file for every combination of users. Remember that formatDataset(filename) does not have to be exceedingly efficient, so think carefully about how you create friendData to make searches most efficient!

Hint 2: When testing your functions, first make sure everything works on a very small dataset, like the example given above. Then try it on successively larger datasets from the zip file. Does the time increase proportionally to the size of the file / number of unique user IDs? Or is it too slow to run on the largest dataset? Use these files to empirically approxiamte the efficiency of your functions. Do they match your expectation?

Note: With the exception of the short example above, the datasets used in this assignment are real! Their source is Snap Datasets: Stanford Large Network Dataset Collection (Jure Leskovec and Andrej Krevl, Jun. 2014).

friendsOfFriends(d) [15pts]
Background: we can create a dictionary mapping people to sets of their friends. For example, we might say:
d = { } d["jon"] = set(["arya", "tyrion"]) d["tyrion"] = set(["jon", "jaime", "pod"]) d["arya"] = set(["jon"]) d["jaime"] = set(["tyrion", "brienne"]) d["brienne"] = set(["jaime", "pod"]) d["pod"] = set(["tyrion", "brienne", "jaime"]) d["ramsay"] = set()
With this in mind, write the function friendsOfFriends(d) that takes such a dictionary mapping people to sets of friends and returns a new dictionary mapping all the same people to sets of their friends of friends. For example, since Tyrion is a friend of Pod, and Jon is a friend of Tyrion, Jon is a friend-of-friend of Pod. This set should exclude any direct friends, so Jaime does not count as a friend-of-friend of Pod (since he is simply a friend of Pod) despite also being a friend of Tyrion's.

Thus, in this example, friendsOfFriends should return:
{ 'tyrion': {'arya', 'brienne'}, 'pod': {'jon'}, 'brienne': {'tyrion'}, 'arya': {'tyrion'}, 'jon': {'pod', 'jaime'}, 'jaime': {'pod', 'jon'}, 'ramsay': set() }
Note 1: your function should not modify the initial provided dictionary!

Note 2: you may assume that everyone listed in any of the friend sets also is included as a key in the dictionary.

Note 3: you may assume that a person will never be included in their own friend set. You should also not include people in their own friends-of-friends sets!

Note 4: you may not assume that if Person1 lists Person2 as a friend, Person2 will list Person1 as a friend! Sometimes friendships are only one-way. =(

Hint: How is this different or similar to the facebook friends problem?

Bonus problems

Bonus/Optional: threeWayMergesort [1pt]
First, write the function threeWayMergesort(L) that takes a list L and does a modified version of mergesort on L, where instead of combining two sorted sublists at a time, you should combine three sorted sublists at a time (so after one pass, you have sublists of size 3, and after two passes, sublists of size 9, and so on). You may not assume that the size of the list is a power of 3.

Next, in a triple-quoted string just below your threeWayMergesort function, write a short proof that the function runs in O(nlogn) -- that is, in the same big-oh worst-case runtime as normal two-way mergesort. Then, write the function threeWayMergesortTiming() that empirically demonstrates that threeWayMergesort runs in the same big-oh as normal (two-way) mergesort (no more than a constant time faster or slower, even as n grows large). So all that extra work to write threeWayMergesort was for naught. Sigh.

When you are done with this function, run it, get the timing data that proves your point, and include it with a very brief explanation at the end of that triple-quoted string just below threeWayMergesort.

Bonus/Optional: Sorting animation (Manually graded) [2pt]
Write an animation function for your favorite sorting algorithm! You'll want to use the animation starter code, but you can be creative regarding how you visualize the sorting process. Look at the links in the sorting course notes for inspiration. Your animation should clearly indicate the name of the sorting algorithm being demonstrated, and should start by showing a randomly-generated (i.e. not hard-coded) sequence of each integer from 0 to 20, inclusive. Pressing the right arrow should advance the sorting algorithm by one step, and the left arrow should undo the sorting algorithm by one step. Make sure the animation doesn't crash if you press the arrow keys too many times! Pressing "r" should generate and display a new randomly-generated sequence of unsorted integers. For each step, it should be clear which integers are being compared, moved, swapped, stored, etc. In other words, it should be useful to someone trying to learn how this sorting algorithm works. TAs will grade your animations manually, so make sure your animation runs when you run hw8.py.

IMPORTANT: If you attempt this bonus problem, look back at how we separated autograded and manually graded problems in hw5, and make sure to place your code beneath the following lines:

###################################################################### # GRAPHICS/ANIMATION PROBLEMS # ignore_rest # The autograder will ignore all code below here ###################################################################### from tkinter import *

Suggestion: If you plan to attempt this bonus problem, pay close attention to how many Autolab submissions you have left and be careful that your animation code is not causing Autolab to return errors for your required problems!

CMU 15-112 Fall 2018: Fundamentals of Programming and Computer Science Homework 8 (Due Sunday 21-Oct, at 5pm)

CMU 15-112 Fall 2018: Fundamentals of Programming and Computer Science
Homework 8 (Due Sunday 21-Oct, at 5pm)