A Working List of Probability Questions: Week 3

Programming, Statistics, and Past Interview Questions
July 6, 2020 | Updated August 5, 2020

Welcome to the last week of my trading interview prep series! My focus this week is to review programming and statistics concepts, along with exploring more challenging problems in areas I’ve already done practice problems in. Aside from these topics, I’ll mostly be pulling past interview questions from Glassdoor pages and Leetcode. As always, I’ll be providing solutions to the problems I do and a list of resources I use for reviewing concepts, along with a list of extra problems that I work on after this week is over.

Day 13   |   Day 14   |   Day 15   |   Day 16   |   Day 17   |   More  

Day Thirteen: July 6

Most of today’s problems are past interview questions in trading!

Problem 1. You have a drawer with an infinite number of two colors of socks, which exist in equal probability. What is the expected number of attempts at taking out socks individually from the drawer before a matching pair is found?
Solution. This is a past Jane Street interview question that should be answered without pen/paper. Note that a matching pair of socks must be found by the third draw by the pidgeonhole principle. After our first draw, we have probability 12\frac{1}{2} of drawing the same color our second round and 12\frac{1}{2} of drawing a different color. Should we draw a third color, it is guaranteed that we will complete a pair of matching socks the third round so we have and equal probability of completing the pair in two or three rounds. Then then expected number of attempts is 2.52.5.

Problem 2. There is a raffle with 80 total tickets. Three tickets win a prize. I buy 5 tickets. What is the approximate probability that I win exactly one prize? Solution. We can approach this problem either combinatorially or by brute force (I prefer the combinatorial method but the brute force method is more intuitive). There are (31)\binom{3}{1} ways to choose one (and only one) of the three prizes. Of the remaining 7777 tickets, there are (774)\binom{77}{4} ways to choose four non-winning tickets. We can multiply these two together to get the number of outcomes in which we win only one prize. Finally, there are (805)\binom{80}{5} ways to choose 55 tickets (winning or non-winning) so we divide our numerator by this to get the probability of getting a prize, or about 16.89%16.89\%. To brute force it, let’s consider each of the three prizes. Note that the probability of winning the first prize is 580\frac{5}{80}. Then, considering that we can only win one of the three prizes,

P(win only first prize)=58075797478P(win only second prize)=75805797478P(win only third prize)=75807479578\begin{aligned} P(\text{win only first prize}) &= \tfrac{5}{80} \cdot \tfrac{75}{79} \cdot \tfrac{74}{78} \\ P(\text{win only second prize}) &= \tfrac{75}{80} \cdot \tfrac{5}{79} \cdot \tfrac{74}{78} \\ P(\text{win only third prize}) &= \tfrac{75}{80} \cdot \tfrac{74}{79} \cdot \tfrac{5}{78} \end{aligned}

Combining these probabilities gives us the same answer as above with our combinatorial solution.

Problem 3. Consider a deck with 2626 black and 2626 red cards. You draw one card at a time and you can choose either guess on whether it is red beforehand or simply observe the result. If the card is red you get $1\$1 and the game ends whenever you decide to guess. What is your strategy to play this game and the expected earnings? Solution. I have no idea what the right answer is so solving this question was rather fun. The most pedestrian strategy is to guess one the first card with probability 12\frac{1}{2} of guessing correctly, with EV $0.50\$0.50. However, we can aim to slighly improve this by basing our future guesses on previously drawn cards. We can find the expected value guessing at any given round based on the number of red and black cards left in the deck. Let f(r,b)f(r,b) be the expected payoff for the state where there are rr red and bb blue cards left. The probability of guessing correctly at any given state of red and black cards is f(r,b)=rr+bf(r,b) = \frac{r}{r+b} (the expected value of guessing at each round is also described by this function). This is where my strategy gets a bit more dicey: I’d first draw 55 cards with the hope that I pick out more black cards than red cards which would increase the EV of the rest of the deck. Should I end up behind on this initial sequence of draws, I’m more likely to draw more black cards than red cards in the commin turns so I’ll continue to do so until I reacha bit over 50%50\% of drawing another red card. This is an inexact science and I don’t really have an answer to encapsulate every situation that could arise other than drawing more cards up to a point where there are few enough cards or high enough percentage of drawing a red card that I can maximize my probability of winning. This strategy introduces more variability into our expected payoff: instead of having a strictly 50%50\% chance of getting a dollar, we introduce a bit of rng into our draw. We can potentially go above or below 50%50\% of guessing a red card correctly. Doing this, I run the risk of never getting back up to 50%50\% should I draw 2626 red cards right off the bat (an extremely unlikely and unfortunate outcome). Update: I’ll be posting a revised answer to this question soon!

My calculation for the expected payoff is also a bit sketchy and relies on a symmetry argument: consider that we draw a sequence of cards such that we have probability pp of drawing a red card next time (EV pp). However, we are just as likely to have draw the sequence of cards such that each ithi^\text{th} card drawn in the second sequence is the opposite of its counterpart in our actual draw, giving EV 1p1-p. Applying this to any state of cards that we decide to guess at, we are equally likely to have a payoff pp as a payoff 1p1-p so our expected value of the overall game is $0.50\$0.50 regardless of our strategy. All in all, I’m not really sure if I improve anything for myself by taking on increase variability but I am inclined to be less risk averse given that I cannot lose money.

Problem 4. How would I design the elevators for a new 40 story office building that had an average of 100 people per floor to most efficiently fill and empty the building given a standard 9-5 workday and traffic conditions in my city? The answer should to be as detailed as possible, including expected passengers per car, time per stop, average floors stops per trip at various hours, etc. Solution. This is a Google PM question but the nature of this question feels very trading-esque. Note hat 40004000 people must enter and exit the building each day, presumably during the morning and afternoon close to 99 and 55. Some people will enter in early and some will funnel in a bit later in the workday so I’ll asume we have some sort of normal distribution of people entering the office centered at 9:009:00, with sd of 3030 minutes. Then, we have approximately the following number of people using the elevator in various times during the morning:

7:30 - 8:00 - 8080
8:00 - 8:30 - 560560
8:30 - 9:00 - 13601360
9:00 - 9:30 - 13601360
9:30 - 10:00 - 8080
10:00 - 10:30 - 8080

During lunch some people will probably leave the office, but it’s quite unlikely that the elevators will be operating at full capacity since some people (perhaps even a majority) will stay in for lunch. Let’s say that there are 1010 elevators and each elevator can carry an average of 1010 people, disregarding weight, so we have 13601360 people being funnelled into the building in a 3030 minute interval. Then each elevator should make 1360100=13.6\frac{1360}{100} = 13.6 trips between the first floor and carrying people up to higher floors. Since this office complex is relatively large, we should consider splitting up the floors serviced by each elevator so the time spent travelling is minimized. We can break this up so that each elevator services 88 floors: elevator 1 & 2 servicing floors 181-8, elevators 3 & 4 servicing floors 1,9161, 9-16, etc. Then each elevator makes an average of 99 floor stops per trip, with the outlier of elevators 1,21,2 which average 88 floors per trip. While this helps reduce inefficiency, the elevators that travel the most should still travel between te first floor and the last 88 floors, making an average of one trip per 2.22.2 minutes. It takes relatively little time for people to exit an elevator, so we’ll allocate an average of 55 seconds per stop before moving on to the next floor. Then elevators 9,109,10 will spend 4040 seconds travelling between top floors, which gives them about 9090 more seconds to travel between floor 1 and floor 40 and fill up another elevator full of people. Note that this answer can be a bit more detailed, focusing in on maximizing the carriage capacity of higher floor elevators by making use of the fact that people on lower floors will take the stairs, and other random variables that I haven’t considered.

Problem 5. Consider a card game with two black cards and two red cards. You play a game where you draw a card each round. If it’s black you lose a dollar and if its red you win a dollar. You can choose to stop at any time. What’s your winning strategy and its associated expected value? Solution. Note that by playing the game out completely we ensure breaking even. To cover the obvious cases, if we finish the third draw with one dollar (RBRRBR) or the second draw with two dollars (RRRR), it makes sense to end the game since we are guaranteed to lose part of our winnings by continuing. Also, if we’re behind at any stage of the game we continue drawing since we can guarantee an even outcome and if we’re even at any point in the game we should keep drawing since we will either have positive gains or play out the entire game and end even. Let’s consider what happens when we draw RR first. Then we will be up one dollar and have probability 13\frac{1}{3} of drawing another red (net +2+2), 2313=13\frac{2}{3}\cdot \frac{1}{3} = \frac{1}{3} of drawing two blacks and forcing us to finish the game (+0+0), and 13\frac{1}{3} of drawing BRBR afterward and ending (+1+1). So conditional of drawing a red first turn, we have ev 132+131\frac{1}{3} \cdot 2 + \frac{1}{3} \cdot 1. Now consider rolling BB first: if you draw two consecutive reds afterwards (13\frac{1}{3}) you can end with $1\$1 and break even otherwise (23\frac{2}{3}). Then, our total EV is 121+12(131+230)=23\frac{1}{2} \cdot 1 + \frac{1}{2}(\frac{1}{3} \cdot 1 + \frac{2}{3} \cdot 0) = \frac{2}{3} using this strategy. This can be extended to any number of cards.

Day Fourteen: July 7

Todays problems are a mix of programming questions from Leetcode and some past trading interview questions.

Problem 1. Given a string, find the length of the longest substring without repeating characters. Solution. This is a leetcode medium question. We can solve this problem with a sliding window technique and minimize time complexity by using a hashmap, gicing an O(n)O(n) solution. Starting from the first index of the string, we can insert each character into the hashmap and begin counting substrings sies through a counter variable. Should we happen to come across a non-unique character in our hashmap, we slide our “window” once index to the right and scan the next substring and update our counter variable each time we come across a unique character.

class Solution {
public:
    int lengthOfLongestSubstring(string s) {
        int left(0), counter(0);
        unordered_map<char, int> myMap;

        for (int i = 0; i < s.size(); i++) {
            if (myMap.find(s[i]) != myMap.end()) {
                left = std::max(left, myMap[s[i]] + 1);
            }
            myMap[s[i]] = i;
            counter = std::max(counter, i - left + 1);
        }
        return counter;
    }
};

This solution runs faster than 54%54\% C++ submissions so it could definitely be streamlined/improved. I implemented a basic sliding window technique and didn’t dedicate much time to improving it.

Problem 2. There are 33 balls distributed in 44 boxes. What is the probability that the maximum number of balls in any given box is exactly 22? Solution. This is an incredibly simple combinatorics question that I got wrong on a recent HackerRank. The number of ways to arrange nn balls into kk boxes is (n+k1n)\binom{n+k-1}{n}, so there are (63)=20\binom{6}{3} = 20 ways to arrange the the balls into boxes. You can learn more about balls and boxes combinatorial methods here. To find the desired probability we will consider the case that 22 balls are in one box. Then, the final ball is in one of the three remaining boxes. There are (41)\binom{4}{1} ways to pick a box to carry two balls and (31)\binom{3}{1} ways to pick which of the remaining boxes we place the last ball into for each 22-balled box. Then, we have probability (41)(31)(63)=35\frac{\binom{4}{1} \binom{3}{1}}{\binom{6}{3}} = \frac{3}{5} of the given scenario occuring.

If you didn’t know the formula for the total number of ways to arrange the balls into boxes we can still derive it with combinatorial methods. There are three cases that we must consider: all balls are contained in different boxes, two balls are contained in one box and the final one in another box, and all three balls in one box. There are (43)\binom{4}{3} ways to choose three boxes for which each contains one ball, (41)(31)\binom{4}{1}\binom{3}{1} ways to choose one box to hold two balls and a second box to hold the final ball (as previously calculated), and (41)\binom{4}{1} ways to choose one box to contain all the balls. Then our probability of obtaining the desired outcome is (41)(31)(43)+(41)(31)+(41)=35\frac{\binom{4}{1}\binom{3}{1}}{\binom{4}{3} + \binom{4}{1}\binom{3}{1} + \binom{4}{1}} = \frac{3}{5}.

Problem 3. You have a box filled with cash. Cash value is uniformly and randomly distributed from 0 to 1000. You are trying to win the box in an auction: you win the box if you bid at least the value of the cash in the box; you win nothing if you bid less (but you lose nothing). If you win the box, you can resell it for 150% of its value. How much should you bid to maximize the expected value of your profit (resale of box minus bid)? Solution. We would bid $0\$0 as the expected value of this game is always negative. I’ll assume that cash means there will not be fractions of a dollar in the box. Let XX be the payout of the box and YY be the amount that we pay for the box after bidding. Then we have P(X)=1.5XYP(X) = 1.5X - Y, XYX\leq Y as ur profit function. Since E[P(X)]=1.5E[X]Y\mathbb{E}[P(X)] = 1.5\mathbb{E}[X] - \mathbb{Y}, we first need to find the expected value of XX and YY. Each integer between 00 and 10001000 is equally likely to be in the box (P(X=x)=11001P(X=x)= \frac{1}{1001}). Since we only realize the payout of the box if XYX\leq Y we only need to sum up to YY:

E[X]=x=0Y11001=Y(Y+1)2002\mathbb{E}[X] = \sum_{x= 0}^Y \frac{1}{1001} = \frac{Y(Y+1)}{2002}

To find the expected value of YY, note that we will pay YY dollars for the box if XYX\leq Y and 00 otherwise. Given a YY, the probability that XYX\leq Y is Y+11001\frac{Y+1}{1001} since each potential value of XX shy of YY has probability 11001\frac{1}{1001}. Then E[Y]=Y(Y+1)1001\mathbb{E}[Y] = \frac{Y(Y+1)}{1001}. It follows that

E[P(X)]=1.5E[X]E[Y]=1.5Y(Y+1)2002Y(Y+1)1001=3Y(Y+1)4004\begin{aligned} \mathbb{E}[P(X)] &= 1.5\mathbb{E}[X] - E[Y] = \frac{1.5Y(Y+1)}{2002} - \frac{Y(Y+1)}{1001} \\ &= \frac{-3Y(Y+1)}{4004} \end{aligned}

which is always negative so we should not bid any amount for the box. Our optimal bid then is $0\$0, with expected payoff $0\$0. Note that my original assumption was the impetus for using a summation; should I have assumed that any value was possible in the box, then we would have to approach XX as though it were a continuous variable.

Problem 4. Imagine I flip 100 coins in a sequence, and you need to guess the sequence in one try. You are allowed to ask one yes/no question. What do you ask to maximize the probability of guessing the sequence? Solution. This question is a bit open-ended as there isn’t really a definitive correct answer. Without information, we have probability (12)100(\frac{1}{2})^{100} of guessing the entire sequence correctly. We can double our probablility to (12)99(\frac{1}{2})^{99} by asking a question about any specific index in the sequence (we should ask for the first one since it is the only index that we are guaranteed to reach). We could potentially ask whether there are more heads than tails or vice versa, but the probability that one type of flip occurs more frequently than the other in the sequence is 14\frac{1}{4} (convince yourself of this) and a yes to the question provides minimal benefit in choosing the values of any specific index until very late in the sequence. A no to the question, which is far more likely, means that the question is esentially wasted as we don’t know whther there are more tails than heads or if they are equal in number. Potentially we could ask about the second or the third index, which would give us probability 12,14\frac{1}{2}, \frac{1}{4} of guessing the first two and three indicies respectively. However, this still leaves the first index up to chance and runs the risk of guessing no index correctly should we guess the first one wrong and the game ends. I can’t really think of any other ways to gather information that would be readily pertinent to guessing the the sequence and it seems like asking for the first index is the best option since we can guarantee at least one index is correct. Were I more confident in my understanding of the coin flips or the nature of the sequence I potentially would employ the more risky strategy of asking the identity of some non-first index, or even if the number of heads is greater than the number of tails if I were confident enough in my ability to guess more than 50%50\% of the sequence. As it stands however, asking about anything but the first index seems to be extremely high risk low reward.

Day Fifteen: July 8

Here are some more betting games from previous interviews.

Problem 1. You have all the clubs from a deck, 13 cards, and you can choose 2 from the deck and get paid their product, where all face cards are considered to be 0. You can pay 11 to reveal the difference of any two cards you choose, how much would you pay to play this game?
Solution. Our probability of getting a nontrivial outcome (not picking a J,Q,KJ,Q,K) just guessing off the top is (102)(132)=1526\frac{\binom{10}{2}}{\binom{13}{2}} = \frac{15}{26}. We’ll aim to improve this and maximize our payout from this game. Let’s index our cards and call them C1,C2,,C13C_1, C_2, \dots, C_13, with CiC_i describing the ithi^\text{th} card from the left. WOLOG lets make C1C_1 the card that we compare all other ones to. If C1C_1 is a face card, then there are 22 more face cards within the next 1212 cards so it will take 1111 comparisons for us to encounter a second face card, at which we’ll see the difference between C1C_1 and the new face card is 00. From there, we know that all of the previous comparisons we’ve done have yielded the exact value of each card, which allows us to select the 99 and 1010 cards for a $90\$90 payout. Note that this is the worse case scenario and there is a chance we come across the second zero much earlier and it’s simply our imperative to find the 9,109,10 cards through direct comparison.

If the first card is not a face card, then there are 33 face cards in the next 1212 cards that we’ll compare to C1C_1. Then, within 1111 cards, we’ll come across at least 22 face cards, at which we’ll realize that two of the comparisons that we’ve done yield the same difference between cards. The two cards that have the same difference when compared to C1C_1 are then face cards and their difference with C1C_1 is simply the value of C1C_1. From there, you can reverse engineer the values of the rest of the cards you went through in order to find the 9,109,10 cards. Again, we might come across the case where this occurs much earlier than the 11th11^\text{th} comparsion which would mean that we’d simply have to find the 9,109,10 cards again through comparisons. From these two cases, we’ve confirmed that spending $11\$11 can guarantee a payout of $90\$90. I’d then pay up to $79\$79 to play this game, which ensures that I’ll always have a nonnegative expectation. Note that 7979 is not the expected value of the game, as we’d have to consider the multitude of cases that can arise with the cards, but that’s relatively tedious to calculate so I’m inclined to pay up to, or maybe a bit more than, $79\$79. The reason I’m willing to go beyong $79\$79 is because paying that amount guarantees a nonnegative payout, but the expected profit of doing this is positive, as we’ve essentially done a worst case analysis and haven’t considered the game on average.

Problem 2. We randomly sample integers a,b,ca,b,c from a uniform distribution of [0,1][0,1]. What is the probability that ax2+2bx+c=0ax^2 + 2bx + c = 0 has real roots? Solution. In order for the given quadratic equation to have real roots, we must have 4b24ac04b^2 - 4ac \geq 0, or b2acb^2 \geq ac. This can be solved by finding the probability that b2acb^2 \geq ac on the unit cube, but I’m not particularly good at multivariable calculus so I opted for a solution using random variables. For a uniformly distributed r.v. XX on [0,1][0,1], we have that ln(X)-\ln(X) follows an exponential distribution with λ=1\lambda = 1. Then, let A,B,CA,B,C be uniform independent random variables on [0,1][0,1] and

P(B2AC)=P(ln(B)ln(A)ln(C)2)=P(XY2)P(B^2 \geq AC) = P(-\ln(B) \leq \frac{-\ln(A) - \ln(C)}{2}) = P(X \leq \frac{Y}{2})

for XX exponential and YY with the pdf xexxe^{-x} for x>0x>0. Since XX and YY are independent, we have

0P(XY2Y=x)fY(x)dx=0[1ex/2]xexdx=59\int_0^\infty P(X \leq \frac{Y}{2} \, | \, Y = x)f_Y(x) \, dx = \int_0^\infty [1 - e^{-x/2}]xe^{-x} \, dx = \frac{5}{9}

Then the probability that our quadratic equation has real roots is 59\frac{5}{9}. This solution relies on drawing the connection between the negative log of a uniform distribution on [0,1][0,1] being exponential and the sum of exponential distributions with λ\lambda parameter having a pdf of λ2xeλx\lambda^2 xe^{-\lambda x}. A more basic case of this problem with ax2+bx+c=0ax^2 + bx + c = 0 can be found here.

Problem 3. Let’s play a game - I give you a 12 sided die and will pay you whatever the die lands on. If you are unhappy with the roll, you can choose to roll another two 6-sided dice and I will pay you the sum of the two dice. How much are you willing to pay to play this game? Solution. This is a Jane Street interview question that I came across on their Glassdoor page. Note that the expected value of the second dice is 77 (just the sum of expected value of two fair dice) so if we roll less than a 77 on the 1212-sided die, we’ll be incentivised to roll again. This occurs with probability 12\frac{1}{2} so we have 12\frac{1}{2} chance of getting a $7\$7 payoff. On the other hand, we also have 12\frac{1}{2} probability that we roll at least a 77, giving us expected payoff of 16(7+8+9+10+11+12)=9.5\frac{1}{6}(7 + 8 + 9 + 10 + 11 + 12) = 9.5 for the scenario that we roll at least a 77 of the 1212-sided die. Then, the total expected value is 127+129.5=8.25\frac{1}{2} \cdot 7 + \frac{1}{2} \cdot 9.5 = 8.25. Then it follows that I’d pay up to $8.25\$8.25 to play this game, assuming that I’m using the optimal strategy that I described above.

Problem 4. Given n non-negative integers a1,a2,,ana_1, a_2, \dots, a_n , where each represents a point at coordinate (i,ai)(i, a_i). n vertical lines are drawn such that the two endpoints of line ii is at (i,ai)(i, a_i) and (i,0)(i, 0). Find two lines, which together with x-axis forms a container, such that the container contains the most water. Solution. This is another Leetcode medium question. The brute force solution involves comparing every possible pair of lines and the volumes of containers that they build, which run in O(n)O(n) time. On the other hand, we can approach the question with the following intution: given two lines, the area of the container formed is limited by the height of the lower stick. In hopes of getting a taller line, we would shift the lower line one unit to the right (or left depending which is shorter), which will decrease the width but add the potential of increasing the height of the next line. After each iteration of line moving, we’ll check the area again and make updates to the maximum area should that be applicable. This is done as opposed to moving the taller line, which would decrease the width of the container and add no possibility of improved height. This method won’t always guarantee us a larger area, but it encompasses all the possibilites of higher sticks adding area as we “traverse” the array. This gives us the following solution:

class Solution {
public:
    int maxArea(vector<int>& height) {
        int size = height.size();
        int left = 0;  
        int right = size - 1;
        int maxArea = 0;
        
        while(left != right){
            maxArea = max(maxArea, containerArea(height[left], height[right], right - left));
            
            if(height[right] > height[left]){
                left++;
            }else{
                right--;
            }
        }
        return maxArea;
    }
    
    int containerArea(int lHeight, int rHeight, int width){
        return width * min(lHeight, rHeight);
    }
};

This solution runs in O(n)O(n) time, but does waste some resources. When we shift over from one line to another, we’re doing it in hopes that we’ll compensate for the smaller width by getting to a higher line. We can guarantee that the next area we consider is with a stick of higher height by using a while loop to continue incrementing our line index until we reach a line that is taller than the original one we deemed was too short. This would be relatively simple to implement, simply changing left++ to while(height[left] <= min(height[left], height[right]) {left++;} (do this for right as well), which would make up for the wasted time we iterate through our origina while loop.

Day Sixteen: July 9

Here are a couple more solutions to some leetcode and past interview questions at trading firms.

Problem 1. What is the expected value of the largest of four dice rolls? Solution. This is a pretty straightforward order statistic question. Let XiX_i be the random variable denoting the outcome of the ithi^\text{th} dice rolls. Our goal is to find the pmf of Z=max{X1,X2,X3,X4}Z = \max\{X_1, X_2, X_3, X_4\}, which is the 4th4^\text{th}. One thing to keep in mind is that each of the four dice rolls is independent so we can multiply their probabilies together when combining cdf’s. Then,

P(Zz)=P(X1,X2,X3,X4z)=P(X1z)P(X4z)=P(X1z)4=(x6)4\begin{aligned} P(Z \leq z) &= P(X_1, X_2, X_3, X_4 \leq z) = P(X_1 \leq z) \cdots P(X_4 \leq z)\\ &= P(X_1 \leq z)^4 = \left(\frac{x}{6} \right)^4 \end{aligned}

Since each XiX_i is discretely distributed, we have that P(Xi=x)=P(Xx)P(Xx1)P(X_i = x) = P(X \leq x) - P(X \leq x-1) so pZ(z)=z4(z1)464p_{Z}(z) = \frac{z^4 - (z-1)^4}{6^4}. We can directly use this pmf to calculate the expected value of ZZ:

E[Z]=RpZ(z)z=z=16zz4(z1)464=679712965.24\mathbb{E}[Z] = \sum_{\mathbb{R}} p_Z(z)z = \sum_{z=1}^6 z \cdot \frac{z^4 - (z-1)^4}{6^4} = \frac{6797}{1296} \approx 5.24

Then, the expected value of the largest of four independent dice rolls is 5.245.24. Note that we can extend this problem to nn dice rolls by employing a similar strategy of using the definition of expected value and calculating the cdf of Z=max{X1,,Xn}Z = \max\{X_1 , \dots, X_n\}, where X1,XnX_1, \dots X_n signify the outcomes of nn independent dice rolls.

Problem 2. If I toss 3 Tic Tacs on a tic-tac-toe board at random, and I pay you $20\$20 if the pieces create tic-tac-toe, and you pay me $1\$1 if they dont, do you want to play the game? Solution. This is a past interview question at SIG. My intuition led me to solve it combinatorially since it’s a varient of the ball and boxes question from a couple days ago. Assuming that Tic Tacs don’t have to land on unique boxes on the board, we have (93)+(91)(81)+(91)=165\binom{9}{3} + \binom{9}{1}\binom{8}{1} + \binom{9}{1} = 165 ways to arrange the Tic Tacs on the board (this can also be done by (9+313)\binom{9+3-1}{3}). Additionally, there are 88 ways to arrange the Tic Tacs in order to create a tic-tac-toe with the board (6 ways across the rows and columns and 2 additional for the diagonals). Then, our expected payout to this game is

E[X]=816520+15716510.018\mathbb{E}[X] = \frac{8}{165} \cdot 20 + \frac{157}{165} \cdot -1 \approx 0.018

Our expected payout from playing this game is about 22 cents per turn so I would play this game. The original question on SIG’s Glassdoor page can be found here, and documents the scenario in which all the Tic Tacs land on unique boxes and the payout for a tic-tac-toe is only 99 dollars instead of 2020.

Problem 3. Flip a fair coin many times until either a pattern of HHTHHT or HTTHTT occurs. What’s the probability that HHT occurs before HTT? Solution. This is a classic question that has a relatively elegant solution. We could potentially model this with a Markov chain but this approach is rather extensive and unnecessary. Note that since both sequences begin with HH, sequences with leading tails can be ignored up until the appearance of the first HH. Then within the first three flips, the following sequences each appear with probability 14\frac{1}{4}:

HHHHHH: We will have HHTHHT appear first as soon as the next heads shows up HHTHHT: HHTHHT appears first
HTTHTT: HTTHTT appears first
HTHHTH: The sequence is basically reset as HH is no longer within the most recent 33-flip sequence

Note that we dont consider the 44 outcomes that begin with a tail flipped first due to our earlier observation. The first two cases guarantee that HHTHHT will be flipped first, the third guarantees HHTHHT appears first, and the final case is a reset. Then HHTHHT appears first with probablity 23\frac{2}{3}, twice that of HTTHTT appearing first. You can find a similar question dealing with consecutive tennis wins here.

Problem 4. You roll a dice until you get a 55. What is the expected value of the minimum value rolled? Solution. This is a past interview at Jane Street. It can be solved with the tail sum formula, E[x]=x=0P(Xx)\mathbb{E}[x] = \sum_{x=0}^\infty P(X\geq x) but I approached it combinatorially. Let XX be the minimum value of the dice rolls until the first 55 is rolled. The maximum value of XX is 55 as rolling only 66’s gets cancelled out when the first 55 is rolled so E[X]=x=15P(X=x)\mathbb{E}[X] = \sum_{x=1}^5 P(X = x). Consider P(X=1)P(X=1): since 11 is the smallest possible value for XX we don’t need to consider the values 2,3,42,3,4 being rolled before the first 55. We only need a 11 to be rolled before the first 55, which happens with probability 12\frac{1}{2}. For P(X=2)P(X=2) we need 22 to be rolled before 55 and 11 to not be rolled before 55. If we consider the 3!3! unique permuations of {1,2,5}\{1,2,5\}, there is only one of them (251251) that satisfies this condition. Then, there is 13!=16\frac{1}{3!} = \frac{1}{6} chance that P(X=2)P(X=2). For P(X=3)P(X=3) we need 33 to be rolled before 55 and 22 and 11 to be rolled after 55. Only 35123512 and 35213521 satisfy this giving probability 24!=112\frac{2}{4!} = \frac{1}{12}. For P(X=4)P(X=4) we need 44 rolled before 55 and 1,2,31,2,3 rolled afterward, which has 3!5!\frac{3!}{5!} probability. Finally, for P(X=5)P(X=5), we need to roll 55 first, followed by 1,2,3,41,2,3,4 in any order. This has 4!5!=15\frac{4!}{5!} = \frac{1}{5} probability, giving

E[X]=x=15P(X=x)=12+21!3!+32!4!+43!5!+54!5!=13760\mathbb{E}[X] = \sum_{x=1}^5 P(X=x) = \frac{1}{2} + 2 \cdot \frac{1!}{3!} + 3 \cdot \frac{2!}{4!} + 4 \cdot \frac{3!}{5!} + 5 \cdot \frac{4!}{5!} = \frac{137}{60}

Then the expected value of the minimum value of dice rolled is about 2.282.28.

Day Seventeen: July 10

Today is the last day I’ll be consistently updating my website with new problems but stay tuned below for more resources in the future!

Problem 1. Suppose there are nn points on a circle. How many regions can be formed by connecting all of these points to one another? Solution. This is a problem that can solved for by studying the first few cases. Lets called R(n)R(n) the number of regions that can be formed with nn points. Then R(1)=1R(1)=1 and R(2)=2R(2)=2. Let R(n)R'(n) be our approximation of R(n)R(n). On first glance, it seems like the recurring pattern is R(n)=2n1R'(n) = 2^{n-1}. However, this breaks down when n=6n=6 as R(6)=3132R(6) = 31 \neq 32. We can approach this problem instead by considering the number of regions that each new chord between points creates. For nn points, connecting each of them gives us (n2)\binom{n}{2} chords. However, in the cases of n=1,2n=1, 2 we have (n2)=1,2\binom{n}{2} = 1, 2 respectively while R(1)=1R(1) = 1 and R(2)=2R(2) =2. In both of these cases, there is an additional region present that doesn’t seem to be dependent on the number of line segments drawn. This can be confirmed by the fact that R(0)=1R(0) = 1 and there is always going to one region present before we consider any points on the circle. Then we potentially have R(n)=(n2)+1R'(n) = \binom{n}{2} + 1. However when n>=3n >= 3 this pattern seems to break down. R(4)=8R(4) = 8 , R(5)=16R(5) = 16, and R(6)=31R(6) = 31 but we have the estimated values R(4)=7R'(4) = 7, R(5)=11R'(5) = 11 and R(6)=16R'(6) = 16. This difference that we observe between R(n)R(n) and R(n)R'(n) can be explained through the number of intersection between chords in the circle (convince yourself of this). We have an intersection between chords for every two lines, which require 44 points so there are a total of (n4)\binom{n}{4} intersections for nn points. Then R(n)=(n2)+(n4)+1R(n) = \binom{n}{2} + \binom{n}{4} + 1 (you can check that adding (n4)\binom{n}{4} doesn’t break any of the smaller cases). For a proof of why (n2)+(n4)+1=2n1\binom{n}{2} + \binom{n}{4} + 1 = 2^{n-1} for n[1,5]n \in [1,5], see the top comment on this Stack Exchange post.

Problem 2. You are given a 100-sided die. After you roll once, you can choose to either get paid the dollar amount of that roll OR pay one dollar for one more roll. What is the expected value of the game? Solution. Let XX be the payout from playing this game. For this approach I’m assuming that I’ll want to stop playing the game once I reach some value x[1,100]x\in [1,100] and find the value of xx that maximizes E[X]\mathbb{E}[X]. In my opinion, this approach is flawed as we aren’t keeping track of the amount of money we’ve already paid which prevents us from maximizing profit, but this might be subverted by the fact that we’re calculating the EV of the game overall. So our strategy is to keep our winnings if we roll X>xX > x and and keep rerolling if we get XxX\leq x. An interesting thing to note is that since each roll of the dice is independent, each nthn^\text{th} roll of the dice has exactly the same conditions and distribution as the first roll (except we’ve spent nn dollars). Then on our first roll we have one of two cases:
11: we roll X>xX> x with probability 100x100\frac{100 - x}{100} with expected winning 101+x2\frac{101 + x}{2}
22: we roll XxX\leq x with probability x100\frac{x}{100} and we pay $1\$1 to reroll for payout E[X]1\mathbb{E}[X] - 1
Combining these two, we have

E[X]=100x100101+x2+x(E[X]1)100=101+x2x100x\begin{aligned} \mathbb{E}[X] &= \frac{100 - x}{100} \cdot \frac{101 + x }{2} + \frac{x(\mathbb{E}[X] - 1)}{100}\\ &= \frac{101 + x}{2} - \frac{x}{100-x}\\ \end{aligned}

We can then differentiate to find the maximum:

dEdx=x(100x)21100x+12=x2200x+98002(x100)2=0\begin{aligned} \frac{dE}{dx} &= \frac{-x}{(100-x)^2} - \frac{1}{100-x} + \frac{1}{2}\\ &= \frac{x^2 - 200x + 9800}{2(x-100)^2} = 0 \end{aligned}

Solving for xx gives us x85.86x\approx 85.86. Since xx can only take on discrete values, we check 8585 and 8686 to find that E[85]=87.33\mathbb{E}[85] = 87.33 and E[86]=87.36\mathbb{E}[86] = 87.36. Then, the expected value of the game optimized by using this strategy is $87.36\$87.36.

Problem 3. Suppose two players are playing a game. The first player chose an integer between 1-30. The second player picks a different one. After that, a random integer (x) between 1-30 will be generated. The player whose number is closer to x get paid x dollars. Would you go first or second? What number should you pick? Solution. Let mm and nn be Player 1 and Player 2’s choices respectively. Player 2 can maximize his payout by choosing nn to be close to mm but he must decide whether it is optimal to choose above or below Player 1’s choice. In order for Player 2 to choose above Player 1, then expected value of choosing higher than nn must be greater than the expected value of choosing lower than nn, that is

130i=n+130i>130i=1n1i30312>2n(n1)2+(n+1)+n21.56>n\begin{aligned} \frac{1}{30} \sum_{i = n + 1}^{30} i &> \frac{1}{30} \sum_{i = 1}^{n-1} i \\ \frac{30 \cdot 31}{2} &> 2\frac{n(n-1)}{2} + (n+1) + n\\ 21.56 &> n \end{aligned}

Then, should Player 1 choose n21n \leq 21, then Player 2 would choose m=n+1m= n+1 and if Player 1 chose n22n\geq 22 Player 2 would choose m=n1m=n-1. The expected value is clearly maximized at either n=21n=21 or n=22n=22 for Player 1, and a quick calculation finds that E[21]=7710\mathbb{E}[21] = \frac{77}{10} and E[22]=7810\mathbb{E}[22] = \frac{78}{10}. Then Player 11 would choose 2222 and Player 2 would choose 2121. Since Player 1’s expected payoff is slightly higher than that of Player 22, I would choose to go first and pick n=22n=22.

Problem 4. A survey is given to all passengers on a number of different planes. The survey asks each person how full their plane was. The people answer honestly. If 50% of people claim that their plane was 80% full, while the other 50% claim that their plane was 20% full, how full was the average plane? Solution. This is a relatively simple question that’s intuitively misleading. An incorrect approach to the question would be to average out the 20%20\% and the 80%80\% passengers for a total of 50%50\% fullness on plans on average. Instead note that since the total number of people that claim 80%80\% of the plane is full is equal to the total number of people who claim their plane is 20%20\% full, there must be more planes carrying people at 20%20\% capacity than at 80%80\%. In fact, there are 44 times as many people in planes 20%20\% full as there are peope in planes 80%80\% full. Then, the average capacity of each plane is

1580%+4520%=32%\frac{1}{5}\cdot 80\% + \frac{4}{5} \cdot 20\% = 32\%

Note that the first approach is wrong since averaging out the percentages assumes that the total number of people flying in each type of plane is distributed uniformly across the proportions of plane capacity. You can find this question on Jane Street’s Glassdoor.

Extra Problems

Problem. Given an array of length nn find the largest possible sum of consecutive elements of the array. Solution. This question is one of the classic dp questions from Brian Dean’s academic site. Clearly, the max sum will be along a subarray that is completely positive, so we need only keep track of the positive elements. We can then use a sliding window technique: keep track of the maximum sum up to a certain index jj and decide whether to add the next element based on how big it is compared to the existing sub-array. Along the way, we can update the overall maxSum by comparing the current maxSum to the sum of the subarray being considered.

int maxContiguousSum(int arr[], int n){
    int maxSum = arr[0];
    int maxSub = arr[0];

    for(int i = 1; i < n; i++){
        maxSub = max(arr[i], maxSub + arr[i]);
        maxSum = max(maxSum, maxSub);
    }

    return maxSum;
}

This solution runs in O(n)O(n) time.

Problem. We randomly sample from a uniform distribution on [0,1][0,1] until the sum of all sampled values exceeds 11. What is the expected number of samples we take? Solution. Let NN be the number of samples we take and sum together to get a number greater than 11. We would like to find P(N=n)0P(N = n) \geq 0 since NN is a discrete random variable. We can find this by taking P(Nn)P(N<n)=P(Nn)P(Nn1)P(N \leq n) - P(N < n) = P(N \leq n) - P(N \leq n-1), or equivalently P(N>n1)P(N>n)P(N > n - 1) - P(N > n). Note that P(N>n)P(N > n) means that it takes more than nn samples to sum up to 11, or equivalently that the first nn uniform r.v.’s that we draw sum up to less than 11. The probability of this occuring is simply 1n!\frac{1}{n!}, as shown on Day 5 Problem 5. Then, we have

P(N=n)=1(n1)!1n!=n1n!P(N= n) = \frac{1}{(n-1)!} - \frac{1}{n!} = \frac{n-1}{n!}

where n=1,2,3,n = 1, 2, 3, \dots We can use the definition of expected value to solve for the expected number of samples we need to take:

E[N]=n=1nP(N=n)=n=1n(n1)n!=n=0(n+1)n(n+1)!=n=01n!=e\begin{aligned} \mathbb{E}[N] &= \sum_{n=1}^\infty nP(N=n) = \sum_{n=1}^\infty \frac{n(n-1)}{n!} = \sum_{n=0}^\infty \frac{(n+1)n}{(n+1)!}\\ &= \sum_{n=0}^\infty \frac{1}{n!} = e \end{aligned}

Then we expect to draw ee times in order for the sum to exceed 11.