Deadline: 2021-11-14 11:59pm
To submit your work, simply push it to the dedicated repository created for your group.
We will grade only the latest files prior to the deadline. Any ulterior modifications are pointless.
The objectives of this homework assignment are the followings:
This project must be done using GitHub and respect the following requirements:
You can create one or several RMarkdown files to answer the following problems:
In this problem, we consider a Monte-Carlo approach for approximating the area of a shape
Let us assume that the x and y coordinates are uniformly distributed between 0 and 1, that is
If you want to understand why this method works, see the bottom of the page.
a) Create a function that approximates find_area()
and should have three arguments:
B
: the number of points for the approximation with defaults value = 5000,seed
: a positive integer that controls the generation of random numbers with default value = 10,make_plot
: a Boolean value that control whether or a not a graph should be made (see below for details and use FALSE as default).Your function should look like:
find_area <- function(B = 5000, seed = 10, make_plot = FALSE){
# Control seed
set.seed(seed)
# Simulate B points
point = matrix(runif(2*B, 0, 1), nrow = B, ncol = 2)
...
return(area_hat)
}
When enabling the plot by setting make_plot = TRUE
, the function find_area()
should produce a graph with a square, the shape of B
points with two distinct colors according to whether the point is inside or outside the circle. See below for an example.
b) Verify that by running find_area(B=10^6, make_plot = TRUE)
the function returns the value
Global Navigation Satellite Systems or GNSS are systems with global coverage that uses satellites to provide autonomous geo-spatial positioning. They allow small electronic receivers to determine their location (longitude, latitude, and altitude/elevation) to high precision (within a few meters) using time signals (i.e. “distances” informally speaking) transmitted along a line of sight by radio from satellites. Currently, there exist only three global operational GNSS, the United States' Global Positioning System (GPS), Russia’s GLONASS and the European Union’s Galileo. However, China is in the process of expanding its regional BeiDou Navigation Satellite System into a global system by 2020. Other countries, such as India, France or Japan are in the process of developing regional and global systems.
Obviously, GNSS are very complex systems and in this exercise we will consider an extremely simplified setting to illustrate the basic concepts behind satellite positioning. If you are interested in learning more about GNSS, an excellent introduction to get started this topics can be found here: “An Introduction to GNSS”.
For simplicity, let us start by assuming that the earth is a motionless perfect circle in a two-dimensional space. Next, we assume that three motionless GNSS-like satellites are placed around the earth. The position of these satellites is assumed to be known and we will assume that there are synchronized (i.e. they all have the same “time”). Our simplified setting can be represented as follows:
Now, suppose that you are somewhere on our flat earth with a GNSS-like receiver. The way you will be able to compute your position with such system is by first determining the distance (or a closely related notion) between yourself and with each satellite. The computation of these distances is done by comparing the time at which a signal is emitted by a satellite and the time at which it is received by your device. More precisely, let
1 | -300 | 300 |
2 | 300 | 300 |
3 | 0 | -300 |
Finally, we let
While there exists several methods to solve such estimation problem, the most common is known as the “least-square adjustement” and will be used in this problem. It is given by:
a) Write a general function named get_position()
that takes as a single argument a vector of observed distances and returns an object with an appropriate class having custom summary
and plot
functions. For example, suppose we observe the vector of distances
position = get_position(c(453.2136, 288.8427, 418.3106))
summary(position)
## The estimated position is:
## X = 99.9958
## Y = 100.003
plot(position)
Note that inside the get_position()
function, you need to estimate the positions optim
for this purpose. It has the syntax:
optim(par = starting_values, fn = my_objective_function, arg1 = arg1, ..., argN = argN)
The arguments are:
starting_values
: the starting values for the optimization, here a vector of three values corresponding to my_objective_function
: the objective function, here the one we defined above. It must have arg1
to argN
: additional arguments for the objective functions.b) Generalize the function get_position()
of the point a) to accept also a matrix of observed distances (but keep only one argument!).
c) Verify that the function you wrote at the point b) display the same graph
position = get_position(dist_mat)
summary(position)
where the matrix inputed is
## [,1] [,2] [,3]
## [1,] 458.9474 337.1013 363.1112
## [2,] 337.0894 458.9355 363.0993
## [3,] 442.5835 442.5835 283.9493
## [4,] 520.1845 520.1845 184.0449
## [5,] 534.1411 499.0299 191.3455
## [6,] 499.1322 534.2434 191.4479
## [7,] 542.0904 470.4216 212.7515
## [8,] 470.4070 542.0758 212.7369
## [9,] 541.6032 429.4569 250.9978
## [10,] 429.4120 541.5583 250.9528
This exercise is dedicated to code your own “logistic regression” using any arbitrary probability function and the optim()
function.
Recall the tutorial on logistic regression, where a probability of success is defined by the sigmoid function
Similarly to the tutorial, further we express
We will compare a standard glm
fit for the titanic
data set with family = binomial(link = "logit")
to our own model.
Your goal is to code the log likelihood function glm
model for the training set (follow the tutorial), fit the Cauchy model, and finally, compare the results on the test set.
(a) Load the titanic
data set. Create a training set data_train
being data_test
- the remaining Survived, Pclass, Sex, SibSp, Parch, Fare
columns. Fit the logistic regression on the training set with Survived
being the response variable.
(b) Code the negative log likelihood function nll(w, X, y)
, where w
is a vector of unknown coefficients; X
is a matrix y
is the response variable. Use optimisation function optim()
with target nll
and method = "BFGS"
.
(c) For both models calculate the survival probability of data_test
. Using a threshold of 0.5, predict the survival as table()
, and compare the accuracies of both models. Which model performs better?
This exercise aims to be a guided tutorial towards building your own (simple) neural network. It is entirely based on James Loy’s tutorial on how to do so on Python.
A neural network is, briefly speaking, a way to map input data to output data using mathematical functions along in order to do so. In a neural network, you have inputs, call them
For simplicity in this tutorial, we will consider the biases to be equally 0 and the network to only have a single hidden layer. The following picture, taking again from James Loy’s tutorial, should graphically clarify the situation.
The output of a neural network with one hidden layer is given by considering the following function:
We will also impose a parametric form for
We will start by imposing some random values to the weights and we will update them by “training” the neural network, which consists of two steps: the feedforward step and the back propagation step:
Graphically, the following picture should clarify the situation, taken again from James Loy’s tutorial:
Now that the setting is clarified, we will build the neural network step by step.
(a) Create a matrix containing the following 4 observations (inputs):
and a vector containing the following 4 outcomes, corresponding to the 4 observations above:
(b) Generate a 3x4 matrix of weights (
(c) Create and store the sigmoid function (as a function of a generic x
) into R
. Then, create the derivative of this function.
(d) Create and store the loss function (as a function of a generic neural net as created in part b)) into R
. For the purpose of this exercise, we assume that the loss (or cost) is simply the sum of the squared errors, i.e.
(e) Create the feedforward function (as a function of a generic neural net as created in part b)). To do so, use the sigmoid function from point c): in particular, assign to the first layer of the neural net the value of the sigmoid of the matrix product of the input layer with the weights %*%
in R
. Do the operations in the order mentioned.
(f) Create the back propagation function (as a function of a generic neural net as created in part b)). To do so, we will update the weights by a gradient descent-type procedure. This works as follow:
Hints: The above subscript t()
function in R
). The feedforward
function. Also, note that the last part is simply the sigmoid derivative evaluated at the output created using the feedforward
function above. So both of these elements are stored in the neural net you input to the backpropagation
function and therefore, the above derivative should not be that hard to write.
The notation is as before. Note that
Then, update weights1 by adding the derivative with respect to
(g) “Train” the neural network: iterate feedforward
and backpropagation
1500 times on the neural net created in b). Store the values of the loss function along the iterations to be able to plot it. Then display the predicted vs the actual observation.
To understand why the method presented indeed calculates the area of the shape, let us show that
Let
Finally, to estimate the area of