Back to All Projects

CS 180 Final Project Report

Aathreya Kadambi

Getting to build a NeRF! Exciting! :-)

Neural Fields

Neural Radiance Fields

Adventure Log

8. The Penultimate Stage: Pain and Perserverence (hehehe, that sounds pretty cool) After several iterations, I was noticing that sometimes, the model wouldn't train well at all (note, I was using a batch size of 5000 instead of 10000 due to memory limitations). Even on Colab, for some reason 5000 was working better for me. But in either case, the training results were highly unpredictable. Sometimes the model would barely train, other times it would get the expected results (as good as the staff solution/other people on the Ed). I'm not yet sure why this is the case, but here are some .gifs after my 1000, 2000, 3000, and 4000 iteration checkpoints on my best model so far:
A fun GIF A fun GIF A fun GIF A fun GIF
These aren't too bad, but they aren't quite as good as I wanted. Based on what I saw in the Ed, perhaps there was an issue somewhere in my sampling. The results were highly inconsistent though, so I thought maybe there was something wrong with my model.

I'm so cooked, I've been working on this for hours... and my results have only gotten worse! I read one too many motivational quotes and thought to "break things so that I could rebuild them better". But I only really accomplished that first part, it seems. 😂 😭

I'm really struggling to debug or figure out what's up because to be honest, every test I've written and all the visualizations seem to run as expected. There's only one suspicious thing according to the spec, but I really don't think it's that suspicious to be honest, and people on the Ed seem to agree with my opinion (no staff confirmation though, so perhaps we're all wrong). Luckily I still have the model files from the above iteration, but I've never been able to train a model that good again. Somehow after I tried breaking things, I've been getting high PSNRs but weird looking outputs. I really need a more systematic way to go about debugging and visualize what's happening here.

After watching a few instagram reels, I came across one to take advice from water or something, and took a shower, and woah, somehow after coming back, I rewrote my code in a Jupyter notebook and it's kind of working (and to my delight, not using up a crazy amount of memory on my computer). I actually think reducing the batch size to 5000 helps, although this is mainly a hunch. That's actually how I did it the first time, but I switched it to 10000 on subsequent runs based on the spec and also because I thought it would boost performance by decreasing variance in the stochastic gradient descent... or something. I think it's actually helping because of memory limitations and potentially how PyTorch does memory fragmentation or something related to this, but somehow, I've gotten it to work again.
This time I also have training loss and PSNR curves. Unfortunately, I don't have the same for validation on this local version, doing those would have been way too costly to run in terms of time and also GPU (I would have had to split the work up instead of one huge batch, which would take a while to process). It was probably doable, but since I'm still figuring out how to make my model training consistent, I didn't do that. Here are my training loss and PSNR:
and finally here's a simple memory usage graph over time (kind of vague, but this is basically with respect to where I collected it in the "Becoming Memory Efficient" section below). The y-axis is in GB.
I made this plot to show a weird aspect of using PyTorch on MPS: for some reason, it doesn't seem to free up memory as easily as it does on cuda GPUs. For example, see these plots which I got from Kaggle:
You'll notice there are two curves. The orange is for "reserved" memory and the blue is for "allocated" memory. For some reason, PyTorch only supports showing the allocated memory for MPS, whereas it can show both for cuda. A little sus, but hey, I still love PyTorch. The reason you see higher reserved memory is that I did a validation step in Kaggle. I didn't record this memory usage in the blue curve, but it essentially causes the spike in "reserved" memory allocations. Essentially, PyTorch doesn't really "free up" all the memory it unallocates, it just keeps it reserved for easier access for the future, which makes things faster... unless you don't have access to a lot of memory, in which case it kind of makes things slower... (at least that's what I've been noticing). The reason you see ups and downs in the cuda plots is that the memory gets unallocated (but still reserved) at the end of each training epoch in my training loop.
7. Becoming Memory Efficient I think the first step to becoming memory efficient is profiling. Since everything is quite slow and glitchy on my laptop, I used Google Colab to print the reserved and allocated GPU memory at every point in time. My training loop was:

for epoch in range(EPOCHS):
  # TRAINING
  model.train()
  
  print_memory_usage(f"Prior to Epoch {epoch+1}")
  r_o, r_d, pixels = train_dataset.sample_rays(BATCH_SIZE)
  x = sample_along_rays(r_o, r_d, perturb=True, n_samples=N_SAMPLES)

  r_d_expanded = np.repeat(r_d[:,np.newaxis,:], N_SAMPLES, axis=1)

  x = x.astype(np.float32); r_d_expanded = r_d_expanded.astype(np.float32); 
  pixels = pixels.astype(np.float32) # pixel is great name
  X = torch.from_numpy(x); D = torch.from_numpy(r_d_expanded); P = torch.from_numpy(pixels)
  print_memory_usage(f"Right after torch.from_numpy")
  
  X = X.to(device); D = D.to(device); P = P.to(device)
  print_memory_usage(f"Right after .to(device)")
  
  density, rgb = model(X, D)
  print_memory_usage(f"Right after calling model")
  
  P_pred = volrend(density, rgb, N_SAMPLES)
  print_memory_usage(f"Right after volume render")
  
  l = loss(P_pred, P)
  print_memory_usage(f"Right after evaluating loss")

  optimizer.zero_grad()
  print_memory_usage(f"Right after zero_grad")

  l.backward()
  print_memory_usage(f"Right after backward")

  optimizer.step()
  print_memory_usage(f"Right after optimizer.step()")

  optimizer.zero_grad()
  print_memory_usage(f"Right after zero grad")

  t_losses.append(l.item())
  t_psnrs.append(psnr(l).item())

  if epoch % 5 == 4:
    # VALIDATION
    model.eval()
    i = random.randint(0, 9) # pick a random image

    r_o = val_dataset.rays_o[i*40000:(i+1)*40000]
    r_d = val_dataset.rays_d[i*40000:(i+1)*40000]
    pixels = val_dataset.pixels[i*40000:(i+1)*40000]
    x = sample_along_rays(r_o, r_d, perturb=True, n_samples=N_SAMPLES)
    r_d_expanded = np.repeat(r_d[:,np.newaxis,:], N_SAMPLES, axis=1)

    x = x.astype(np.float32); r_d_expanded = r_d_expanded.astype(np.float32); 
    pixels = pixels.astype(np.float32) # pixel is great name
    X = torch.from_numpy(x); D = torch.from_numpy(r_d_expanded); P = torch.from_numpy(pixels)
    X = X.to(device); D = D.to(device); P = P.to(device)

    density, rgb = model(X, D)
    P_pred = volrend(density, rgb, N_SAMPLES)

    val_l = loss(P_pred, P)
    P = P.detach()
    P_pred = P_pred.detach()
    v_losses.append(val_l.item())
    v_psnrs.append(psnr(val_l).item())
    optimizer.zero_grad()

    print(f"Epoch {epoch+1}, Loss: {t_losses[-1]}, psnr: {t_psnrs[-1]}, Validation Loss: {v_losses[-1]}, 
          psnr: {v_psnrs[-1]}", "Threads used:", torch.get_num_threads())

  if epoch % 50 == 49:
    image = torch.cat(P_pred, P).reshape((200, 400, 3)).detach().numpy()
    cv2.imwrite(f'lego-truck-reconstruction-epoch-{epoch+1}.png', 255*image)
and the output was (sorry for how verbose this is):

Memory at Prior to Epoch 1: Allocated = 0.029 GB, Reserved = 2.403 GB
Memory at Right after torch.from_numpy: Allocated = 0.029 GB, Reserved = 2.403 GB
Memory at Right after .to(device): Allocated = 0.033 GB, Reserved = 2.403 GB
Memory at Right after calling model: Allocated = 1.690 GB, Reserved = 2.403 GB
Memory at Right after volume render: Allocated = 1.693 GB, Reserved = 2.405 GB
Memory at Right after evaluating loss: Allocated = 1.693 GB, Reserved = 2.405 GB
Memory at Right after zero_grad: Allocated = 1.693 GB, Reserved = 2.405 GB
Memory at Right after backward: Allocated = 0.038 GB, Reserved = 2.405 GB
Memory at Right after optimizer.step(): Allocated = 0.038 GB, Reserved = 2.405 GB
Memory at Right after zero grad: Allocated = 0.036 GB, Reserved = 2.405 GB
Memory at Prior to Epoch 2: Allocated = 0.029 GB, Reserved = 2.405 GB
Memory at Right after torch.from_numpy: Allocated = 0.029 GB, Reserved = 2.405 GB
Memory at Right after .to(device): Allocated = 0.034 GB, Reserved = 2.405 GB
Memory at Right after calling model: Allocated = 1.691 GB, Reserved = 2.405 GB
Memory at Right after volume render: Allocated = 1.694 GB, Reserved = 2.405 GB
Memory at Right after evaluating loss: Allocated = 1.694 GB, Reserved = 2.405 GB
Memory at Right after zero_grad: Allocated = 1.694 GB, Reserved = 2.405 GB
Memory at Right after backward: Allocated = 0.039 GB, Reserved = 2.405 GB
Memory at Right after optimizer.step(): Allocated = 0.039 GB, Reserved = 2.405 GB
Memory at Right after zero grad: Allocated = 0.037 GB, Reserved = 2.405 GB
Memory at Prior to Epoch 3: Allocated = 0.030 GB, Reserved = 2.405 GB
Memory at Right after torch.from_numpy: Allocated = 0.030 GB, Reserved = 2.405 GB
Memory at Right after .to(device): Allocated = 0.034 GB, Reserved = 2.405 GB
Memory at Right after calling model: Allocated = 1.691 GB, Reserved = 2.405 GB
Memory at Right after volume render: Allocated = 1.694 GB, Reserved = 2.405 GB
Memory at Right after evaluating loss: Allocated = 1.694 GB, Reserved = 2.405 GB
Memory at Right after zero_grad: Allocated = 1.694 GB, Reserved = 2.405 GB
Memory at Right after backward: Allocated = 0.039 GB, Reserved = 2.405 GB
Memory at Right after optimizer.step(): Allocated = 0.039 GB, Reserved = 2.405 GB
Memory at Right after zero grad: Allocated = 0.037 GB, Reserved = 2.405 GB
Memory at Prior to Epoch 4: Allocated = 0.037 GB, Reserved = 2.405 GB
Memory at Right after torch.from_numpy: Allocated = 0.033 GB, Reserved = 2.405 GB
Memory at Right after .to(device): Allocated = 0.037 GB, Reserved = 2.405 GB
Memory at Right after calling model: Allocated = 1.691 GB, Reserved = 2.405 GB
Memory at Right after volume render: Allocated = 1.694 GB, Reserved = 2.405 GB
Memory at Right after evaluating loss: Allocated = 1.694 GB, Reserved = 2.405 GB
Memory at Right after zero_grad: Allocated = 1.694 GB, Reserved = 2.405 GB
Memory at Right after backward: Allocated = 0.039 GB, Reserved = 2.405 GB
Memory at Right after optimizer.step(): Allocated = 0.039 GB, Reserved = 2.405 GB
Memory at Right after zero grad: Allocated = 0.037 GB, Reserved = 2.405 GB
Memory at Prior to Epoch 5: Allocated = 0.029 GB, Reserved = 2.405 GB
Memory at Right after torch.from_numpy: Allocated = 0.029 GB, Reserved = 2.405 GB
Memory at Right after .to(device): Allocated = 0.033 GB, Reserved = 2.405 GB
Memory at Right after calling model: Allocated = 1.690 GB, Reserved = 2.405 GB
Memory at Right after volume render: Allocated = 1.693 GB, Reserved = 2.405 GB
Memory at Right after evaluating loss: Allocated = 1.693 GB, Reserved = 2.405 GB
Memory at Right after zero_grad: Allocated = 1.693 GB, Reserved = 2.405 GB
Memory at Right after backward: Allocated = 0.038 GB, Reserved = 2.405 GB
Memory at Right after optimizer.step(): Allocated = 0.038 GB, Reserved = 2.405 GB
Memory at Right after zero grad: Allocated = 0.036 GB, Reserved = 2.405 GB
Epoch 5, Loss: 0.17976847290992737, psnr: 7.452865123748779, 
      Validation Loss: 0.1619608849287033, psnr: 7.9058990478515625 Threads used: 1
Memory at Prior to Epoch 6: Allocated = 13.341 GB, Reserved = 14.152 GB
Memory at Right after torch.from_numpy: Allocated = 13.341 GB, Reserved = 14.152 GB
Memory at Right after .to(device): Allocated = 13.346 GB, Reserved = 14.152 GB
Memory at Right after calling model: Allocated = 15.003 GB, Reserved = 15.370 GB
Memory at Right after volume render: Allocated = 15.006 GB, Reserved = 15.372 GB
Memory at Right after evaluating loss: Allocated = 15.006 GB, Reserved = 15.372 GB
Memory at Right after zero_grad: Allocated = 15.006 GB, Reserved = 15.372 GB
Memory at Right after backward: Allocated = 13.350 GB, Reserved = 14.896 GB
Memory at Right after optimizer.step(): Allocated = 13.350 GB, Reserved = 14.898 GB
Memory at Right after zero grad: Allocated = 13.348 GB, Reserved = 14.898 GB
Memory at Prior to Epoch 7: Allocated = 13.315 GB, Reserved = 14.898 GB
Memory at Right after torch.from_numpy: Allocated = 13.315 GB, Reserved = 14.898 GB
Memory at Right after .to(device): Allocated = 13.319 GB, Reserved = 14.898 GB
Memory at Right after calling model: Allocated = 14.976 GB, Reserved = 15.412 GB
Memory at Right after volume render: Allocated = 14.979 GB, Reserved = 15.412 GB
Memory at Right after evaluating loss: Allocated = 14.979 GB, Reserved = 15.412 GB
Memory at Right after zero_grad: Allocated = 14.979 GB, Reserved = 15.412 GB
Memory at Right after backward: Allocated = 13.324 GB, Reserved = 15.594 GB
Memory at Right after optimizer.step(): Allocated = 13.324 GB, Reserved = 15.594 GB
Memory at Right after zero grad: Allocated = 13.322 GB, Reserved = 15.594 GB
Memory at Prior to Epoch 8: Allocated = 13.322 GB, Reserved = 15.594 GB
Memory at Right after torch.from_numpy: Allocated = 13.318 GB, Reserved = 15.594 GB
Memory at Right after .to(device): Allocated = 13.322 GB, Reserved = 15.594 GB
Memory at Right after calling model: Allocated = 14.976 GB, Reserved = 15.594 GB
Memory at Right after volume render: Allocated = 14.979 GB, Reserved = 15.594 GB
Memory at Right after evaluating loss: Allocated = 14.979 GB, Reserved = 15.594 GB
Memory at Right after zero_grad: Allocated = 14.979 GB, Reserved = 15.594 GB
Memory at Right after backward: Allocated = 13.324 GB, Reserved = 15.594 GB
Memory at Right after optimizer.step(): Allocated = 13.324 GB, Reserved = 15.594 GB
Memory at Right after zero grad: Allocated = 13.322 GB, Reserved = 15.594 GB
Memory at Prior to Epoch 9: Allocated = 13.314 GB, Reserved = 15.594 GB
Memory at Right after torch.from_numpy: Allocated = 13.314 GB, Reserved = 15.594 GB
Memory at Right after .to(device): Allocated = 13.318 GB, Reserved = 15.594 GB
Memory at Right after calling model: Allocated = 14.975 GB, Reserved = 15.594 GB
Memory at Right after volume render: Allocated = 14.978 GB, Reserved = 15.594 GB
Memory at Right after evaluating loss: Allocated = 14.978 GB, Reserved = 15.594 GB
Memory at Right after zero_grad: Allocated = 14.978 GB, Reserved = 15.594 GB
Memory at Right after backward: Allocated = 13.323 GB, Reserved = 15.594 GB
Memory at Right after optimizer.step(): Allocated = 13.323 GB, Reserved = 15.594 GB
Memory at Right after zero grad: Allocated = 13.321 GB, Reserved = 15.594 GB
Memory at Prior to Epoch 10: Allocated = 13.314 GB, Reserved = 15.594 GB
Memory at Right after torch.from_numpy: Allocated = 13.314 GB, Reserved = 15.594 GB
Memory at Right after .to(device): Allocated = 13.319 GB, Reserved = 15.594 GB
Memory at Right after calling model: Allocated = 14.976 GB, Reserved = 15.594 GB
Memory at Right after volume render: Allocated = 14.979 GB, Reserved = 15.594 GB
Memory at Right after evaluating loss: Allocated = 14.979 GB, Reserved = 15.594 GB
Memory at Right after zero_grad: Allocated = 14.979 GB, Reserved = 15.594 GB
Memory at Right after backward: Allocated = 13.324 GB, Reserved = 15.594 GB
Memory at Right after optimizer.step(): Allocated = 13.324 GB, Reserved = 15.594 GB
Memory at Right after zero grad: Allocated = 13.322 GB, Reserved = 15.594 GB
---------------------------------------------------------------------------
OutOfMemoryError                          Traceback (most recent call last)
 in ()
     52         X = X.to(device); D = D.to(device); P = P.to(device)
     53 
---> 54         density, rgb = model(X, D)
     55         P_pred = volrend(density, rgb, N_SAMPLES)
     56 

9 frames
/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py in relu(input, inplace)
   1702         result = torch.relu_(input)
   1703     else:
-> 1704         result = torch.relu(input)
   1705     return result
   1706 

OutOfMemoryError: CUDA out of memory. Tried to allocate 1.22 GiB...
Looking through the above, I'm actually doing fine on memory until validation time! At this moment, we suddenly throw a ton of stuff onto the GPU, and it never gets freed (perhaps because I'm never doing a .backward() call). To resolve this, I tried using with torch.no_grad() to see if this would prevent loss from storing a bunch of extra information relating to gradients and such.

This seemed to work on CUDA and Google Colab in terms of memory usage, but locally, I was still seeing a jump of around 15-20 GB in allocations prior to validation. This was significantly slowing down my training. Even stranger, Google Colab was just outputting empty black images for output, whereas my local version at least outputted some kind of structure. Until I could figure more out, I decided to move the validation to happen just once after everything else. My assumption was that maybe (since torch only hsa MPS support for showing one memory display, rather than a breakdown of allocated and reserved) the issue was actually with memory reservation locally, and that was slowing everything down. It seemed that only after the next five iterations did the displayed memory usage come down. Very strange to me. This actually reduced the time to run five-fold locally (makes sense).

Still, I was having major issues because my PSNR wasn't going up as fast as the staff solution (it was still a vague cloud even after 1000 iterations). I'll continue this in the next section.
6. Network Implementation and Volume Rendering I was able to implement the volrend function without too many pains. The main issue was I shifted T over to accomodate for the fact that the sum for T goes up to i1 rather than i, but I filled it in with zeros instead of ones (you need ones because e0=1, and I was concatenating after taking torch.exp). Seeing tests passing feels awesome:
I also implemented the network, and my preliminary results were... well, difficult. I don't think my memory usage was the greatest, and to be honest, I wasn't sure why. Based on some rough computations I did, my model should have taken far less than a GB per run. But the memory usage on my computer and Google Colab was mind boggling, and my computer started glitching like crazy. I decided I needed to get to the bottom of this, and decided to dedicate a section to it.
5. Sampling I had many things to debug here, starting with, surprisingly, camera_to_world! I was a bit silly and actually used w2c in this method instead of c2w. My tests weren't able to pick up on this because regardless of the matrix, A1Ax=x... so maybe the test was a bit silly.

After that, I had this other issue:
If you look closely, the rays are only going through the bottom right quadrant of the image, and three-fourths of the rays aren't going through the image. I realized it was because I used K = np.diag([focal, focal, 1]) instead of the whole matrix including the principal point. After that, I got to this delightful image:
4. Creating Rays from Camera For the first part, implementing the camera_to_world function was fairly easy with numpy, and I simply tested it on the identity matrix to make sure that it would work for all vectors in R4. An important note is that while the spec said to test: x == camera_to_world(c2w.inv(), camera_to_world(c2w, x)), it made more sense to use np.isclose or manually check for floating point errors.

I also wrote tests for pixel_to_camera, but didn't for pixel_to_ray to save time.
3. Starting NeRFs, Playing With Data Before starting Part 2.1, I got this nice visualization of the cameras working with plotly, to make sure I was comfortable with the data:
It's far from perfect, but I'd rather move to part 2.1 for now than continue to make this visualization pretty.
2. Neural Fields, PyTorch Segfault Disasters, and White Screens I started with the Neural fields. In particular, I'm building the architecture mentioned on the CS 180 website. One issue I ran into that has been quite painful is that even though my custom torch Dataset seems fine in my opinion, creating a dataloader like dataloader = DataLoader(img_data, batch_size=N_SAMPLE, shuffle=True) has been creating segfault after segfault... and debugging hasn't helped. To avoid spending too much time on it, I decided to just load data myself.
Doing it my way ended up looking like this:

for epoch in range(EPOCHS):
    random_indices = random.sample(list(range(len(img_data))), BATCH_SIZE)
    X, Y = img_data[random_indices]
    print(X.shape)
    X = torch.from_numpy(X)
    print(X.shape, X.device)
    break
which outputted:

(10000, 2)
torch.Size([10000, 2]) cpu
as one would expected. I think based on this output, there should be no need to unsqueeze or do any funny business like that.

Another strange thing that happened was when updating my positional encoding class to work with batch sizes. My forward initially looked something like this:
def forward(self, x: torch.Tensor) -> torch.Tensor:
    print("here")
    batch_size, seq_len = x.shape
    print(batch_size, seq_len)

    temp = 2 * self.L + 1
    print("here7", [batch_size, seq_len * temp])
    y = torch.zeros([batch_size, seq_len * temp], device="cpu")

    print("here7")
    for i in range(batch_size):
        for j in range(seq_len):
            idx = j * temp
            y[i, idx] = x[i, j]
            print(idx)
            for k in range(self.L):
                factor = 2**k * math.pi * x[i, j]
                y[i, idx + 2*k + 1] = torch.sin(factor)
                y[i, idx + 2*k + 2] = torch.cos(factor)
    print("here5")
    return y
            
which didn't really work (caused segfault), but changing the `torch.zeros` to `np.zeros` and then returning `torch.from_numpy(y)` magically worked... I guess because of memory allocation issues. I still don't know exactly why, or how to just initialize with `torch` from the start. But I think this might be a more memory-stable solution for now, so I went with it.

After this, I was still getting a segfault! But this time, it was because the batch size was too large; 10k is definitely too large of a batch size for some computers (I have a MacBook Pro with an M2 chip).

At this point, I was feeling like I finally deserved my "I'm the computer fairy, I make the segfaults go away!" sticker... but then I got ANOTHER segfault. 🥲 This time, it was in the `optimizer.step()`. I definitely must have set up something weird, I've never had this many issues setting up a simple neural network. Then again, I usually don't have to make my own custom dataset, someone usually does that for me. I think I either messed up something in the custom dataset or the positional encodings. Anyhow, I needed to debug it, so I used the following snippet:
for param in model.parameters():
print("hi", param)
print("grad", param.grad)
if param.grad is not None:
    if torch.isnan(param.grad).any() or torch.isinf(param.grad).any():
        print("NaN or Inf in gradients!")
        break
    else:
        print(param.grad)
            
which also segfaulted, on the third "hi" and "grad". By visual inspection, I didn't actually notice anything funky about the parameters or grad. Somehow, with some more print debugging, I realized the segfault was coming from the torch.isnan(...).any() and torch.isinf(...).any(). At this point, I also thought about it and realized... my positional encodings don't involve any model parameters. As such, they definitely shouldn't be the reason for the segfault, considering the forward call worked.

After debugging into the adam.py file, I found that the issue was actually the line: state['exp_avg'] = torch.zeros_like(p, memory_format=torch.preserve_format). I should've known... more memory allocation issues. At this point, I decided I should probably figure out what was up. I temporarily replaced that with numpy computations, but still not working. Me thinking I could debug this with print statements: (look at the terminal lol)
Wow... now I've discovered that "lerp" isn't working.
In the end... it was just a `pip install --upgrade torch torchvision torchaudio` that fixed my problems! Crazy... but now I can call myself the computer fairy. :-) After this, using dataloader also worked. After the segfault fiasco, I noticed that my model was suffering severely from vanishing gradients, so I put in some batchnorms. Still nothing was working, and considering the gradient seemed to be small even in the earlier layers, I decided to get rid of the batchnorm and make sure my positional encodings were correct. Printing them out for [[1.0, 1.0]], I got:

Example Positional Encoding:
tensor([[ 1.0000e+00,  1.0000e+00, -8.7423e-08,  1.7485e-07,  3.4969e-07,
          6.9938e-07,  1.3988e-06,  2.7975e-06,  5.5951e-06,  1.1190e-05,
          2.2380e-05,  4.4760e-05, -1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00, -8.7423e-08,  1.7485e-07,  3.4969e-07,
          6.9938e-07,  1.3988e-06,  2.7975e-06,  5.5951e-06,  1.1190e-05,
          2.2380e-05,  4.4760e-05, -1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,  1.0000e+00,
          1.0000e+00,  1.0000e+00]])
            
After I looked at this output and thought about it, I realized something sus... it makes no sense to do positional encodings this way because the input x will always be an integer, so sin(2kπx) and cos(2kπx) will always be something very close to {1,0,1}.... That doesn't seem very useful! That's when I noticed the sentence in the spec... "You would want to normalize both the coordinates (x = x / image_width, y = y / image_height) and the colors (rgbs = rgbs / 255.0) to make them within the range of [0, 1]." So I need to normalize the coordinates too... at least now I know why! :-) I guess I'll have to read more carefully from now on.

Even after fixing this though, the gradients just basically zero, and after a handful of epochs, I noticed that the parameters jsut weren't changing up to like 15 significant figures between each train iteration. So I decided to add back batch norms, which fixed that. On the other hand, I started getting these weird images (in epochs 1, 61, 101, and 161):
These look kind of pretty... or like a broken computer screen, but they're not quite what we want. I noticed the signs of the agverage gradients between epochs for each parameter were alternating a lot, so I thought to maybe work with the learning rate a bit. But after increasing the learning rate a lot, the gradients eventually went to zero. I was a bit self-conscious about adding in the batchnorms when the architecture on the website didn't say they were necessary, so I removed them, but same issue. And it always converged to a white screen. Mysterious.

Interestingly, switching back from using dataloader to doing:

random_indices = random.sample(list(range(len(img_data))), BATCH_SIZE)
X, Y = img_data[random_indices]
X = torch.from_numpy(X); Y = torch.from_numpy(Y)
            
completely changed my results!
still a bit cooked though.... But these were definitely prettier. Some more from different iterations:
Given how pretty these were, I started saving them as videos. And after scrolling through Ed (somehow there weren't thaaat many people with problems at this stage so I was like hmm... what am I doing so wrong?) but I noticed someone mention their final loss was close to 80ish (actually scrolling back, I somehow can't find it again, but I feel like I saw something like that). That being said, someone else mentioned losses around 0.1, but I think using 0-255 pixel values would actually change the step size by a huge factor, so I decided to try that. I got this loss (using MSE):
Beautiful. But guess what video I got (for the generated image over time):
It turned out though, that the error was just from how I was making the dataset class/loading in the data....
With all of that trouble out of the way, I was finally able to get some meaningful results (as I would find out later though, there was still one more major issue).
Here are some notes from my hyperparameter tuning (note: I was using BatchNorms between layers here because without them my results were very poor, or so I thought):
Experiment L LR Other Parameters Training Time (s) * Loss PSNR Notes
6 5 0.01 3000 Epochs, 10k Batch, 3 linear layers, Hidden Dimension 256 231.2 0.00296 25.293
1 10 " " 236.5 0.00163 27.867 Spec Params
2 15 " " 244.7 0.00167 27.776
3 25 " " 278.9 0.00152 28.189
4 50 " " 331.6 0.00129 28.907
5 100 " " 384.8 0.00162 27.903
13 50 0.001 " 307.6 0.00144 28.405
12 " 0.002 " 296.5 0.00131 28.820
11 " 0.005 " 310.3 0.00117 29.323
17 " 0.006 " 344.3 0.00120 29.191
14 " 0.007 " 289.1 0.00120 29.192
15 " 0.008 " 339.1 0.00132 28.798
16 " 0.009 " 333.7 0.00121 29.187
7 " 0.02 " 302.6 0.00156 28.056
8 " 0.05 " 298.9 0.00242 26.166
9 " 0.1 " 298.8 0.00453 23.444
10 " 1 " 290.7 0.218 6.621
17 " 0.005 3000 Epochs, 10k Batch, 7 linear layers, Hidden Dimension 256 503.2 0.000903 30.444
18 " 0.01 3000 Epochs, 10k Batch, 7 linear layers, Hidden Dimension 1024 1004.2 0.000726 31.390
21 " 0.0001 3000 Epochs, 10k Batch, 7 linear layers, Hidden Dimension 512, Only one BatchNorm after first liner layer 800.2 0.00153 28.144
19 " 0.005 3000 Epochs, 10k Batch, 7 linear layers, Hidden Dimension 1024, Only one BatchNorm after first liner layer 788.2 0.000696 31.574
20 " 0.01 " 896.1 0.213 6.713
22 " 0.001 3000 Epochs, 10k Batch, 12 linear layers, Hidden Dimension 512, NO BatchNorm 1699.7 0.000591 32.283 TRIUMPH!
* Training includes some time for saving checkpoints.
After (more than) 22 experiments, it finally worked... AFTER I REMOVED THE BATCHNORMS.... OMG... if I had just trusted the instructions from the beginning and followed them to a tea... I would've gotten it. I don't know why I tried to be fancy. I have learned my lesson. There are many other things I still don't know though, like how my losses and PSNR with and without the batch norms look pretty similar, but somehow without the batchnorms I'm getting coherent images and with the batchnorms my images just look blurry!! I've placed my final video in my gallery.

After all of this, I decided to redo my hyperparameter search without the batchnorms, which I've included in my report for this section.
1. Started Project You can use these dropdown arrows to check out what happened at each step!