CIS565-Fall-2018 · LanLou123 · Oct 15, 2018 · Oct 15, 2018 · Oct 15, 2018 · Oct 15, 2018
diff --git a/README.md b/README.md
@@ -1,19 +1,83 @@
 CUDA Rasterizer
 ===============
 
-[CLICK ME FOR INSTRUCTION OF THIS PROJECT](./INSTRUCTION.md)
+### Sample Rasterization
 
-**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 4**
+#### ```resolution```: 900X900 ```GLTF model```: cesiummilktruck ```shader``` : blinn_phong perspective corrected bilinear textureed
+![](https://github.com/LanLou123/Project4-CUDA-Rasterizer/raw/master/renders/truck.gif)
 
-* (TODO) YOUR NAME HERE
-	* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+### Introduction
 
-### (TODO: Your README)
+  Rasterization is an efficient rendering technique commonly used in computer graphics and especially in games,simmilar to path and ray tracing, it basically does one thing: transforming the 3d object into 2d screen. 
+  Different from raytracing or pathtracing, however, in rasterization, we will not track rays' further interaction with geometry anymore, instead, we will only cast the rays from each screen pixels into the scene, and get the color, depth, specular, etc results, and use these to simulate the scene, so as a consequence, rasterization is much more efficient, but is harder to get to an realistic result.
 
-*DO NOT* leave the README to the last minute! It is a crucial part of the
-project, and we will not be able to grade you without a good README.
+### Features:
 
+- Basic features:
+  - Vertex shading
+  - Primitive assembly 
+  - Rasterization
+  - Fragment shading
+  - A depth buffer for storing and depth testing fragments
+  - Fragment-to-depth-buffer writing (with atomics for race avoidance)
+  -  simple lighting scheme including togglable Lambert and Blinn-Phong
+- Extra:
+  - UV texture mapping with bilinear texture filtering and perspective correct texture coordinates
+  - Support for rasterizing additional primitives with toggle, including line, points
+  - correct color interpolation on a primitive
+  - * tried SSAO, but result is not accurate
+
+## Debug view:
+
+albedo buffer|depth buffer|
+------------|--------
+![](https://github.com/LanLou123/Project4-CUDA-Rasterizer/raw/master/renders/diffuse.gif) | ![](https://github.com/LanLou123/Project4-CUDA-Rasterizer/raw/master/renders/depth.gif)  
+
+normal buffer|specular buffer|
+------------|--------
+![](https://github.com/LanLou123/Project4-CUDA-Rasterizer/raw/master/renders/normal.gif) | ![](https://github.com/LanLou123/Project4-CUDA-Rasterizer/raw/master/renders/spec.gif)
+
+#### combined:
+![](https://github.com/LanLou123/Project4-CUDA-Rasterizer/raw/master/renders/res.gif)
+
+## Support for other primitives:
+
+point|line|triangle
+-----|----|-----
+![](https://github.com/LanLou123/Project4-CUDA-Rasterizer/raw/master/renders/p.gif) | ![](https://github.com/LanLou123/Project4-CUDA-Rasterizer/raw/master/renders/line.gif) | ![](https://github.com/LanLou123/Project4-CUDA-Rasterizer/raw/master/renders/lamm.gif)
+
+
+
+
+## perspective correct texture coordinates
+
+in the following to comparisions, the right image both shows what we will get when we use simple linear interpolation to aquire stuffs like normal, albedo, and depth, the result is apparently wrong and looks wierd, the reason for this is that when we are doing interpolation, we are only using the barycentric values in triangle vertices and the triangle value we want to interpolate, we haven't taken depth(z) information into consideration, which is really important for correctly transforming 3d data into 2d screen (depth information can't be lost), so what we should do instead is to use both the baryalue and z value to compute our result.
+
+corrected duck|not-corrected duck
+-----|----
+![](https://github.com/LanLou123/Project4-CUDA-Rasterizer/raw/master/renders/yes.gif) | ![](https://github.com/LanLou123/Project4-CUDA-Rasterizer/raw/master/renders/non.gif)
+
+corrected checkboard|not-corrected checkboard
+-----|----
+![](https://github.com/LanLou123/Project4-CUDA-Rasterizer/raw/master/renders/perspcorrect.JPG) | ![](https://github.com/LanLou123/Project4-CUDA-Rasterizer/raw/master/renders/notcorrected.JPG)
+
+## correct color interpolation between points on a primitive
+
+![](https://github.com/LanLou123/Project4-CUDA-Rasterizer/raw/master/renders/colorinterp.gif)
+
+# Performance analysis
+
+## break down of pipeline time consumption:
+
+![](https://github.com/LanLou123/Project4-CUDA-Rasterizer/raw/master/renders/imgg.JPG)
+
+as can be seen from the above graph, for each model, vertex transfrom almost cost the same amount of time,as it is bascically a parallel data copying process, for the same reason, primitive assembly time is the same in spite of different models, apparently, most time is used to do the rasterization operation, because we will have a lot of iterative checks for each thread, moreover, I put the bilinear filtering inside the triangle rasterization kernal, so it might bring the time consumption even higher, finally, rendering is also the same for all the models, this is simply because the shader are just too simple.....
+
+## with and without perspective correction:
+
+![](https://github.com/LanLou123/Project4-CUDA-Rasterizer/raw/master/renders/pcomp.JPG)
+
+the above image is tested using duck with and without perspective correction, it shows that, with perspective correction, we have some decrease in rasterization efficiency, this might because we have to do extra computation with z values in order to interpolate stuff.
 
 ### Credits
 

diff --git a/renders/colorinterp.gif b/renders/colorinterp.gif
diff --git a/renders/depth.gif b/renders/depth.gif
diff --git a/renders/diffuse.gif b/renders/diffuse.gif
diff --git a/renders/imgg.JPG b/renders/imgg.JPG
diff --git a/renders/lambert.gif b/renders/lambert.gif
diff --git a/renders/lamm.gif b/renders/lamm.gif
diff --git a/renders/line.gif b/renders/line.gif
diff --git a/renders/non.gif b/renders/non.gif
diff --git a/renders/normal.gif b/renders/normal.gif
diff --git a/renders/notcorrected.JPG b/renders/notcorrected.JPG
diff --git a/renders/p.gif b/renders/p.gif
diff --git a/renders/pcomp.JPG b/renders/pcomp.JPG
diff --git a/renders/perspcorrect.JPG b/renders/perspcorrect.JPG
diff --git a/renders/res.gif b/renders/res.gif
diff --git a/renders/spec.gif b/renders/spec.gif
diff --git a/renders/truck.gif b/renders/truck.gif
diff --git a/renders/yes.gif b/renders/yes.gif
diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
@@ -2,9 +2,11 @@ set(SOURCE_FILES
     "rasterize.cu"
     "rasterize.h"
     "rasterizeTools.h"
+    "common.h"
+    "common.cu"
     )
 
 cuda_add_library(src
     ${SOURCE_FILES}
-    OPTIONS -arch=sm_20
+    OPTIONS -arch=sm_60
     )
diff --git a/src/common.cu b/src/common.cu
@@ -0,0 +1,51 @@
+#include "common.h"
+
+void checkCUDAErrorFn(const char *msg, const char *file, int line) {
+    cudaError_t err = cudaGetLastError();
+    if (cudaSuccess == err) {
+        return;
+    }
+
+    fprintf(stderr, "CUDA error");
+    if (file) {
+        fprintf(stderr, " (%s:%d)", file, line);
+    }
+    fprintf(stderr, ": %s: %s\n", msg, cudaGetErrorString(err));
+    exit(EXIT_FAILURE);
+}
+
+
+namespace StreamCompaction {
+    namespace Common {
+
+        /**
+         * Maps an array to an array of 0s and 1s for stream compaction. Elements
+         * which map to 0 will be removed, and elements which map to 1 will be kept.
+         */
+        __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
+            // TODO
+			int idx = blockIdx.x*blockDim.x + threadIdx.x;
+			if (idx < n)
+			{
+				if (idata[idx])
+					bools[idx] = 1;
+			}
+        }
+
+        /**
+         * Performs scatter on an array. That is, for each element in idata,
+         * if bools[idx] == 1, it copies idata[idx] to odata[indices[idx]].
+         */
+        __global__ void kernScatter(int n, int *odata,
+                const int *idata, const int *bools, const int *indices) {
+            // TODO
+			int idx = blockIdx.x*blockDim.x + threadIdx.x;
+			if (idx < n)
+			{
+				if (bools[idx])
+					odata[indices[idx]] = idata[idx];
+			}
+        }
+
+    }
+}
diff --git a/src/common.h b/src/common.h
@@ -0,0 +1,132 @@
+#pragma once
+
+#include <cuda.h>
+#include <cuda_runtime.h>
+
+#include <cstdio>
+#include <cstring>
+#include <cmath>
+#include <algorithm>
+#include <chrono>
+#include <stdexcept>
+
+#define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__)
+#define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__)
+
+/**
+ * Check for CUDA errors; print and exit if there was a problem.
+ */
+void checkCUDAErrorFn(const char *msg, const char *file = NULL, int line = -1);
+
+inline int ilog2(int x) {
+    int lg = 0;
+    while (x >>= 1) {
+        ++lg;
+    }
+    return lg;
+}
+
+inline int ilog2ceil(int x) {
+    return x == 1 ? 0 : ilog2(x - 1) + 1;
+}
+
+namespace StreamCompaction {
+    namespace Common {
+        __global__ void kernMapToBoolean(int n, int *bools, const int *idata);
+
+        __global__ void kernScatter(int n, int *odata,
+                const int *idata, const int *bools, const int *indices);
+
+	    /**
+	    * This class is used for timing the performance
+	    * Uncopyable and unmovable
+        *
+        * Adapted from WindyDarian(https://github.com/WindyDarian)
+	    */
+	    class PerformanceTimer
+	    {
+	    public:
+		    PerformanceTimer()
+		    {
+			    cudaEventCreate(&event_start);
+			    cudaEventCreate(&event_end);
+		    }
+
+		    ~PerformanceTimer()
+		    {
+			    cudaEventDestroy(event_start);
+			    cudaEventDestroy(event_end);
+		    }
+
+		    void startCpuTimer()
+		    {
+			    if (cpu_timer_started) { throw std::runtime_error("CPU timer already started"); }
+			    cpu_timer_started = true;
+
+			    time_start_cpu = std::chrono::high_resolution_clock::now();
+		    }
+
+		    void endCpuTimer()
+		    {
+			    time_end_cpu = std::chrono::high_resolution_clock::now();
+
+			    if (!cpu_timer_started) { throw std::runtime_error("CPU timer not started"); }
+
+			    std::chrono::duration<double, std::milli> duro = time_end_cpu - time_start_cpu;
+			    prev_elapsed_time_cpu_milliseconds =
+				    static_cast<decltype(prev_elapsed_time_cpu_milliseconds)>(duro.count());
+
+			    cpu_timer_started = false;
+		    }
+
+		    void startGpuTimer()
+		    {
+			    if (gpu_timer_started) { throw std::runtime_error("GPU timer already started"); }
+			    gpu_timer_started = true;
+
+			    cudaEventRecord(event_start);
+		    }
+
+		    void endGpuTimer()
+		    {
+			    cudaEventRecord(event_end);
+			    cudaEventSynchronize(event_end);
+
+			    if (!gpu_timer_started) { throw std::runtime_error("GPU timer not started"); }
+
+			    cudaEventElapsedTime(&prev_elapsed_time_gpu_milliseconds, event_start, event_end);
+			    gpu_timer_started = false;
+		    }
+
+		    float getCpuElapsedTimeForPreviousOperation() //noexcept //(damn I need VS 2015
+		    {
+			    return prev_elapsed_time_cpu_milliseconds;
+		    }
+
+		    float getGpuElapsedTimeForPreviousOperation() //noexcept
+		    {
+			    return prev_elapsed_time_gpu_milliseconds;
+		    }
+
+		    // remove copy and move functions
+		    PerformanceTimer(const PerformanceTimer&) = delete;
+		    PerformanceTimer(PerformanceTimer&&) = delete;
+		    PerformanceTimer& operator=(const PerformanceTimer&) = delete;
+		    PerformanceTimer& operator=(PerformanceTimer&&) = delete;
+
+	    private:
+		    cudaEvent_t event_start = nullptr;
+		    cudaEvent_t event_end = nullptr;
+
+		    using time_point_t = std::chrono::high_resolution_clock::time_point;
+		    time_point_t time_start_cpu;
+		    time_point_t time_end_cpu;
+
+		    bool cpu_timer_started = false;
+		    bool gpu_timer_started = false;
+
+		    float prev_elapsed_time_cpu_milliseconds = 0.f;
+		    float prev_elapsed_time_gpu_milliseconds = 0.f;
+	    };
+    }
+}
diff --git a/src/main.cpp b/src/main.cpp
@@ -17,7 +17,7 @@
 //-------------------------------
 //-------------MAIN--------------
 //-------------------------------
-
+int starttime;
 int main(int argc, char **argv) {
     if (argc != 2) {
         cout << "Usage: [gltf file]. Press Enter to exit" << endl;
@@ -57,6 +57,7 @@ int main(int argc, char **argv) {
     // Launch CUDA/GL
     if (init(scene)) {
         // GLFW main loop
+		starttime = GetTickCount();
         mainLoop();
     }
 
@@ -97,16 +98,18 @@ void mainLoop() {
 //---------RUNTIME STUFF---------
 //-------------------------------
 float scale = 1.0f;
-float x_trans = 0.0f, y_trans = 0.0f, z_trans = -10.0f;
+float x_trans = 0.0f, y_trans = 0.0f, z_trans = -10.f;
 float x_angle = 0.0f, y_angle = 0.0f;
 void runCuda() {
     // Map OpenGL buffer object for writing from CUDA on a single GPU
     // No data is moved (Win & Linux). When mapped to CUDA, OpenGL should not use this buffer
     dptr = NULL;
+	int timert = GetTickCount() - starttime;
 
+	//y_angle = 0.001*timert;
 	glm::mat4 P = glm::frustum<float>(-scale * ((float)width) / ((float)height),
 		scale * ((float)width / (float)height),
-		-scale, scale, 1.0, 1000.0);
+		-scale, scale, 3, 1000.0);
 
 	glm::mat4 V = glm::mat4(1.0f);
 
@@ -382,13 +385,13 @@ void mouseMotionCallback(GLFWwindow* window, double xpos, double ypos)
 	if (mouseState == ROTATE)
 	{
 		//rotate
-		x_angle += (float)s_r * diffy;
+		x_angle += -(float)s_r * diffy;
 		y_angle += (float)s_r * diffx;
 	}
 	else if (mouseState == TRANSLATE)
 	{
 		//translate
-		x_trans += (float)(s_t * diffx);
+		x_trans += (float)(-s_t * diffx);
 		y_trans += (float)(-s_t * diffy);
 	}
 }