Tags attention1 bottleneck1 cuda5 Cuda1 deep-learning4 fine-tuning1 Fine‑Tuning1 gpu4 gpu-kernels4 HyLoRA1 json1 learning-journey6 LLaMA1 llm1 LoRA1 matrix multiplication1 matrix-multiplication2 memory-optimization4 mistral1 nlp1 optimization4 parallel-computing2 peft1 performance4 Performance Trade‑off1 Perplexity1 phi-41 profiling3 qlora1 shared memory1 shared-memory2 structured-data-extraction1 SVD1 tiling2 transformer1 transformers1 triton4 warp-divergence2