About 13,600,000 results
Open links in new tab
  1. PyTorch Optimizer: AdamW and Adam with weight decay

    Oct 31, 2020 · Yes, Adam and AdamW weight decay are different. Hutter pointed out in their paper (Decoupled Weight Decay Regularization) that the way weight decay is implemented in …

  2. 为什么NLP模型通常使用AdamW作为优化器,而不是SGD?

    其次介绍下Adamw是如何解决了Adam优化器让L2正则化变弱的缺陷。 相信读完这篇文章,能让你熟练掌握LLM时代神经网络优化器Adamw。

  3. 你有没有把 AdamW 换成 Muon?为什么? - 知乎

    AdamW:我们,以前感情明明那么好。 AdamW:每天都开心,大家在一起, AdamW:现在却说要换成Muon也太奇怪了吧。 AdamW已经结束了,绝没有复活的可能 AdamW:没有结束! …

  4. adamw优化器为什么和大的weight decay的效果好? - 知乎

    adamw优化器为什么和大的weight decay的效果好? 原本我以为只是类似vit这类模型需要adamw加快收敛,然后大wd鼓励权重稀疏性,但我经过实验(cls和det任务的多个模型, …

  5. Muon真的能挑战Adam吗?它体现了什么新的优化思想? - 知乎

    而Muon是对每个参数矩阵都通过Newton-Schulz迭代去计算一个正交近似的梯度动量,所以必须保证对应的节点上有完整的参数state矩阵,Kimi的工作是通过一步额外的gather来实现这一点, …

  6. Implementation of AdamW is deprecated and will be removed in a …

    Feb 22, 2023 · Implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW Asked 2 years, 10 months ago Modified …

  7. What is the proper way to weight decay for Adam Optimizer

    Jun 9, 2017 · Since Adam Optimizer keeps an pair of running averages like mean/variance for the gradients, I wonder how it should properly handle weight decay. I have seen two ways of …

  8. python - How does a decaying learning rate schedule with …

    Jan 25, 2022 · To see this, consider the second line within the for-loop in the AdamW algorithm: But what if the learning rate lambda shrinks after each epoch because we use (say) an …

  9. 为什么transformer要用adam? - 知乎

    convergence rate 第二,作者假设了heavy tail noise的情况下,发现SGD的L2 norm的期望是没有被bound住的(趋向正无穷),所以SGD比Adam要差,但是他提出了一个解决方案就是clip …

  10. learning rate in torch.optim.AdamW has no effect?

    Oct 8, 2024 · I am working on fine-tuning BLIP-2 on the RSICD dataset using LoRA. I am working on colab, using an A100. I am strangely finding that when I set the learning rate in the code …