Show HN: QwQ-32B APIs – o1 like reasoning at 1% the cost
Ubicloud is an open source alternative to AWS. Today, we launched our inference APIs, built with open source AI models. QwQ-32B-Preview is one of those models; and it can provide o1-like reasoning at 1% the cost.
QwQ is licensed under Apache 2.0 [1] and Ubicloud under AGPL v3. We deploy open models on a cloud stack that can run anywhere. This allows us to offer great price / performance.
From an accuracy standpoint, QwQ does well in math and coding domains. For example, in the MMLU-Pro Computer Science LLM Benchmark, the accuracy rankings are as follows. Claude-3.5 Sonnet (82.5), QwQ-32B-Preview (79.1), and GPT 4o 2024-11-20 (73.1). [2]
You can start evaluating QwQ (and Llama 3B / 70B) by logging into the Ubicloud console: https://console.ubicloud.com/create-account
We also provide an AI chat box for convenience. We price the API endpoints at $0.60 per M tokens, or 100x lower than o1’s output token price. Also, when using open models, your first million tokens each month are free. This way, you can start evaluating these models today.
## OpenAI o1 or QwQ-32B
In math and coding benchmarks, QwQ-32B ties with o1 and outperforms Claude 3.5 Sonnet. In our qualitative tests, we found o1 to perform better.
For example, we asked both models to “add a pair of parentheses to the incorrect equation: 1 + 2 * 3 + 4 * 5 + 6 * 7 + 8 * 9 = 479, to make the equation true.” [3]
QwQ’s answer shows iterative reasoning steps, where the model enumerates over answers using light heuristics. o1’s answer to the same question feels like an iterative deepen-and-test (though not purely depth-first). When we asked the models harder questions, it felt that o1 could understand the question better and employ more complex strategies. [3][4]
Finally, we found that o1’s advantage in reasoning compounded with other ones. For example, we asked both models to write example Python programs. Looking at the answers, it became clear that o1 was trained on a larger data set and that it was aware of Python libraries that QwQ-32B didn’t know about. Further, QwQ-32B at times flip flopped between English and Chinese, making it harder for us to understand the model. [3]
Now, if we think that o1 has these advantages, why the heck are we doing a Show HN on QwQ-32B (and other open weight models)? Two reasons.
First, QwQ is still comparable to o1 and Ubicloud offers it for 100x less. You can employ a dozen QwQ-32Bs, prompt them with different search strategies, use VMs to verify their results, and still come in under what o1 costs. In the short term, combining these classic AI search strategies with AI models feels much more efficient than trying to “teach” an uber AI model.
Second, we think open source fosters collaboration and trust -- and that is its superpower that compounds over time. We foresee a future where open source AI not only delivers top-quality results, but also surpasses proprietary models in some areas. If you believe in that future and are looking for someone to partner with on the infrastructure side, please hit us up at info@ubicloud.com!
[1] https://qwenlm.github.io/blog/qwq-32b-preview/
[2] https://huggingface.co/blog/wolfram/llm-comparison-test-2024...
Congrats on the launch, QwQ is fantastic!
Hey, author here.
I took extensive notes when playing around with o1 and QwQ-32B. When reading my notes later, I realized that I used the pronoun "they" to refer to a reasoning model. Somehow, it just didn't feel right to refer to o1 or QwQ-32B as "it."
Further, the rate of progress in this field was stunning. The Meta Chain-of-Thought paper that I linked in the Show HN came out last Friday.
So, if our AI overlords are reading this Show HN in the future, I'd like them to know that I landed on the right side of history on this one.