Skip Navigation

InitialsDiceBearhttps://github.com/dicebear/dicebearhttps://creativecommons.org/publicdomain/zero/1.0/„Initials” (https://github.com/dicebear/dicebear) by „DiceBear”, licensed under „CC0 1.0” (https://creativecommons.org/publicdomain/zero/1.0/)J
Posts
96
Comments
454
Joined
1 yr. ago

  • Locked

    I dunno

    Jump
  • Doubt it, don't think Bezos or Musk give a single shit what you think.

  • Locked

    I dunno

    Jump
  • I may very well be, still researching.

  • Locked

    I dunno

    Jump
  • Funny, tell that to the billionaires who have a private jet.

  • "AVIF is an image file format that uses AV1 compression algorithms." yes i mean that

  • Locked

    I dunno

    Jump
  • Locked

    I dunno

    Jump
  • Locked

    I dunno

    Jump
  • What's your knowledge regarding LLMs, if any at all?

  • Yes, as far as scalability, cheaper more efficient models can be used in applications which require thousands of uses a day.

  • This is peak bubble type news. AI is becoming rapidly more energy efficient. These events will be looked back on like pets.com reaching hundreds of millions and then dying.

  • The student loans are never being paid back, just like the federal debt.

  • AI Model Efficiency Index 2.1 — Methodology Summary

    Goal: Rank AI models by real-world value (performance per dollar) using harder, less-contaminated benchmarks.

    Benchmarks Used (8 metrics):

    • 20% SWE-bench – real-world coding tasks (repo-level bug fixes)
    • 15% MMLU-Pro – harder general knowledge (resists saturation)
    • 15% Humanity's Last Exam – extremely difficult academic reasoning
    • 15% GPQA Diamond – PhD-level science questions
    • 10% ARC-AGI – abstract reasoning and problem-solving
    • 15% Chatbot Arena Elo – human preference (crowdsourced rankings)
    • 10% RULER – long-context robustness (32k–128k tokens)
    • 10% EQBench – emotional intelligence and creative quality

    Why This Mix?

    • Reduces gaming and contamination (avoids relying on easy, memorized benchmarks like vanilla MMLU).
    • Captures multiple capability dimensions: coding, reasoning, long-context, human preference, and creativity.
    • Harder benchmarks are less saturated, making score differences meaningful.

    Calculation:

    1. Normalize all 8 benchmark scores to 0–100 scale.
    2. Compute weighted composite score for each model.
    3. Divide composite score by blended API cost (3:1 input:output token ratio).
    4. Rank by efficiency index (higher = better value).

    Coverage:

    • Includes only models with complete or near-complete data across all 8 metrics.
    • Excludes enterprise/niche models (Cohere, AI21, Baichuan) due to incomplete benchmark coverage or opaque pricing.
    • All models are 2025 releases with public pricing and APIs.
  • Okay here's the new bar chart with the more spread out weighting, and honestly it looks a lot more reasonable.

    1. DeepSeek V3.2-Exp (Sep 2025) — 69.26
    2. Kimi K2 Thinking (Nov 2025) — 66.19
    3. Gemini 2.5 Flash (May 2025) — 58.73
    4. Qwen 3 Max (Jul 2025) — 55.56
    5. GPT-5 (Aug 2025) — 21.25
    6. o3 (Apr 2025) — 20.39
    7. Gemini 2.5 Pro (Mar 2025) — 19.98
    8. Gemini 3 Pro (Nov 2025) — 19.82
    9. Claude 3.5 Sonnet (Aug 2025) — 10.17
    10. GPT-5 Pro (Aug 2025) — 1.96
  • Thank you, I'll factor that into the index. If you have any other recommendations for how to make the index more robust let me know. The goal is to make this dependent on real world API costs. I don't care if the newest smartest model is released if it costs $100 every time to use.

  • Thank you, but there's no need to be sorry. It's something that's been accepted and fuels my desire for deeper knowledge.

  • This image describes my soul.

  • We will rebuild with Rust when all is said and done.