Research - Frontier lab performance converges, but OpenAI maintains its edge following the launch of o1, as planning and reasoning emerge as a major frontier.
Foundation models demonstrate their ability to break out of language as multimodal research drives into mathematics, biology, genomics, the physical sciences, and neuroscience.
US sanctions fail to stop Chinese (V)LLMs rising up community leaderboards. (View Highlight)
Industry - NVIDIA remains the most powerful company in the world, enjoying a stint in the $3T club, while regulators probe the concentrations of power within GenAI.
More established GenAI companies bring in billions of dollars in revenue, while start-ups begin to gain traction in sectors like video and audio generation.
Although companies begin to make the journey from model to product, long-term questions around pricing and sustainability remain unresolved.
Driven by a bull run in public markets, AI companies reach $9T in value, while investment levels grow healthily in private companies. (View Highlight)
Politics - While global governance efforts stall, national and regional AI regulation has continued to advance, with controversial legislation passing in the US and EU.
The reality of compute requirements forces Big Tech companies to reckon with real-world physical constraints on scaling and their own emissions targets.
Meanwhile, governments’ own attempts to build capacity continue to lag.
Anticipated AI effects on elections, employment and a range of other sensitive areas are yet to be realized at any scale. (View Highlight)
Safety - A vibe-shift from safety to acceleration takes place as companies that previously warned us about the pending extinction of humanity need to ramp up enterprise sales and usage of their consumer apps.
Governments around the world emulate the UK in building up state capacity around AI safety, launching institutes and studying critical national infrastructure for potential vulnerabilities.
Every proposed jailbreaking ‘fix’ has failed, but researchers are increasingly concerned with more sophisticated, long-term attacks. (View Highlight)
For much of the year, both benchmarks and community leaderboards pointed to a chasm between GPT-4 and ‘the best of the rest’. However, Claude 3.5 Sonnet, Gemini 1.5, and Grok 2 have all but eliminated this gap as model performance now begin to converge.
● On both formal benchmarks and vibes-based analysis, the best-funded frontier labs are able to rack up scores within low single digits of each other on individual capabilities.
● Models are now consistently highly capable coders, are strong at factual recall and math, but less good at open-ended question-answering and multi-modal problem solving.
● Many of the variations are sufficiently small that they are now likely to be the product of differences in implementation. For example, GPT-4o outperforms Claude 3.5 Sonnet on MMLU, but apparently underperforms it on MMLU-Pro - a benchmark designed to be more challenging.
● Considering the relatively subtle technical differences between architectures and likely heavy overlaps in pre-training data, model builders are now increasingly having to compete on new capabilities and product features. (View Highlight)
The OpenAI team had clearly clocked the potential of inference compute early, with OpenAI o1 appearing within weeks of papers from other labs exploring the technique.
● By shifting compute from pre- and post-training to inference, o1 reasons through complex prompts step-by-step in a chain-of-thought (COT) style, employing RL to sharpen the COT and the strategies it uses. This unlocks the possibility of solving multi-layered math, science, and coding problems where LLMs have historically struggled, due to the inherent limitations of next-token prediction.
● OpenAI report significant improvements on reasoning-heavy benchmarks versus 4o, with the starkest on AIME 2024 (competition math), with a whopping score of 83.83 versus 13.4.
● However, this capability comes at a steep price: 1M input tokens of o1-preview costs 15,while1Moutputtokenswillsetyouback60. This makes it 3-4x more expensive than GPT-4o.
● OpenAI is clear in its API documentation that it is not a like-for-like 4o replacement and that it is not the best model for tasks that require consistently quick responses, image inputs or function calling (View Highlight)
The community were quick to put o1 through its paces, finding that it performed significantly better than other LLMs on certain logical problems and puzzles. Its true edge shone through, however, on complex math and science tasks, with a viral video of a PhD student reacting with astonishment as it reproduced a year of his PhD code in approximately an hour. However, the model remains weaker on certain kinds of spatial reasoning. Like its predecessors, it can’t play chess to save its life… yet. (View Highlight)
In April, Meta dropped the Llama 3 family, 3.1 in July, and 3.2 in September. Llama 3.1 405B, their largest to-date, is able to hold its own against GPT-4o and Claude 3.5 Sonnet across reasoning, math, multilingual, and long-context tasks. This marks the first time an open model has closed the gap with the proprietary frontier.
● Meta stuck to the same decoder-only transformer architecture that it’s used since Llama 1, with minor adaptations, namely more transformer layers and attention heads.
● Meta used an incredible 15T tokens to train the family. While this blew through the “Chinchilla-optimal” amount of training compute, they found that both the 8B and 70B models improved log-linearly up to 15T.
● Llama 3.1 405B was trained over 16,000 H100 GPUs, the first Llama model trained at this scale.
● Meta followed up with Llama 3.2 in September, which incorporated 11B and 90B VLMs (Llama’s multimodal debut).
The former was competitive with Claude 3 Haiku, the latter with GPT-4o-mini. The company also released 1B and 3B text-only models, designed to operate on-device.
● Llama-based models have now racked up over 440M downloads on Hugging Face. (View Highlight)
New highlights added October 14, 2024 at 9:56 AM
With open source commanding considerable community support and becoming a hot button regulatory issue, some researchers have suggested that the term is often used misleadingly. It can be used to lump together vastly different openness practices across weights, datasets, licensing, and access methods. (View Highlight)
New highlights added October 14, 2024 at 11:56 AM
With new model families reporting incredibly strong benchmark performance straight out-of-the-gate, researchers have increasingly been shining a light on dataset contamination: when test or validation data leaks into the training set. Researchers from Scale retested a number of models on a new Grade School Math 1000 (GSM1k) that mirrors the style and complexity of the established GSM8k benchmark, finding significant performance drops in some cases. (View Highlight)
A team from the University of Edinburgh flagged up the number of mistakes in MMLU, including the wrong ground truth, unclear questions, and multiple correct answers. While low across most individual topics, there were big spikes in certain fields, such as virology, where 57% of the analyzed instances contained errors.
● On a manually corrected MMLU subset, models broadly gain in performance, although worsened on professional law and formal logic. This says inaccurate MMLU instances are being learned during pre-training.
● In more safety-critical territory, OpenAI has warned that SWE-bench, which evaluates models’ ability to solve real-world software issues, was underestimating the autonomous software engineering capabilities of models, as it contained tasks that were hard or impossible to solve.
● The researchers partnered with the creators of the benchmark to create SWE-bench Verified. (View Highlight)
The LMSYS Chatbot Arena Leaderboard has emerged as the community’s favorite method of formalizing evaluation by “vibes”. But as model performance improves, it’s beginning to produce counterintuitive results ● The arena, which allows users to interact with two randomly selected chatbots side-by-side provides a rough crowdsourced evaluation.
● However, controversially, this led to GPT-4o and GPT-4o Mini receiving the same scores, with the latter also outperforming Claude Sonnet 3.5.
● This has led to concerns that the ranking is essentially becoming a way of assessing which writing style users happen to prefer most.
● Additionally, as smaller models tend to perform less well on tasks involving more tokens, the 8k context limit arguably gives them an unfair advantage.
● However, the early version of the vision leaderboard is now beginning to gain traction and aligns better with other evals. (View Highlight)
Deficiencies in both reasoning capabilities and training data mean that AI systems have frequently fallen short on math and geometry problems. With AlphaGeometry, a symbolic deduction engine comes to the rescue.
● A Google DeepMind/NYU team generated millions of synthetic theorems and proofs using symbolic engines, using them to train a language model from scratch.
● AlphaGeometry alternates between the language model proposing new constructions and symbolic engines performing deductions until a solution is found.
● Impressively, It solved 25 out of 30 on a benchmark of Olympiad-level geometry problems, nearing human International Mathematical Olympiad gold medalist performance. The next best AI performance scored only 10.
● It also demonstrated generalisation capabilities - for example, finding that a specific detail in a 2004 IMO problem was unnecessary to for the proof. (View Highlight)
Research suggests that models are robust in the face of deeper layers - which are meant to handle complex, abstract, or task-specific information - being pruned intelligently. Maybe it’s possible to go even further.
● A Meta/MIT team looking at open-weight pre-trained LLMs concluded that it’s possible to do away with up to half a model’s layers and suffer only negligible performance drops on question-answering benchmark.
● They identified optimal layers for removal based on similarity and then “healed” the model through small amounts of efficient fine-tuning.
● NVIDIA researchers took a more radical approach by pruning layers, neurons, attention heads, and embeddings, and then using knowledge distillation for efficient retraining.
● The MINITRON models, derived from Nemotron-4 15B, achieved comparable or superior performance to models like Mistral 7B and Llama-3 8B while using up to 40x fewer training tokens. (View Highlight)
As Andrej Karpathy and others have argued, current large model sizes could be a reflection of inefficient training. Using these big models to refine and synthesize training data, could help train capable smaller models.
● Google have embraced this approach, distilling Gemini 1.5 Flash from Gemini 1.5 Pro, while Gemma 2 9B was distilled from Gemma 2 27B, and Gemma 2B from a larger unreleased model.
● There is also community speculation that Claude 3 Haiku, a highly capable smaller model, is a distilled version of the larger Opus, but Anthropic has never confirmed this.
● These distillation efforts are going multimodal too. Black Forest Labs have released FLUX.1 dev, an open-weight text-to-image distilled from their Pro model.
● To support these efforts, the community has started to produce open-source distillation tools, like arcee.ai’s DistillKit, which supports both Logit-based and Hidden States-based distillation.
● Llama 3.1 405B is also being used for distillation, after Meta updated its terms so output logits can be used to improve any models, not just Llama ones.
stateof.ai 2024 (View Highlight)
odels built for mobile compete with their larger peersstateofai | 23 → As big tech companies think through large-scale end user deployment, we’re starting to see high-performing LLM and multimodal models that are small enough to run on smartphones.
● Microsoft’s phi-3.5-mini is a 3.8B LM that competes with larger models like 7B and Llama 3.1 8B. It performs well on reasoning and question-answering, but size restricts its factual knowledge. To enable on-device inference, the model was quantized to 4 bits, reducing its memory footprint to approximately 1.8GB.
● Apple introduced MobileCLIP, a family of efficient image-text models optimized for fast inference on smartphones. Using novel multimodal reinforced training, they improve the accuracy of compact models by transferring knowledge from an image captioning model and an ensemble of strong CLIP encoders.
● Hugging Face also got in on the action with SmolLM, a family of small language models, available in 135M, 360M, and 1.7B formats. (View Highlight)
It’s possible to shrink the memory requirements of LLMs by reducing the precision of their parameters. Researchers are increasingly managing to minimize the performance trade-offs.
● Microsoft’s BitNet uses a “BitLinear” layer to replace standard linear layers, employing 1-bit weights and quantized activations.
● It shows competitive performance compared to full-precision models and demonstrates a scaling law similar to full-precision transformers, with significant memory and energy savings.
● Microsoft followed up with BitNet b1.58, with ternary weights to match full-precision LLM performance at 3B size while retaining efficiency gains.
● Meanwhile, ByteDance’s TiTok (Transformer-based 1-Dimensional Tokenizer) quantizes images into compact 1D sequences of discrete token for image reconstruction and generation tasks. This allows images to be represented with as few as 32 tokens, instead of hundreds or thousands. (View Highlight)
ll representation fine tuning unlock on-device personalization?stateofai | 25 → Parameter-efficient fine-tuning (e.g. via LoRA) is nothing new, but Stanford researchers believe a more targeted approach offers greater efficiency and adaptation.
● Inspired by model interpretability research, ReFT (Representation Fine-tuning) doesn’t alter the model’s weights. Instead, it manipulates the model’s internal representations at inference time to steer its behavior.
● While it comes with a slight interference penalty, ReFT requires 15-65x fewer parameters compared to weight-based fine-tuning methods.
● It also enables more selective interventions on specific layers and token positions, enabling fine-grained control over the adaptation process.
● The researchers show its potential in few-shot adaptation where a chat model is given a new persona with just five examples. Combined with the small storage footprint for learned interventions, it could be used for real-time personalization on devices with sufficient compute power (View Highlight)
Models that combine attention and other mechanisms are able to maintain or even improve accuracy, while reducing computational costs and memory footprint.
● Selective state-space models like Mamba, designed last year to handle long sequences more efficiently, can to some extent compete with transformers, but lag on tasks that require copying or in-context learning. That said, Falcon’s Mamba 7B shows impressive benchmark performance versus similar-sized transformer models.
● Hybrid models appear to be a more promising direction. Combined with self-attention and MLP layers, the A121’s 8B Mamba-2-Hybrid outperforms the 8B Transformer across knowledge and reasoning benchmarks, while being up to 8x faster generating tokens in inference.
● In a nostalgia trip, there are early signs of a comeback for recurrent neural networks, which had fallen out of fashion due to training and scaling difficulties.
● Griffin, trained by Google DeepMind, mixes linear recurrences and local attention, holding its own against Llama-2 while being trained on 6x fewer tokens. (View Highlight)
By transferring knowledge from a larger, more powerful model, one could improve the performance of subquadratic models, allowing us to harness their efficiency on downstream tasks.
● MOHAWK is a new method for distilling knowledge from a large, pre-trained transformer model (teacher) to a smaller, subquadratic model (student) like a state-space model (SSM).
● It aligns i) the sequence transformation matrices of the student and teacher models ii) and the hidden states of each layer, then iii) transfers the remaining weights of the teacher model to the student model to finetune it.
● The authors create Phi-Mamba, a new student model combining Mamba-2 and an MLP block and a variant called Hybrid-Phi-Mamba that retains some attention layers from the teacher model.
● Mohawk can train Phi-Mamba and Hybrid-Phi-Mamba to achieve performance close to the teacher model. Phi-Mamba is distilled with only 3B tokens, less than 1% of the data used to train either the previously best-performing Mamba models and 2% for the Phi-1.5 model itself. (View Highlight)
the transformer continues to reign supreme (for now)stateofai | 28 Work with transformer alternatives and hybrid models is interesting, but at this stage remains niche. One paradigm still seems to rule them all. (View Highlight)
Synthetic data starts gaining more widespread adoption…stateofai | 29 → Last year’s report pointed to the divides of opinion around synthetic data: with some finding it useful, others fearing its potential to trigger model collapse by compounding errors. Opinion seems to be warming.
● As well as being the main source of training data for the Phi family, synthetic data was used by Anthropic when training Claude 3 to help represent scenarios that might have been missing in the training data.
● Hugging Face used Mixtral-8x7B Instruct to generate over 30M files and 25B tokens of synthetic textbooks, blog posts, and stories to recreate the Phi-1.5 training dataset, which they dubbed Cosmopedia.
● To make this process easier, NVIDIA released the Nemotron-4-340B family, a suite of models designed specifically for synthetic data generation, available via a permissive license. Meta’s Llama can also be used for synthetic data generation.
● It also appears possible to create synthetic high-quality instruction data by extracting it directly from an aligned LLM, with techniques like Magpie. Models fine-tuned this way sometimes perform comparably to Llama-3-8B-Instruct. (View Highlight)
antity of synthetic data that triggers these kinds of outcomes and if any mitigations work ● A Nature paper from Oxford and Cambridge researchers found model collapse occurs across various AI architectures, including fine-tuned language models, challenging the idea that pre-training or periodic exposure to small amounts of original data can prevent degradation (measured by Perplexity score).
● This creates a “first mover advantage”, as sustained access to diverse, human-generated data will become increasingly critical for maintaining model quality.
● However, these results are primarily focused on a scenario where real data is replaced with synthetic data over generations. In practise, real and synthetic data usually accumulates.
● Other research suggests that, provided the proportion of synthetic data doesn’t get too high, collapse can usually be avoided. (View Highlight)
Team Hugging Face built a 15T token dataset for LLM pre-training, using 96 CommonCrawl snapshots, which produces LLMs that outperform other open pre-training datasets. They also released an instruction manual.
● FineWeb, the dataset, was created through a multi-step process including base filtering, independent MinHash deduplication per dump, selected filters derived from the C4 dataset, and the team’s custom filters.
● The text extraction using the trafilatura library produced higher quality data than default CommonCrawl WET files, even though the resulting dataset was meaningfully smaller.
● ● They found deduplication drove performance improvements, up to a point, before hitting a point of diminishing returns, and then worsening it.
● The team also used llama-3-70b-instruct to annotate 500k samples from FineWeb, scoring scoring each for their educational quality on a scale from 0 to 5. FineWeb-edu, which filtered out samples scored below 3, outperformed FineWeb and all other open datasets, despite being significantly smaller. (View Highlight)
While retrieval and embeddings are not new, growing interest in retrieval augmented generation (RAG) has prompted improvements in the quality of embedding models.
● Following the playbook that’s proven effective in regular LLMs, massive performance improvements have come from scale (GritLM has ~ 47B parameters vs the 110M common among prior embedding models).
● Similarly, the usage of broad web scale corpora and improved filtering methods have led to large improvements in the smaller models.
● Meanwhile, ColPali is a vision-language embedding model that exploits the visual structure of documents, not just their text embeddings, to improve retrieval.
● Retrieval models are one of the few subdomains where open models commonly outperform proprietary models from the biggest labs. On the MTEB Retrieval Leaderboard, OpenAI’s embedding model ranks 29th, while NVIDIA’s open NV-Embed-v2 is top. (View Highlight)
Traditional RAG solutions usually involve creating text snippets 256 tokens at a time with sliding windows (128 overlapping the prior chunk). This makes retrieval more efficient, but significantly less accurate.
● Anthropic solved this using ‘contextual embeddings’, where a prompt instructs the model to generate text explaining the context of each chunk in the document.
● They found that this approach leads to a reduction of top-20 retrieval failure rate of 35% (5.7% → 3.7%).
● It can then be scaled using Anthropic’s prompt caching.
● As Fernando Diaz of CMU observed in a recent thread, this is a great example of techniques pioneered on one area of AI research (e.g. early speech retrieval and document expansion work) being applied to another. Another version of “what is new, is old”.
● Research from Chroma shows that the choice of chunking strategy can affect retrieval performance by up to 9% in recall. (View Highlight)
Many commonly used RAG benchmarks are repurposed retrieval or question answering datasets. They don’t effectively evaluate the accuracy of citations, the importance of each piece of text to the overall answer, or the impact of conflicting points of information.
● Researchers are now pioneering novel approaches, like Ragnarök, which introduces a novel web-based arena for human evaluation through pairwise system comparisons. This addresses the challenge of assessing RAG quality beyond traditional automated metrics.
● Meanwhile, Researchy Questions provides a large-scale collection of complex, multi-faceted questions that require in-depth research and analysis to answer, drawn from real user queries. (View Highlight)
Data curation is an essential part of effective pre-training, but is often done manually and inefficiently. This is both hard to scale and wasteful, especially for multimodal models.
● Usually, an entire dataset is processed upfront, which doesn’t account for how the relevance of a training example can change over the course of learning. These methods are frequently applied before training, so cannot adapt to changing needs during training.
● Google DeepMind’s JEST selects entire batches of data jointly, rather than individual examples independently. The selection is guided by a ‘learnability score’ (determined by a pre-trained reference model) which evaluates how useful it will be for training. It’s able to integrate data selection directly into the training process, making it dynamic and adaptive.
● JEST uses lower-resolution image processing for both data selection and part of the training, significantly reducing computational costs while maintaining performance benefits. (View Highlight)
Models produced by DeepSeek, 01.AI, Zhipu AI, and Alibaba have achieved strong spots on the LMSYS leaderboard, displaying particularly impressive results in math and coding.
● The strongest models from Chinese labs are competitive with the second-most powerful tier of frontier model produced by US labs, while being challenging the SOTA on certain subtasks.
● These labs have prioritized computational efficiency to compensate for constraints around GPU access, learning to stretch their resources much further than their US peers.
● Chinese labs have different strengths. For example, DeepSeek has pioneered techniques like Multi-head Latent Attention to reduce memory requirements during inference and an enhanced MoE architecture.
● Meanwhile, 01.AI has focused less on architectural innovation and more on building a strong Chinese language dataset to compensate for its relative paucity in popular repositories like Common Crawl. (View Highlight)
And Chinese open source projects win fans around the worldstateofai | 38 + → To drive international uptake and evaluation, Chinese labs have become enthusiastic open source contributors. A few models have emerged as strong contenders in individual sub-domains.
● DeepSeek has emerged as a community favorite on coding tasks, with deepseek-coder-v2 for its combination of speed, lightness, and accuracy.
● Alibaba released the Qwen-2 family recently, and the community has been particularly impressed by its vision capabilities, ranging from challenging OCR tasks to its ability to analyse complex art work.
● At the smaller end, the NLP lab at Tsinghua University has funded OpenBMB, a project that has spawned the MiniCPM project.
● These are small <2.5B parameter models that can run on-device. Their 2.8B vision model is only marginally behind GPT-4V on some metrics, while 8.5B Llama 3 based model surpasses it on some metrics.
● Tsinghua University’s Knowledge Engineering Group has also created CogVideoX - one of the most capable text to video models. (View Highlight)
Moving on from diffusion models for text-to-image, Stability AI have continued to search for refinements that increase quality while bringing about greater efficiency.
● Adversarial diffusion distillation speeds up image generation by reducing the sampling steps needed to create high-quality images from potentially hundreds down to 1-4, while maintaining high fidelity.
● It combines adversarial training with score distillation: the model is trained just using a pre-trained diffusion model as a guide.
● As well as unlocking single-step generation, the authors focused on reducing computational complexity and improving sampling efficiency.
● Rectified flow improves upon traditional diffusion methods by connecting data and noise through a direct, straight line, rather than a curved path.
● They combined this with a novel transformer-based architecture for text-to-image that allow for a bidirectional flow of information between text and image components. This enhances the model’s ability to generate more accurate and coherent high-resolution images based on textual descriptions. (View Highlight)
Both Google DeepMind and OpenAI have given us sneak previews of highly powerful text-to-video diffusion models. But access remains heavily gated and neither has supplied much technical detail.
● OpenAI’s Sora is able to generate videos up to a minute long, while maintaining 3D consistency, object permanence, and high resolution. It uses spacetime patches, similar to the tokens used in transformer models, but for visual content, to learn efficiently from a vast dataset of videos.
● Sora was also trained on visual data in its native size and aspect ratio, removing the usual cropping and resizing that reduces quality.
● Google DeepMind’s Veo combines text and optional image prompts with a noisy compressed video input, processing them through encoders and a latent diffusion model to create a unique compressed video representation.
● The system then decodes this representation into a final high-resolution video.
● Also in the fight are Runway’s Gen-3 Alpha, Luma’s Dream Machine, and Kling by Kuaishou. (View Highlight)
Keeping the gated approach of other labs, Meta has brought together its work on different modalities via the Make-A-Scene and Llama families to build Movie Gen.
● The core of Movie Gen is a 30B video generation and a 13B audio generation model, capable of producing 16-second videos at 16 frames per second and 45-second audio clips respectively.
● These models leverage joint optimization techniques for text-to-image and text-to-video tasks, as well as novel audio extension methods for generating coherent audio for videos of arbitrary lengths.
● Movie Gen’s video editing capabilities combine advanced image editing techniques with video generation, allowing for both localized edits and global changes while preserving original content.
● The models were trained on a combination of licensed and publicly available datasets.
● Meta used A/B human evaluation comparisons to demonstrate positive net win rates against competing industry models across their four main capabilities. The researchers say they intend to make the model available in future, but don’t commit to a timeline or release strategy. (View Highlight)
In a sign that AI has truly come of age as both a scientific discipline and a tool to accelerate science, the Royal Swedish Academy of Sciences awarded Nobel Prizes to OG pioneers in deep learning, alongside the architects of its best-known application (so far) in science. The news was celebrated by the entire field. (View Highlight)
DeepMind and Isomorphic Labs released AlphaFold 3, their successor from AF2, which can now model how small molecule drugs, DNAs, RNAs and antibodies interact with protein targets.
● There were substantial and surprising algorithmic changes from AF2: all equivariance constraints were removed in favor of simplicity and scale, while the Structure Module was replaced with a diffusion model to build the 3D coordinates.
● Unsurprisingly, the researchers claim that AF3 performs exceptionally well in comparison to other methods (esp. for small molecule docking), although this was not compared to stronger baselines.
● Notably, no open-source code was made available (yet).
Several independent groups are working on reproducing the work openly. (View Highlight)
The decision to not release code for the AF3 publication was highly controversial, with many blaming Nature. Politics aside, there has been a race by start-ups and AI communities to make their model the go-to alternative.
● The first horse out of the gate was Baidu with their HelixFold3 model, which was comparable to AF3 for ligand binding. They provide a web server and their code is fully open-sourced for non-commercial use.
● Chai-1 from Chai Discovery (backed by OpenAI) recently released a molecular structure prediction model that has taken off in popularity due to its performance and high quality implementation. The web server is also available for commercial drug discovery use.
● We are still waiting for a fully open-sourced model with no restrictions (e.g. using outputs for training of other models). ● Will DeepMind fully release AF3 sooner if they begin to fear alternative models are becoming the communities favourite? (View Highlight)
The secretive protein design team at DeepMind finally “came out of stealth” with their first model AlphaProteo, a generative model that is able to design sub-nanomolar protein binders with 3- to 300-fold better affinities.
● While few technical details were given, it seems it was built on top of AlphaFold3 and is likely a diffusion model. ‘Hotspots’ on the target epitope can also be specified.
● The model was able to design protein binders with 3- to 300-fold better binding affinities than previous works (e.g. RFDiffusion).
● The “dirty secret” of the protein design field is that the in silico filtering is just as (if not more) important than the generative modelling, with the paper suggesting that AF3-based scoring is key.
● They also use their confidence metrics to screen a large number of possible novel targets for which future protein binders could be designed. (View Highlight)
The Bitter Lesson: Equivariance is dead…long live equivariance!stateofai | 48 → Equivariance is the idea of giving a model the inductive biases to natively handle rotations, translations and (sometimes) reflections. It has been at the core of Geometric Deep Learning and biomolecular modelling research since AlphaFold 2. However, recent works by top labs have questioned the existing mantra.
● The first shots were fired by Apple, with a paper that obtained SOTA results on predicting the 3D structures of small molecules using a non-equivariant diffusion model with a transformer encoder.
● Remarkably, the authors showed that using the domain-agnostic model did not deleteriously impact generalization and was consistently able to outperform specialist models (assuming sufficient scale was used). ● Next was AlphaFold 3, which infamously dropped all the equivariance and frames constraints from the previous model in favour of another diffusion process coupled with augmentations and, of course, scale.
● Regardless, the greatly improved training efficiency of equivariant models means the practice is likely to stay for a while (at least for academic groups working on large systems such as proteins). (View Highlight)
Since 2019, Meta had been publishing transformer-based language models (Evolutionary Scale Models) trained on large-scale amino acid and protein databases. When Meta terminated these efforts in 2023, the team founded EvolutionaryScale. This year, they released ESM3, a frontier multimodal generative model that was trained over sequences, structures and functions of proteins rather than sequences alone.
● The model is a bidirectional transformer that fuses tokens that represent each of the three modalities as separate tracks into a single latent space.
● Unlike traditional masked language modelling, ESM3’s training process uses a variable masking schedule, exposing the model to diverse combinations of masked sequence, structure, and function. ESM3 learns to predict completions for any combination of modalities.
● ESM3 was prompted to generate new green fluorescent proteins (GFP) with low sequence similarity to known ones. (View Highlight)
The fundamental problem with research at the intersection of biology and ML is that there are very few people with the skills to both train a frontier model and give it a rigorous biological appraisal.
● Two works from late 2023, PoseCheck and PoseBusters, showed that ML models for molecule generation and protein-ligand docking gave structures (poses) with gross physical violations.
● Even the AlphaFold3 paper didn’t get away without a few bruises when a small start-up showed that using a slightly more advanced conventional docking pipeline beat AF3.
● A new industry consortium led by Valence Labs, including major pharma companies (i.e. Recursion, Relay, Merck, Novartis J&J, Pfizer), is developing Polaris, a benchmarking platform for AI-driven drug discovery. Polaris will provide high-quality datasets, facilitate evaluations, and certify benchmarks.
● Meanwhile, Recursion’s work on perturbative map-building led them to create a new set of benchmarks and metrics. (View Highlight)
To determine the properties of physical materials and how they behave under reactions, it is necessary to run atomic-scale simulations that today rely on density functional theory. This method is powerful, but slow and computational expensive. While faster, alternative approaches that calculate force fields (interatomic potentials) tend to have insufficient accuracy to be useful, particularly for reactive events and phase transitions.
● In 2022, equivariant message passing neural networks (MPNN) combined with efficient many-body messages (MACE) were introduced at NeurIPS.
● Now, the authors present MACE-MP-0, which uses the MACE architecture and is trained on the Materials Project Trajectory dataset, which contains millions of structures, energies, magnetic moments, forces and stresses.
● The model reduces the number of message passing layers to two by considering interactions involving four atoms simultaneously, and it only uses nonlinear activations in selective parts of the network.
● It is capable of molecular dynamics simulation across a wide variety of chemistries in the solid, liquid and gaseous phases. (View Highlight)
Deep learning, originally inspired by neuroscience, is now making into modelling the brain itself. BrainLM is a foundation model built on 6,700 hours of human brain activity recordings generated by functional magnetic resonance imaging (fMRI), which detects changes in blood oxygenation (left figure). The model learns to reconstruct masked spatiotemporal brain activity sequences and, importantly, it can generalise to held-out distributions (right figure). This model can be fine-tuned to predict clinical variables e.g. age, neuroticism, PTSD, and anxiety disorder scores better than a graph convolutional model or an LSTM. (View Highlight)
Classical atmospheric simulation methods like numerical weather prediction are costly and unable to make use of diverse and often scarce atmospheric data modalities. But, foundation models are well suited here. Microsoft researchers created Aurora, a foundation model that produces forecasts for a wide range of atmospheric forecasting problems such as global air pollution and high-resolution medium-term weather patterns. It can also adapt to new tasks by making use of a general-purpose learned representation of atmospheric dynamics. (View Highlight)
Foundation models for the mind: reconstructing what you seestateofai | 56 + → MindEye2, is a generative model that maps fMRI activity to a rich CLIP space from which images of what the individual sees are reconstructed using a fine-tuned Stable Diffusion XL. The model is trained on the Natural Scenes Dataset, an fMRI dataset built from 8 subjects whose brain responses were captured for 30-40 hours as they looked at hundreds of rich naturalistic stimuli from the COCO dataset scanning sessions for 3 seconds each. (View Highlight)
Decoding speech from brain recordings with implantable microelectrodes could enable communication for patients with impaired speech. In a recent case, a 45-year-old man with amyotrophic lateral sclerosis (ALS) with tetraparesis and severe motor speech damage underwent surgery to implant microelectrodes into his brain. The arrays recorded neural activity as the patient spoke in both prompted and unstructured conversational settings. At first, cortical neural activity was decoded into a small vocabulary of 50 words with 99.6% accuracy by predicting the most likely English phoneme being attempted. Sequences of phonemes were combined into words using an RNN, before moving to a larger 125,000-word vocabulary enabled by further training. (View Highlight)
François Chollet, the creator of Keras, has partnered with Zapier co-founder Mike Knoop to launch the ARC prize, offering a $1M prize fund for teams that make significant progress on the ARC-AGI benchmark ● Chollet created the benchmark back in 2019 as a means of measuring models’ ability to generalize, focusing on tasks that are easier for humans and hard for AI. The tasks require minimal prior knowledge and emphasise visual problem-solving and puzzle-like tasks to make it resistant to memorization.
● Historically, LLMs have performed poorly on the benchmark, with performance peaking at about 34%.
● Chollet is sceptical of LLMs’ ability to generalize to new problems outside of their training data and is hoping the prize will encourage new research directions that will lead to a more human-like form of intelligence.
● The highest score so far is 46 (short of the 85 target). It’s been achieved by the Minds AI team, who have used an LLM-based approach, employing active inference, fine-tuning the LLM on test task examples and expanding it with synthetic examples to improve performance. (View Highlight)
On novel tasks, where LLMs are unable to rely on memory and retrieval, performance often degrades. This suggests that they still often struggle to generalize beyond familiar patterns without external help.
● Even advanced LLMs like GPT-4 have difficulty reliably simulating state transitions in text-based games, especially for environment-driven changes. Their inability to consistently grasp causality, physics, and object permanence, makes them poor world-modellers, even on relatively straightforward tasks.
● Researchers found that LLMs accurately predict direct action results, like a sink turning on, around 77% of the time, but struggle with environmental effects, such as water filling a cup in the sink, achieving only 50% accuracy for these indirect changes.
● Other research evaluated LLMs on planning domains, including Blocksworld, and Logistics. GPT-4 produced executable plans 12% of the time. However, using iterative prompting with external verification, Blocksworld plans hit 82% accuracy and Logistics plans 70% accuracy after 15 rounds of feedback.
When re-run with o1, performance jumped but was still far from perfect. (View Highlight)
Researchers are exploring methods to generate stronger internal reasoning processes, variously targeting both training and inference. The latter approach appears to underpin OpenAI o1’s jump in capabilities.
● Quiet-STaR from a joint Stanford-Notbad AI team generates internal rationales during pre-training, using a parallel sampling algorithm and custom meta-tokens to mark the beginning and end of these “thoughts.” ● The approach employs a reinforcement learning-inspired technique to optimize the usefulness of generated rationales, rewarding those that improve the model’s ability to predict future tokens.
● Meanwhile, Google DeepMind have targeted inference, showing that for many types of problems, strategically applying more computation at test time can be more effective than using a much larger pre-trained model. ● A Stanford/Oxford team have also looked at scaling inference compute, finding that repeated sampling can significantly improve coverage. They suggest that using weaker and cheaper models with many attempts can outperform single attempts from their stronger and more expensive peers. (View Highlight)
Open-endedness gathers momentum as a promising research directionstateofai | 61 → One path to improving the robustness of LLM reasoning is to embrace an open-ended approach such that they’re capable of generating new knowledge.
● In a position paper, a Google DeepMind team framed open-ended systems as able to “continuously generate artifacts that are novel and learnable to an observer”.
● They outline potential paths towards open-ended foundation models, including reinforcement learning, self-improvement, task generation, and evolutionary algorithms.
● On the self-improvement front, we saw STRATEGIST, a method for allowing LLMs to learn new skills for multi-agent games.
● The researchers used a bi-level tree search approach, combining high-level strategic learning with low-level simulated self-play for feedback. It outperformed RL and other LLM-based approaches on Game of Pure Strategy and The Resistance: Avalon at action planning and dialogue generatio (View Highlight)
After prolonged training beyond the point of overfitting (known as grokking), some researchers have argued that transformers learn to reason over parametric knowledge through composition and comparison tasks.
● Researchers at Ohio State University argued that a fully grokked transformer outperformed then SOTA models like GPT-4-Turbo and Gemini-1.5-Pro on complex reasoning tasks with a large search space.
● They conducted mechanistic analyses to understand the internal workings of the models during grokking, revealing distinct generalizing circuits for different tasks.
● However, they found that while fully grokked models performed well on comparison tasks (e.g. comparing attributes based on atomic facts), they were less good at out-of-distribution generalization in composition tasks.
● This raises questions about whether these are really meaningful reasoning capabilities versus memorization by another name, although the researchers believe that enhancing the transformer with better cross-layer memory sharing could resolve this. (View Highlight)
For agents to be useful, they need to be robust to real-word stochasticity, which SOTA models have historically struggled with. We’re beginning to see signs of progress.
● DigiRL is a novel autonomous reinforcement learning approach for training in-the-wild device control agents specifically for Android devices. The method involves a two-stage process: offline reinforcement learning followed by offline-to-online reinforcement learning. (View Highlight)
To improve planning, approaches like MCTS, which helped to power AlphaGo, are slowly returning to the fore. Early results are promising, but will they be enough?
● MultiOn and Stanford combined an LLM with MCTS, along with a self-criticism mechanism and direct preference optimization, to learn from different success and failure criteria.
● They found this improved Llama-3 70B’s zero-shot performance from 18.6% to 81.7% in real-world booking scenarios, after a day of data collection, and up to 95.4% with online search.
● The longer-term question will be whether next-token prediction loss is too fine-grained.
● This risks limiting the ability of RL and MCTS to achieve agentic behavior by focusing too much on individual tokens and hindering the exploration of broader, more strategic solutions. (View Highlight)
One of the big bottlenecks for training RL agents is a shortage of training data. Standard approaches like converting pre-existing environments (e.g. Atari) or manually building them are labor-intensive and don’t scale.
● Genie (winner of a Best Paper award at ICML 2024) is a world model that can generate action-controllable virtual worlds. It analyzed 30,000 hours of video game footage from 2D platformer games, learning to compress the visual information and infer the actions that drive changes between frames.
● By learning a latent action space from video data, it can handle action representations without requiring explicit action labels , which distinguishes it from other world models.
● Genie is both able to imagine entirely new interactive scenes and demonstrate significant flexibility: it can take prompts in various forms, from text descriptions to hand-drawn sketches, and bring them to life as playable environments.
● This approach demonstrated applicability beyond games, with the team successfully applying the hyperparameters from the game model to robotics data, without fine tuning. (View Highlight)
New lab Sakana AI has been focused on attempting to enhance the creative capabilities of current frontier models. One of their first papers looks at using foundation models to automate research itself.
● The AI Scientist is an end-to-end framework designed to automate the generation of research ideas, implementation, and the production of research papers.
● After being given a starting template, it brainstorms novel research directions, before executing the experiments, and writing them up. The researchers claim their LLM-powered reviewer evaluates the generated papers with near-human accuracy.
● The researchers used it to generate example papers about diffusion, language modeling, and grokking. These were convincing at first glimpse, but contained some flaws on closer examination.
● Yet, the system periodically displayed signs of unsafe behavior, e.g. importing unfamiliar Python libraries and editing code to extend experiment timelin (View Highlight)
Meta’s TestGen-LLM combines multiple LLMs, prompts and configurations to leverage different models’ strengths to improve unit testing coverage for Android code on Instagram and Facebook.
● It uses an “assured” approach, filtering generated tests to ensure they build successfully, pass reliably, and increase coverage before recommending them. This is the first large-scale industrial deployment of an approach that combines LLMs with verifiable guarantees of code improvement, addressing concerns about LLM hallucinations and reliability in a software engineering context.
● In deployment, TestGen-LLM improved about 10% of test classes it was applied to, with 73% of its recommendations accepted by developers. (View Highlight)
Wayve’s LINGO-2 is the second generation of its vision-language-action model, that, unlike its predecessor, can both generate real-time driving commentary and control a car, linking language explanations directly with decision-making and actions. Meanwhile, the company is using generative models to enhance its simulator with more real-world detail. PRISM-1 creates realistic 4D simulations of dynamic driving scenarios using only camera inputs. It enables more effective testing and training by accurately reconstructing complex urban environments, including moving elements like pedestrians, cyclists, and vehicles, without relying on LiDAR or 3D bounding boxes. (View Highlight)
Despite all eyes being on Gemini, the Google DeepMind team has steadily been increasing its robotics output, improving the efficiency, adaptability, and data collection of robots.
● The team created AutoRT, a system that uses a VLM for environmental understanding and an LLM to suggest a list of creative tasks the robot could carry out. These models are then combined with a robot control policy.
This helps to scale up deployment quickly in previously unseen environments.
● RT-Trajectory enhances robotic learning through video input. For each video in the dataset of demonstrations, a 2D sketch of the gripper performing the task is overlaid. This provide practical visual hits to the model as it learns.
● The team have also improved the efficiency of transformers. SARA-RT is a novel ‘up-training’ method to convert pre-trained or fine-tuned robotic policies from quadratic to linear attention, while maintaining quality.
● Researchers have found Gemini 1.5 Pro’s multimodal capabilities and long context window makes it an effective way of interacting with robots via natural language. (View Highlight)
Historically, robotics had significantly fewer open source datasets, tools, and libraries than other areas of AI - creating an artificially high barrier to entry. Hugging Face’s LeRobot aims to bridge the gap, hosting pretrained models, datasets with human-collected demonstrations, and pre-trained demonstrations. And the community’s loving it. (View Highlight)
While consumer demand for the Vision Pro lacklustre so far, it’s taking robotics research by storm, where its high-res, advanced tracking, and processing power is being leveraged by researchers working on teleoperation - controlling robot movements and actions at a distance. Systems like Open-TeleVision and Bunny-Vision Pro use it to help enable precise control of multi-finger robotic hands (at a 3000 mile distance in the case of the former), demonstrating improved performance on complex manipulation tasks compared to previous approaches. They address challenges such as real-time control, safety through collision avoidance, and effective bimanual coordination. (View Highlight)
ast year, a non-finetuned GPT-4 via one API call was highly competitive with Google’s Med-PaLM2 on certain medical knowledge benchmarks. Gemini has ridden to the rescue.
● The Med-Gemini family of multimodal models for medicine are finetuned from Gemini Pro 1.0 and 1.5 using various medical datasets and incorporate web search for up-to-date information. They achieved SOTA 91.1% accuracy on MedQA, surpassing GPT-4.
● For multimodal tasks (e.g. in radiology and pathology), Med-Gemini set a new SOTA on 5 out of 7 datasets.
● When quality errors in questions were fixed, model performance improved and it exhibited strong reason across other benchmarks. It also achieved high precision and recall in retrieving rare findings in lengthy EHRs - a challenging “needle-in-a-haystack” task.
● In a preliminary study, clinicians rated Med-Gemini’s outputs equal or better than human-written examples in most cases. (View Highlight)
terprise automation set to get an AI-first upgradestateofai | 83 Traditional Robotic Process Automation (RPA), embodied by UiPath, has struggled with high set-up costs, brittle execution, and burdensome maintenance. Two novel approaches, FlowMind (JP Morgan) and ECLAIR (Stanford), use foundation models to address these limitations. FlowMind focuses on financial workflows, using LLMs to generate executable workflows via APIs. In experiments on the NCEN-QA dataset, FlowMind achieved 99.5% accuracy in workflow understanding. ECLAIR takes a broader approach, using multimodal models to learn from demonstrations and interact directly with graphical user interfaces across various enterprise settings. On web navigation tasks, ECLAIR improved completion rates from 0% to 40%. (View Highlight)
As AI emerges as the new competitive battleground, big tech companies begin to hold more details of their work close to their chest. Frontier labs have meaningfully cut publication levels for the first time since this report began, while academia gets into gear. (View Highlight)
Amid growing demand for its hardware to power demanding gen AI workloads, every major lab depends on NVIDIA for its hardware. Its market cap hit $3T in June, only the third US company to reach this milestone (following Microsoft and Apple). Following blowout earnings in Q2, its position looks as unassailable as ever. (View Highlight)
NVIDIA has already booked significant pre-sales on its new Blackwell family of GPUs and is making a strong play for governments.
● The new Blackwell B200 GPU and GB200 Superchip promise significant performance gain over the Hopper architecture of H100 fame. NVIDIA claims it can reduce cost and energy consumption 25x over an H100. In a mark of NVIDIA’s power, every major AI lab CEO provided a supporting quote in the press release.
● While the Blackwell architecture was delayed by manufacturing issues, the company is still confident of booking several billion in revenue from it by the end of the year.
● Jensen Huang, NVIDIA’s Founder and CEO is expanding the pitch, outlining the company’s vision of sovereign AI.
● He has argued that every government needs to build its own LLM to preserve its national heritage. (View Highlight)
AMD and Intel have started to invest in their software ecosystems, while AMD has made a heavy pitch to the open source community using ROCm (its CUDA competitor). However, they are yet to develop compelling alternatives to NVIDIA’s portfolio of networking solutions. AMD is hoping its planned $4.9B acquisition of server builder ZT Systems will change this. Meanwhile, Intel has seen its hardware sales decline. Short of regulatory intervention, a change in research paradigm or supply constraints, NVIDIA’s position seems unassailable. (View Highlight)
We looked at the 6BinvestedinAIchipchallengerssince2016andaskedwhatwouldhavehappenedifinvestorshadjustboughttheequivalentamountofNVIDIAstockatthatday’sprice.Theanswerislimegreen:that6B would be worth 120BofNVIDIAstocktoday(20x!)vs.the31B (5x) in its startup contenders. (View Highlight)
A vocal minority of analysts and commentators aren’t convinced. They point to the decline in GPU scarcity, how only a few companies are currently generating reliable revenue from AI-first offerings, and how even Big Tech’s infrastructure build-out is unlikely to be big enough to justify the company’s current valuation. The market is currently ignoring these voices and seems more inclined to agree with early Tesla investor James Anderson’s view that the company could be worth “double-digit trillions” in a decade. (View Highlight)
New highlights added October 14, 2024 at 1:34 PM
The real large-scale GPU cluster growth has come from H100s. The largest continues to be Meta’s 350k H100s, followed by xAI’s 100k cluster and Tesla’s 35k. Meanwhile, Lambda, Oracle and Google have been building large clusters summing over 72k H100s. Companies including Poolside, Hugging Face, DeepL, Recursion, Photoroom and Magic have built over 20k worth H100 capacity. Moreover, the first GB200 clusters are going live (e.g. 10,752 at the Swiss National Supercomputing Centre), while OpenAI will have access to 300,000 by the end of next year. (View Highlight)
By last year’s count, NVIDIA was used 19x more than all of its peers combined in AI research papers (note the log-scale y-axis!). This year, this lead has compressed to 11x, due in part to the 522% growth in papers that use TPUs (gap is now 34x with NVIDIA). We also note the 353% growth in the use of Huawei’s Ascend 910, the 61% growth of large AI chip start-up contenders and the new appearance of Apple’s silicon. (View Highlight)
Usage of A100s continues to grow (+59% YoY) alongside the H100 (+477%) and the 4090 (+262%), albeit from a much lower base. The V100 (now 7 years old, -20%), continues to be used at half the rate of the A100 (now 4 years old), further demonstrating the longevity of NVIDIA systems for AI research. (View Highlight)
AI chip start-upsstateofai | 95 Meanwhile in start-up land, Cerebras appears to be pulling out ahead the pack with 106% growth in the number of AI research papers that make use of its wafer scale systems. Groq, which launched their LPU recently, saw its first usage in AI research papers last year. Meanwhile, Graphcore was acquired by SoftBank in late mid-2024.
Unlike their common enemy, NVIDIA, these AI chip start-ups have mostly pivoted from selling systems to inference interfaces on top of open models. (View Highlight)
Ever since the A100 launch in 2020, NVIDIA has been cutting down the time to ship its next datacenter GPU while significantly increasing the TFLOPs they deliver. In fact, timelines have come down by 60% from A100 to H100 and down a further 80% from H200 to GB200. During that time, TFLOPs have gone up 6x. Large cloud companies are buying huge amounts of these GB200 systems: Microsoft between 700k - 1.4M, Google 400k and AWS 360k. OpenAI is rumored to have at least 400k GB200 to itself. (View Highlight)
The speed of data communication between GPUs within a node (scale-up fabric), as well as between nodes (scale-out fabric), is critical to large-scale cluster performance. NVIDIA’s technology for the former, NVLink, has bandwidth per link, the number of links and the number of total GPUs connected per node increasing significantly in the last 8 years. Coupled to their InfiniBand technology for connecting nodes intro large-scale clusters, NVIDIA is ahead of the pack. Meanwhile, Chinese companies like Tencent have reportedly innovated around sanctions for similar outcomes. Their Xingmai 2.0 high-performance computing network, which is said to support over 100,000 GPUs in a single cluster, improves network communication efficiency by 60% and LLM training by 20%. That said, it is not clear whether Tencent possesses clusters of this size. (View Highlight)
On publishing their Llama 3 family of models, Meta shared a breakdown of the 8.6 job interruptions per day they experienced during a 54-day period of pre-training Llama 3 405B. GPUs tend to experience failures more frequently than CPUs and all clusters are by no means created equal. Continuous monitoring is essential, misconfigurations and dead-on arrival components happen too often due to insufficient testing, and low-cost power, affordable networking rates and availability are paramount. More on power needs in the Politics section! (View Highlight)
While big tech companies have long produced their own hardware, these efforts are accelerating as they seek to at least improve their bargaining power with NVIDIA - but these aren’t tackling the most challenging workloads. ● Known for its TPUs, Google has unveiled the Axion, built on the Armv9 architecture and instruction set. These will be made available through Cloud for general-purpose workloads and achieves 30% better performance than the fastest general-purpose Arm-based instances currently available.
● Meta has unveiled the second generation of its in-house AI inference accelerator, which more than doubles the compute and memory bandwidth of its predecessor. The chip is currently used for ranking and recommendation algorithms, but Meta plans to expand its capabilities to cover training for generative AI.
● Meanwhile, OpenAI has been hiring from Google’s TPU team and is in talks with Broadcom about developing a new AI chip.
● Sam Altman has also reportedly been in talks with major investors, including the UAE government, for multi-trillion dollar initiative to boost chip production. (View Highlight)
Riding the NVIDIA tidal wave, AI chip challengers are fighting for a slice of the (VC and customer) pie. ● Cerebras, known for their Wafer-Scale Engine, that integrates an entire supercomputer’s worth of compute onto one wafer-sized processor, has filed to IPO on 136MinrevenueforH12024(up15.6xYoY),87●Thecompanyhasraisedover700M with customers in the compute-intensive energy and pharma sectors. It recently launched an inference service to serve LLMs with faster token generation.
● Meanwhile, Groq raised a 640MSeriesData2.8B valuation for its Language Processing Unit, designed solely for AI inference tasks.
● So far, Groq has landed partnerships with Aramco, Samsung, Meta, and green compute provider Earth Wind & Power.
● Both companies are focusing on speed as a core differentiator and are working on cloud services, with Cerebras recently launching an inference.
● This helps them swerve NVIDIA’s software ecosystem advantage, but gives them a new (challenging) competitor in the form of cloud services providers. (View Highlight)
While SoftBank starts to build its own chip empire (after prematurely selling NVIDIA)stateofai | 101 Known for betting big, SoftBank is entering the arena, tasking subsidiary Arm with launching its first AI chips in 2025 and acquiring struggling UK start-up Graphcore for a rumoured 600−700M.●ArmisalreadyaplayerintheAIworld,buthistorically,itsinstructionsetarchitecturehasnotbeenoptimalforthelarge−scaleparallelprocessinginfrastructurerequiredfordatacentertrainingandinference.It’salsostruggledagainstNVIDIA’sentrencheddatacenterbusinessandmaturesoftwareecosystem.●Withacurrentmarketcapofover140B, markets aren’t bothered. The company is reportedly already in talks with TSMC and others about manufacturing.
● SoftBank also scooped up Graphcore, which pioneered Intelligent Processing Units, a processor designed to handle AI workloads more efficiently than GPUs and CPUs, using small volumes of data. Despite its sophistication, the hardware was often not a logical choice for genAI applications as they took off.
● The company will operate semi-autonomously under the Graphcore brand.
● Meanwhile, Softbank’s talks with Intel on designing a GPU challenger stalled after they were unable to agree on requirements. (View Highlight)
As US export controls widen, previously sanctions-compliant chips have found themselves on the wrong side of tougher performance thresholds. That hasn’t deterred chip manufacturers.
● In last year’s report, we documented how NVIDIA had booked over 1BinsalesoftheA800/H800(theirspecialChina−compliantchip)tomajorChineseAIlabs.TheUSthenbannedsalestoChina,forcingarethink.●USCommerceSecretaryGinaRaimondohaswarnedthat“ifyouredesignachiparoundaparticularcutlinethatenables[China]todoAII’mgoingtocontrolittheverynextday”.●NVIDIA’snewChinachip,theH20istheoreticallysignificantlyweakerthantop−lineNVIDIAhardware,ifyoumeasurebyrawcomputingpower.However,NVIDIAhaveoptimiseditforLLMinferenceworkloads,meaningitisnow2012B in sales.
● China, proportionally, however is becoming less important to US chip manufacturers. It’s gone from representing 20% of NVIDIA’s data center business to “mid-single digits”, according to NVIDIA (View Highlight)
While Chinese labs face restrictions in their ability to import hardware, there are currently no controls on their local affiliates renting access to it overseas. ByteDance rents access to NVIDIA H100s via Oracle in the US, while Alibaba and Tencent are reportedly in conversations with NVIDIA about setting up their own US-based datacenters. Meanwhile, Google and Microsoft have directly pitched big Chinese firms on their cloud offerings.
The US is planning to make hyperscalers report this kind of usage via a KYC scheme, but is yet to draw up plans to prohibit it. (View Highlight)
Many of the buzziest start-ups working on generative AI are raising on record, often three digit revenue multiples. While these might be an indication of investor confidence in future returns, it sets a high bar, as many of these companies currently have no identified path to profitability. However, this isn’t true for everyone, as the biggest model providers see revenue begin to ramp up. (View Highlight)
OpenAI is on course to see revenues triple in the space of a year, but training, inference, and staffing costs mean losses are continuing to mount. They’re not the only leader in search of functional economics. (View Highlight)
Meta has produced an incredible vibe shift in public markets by ditching their substantial metaverse investments and pivoting hard into open source AI with their Llama models. Mark Zuckerberg is, arguably, the de facto messiah of open source AI, counterpositing vs. OpenAI, Anthropic, and Google DeepMind (View Highlight)
Over the summer, Anthropic and then Vercel launched the capability for their chat agents Claude and V0 to open coding environments in which code is written and run in the browser to solve a user’s request. This brings previously static code snippets to life, enabling users to iterate with the agent in real time, and to reduce the barrier for creating software products. Needless to say, social media GenAI fans loved this! Below are examples of Claude Artifacts and V0 generating a playable Minesweeper game from a single prompt. (View Highlight)
The most successful technology companies like Apple, Google, or TikTok have taken a product-first, rather than simply building a foundational technology and an API. As base model performance converges, OpenAI, Anthropic, and Meta are visibly putting more thought into what their ‘product’ looks and feels like - whether it’s Artifacts from Claude, OpenAI’s Advanced Voice functionality, or Meta’s hardware partnerships and lip-syncing tools. Simply building a good model won’t be all you need. (View Highlight)
In last year’s report, we touched on Databricks and Mosaic’s LLM combined strategy, which focused on fine-tuning models on customer’s data. Is the ‘bring your own model’ era over?
● The Mosaic research team, now folded into Databricks, open-sourced DBRX in March. A 132B MoE model, DBRX was trained on just over 3,000 NVIDIA GPUs at a cost of $10M. Databricks is pitching the model as a foundation for enterprises to build on and customize, while remaining in control of their own data.
● Meanwhile, Snowflake’s Arctic is pitched as the most efficient model for enterprise workflows, based on a set of metrics covering tasks including coding and instruction following.
● It’s unclear how much enterprises are willing to invest in costly custom model tuning, given the constant set of releases and improvements driven bigger players.
● With readily available open source frontier models, the appeal of training custom models is increasingly unclear. (View Highlight)
Given the high compute costs involved, model builders increasingly rely on partnership arrangements with established Big Tech companies. Antitrust regulators worry that this will further entrench incumbents.
● Regulators have particularly zeroed in on the close relationship between OpenAI and Microsoft, along with Anthropic’s ties to Google and Amazon.
● Regulators fear that big tech companies are either essentially buying out competition or providing friendly service provision deals to companies that they’ve invested in - potentially disadvantaging competitors.
● They’re particularly nervous about the influence NVIDIA wields over the ecosystem and its decision to make direct investments.
France is contemplating NVIDIA-specific charges.
● Big Tech companies are attempting to place some clear blue water between themselves and start-ups, with Microsoft and Apple both voluntarily surrendering their OpenAI board observer seats. (View Highlight)
Regulatory action can only do so much to shape a market, when economic logic dictates otherwise. Giving the converging performance of many of ‘the rest’ and the companies’ high cap-ex needs, consolidation is unsurprising. Given some of the regulatory hurdles, we’ve seen the rise of pseudo-acquisitions, where a Big Tech company i) hires the founders and much of the team of a start-up; ii) the start-up exits the model-building game to focus on its enterprise offer; iii) investors are paid out via a licensing agreement. This model has been used by Microsoft with Inflection and Amazon with Adept. However, regulators have become wise to the move and regulators on both sides of the Atlantic are beginning to scrutinize these arrangements (View Highlight)
By far the most widely-used AI-powered developer tool, Copilot adoption is growing 180% year-over-year and its annual revenue run rate is now $2B (double its 2022 figure). Copilot (40% of Github revenue) alone is now a bigger business than Github was when Microsoft acquired it. However, it’s just one of a number of coding companies, some of which are raising blockbuster rounds. (View Highlight)
In a now familiar cycle, we’re seeing specialist tools and frameworks gain popularity before struggling to scale and enter production, while incumbents demonstrate impressive resilience and adaptability.
● Following the explosive growth of vector databases, the uniqueness of searching in vector space has worn off.
Existing database providers have launched their own vector search methods.
● Hyperscalers like AWS, Azure, and Google Cloud have expanded their native DB offerings to support vector search and retrieval at scale, while data clouds like MongoDB, Snowflake, Databricks and Confluent are seeking to capture RAG workloads from their existing customer base.
● Core Vector DB providers like Pinecone and Weviate now support traditional keyword search, such as ElasticSearch and OpenSearch along with introducing support for simple and efficient filtering and clustering. ● Over in framework land, the likes of LangChain and LlamaIndex, having achieved popularity for experimentation, their high-level abstractions and limited flexibility have been called out as a source of friction by some developers, as their needs become more sophisticated. (View Highlight)
e AI agents going commercial?stateofai | 119 While H is being cagey about the specifics of its work, its early team contained experts in reinforcement learning and multi-agent systems. Other agentic efforts are already up and running.
● Devin, launched by Cognition, made a splash in March. Pitched as “the first AI software engineer”, it is meant to plan and execute tasks requiring thousands of decisions, while fixing mistakes and learning over time.
● The product itself split users, attracting fans, as well as detractors who point to the need for guardrails and manual intervention.
Either way, investors are impressed, and within six months of launch, the company secured a $2B valuation.
● Devin has an open source competitor in OpenDevin, which beat the proprietary Devin on SWE-bench by 13 percentage points.
● MultiOn is also betting big on RL, with its autonomous web agent - Agent Q (see slide 65) - combining search, self-critique, and RL. It will be made available to users later this year.
● Meta’s TestGen-LLM has gone from paper to product at breakneck space (4 months), being integrated into Qodo’s Cover-Agent. (View Highlight)
With 165Mraised,PerplexityhasemergedasthebuzziestAI−firstsearchchallenger,whileGoogleisrollingoutitsownsearchsummaries.Bothcompaniesarefindingthattheoutputisonlyasgoodastheinformation.●Withina18monthsofbeingfounded,Perplexityhita1B valuation, with rumours that it is already looking to potentially triple it. The LLM analyzes user input, sourcing responses either via a web search or from its knowledge base, before producing a summary with in-line citations.
● Google has ruled out a summary boxes to illustrate the potential of Gemini to power up its standard offering.
● Both services, however, have struggled with reliability issues. Gemini was found to be using satirical Reddit posts as advice sources (e.g.
advising users to consume a rock a day), while Perplexity struggles with the same hallucination issues that hit other LLM-powered services.
● OpenAI has started trialling a prototype search function - SearchGPT - which will eventually be integrated into ChatGPT. While we don’t know technical specifics yet, promotional imagery suggests a Perplexity-esque user experience. (View Highlight)
While copyright concerns are nothing new in generative AI, 2024 saw model builders come under greater scrutiny from media organizations, record labels, and content creators.
● OpenAI and Google are negotiating with major media organizations, hoping that licensing arrangements will take the sting out of criticism. In a similar vein, Eleven Labs has started a voice actor program.
● Some start-ups are swerving this altogether and are embracing ethical certification schemes. The best-known is Fairly Trained, started by ex-Stability AI executive Ed Newton-Rex.
● At the other end of the spectrum, Meta and Perplexity have doubled down on ‘fair use’ arguments and have demonstrated little appetite for compromise with their critics.
● As labs approach the data ceiling, YouTube scraping is in the spotlight.
● OpenAI reportedly transcribed millions of hours of YouTube videos to power its audio transcription model. Meanwhile, Eleuther AI’s widely-used Pile dataset contains subtitles from 173,536 YouTube videos. Internal documents from both RunwayML and NVIDIA showed they mass scraped YouTube. (View Highlight)
The central question about whether creators’ copyright has been infringed by model builders via the use of their work for training remains unsolved, but more expansive arguments have been shot down in the courts.
● Cases continue against Anthropic, OpenAI, Meta, Midjourney, Runway, Udio, Suno, Stability and others from news outlets, image suppliers, authors, creative artists, and record labels.
● So far, model builders have failed to get any of these cases dismissed in full, but have managed to narrow their scope significantly.
● For example, claims from two groups of authors against OpenAI and Meta arguing that the companies were guilty of vicarious copyright infringement because all of their models’ outputs are “infringing derivative work” failed, because they were unable to demonstrate “substantial similarity”. Only their original claims on the ground of copyright infringement have been allowed to proceed.
● A similar pruning happened with the cases against Midjourney, Runway, and Stability with plaintiffs told to focus on the original scraping, with many of their wider claims dismissed.
● Amid this uncertainty, Adobe, Google, Microsoft, and OpenAI have taken the unusual step of indemnifying their customers against any legal claims they might face on copyright grounds. (View Highlight)
Last year, a Cruise vehicle struck a pedestrian in San Francisco. The company lost its licence to operate in California and saw significant leadership turnover. General Motors, Cruise’s historically distant parent has pumped $850M into the company after previously cutting 25% of the workforce and halting market expansion.
The company has resumed testing in Phoenix (with a human in the vehicle) and GM plan to seek external investment. Despite this additional runway, existential questions still loom over the company, signalling the high stands companies operating in the space are held to. (View Highlight)
Humanoid start-ups like Figure, Sanctuary, and 1X have raised close to a billion dollars from corporate investors, including Samsung, Microsoft, Intel, OpenAI, and NVIDIA. Can the tech overcome its limitations?
● Replicating the complexity of human motion and engineering human-like dexterity, has historically proven to be an expensive and technically difficult endeavor.
● Start-ups are betting that sophisticated VLMs, real-world training data and simulation, along with better hardware can change this.
● Avid SOAI readers, however, will be familiar with the story of self-driving - where breakthroughs were promised every year, before companies undershot for half a decade.
● Customers must also be convinced that humanoids are more efficient than cheaper, non-humanlike industrial robot systems.
● The appetite for non-humanoid robotics start-ups remains healthy, despite Amazon’s recent pseudo-acquisition of Covariant, a Bay Area robotics foundation model builder. (View Highlight)
Visual effects are an expensive and labor-intensive business, so Hollywood producers have been slowly trying to integrate generative AI, amid a backlash from artists and animators. While much of this work has been done quietly and post-production, eagle-eyed viewers have spotted clear signs of gen-AI related mishaps in the background of HBO and Netflix productions. This ties back to long-standing issues around models’ ability to represent physics and geometry accurately and consistently. Our prediction never said the output would be good (View Highlight)
n the first deal of its kind, Runway has struck a partnership with film and games studio Lionsgate (famous for films like John Wick, Twilight, and the Hunger Games franchise). Runway will train a new generative model on Lionsgate’s catalogue of 20,000 titles, while Lionsgate said that it would use Runway’s models to support “capital-efficient content creation opportunities”. Financial details remain unclear at this stage, but we know that Lionsgate will initially use the model for storyboarding, before deploying it for the creation of visual effects. (View Highlight)
ue to a combination of scientific disagreement, commercial pressures, personality clashes, and availability of capital, small bands of researchers have broken away from the biggest labs, indicating an ecosystem deepening.
● Japan-based Sakana AI, co-founded by Llion Jones, who was famously the only author of Attention Is All You Need to have not left Google, and David Ha, emerged from stealth with 30Mandthreemodelsbasedontheevolution−inspiredapproachof‘modelmerging’,whereexistingmodelsarecombinedandthemostpromisingbecome‘parents’tothenextgeneration.●Paris−basedHCompany,ledbyateamofexperiencedDeepMinders,raiseda220M round to build action models for RPA.
● Following board drama at OpenAI (more on this later), co-founder Ilya Sutskever left to start Safe Superintelligence Inc. a lab focused on building safe AGI with zero short-term commercial pressures or goals.
● Most recently, a number of the original Stable Diffusion creators launched Black Forest Labs to focus on image and video generation.
They’ve already released FLUX.1, their first family of open source image models, which has rapidly begun to contend Midjourney’s quali (View Highlight)
…but entrepreneurship is hardstateofai | 129 Being a great engineer isn’t always a sign you’ll be a great founder. Some former staffers at labs have experienced early success, others … less so. Safe Sign Technologies, founded by a former solicitor and an ex-DeepMind researcher went through an acquisition without the founding team having to dilute to external investors. At the other end of the spectrum, the ex-DeepMind founding team over at H Company couldn’t get to launch without disintegrating, even with over $200M in the bank. (View Highlight)
ElevenLabs, the market leader in text-to-speech (TTS) hit unicorn status at the start of the year, with a $1.1B valuation. With the big labs approaching the space tentatively, it has much of the field to itself.
● Alongside its flagship TTS product, the company has expanded into dubbing in foreign languages, voice isolation, and has previewed an early text-to-music model. Likely seeking to avoid a copyright blow-up, the company has opted not to release it immediately, but has provided an API for sound effect generation.
● 62% of Fortune 500 companies now have at least one employee using ElevenLabs.
● Meanwhile, the frontier labs have approached the space with caution, likely out of concern that misuse of voice generation capabilities could result in a potential backlash.
● GPT-4o’s voice outputs have been restricted to preset voices for general release, while OpenAI has said it is yet to make a decision on whether it will ever make its Voice Engine (which can allegedly recreate a voice based on a 15-second recording) widely available.
● Meanwhile, Cartesia is betting on state space models for efficient TTS. (View Highlight)
enAI applications continue to see fast growth Avatar video generation product, Synthesia, continues to grow exponentially across enterprise, small businesses and creators. Once considered to be “fringe”, Synthesia is now used by the majority of the Fortune 100 for learning and development, marketing, sales enablement, information security and customer service. Over 24M videos have been generated with the service since launch in 2020, 2.5x more than last year. (View Highlight)
AI-first products begin to demonstrate their stickiness in enterprise…stateofai | 132 In last year’s report, we charted how GenAI products were struggling to retain paying customers beyond their initial ‘wow’ effect and trial periods. New data from US corporate fintech Ramp suggests that both spend and retention is beginning to improve significantly from the 2022 to 2023 cohorts. Top performers include OpenAI, Grammarly, Anthropic, Midjourney, Otter, and ElevenLabs. (View Highlight)
Analysis of the 100 highest revenue grossing AI companies using Stripe reveals that, as a group, they are generating revenue at a much faster pace than previous waves of equivalently well-performing SaaS companies.
Strikingly, the average AI company that has reached $30M+ annualised revenue took just 20 months to get there, compared to 65 months for equally promising SaaS companies. (View Highlight)
While text-to-speech benefits from a ‘wow effect’, speech recognition has the potential to automate away mundane tasks at scale. Investors are beginning to see its potential to scale.
● A string of start-ups working to use speech recognition for a range of use cases including customer support and call centers have scored funding rounds in the last year or so, including Assembly AI (50M),Deepgram(72M), PolyAI (50M),Parloa(66M). PolyAI’s revenue is set to triple this year.
● These start-ups are focused on plugging shortages of call center staff and allowing for more natural speech from customers, including corrections, hesitation, interruption, and topic changes - areas traditional automated systems have struggled with.
● While AI-powered transcription and audio analysis isn’t new, accuracy is improving as a result of larger datasets, transformer models.
● For example, Assembly AI has built Universal-1, a multilingual model trained on 12.5M of speech that is less runs faster, with less compute, fewer errors and better ambient noise reduction than OpenAI’s Whisper. (View Highlight)
For more than a decade, Alexa and Siri have delivered royally underwhelming consumer voice agent experiences. The launch OpenAI’s GPT-4o and Paris-based Kyutai’s Mochi voice agents crosses the uncanny valley. Both systems think and speak at the same time to ensure maximum flow between the speaker/agent. OpenAI showed how two phones running GPT-4o could hold a compelling voice conversation with one another. Mochi’s inference speed was impressive and borderline too fast, producing occasionally jarring interruptions to the human speaker if they paused too long. Google’s Notebook LM’s ability to generate conversational podcasts based on research is also winning fans. More recently, Hugging Face implemented a speech-to-speech pipeline with voice activity detection, TTS, an LLM and text-to-speech. (View Highlight)
Given Apple is publishing work on foundation models that will power Apple Intelligence features, it’s reasonable to ask how long-lasting or deep any OpenAI partnership is likely to be.
● Apple has kept up a steady tempo of research publications and has released a series of highly capable smaller open models with a focus on on-device inference.
● In July, they released a paper documenting the models that will power Apple Intelligence features.
● The server and smaller on-device versions of the model demonstrate competitive performance in instruction following, tool use, writing, and math.
● The on-device 3B model outperforms Gemma-7B and Mistral-7B in human evaluations.
● Apple argues this is a sign that data quality is a far more ● important determinant of performance than data quantity. Pre-training included web pages, math, code, and certain licensed datasets.
● They’re also investing on the MLX array framework for AI research on Apple silicon.
stateof.ai 2024 (View Highlight)
Given Apple is publishing work on foundation models that will power Apple Intelligence features, it’s reasonable to ask how long-lasting or deep any OpenAI partnership is likely to be.
● Unsloth, since launching at the end of last year, has quickly emerged as a popular open source project, offering radically up to 30x faster training and fine-tuning, by leveraging GPU kernel improvements.
● The focus is on optimizing the attention mechanism when using LoRA for efficient fine-tuning. Unsloth manually derives gradients for 6 matrix operations, related to LoRA and attention inputs.
● By carefully arranging the order of matrix multiplications and using in-place operations, it’s possible to significantly boost speed and memory efficiency.
● These optimizations are applied across all model components, not just the attention mechanism. (View Highlight)
Low-Rank Adaptation is a method to fine-tune large models such that their generations improve along aspects that the user cares about, such as characters, styles or concepts. Platforms such as Civit.ai make it easy for users to train LoRA’s using their own training examples. These LoRAs are shared on a marketplace for anyone to use.
Moreover, a popular workflow is to use the output of a LoRA model to condition the generation of a few second video using products like Runway that allow users to set the start and end image frame. It’s surely a matter of time before generative audio is added to the mix! (View Highlight)
Google famously launched their smart glasses in 2014 just as deep learning-based computer vision research was starting to show promise and a few years before the augmented reality hype really started peaking. The product flopped and was pulled in 2015. Meanwhile in 2020, Meta started collaborating with popular sunglasses brand Ray-Ban, to develop smart glasses. The first version was released in 2021 and the second, with enhanced audio capabilities and an integration to Meta AI, launched in 2023 for $299. It has become a hit. While sales numbers aren’t shared, but Zuckerberg stated that many styles and colors sold out. It’s likely that the form factor, quality audio and changing opinions towards privacy contributed to this change of fate. (View Highlight)
Less successful have been attempts to build AI-powered gadgets designed to act as assistants. The two most famous are the Rabbit R1 and the Humane AI pin. These gadgets combine standard voice assistant capabilities with other features, including a camera, image analysis, and language translation. Early reviews have been near universally negative, with common complaints including unreliability, poor battery life, and a lack of useful features. While reviewers often believed there was a world in which these devices could be useful, they complained customers were paying high sums (699forthePin,199 for the R1) to beta test products that weren’t ready for the market. (View Highlight)
Attention is all you need… to build raise billions for sell your AI start-up Noam Shazeer of Character.ai sold his team back to Google for 2.5B,whileAdeptwasacqui−hiredintoAmazonandInflectionintoMicrosoftfor650M. These deals all involved hiring founders and star employees while paying enough money to investors as a technology licensing fee to get the deals through. (View Highlight)
The US introduces limited frontier model rules via executive order…stateofai | 154 After securing voluntary commitments from the big labs in July 2023, the White House decided to make them binding, with Joe Biden signing an executive order on frontier model regulation in October that year.
● Executive Order 14110 was primarily directed at government agencies. Measures include mandating the development of cybersecurity standards, requiring federal agencies to publish AI use policies, directing agencies to address AI-related critical infrastructure risks, and commissioning a labor market study.
● Most notably, the EO mandated labs to notify the federal government and share the results of safety tests before the public deployment, if the model used more than 10^26 FLOPS of computing power in training (slightly more than GPT-4 and Gemini Ultra).
● It also set out additional requirements for companies working on the use of AI for biological synthesis.
● The crucial downside of executive orders is that they can be revoked at the stroke of a pen. The Republican platform for the coming presidential election commits to doing exactly that. (View Highlight)
while states pursue their own, more controversial, rulesstateofai | 155 With little prospect of bipartisan consensus emerging around broader federal AI regulation, states are pursuing their own AI laws, most notably California with SB 1047.
● Bills so far tend to be focused around the disclosure of AI usage, reporting for certain high risk use cases, and consumer opt-outs. For example, the Colorado’s state legislature to include reporting requirements for high-risk systems and to create a reporting mechanism for algorithmic discrimination risks.
● However, the most comprehensive and controversial has been California’s SB 1047. Sponsored by the existential risk org the Center for AI Safety, the bill, it creates a safety and liability regime for foundation models.
● The original draft of the bill spooked industry, with an unconventional method of determining in-scope models*, burdensome reporting and compliance procedures (with accompanying criminal penalties of perjury) and a new government body to oversee frontier models.
● Following pushback by tech companies, VCs, and prominent state Democrats, the bill was significantly amended.
While Anthropic and Elon Musk supported the amended version, OpenAI, Meta, and a trade group representing Big Tech remained opposed.
● Governor Gavin Newsom vetoed the bill, arguing that it risked giving “the public a false sense of security” while “curtailing the very innovation that fuels advancement in favor of the public good”. (View Highlight)
In March, the European Parliament passed the AI Act after an intensive Franco-German influence campaign to weaken certain provisions. Questions about implementation, however, remain unanswered.
● With the passage of the act, Europe is now the first bloc in the world to adopt the a full-scale regulatory framework for AI. Enforcement will be rolled out in stages, with the ban on “unacceptable risk” (e.g. deception, social scoring) to come in February 2025.
● France and Germany managed to secure changes that tiered the foundation model regulations, with a basic set of rules applying to all models and additional regulations for those being deployed in a sensitive environment.
● The full-on ban on facial recognition has now been watered down to allow its use by law enforcement.
● While industry is concerned about the law, the months of consultation and large amount of secondary legislation required means it still has time to shape the specifics of implementation if it engages constructively. (View Highlight)
A combination of the EU AI Act and long-standing GDPR requirements around privacy and data transfers has left US labs struggling to adapt their services. Anthropic’s Claude wasn’t accessible to European users until May 2024, while Meta won’t offer multimodal models to European customers. Meanwhile, Apple is rebelling against EU’s Digital Markets Act, claiming that its interoperability requirements are incompatible with its positions on privacy and security. As a result, it is delaying the European launch of Apple Intelligence. (View Highlight)
As model builders search for more data to meet their insatiable appetites, opt-out policies are coming under scrutiny.
● Under questioning from Australian lawmakers, Meta’s global privacy director admitted that the company automatically scraped posts for model training going back to 2007, provided that they have not explicitly marked them as private.
● Users in the EU have been granted a global opt-out option following regulatory pressure. The company has confirmed that it will not offer this to users unless it is compelled to do so by local regulators.
● The UK’s Information Commissioner’s Office asked Meta to pause in June, but after the company gave users a window in which to object, they allowed it to proceed.
● Meta aren’t alone. X has stopped using European users’ public posts following a court battle, while the Irish Data Protection Commission is now investigating Alphabet’s use of user data to train Gemini. (View Highlight)
The new UK Labour Government has signalled that it intends to break with its predecessor’s approach of only regulating AI via existing legislation, but only subtly.
● At the November Bletchley Summit (more on that later), AWS, Anthropic, Google, Google DeepMind, Inflection AI, Meta, Microsoft, Mistral AI and OpenAI voluntarily agreed to ‘deepen’ the access that they provided to the UK Government.
● Anthropic has given the UK AISI pre-deployment access to Claude Sonnet 3.5, while Google DeepMind made some of the Gemini family available.
● The new UK Government has signalled that it will pass legislation codifying these previously voluntary commitments, but suggested that it will not pursue broader regulation, implicitly ruling out an EU-style approach.
● Observers previously thought this legislation would be published immediately, but the timeline has lengthened, as the government pursues a consultation process in the face of industry pushback.
● This follows on from an industry consultation that the previous government ran on similar questions, which concluded that immediate frontier model regulation was unnecessary, but there would likely be a time when this would change (View Highlight)
China was the first country to begin setting down generative AI guardrails, with comprehensive (originally voluntary) guidelines appearing from 2022 onwards. The country’s censorship apparatus is now stepping in.
● While top Chinese labs continue to produce SOTA models, overseen by the Cyberspace Administration of China, the government is keen to ensure that models simultaneously avoid giving ‘incorrect’ answers to political questions, while avoiding giving the appearance of being censored.
● Before releasing a model, labs have to submit their models to tests with tens of thousands of questions to calibrate their refusal rate.
They usually achieve this by building a spam-filter type classifier. There’s also a booming industry of consultants assisting labs.
● There are also other inconveniences, including a ban on domestic Hugging Face access. The officially sanctioned “mainstream values corpus” acts as an inferior replacement source of training data.
● While big companies like Alibaba, ByteDance, and Tencent can afford the compliance and use their global footprint to some restrictions - start-ups are likely to suffer. (View Highlight)
Shortly after the publication of the last State of AI Report, the US controlled the export of NVIDIA’s sanctions-compliant A800 and H800 chips, but its actions have broadened out ● Not only has the US barred the export of certain items, it has actively attempted to interfere with stockpiling efforts, either blocking shipments of goods or leaning on international partners to do so, ahead of the deadlines for restrictions. This has affected NVIDIA, Intel, and ASML.
● This was followed with letters from the Commerce Department instructing US manufacturers to cease sales to Chinese semiconductor maker SMIC’s most advanced facility.
● The US is also escalating beyond just selling technology and is moving to either block or restrict US investment in Chinese start-ups working on a broad range of applications deemed detrimental to national security, including semiconductors, defense, surveillance, and audio, image, and video recognition.
● Considering sharply diminished US investment in Chinese start-ups, the impact will largely be symbolic. (View Highlight)
Skeptics of US sanctions have long warned that they could inadvertently spur local innovation. These efforts continue to struggle with quality and performance issues, despite generous government subsidy.
● Despite corruption concerns, China has continued to deepen its semiconductor subsidy program. In May, the government unveiled a third state-backed investment fund of $47.5B. The finance ministry is the biggest shareholder, along with a coalition of Chinese banks.
● Huawei caused a stir with the release of the Ascend 910B, a 7nm chip for AI training, which on paper, was a near match for the NVIDIA A100.
● However, SMIC has struggled to manufacture these chips at scale: 4/5 are reportedly defective. The company’s cloud services CEO has all but admitted that the company will struggle to innovate beyond 7nm for the foreseeable future.
● Previously buzzy semiconductor start-ups such as X-Epic have started to lay off staff as the market has cooled, while memory chip maker YMTC ran into severe financial trouble late last year (View Highlight)
For a combination of political and cultural reasons, Japan has historically been a placid market for both venture capital and AI start-ups. The government is suddenly keen to get a slice of the action.
● The Japanese government sees VC and AI as a potential vehicle for kickstarting a long-stagnant economy, while Japan presents an opportunity for investors who’d rather not have to raise from deep-pocketed Gulf states.
● Tokyo-based Sakana has already pulled in 200MfromUSinvestorslikeLuxCapitalandKhoslaVentures,whilea16zisreportedtobeplanningaJapanoffice.●Inturn,theJapanesegovernment−fundedinvestmentvehicleshaveinvestedintwoofUSVCNEA’sfundsandareactivelyexploringothers.MitsubishiissaidtobeinvestinginAndrewNgofStanford’ssecondAIfund.●Meanwhile,thecountryisalsopridingitselfonalight−touchapproachtoregulationandisfocusingonindustry−ledoversightandseemsunsympathetictocopyrightclaimsaroundgenerativeAI.However,ithascreatedaUK−stylesafetyinstitute.●Sensingthemomentum,Microsofthasannounced2.9B of investment in Japanese AI and cloud infrastructure. (View Highlight)
With the capex needs of frontier labs beginning to grow beyond what traditional VC alone can supply, labs are beginning to look further afield. Alarm bells are already beginning to ring in the corridors of power.
● Following the downfall of FTX, its 8% stake in Anthropic was sold primarily to Mubadala, the government of Abu Dhabi’s sovereign wealth fund. A Saudi bid was turned down on national security grounds, although Saudi investors Prince Alwaleed Bin Talal and Kingdom Holding participated in X.ai’s Series B.
● Most controversially, G42, an Emirati AI-focused holding company had struck a partnership with OpenAI to work in the country’s finance, energy, and healthcare sectors.
● G42’s holdings in prominent Chinese technology companies, including Bytedance, prompted panic in the US intelligence community.
● In the end G42, was pressured into divesting its Chinese holdings and accepting a $1.5B investment from Microsoft, with Microsoft President Brad Smith joining the board. (View Highlight)
Public compute efforts pale in comparison to privatestateofai | 166 The UK, US, and EU are all beginning to ramp up their public compute offering, subsiding researchers and start-ups access to expensive hardware. But efforts remain tentative.
● The UK recently froze investment in a number of projects, most significantly a planned national supercomputing facility at Edinburgh.
● Meanwhile, the EU is using grants to make small amounts of compute available to start-ups through a competitive process, along with small sums of money (€250K and 2 million computational hours).
● It recently published a call for proposals for its AI Factores initiative, which will allow developers and researchers to access the EuroHPC network of supercomputers and other resources, including data repositories, skills training, and co-working hubs – leaving the potential hosts with the flexibility to bundle various resources as they see fit.
● The US National AI Research Resource is now operational, with researchers applying for a year-long access on the condition that their work is published openly afterwards.
● At the bolder end of the spectrum, the Indian government has indicated its willingness to fund half the cost of a 10,000 NVIDIA GPU cluster, which it would look to establish in under 18 months, provided private partners are prepared to foot some of the bill. (View Highlight)
Big tech companies have signed up to a range of 2030 climate commitments, with Microsoft even pledging to be carbon negative. AI energy consumption means they’re currently headed in the wrong direction.
● According to Google’s 2024 Environmental Report, the company’s greenhouse gas emissions have climbed by 48% since 2019, while Microsoft’s carbon emissions have jumped by 30% since 2020. xAI’s 100k H100 cluster is thought to be powered by gas generators.
● Meanwhile, Goldman Sachs is estimating that data center power demand is on course to grow 160% by 2030, although they note that demand was growing sharply even before the genAI boom took off.
● Tech companies are trying to shape a review of the Greenhouse Gas Protocol, which sets the rules for carbon accounting.
● Critics argue offsets don’t represent emissions accurately.
Over 50% of Amazon and Microsoft’s renewable energy comes from purchasing clean energy certificates. (View Highlight)
When the conflict started, start-ups enthusiastically sent their equipment to be trialled on the frontline. The Ukrainians haven’t always been impressed.
● Drones produced by US start-ups have frequently fallen short of their benchmark performances on range and payload, while their high-powered designs worked against them. Their advanced comms, designed to make them more secure, gave them an easy signature for Russian electronic warfare to detect.
● While the off-the-shelf Chinese DJI drone remains ubiquitous, it appears that the Ukrainians are working hard to build a domestic ecosystem of drone and ground robotics start-ups.
● At least 67 models of domestically-built UAVs have been certified and 250 teams are working on UGVs.
● Alongside Helsing, there are still signs that international partners are helping on software. For example, Swiss autonomy start-ups Auterion’s Sky Node is helping FPV drones lock onto targets from a long-distance to mitigate the effect of electronic warfare. (View Highlight)
The debate over AI’s economic impact intensifiesstateofai | 171 2023 saw discussion about the extent to which different industries were exposed to AI. While organizations (e.g.
the IMF) continue to publish this work, the debate has begun to move on to its wider economic effects.
● Prominent economist Daron Acemoglu started a row when he argued in a paper for Economic Policy and some Goldman Sachs research that AI would have a minor macroeconomic impact, increasing Total Factor Productivity* by < 0.55% over the next 10 years, while deepening inequality.
● Acemoglu assumes that it will be possible for AI to drive further automation of tasks, while having little effect on the efficiency of currently capital-intensive tasks - unlike previous waves of automation - while creating new ‘negative’ tasks (e.g. producing disinformation or targeted ads). These assumptions sparked criticism.
● On automation itself, influential economics commentator Noah Smith argued that comparative advantage is likely to hold for the foreseeable future - even though AI will be more efficient than humans at any time, the cost of energy and compute will incentivize people only to apply it to the most important tasks.
● This is fortunate, as universal basic income, the policy lever many AI luminaries such as Sam Altman and Demis Hassabis have advocated as a response to AI’s impact, may not be a panacea. A sizeable trial funded by Altman found UBI slightly reduced the number of hours work, but led to little in the way of increased education or entrepreneurship. (View Highlight)
With their ability to communicate directly with western audiences limited, Russia Today was found to be operating a network of 1,000 fake X accounts via a tool called Meliorator. There are also signs that Russian state-linked actors have used fake imagery around the Israel-Hamas conflict to stir controversy. But there’s little evidence to suggest that this material is being viewed or believed by more than a small number of people.
● A recent review published in Nature poured cold water on the significance of the issue, finding that research tended to over-focus on fringe groups, overstate the role of bots, and failed to actually demonstrate real-world effect.
● In a similar vein, a study from the Alan Turing Institute found that AI-enabled disinformation had no impact on UK or European elections this year, with volumes low and exposure largely confined to small groups of political partisans. (View Highlight)
Is AI going to be nationalized? (Spoiler alert: no)stateofai | 173 As capabilities accelerate and tensions with China grow, a small chorus of voices have suggested that the US government may need to intervene and start a new Manhattan Project. Not everyone is convinced.
● Ex-OpenAI staffer Leopold Aschenbrenner revived this discussion with ‘Situational Awareness’, an 165-page PDF arguing that based on scaling laws, AGI by 2027 is plausible and that “the nation’s leading AI labs [are] basically handing the key secrets for AGI to the CCP on a silver platter”.
● Aschenbrenner advocates government nationalizing the major AI labs and building a national AGI project.
● Critics have accused Aschenbrenner of alarmism and questioned his timelines, pointing to constraints in data, energy, and compute.
● However, it’s clear that both government and labs are taking these questions more seriously. OpenAI appointed retired U.S. Army General Paul M. Nakasone to its Board of Directors and created a new Safety and Security Committee.
● This follows reports that the company’s systems were breached by hackers last year. (View Highlight)
From the days of US congressional hearings and world tours to promote the (existential) AI safety agenda, leading frontier model companies are accelerating the distribution of their AI products to consumers.
2023: AI is dangerous 2024: Plz use my app (View Highlight)
A community of red teamers (led by the anonymous Pliny the Prompter) have managed the defenses outlined on the previous slide, with GPT-4o mini’s Instruction Hierarchy being compromised within hours.
● While much of this work is done by ethically-motivated groups, the UK’s AI Safety Institute has expressed alarm at how models from leading labs comply with harmful requests “under relatively simple attacks”.
● Although jailbreak attacks are mostly harmless, DeepKeep, an Israeli cybersecurity start-up, made Llama 2 reveal sensitive personal data. ● Meanwhile, a team at UIUC has shown that GPT-4’s ability to leverage tool use and long context means it can hack websites by performing tasks like SQL injections without human feedback. With the right context, it can also exploit one-day vulnerabilities.
● Other research has illustrated the vulnerability of multi-agent environments to ‘infectious attacks’, where single agents are jailbroken, before contaminating the others. (View Highlight)
Coming up with endless potential attacks to red team models is challenging. Labs are increasingly using LLMs to scale the process of finding and patching vulnerabilities, including two teams at Meta.
● Rainbow Teaming employs an open-ended search algorithm to create prompts that are designed to elicit potentially unsafe or biased responses from the target LLM.
● By varying their approach and content, they can systematically explore LLM weaknesses. This was used as part of the safety testing for Llama 3.
● Rather than evolutionary search, AdvPrompter uses a single LLM, going through an alternating process of generating adversarial prompts and fine-tuning on them.
● Once trained, AdvPrompter can quickly produce new adversarial prompts adapted to different instructions. (View Highlight)
To improve the robustness of image classifiers to adversarial attack, a Google DeepMind team drew inspiration from biological visual systems, specifically the concept of microsaccades (small, involuntary eye movements).
● They feed the model multiple smaller, slightly blurrier versions of the same image. This improves robustness without needing special training.
● CrossMax Ensembling combines predictions from different layers of the model.
● Even if an adversarial attack confuses the final output, the predictions from earlier layers are often still accurate. By combining these, the model becomes stronger against attacks.
● The proposed method achieves state-of-the-art (SOTA) adversarial accuracy on the CIFAR-10 and CIFAR-100 datasets without adversarial training (View Highlight)
Beyond jailbreaking, research points to the potential of more stealthy attacksstateofai | 185 While jailbreaking is often the public face of safety challenges, the potential attack surface is much wider, covering everything from training, through to preference data and fine-tuning.
● Anthropic published an eye-catching paper arguing that it was possible to train LLMs to act as ‘sleeper agents’, exhibiting safe behavior on their initial release, before turning malicious at a later date. This was resistant to safety training techniques, such as supervised fine-tuning, reinforcement learning, and adversarial training.
● Researchers from Google and Technical University of Darmstadt found that poisoning the preference pairs that RLHF relies on was an effective way to manipulate a model. They only needed to compromise <5% of the data, indicating the dangers of the widespread use of public and uncurated datasets for preference training.
● Berkeley and MIT researchers created a dataset that seems benign but trains models to produce harmful outputs in response to encoded requests. When applied to GPT-4, the model consistently acted on harmful instructions while evading common safeguards. (View Highlight)
While there’s a significant body of work on how pre-training performance scales, there’s much less clarity on how downstreaming training does. A team of researchers have scrutinized the role of multiple-choice questions.
● They argue that standard performance metrics like accuracy mask the clear scaling trends visible in raw model outputs, making capability prediction difficult. These metrics compress and distort the original probability data, obscuring subtle improvements that occur as models get larger.
● This would appear to strengthen the argument that ‘emergent capabilities’ are the artificial product of poor metric construction, rather than real capability jumps.
● As the metrics rely on comparing the correct choice against specific incorrect choices, the researchers argue that we need to understand how probabilities change for both correct and incorrect answers as scale increases.
● This will also involve developing new evaluation techniques that preserve more of the raw probability information. (View Highlight)
Ensuring accurate, honest responses is crucial in alignment. However, research points to the interplay of training data, optimization techniques, and the limitations of current architecture making this is difficult to guarantee.
● Anthropic has zeroed in on RLHF, arguing that SOTA AI assistants show consistent sycophantic behavior (e.g.
biased feedback, being swayed by factually incorrect prompts, conforming to beliefs, mimicking errors). The weakness lies in human preference data, with human evaluators preferring supportive responses.
● Optimizing against preference models that don’t sufficiently prioritize or accurately assess truthfulness means they deprioritize accessing using their factual knowledge base for certain queries.
● Similarly, research from the Centro Nacional de Inteligencia Artificial in Chile found that LLMs can overestimate the depth of nonsensical or pseudo-profound statements, thanks to RHLF combined with an absence of contextual understanding. (View Highlight)
First proposed as an alternative to RLHF in 2023, DPO has no explicit reward function and comes with efficiency advantages because it doesn’t sample from a policy during training or require extensive hyperparameter tuning. Despite its novelty, the method has already been used to align Llama 3.1 and Qwen2.
● However, there are signs that the “over-optimization” that’ is traditionally associated with RLHF can also happen with DPO and other kinds of direct alignment algorithms (DAAs), despite the absence of a reward model. This worsens the more models are allowed to deviate from their starting point as they learn to align with human preferences.
● This could be the result of underconstrained objectives, where the algorithm unintentionally assigns high probabilities to out-of-distribution data.
● This is inherent to DAAs, but can be partially mitigated through careful parameter tuning and increased model size. (View Highlight)
Due to a combination of innate advantages and innovation designed to improve its efficiency, offline direct alignment methods don’t look set to displace RLHF at scale anytime soon.
● Testing online vs. offline approaches across datasets covering summarization, helpfulness, conversational ability, and harmlessness, a Google DeepMind team found that RLHF emerged as the winner across all of them.
● They argue that that this stems from on-policy sampling, which more effectively improves generative tasks and cannot be easily replicated by offline algorithms, even with similar data or model scaling.
● Cohere for AI has explored scrapping the Proximal Policy Optimization algorithm in RLHF (which treats each token as an individual action), in favor of their RLOO (REINFORCE Leave One-Out) Trainer, which entire generation as one action, distributing rewards across the full sequence.
● They find this leads to a 50-75% GPU use reduction and a 2-3x increase in training speed versus PPO, depending on model size. (View Highlight)
A Google DeepMind team has combined the simplicity of direct alignment from preferences (DAP) with the on-line policy learning of RLHF to create direct alignment from AI feedback. Here, an LLM serves as an annotator, choosing between two responses during each training iteration. This keeps the advantages of online learning without requiring a separate reward model. This is essentially a form of online DPO. They found it outperformed traditional RLHF and offline DPO across summarization, harmfulness, and helpfulness tasks. (View Highlight)
LLMs suffer from two main reliability errors: response that are inconsistent with their internal knowledge (hallucinations) and ones that share information that does not accord with established external knowledge.
● A recent paper from the University of Oxford focuses on a subset of hallucinations called confabulations, where LLMs produce incorrect generalizations.
● They measure the LLM’s uncertainty by generating multiple answers to a question, and using another model to group them together by similar meaning. Higher entropy scores across groups suggest confabulation.
● Meanwhile, Google DeepMind have introduced SAFE, which evaluates the factuality of LLM responses by breaking them down into individual facts, using search engines to verify facts, and clustering semantically similar statements.
● They’ve also curated LongFact, a new benchmark dataset for evaluating long-form faculty across 38 topics. (View Highlight)
Anthropic’s interpretability team used sparse autoencoders - neural networks that learn efficient representations of data by emphasizing important features and ensuring only a few are active at any one time - to decompose activations of Claude 3 Sonnet into interpretable components. They also showed that by ‘pinning’ a feature to ‘active’ you could control the output - famously turning up the strength of the Golden Gate feature. (View Highlight)
SAEs aren’t new, but researchers often struggled with balancing sparsity and reconstruction quality, and latents dying in training (i.e. inactive neurons). OpenAI researchers have worked on a methodology that scales.
● The researchers introduce the TopK activation function, which directly constrains the number of active features. For each input, only the k-highest activating features are kept, while the rest are set to zero - providing direct control over the sparsity level.
● They also managed to reduce dead latents to only 7%, an improvement on previous methods, where up to 90% could become inactive in large models.
● The OpenAI team also demonstrated both the potential and desirability of scaling, training a 16 million latent autoencoder on GPT-4 activations, finding clear scaling laws. (View Highlight)
Maybe the black box just isn’t that opaque after all?stateofai | 199 We’ve seen a run of interpretability research, including works on SAE, which argue that high-level semantic concepts are encoded “linearly” in the representations - and they can be decoded!
● A Chicago/Carnegie Mellon team introduce a simplified model where words and sentences are represented by binary “concept” variables. They prove that these concepts end up being represented linearly within the model’s internal space, thanks to next-token prediction and the tendency of gradient descent to find simple, linear solutions.
● This linearity was also the theme of work from the Moscow-based AI Research Institute, which argued that transformations happening within the model can be approximated with simple linear operations.
● Google has introduced a popular new method for decoding intermediate neurons. Patchscopes takes a hidden representation for LLM and ‘patching’ it to a different prompt. This prompt is used to generate a description or answer a question, revealing the encoded information. (View Highlight)
Mark Zuckerberg has said that exponential growth curves potentially require data centers powered by 1GW of electricity (close to the size of a meaningful nuclear power plant) versus the 50-100MW at the moment.
● Microsoft and OpenAI’s planed $100B+ Stargate supercomputer is estimated internally to need potentially as much as 5GW to power it. For comparison, The Grand Coulee Dam, the US’ biggest power plant produces 6.8GW.
● Microsoft is set to buy all the output of the revived Three Mile Island nuclear plant.
● A facility like this would require its own power plant, as the grid would not be able to handle it. Ireland, Germany, Singapore, China, and the Netherlands have introduced restrictions on data centers due to capacity concerns.
● Alongside energy, builders of standard-sized data centers are seeing years-long waits for back-up generators and cooling, and challenges sourcing basic components like cables and transistors. (View Highlight)