Microsoft has just unveiled Phi-2, its latest artificial intelligence (AI) language model, positioning itself as a notable advance over competitors like Llama 2 and Mistral 7B. With 2.7 billion parameters, Phi-2 demonstrates comparable performance to much larger models, even outperforming Google’s recent Gemini Nano 2 despite its half a billion additional parameters.
Microsoft researchers highlight the lower toxicity and bias in responses of Phi-2 compared to Llama 2. Despite its promising results, Phi-2 is currently limited to research use only and its commercial use is prohibited under a custom license from Microsoft Research. This limitation will disappoint companies that want to use this technology to develop products.
In recent months, Microsoft’s Machine Learning Foundation research team has released a series of small language models (SLMs) called “Phi” that achieve remarkable performance on a variety of benchmarks. The first model, the 1.3 billion parameter Phi-1 (opens in new tab), achieved top performance in Python coding among existing SLMs (particularly the HumanEval and MBPP benchmarks). Microsoft then expanded its scope to include common sense reasoning and language understanding, creating a new model with 1.3 billion parameters called Phi-1.5 (opens in a new tab), whose performance is comparable to models five times larger.
The Phi-1 language model is a specialized transformer for basic Python coding. His training used a variety of data sources, including subsets of Python code from The Stack v1.2, Q&A content from StackOverflow, competition code from code_contests, and generated synthetic Python textbooks and exercises. from gpt-3.5-turbo-0301. Even though the model and datasets are relatively small compared to modern Large Language Models (LLMs), Phi-1 demonstrated an impressive accuracy rate of over 50% on the simple Python coding benchmark HumanEval.
Uses
Due to the nature of the training data, Phi-1 is best suited for prompts that use the following code format:
Code format:
1 |
def print_prime(n): “”” Print all prime numbers between 1 and n “”” for num in range(2, n+1): for i in range(2, num): if num % i == 0: break else: print(num) |
Limits of Phi-1
- Door limitation: 99.8% of the Python scripts in our fine-tuning dataset use only the Typing, Math, Random, Collections, Datetime, Itertools packages. If the template generates Python scripts that use other packages, we strongly recommend users to manually review all API usage.
- Replicate scripts online: Since our model is trained on Python scripts found online, there is a small risk that it will reproduce these scripts, especially if they appear repeatedly in different online sources;
- Generate inaccurate code: The model often generates incorrect code. We encourage users to view these results as inspiration rather than definitive solutions;
- Unreliable answers from alternative formats: Although they appear to understand instructions in formats such as Q&A or chat, our models often provide inaccurate answers, even when they appear confident. Their capabilities in non-encoded formats are much more limited.
Limits of natural language understanding. As a programming robot, Phi-1’s main goal is to answer questions related to programming. Although it has some natural language understanding capabilities, its primary function is not to engage in general conversation or use common sense like an AI assistant would. Their strength lies in support and advice in the context of programming and software development.
Potential Bias: Phi-1, like other AI models, is trained on web and synthetic data. This data may contain biases and errors that could impact AI performance. Biases can come from a variety of sources, such as unbalanced representation, stereotypes, or controversial opinions in the training data. Therefore, the model sometimes generates answers that reflect these biases or errors.
Warning of security risks
Vigilance is essential when operating Phi-1. Although the model is powerful, it can inadvertently introduce security vulnerabilities in the generated code. Examples of this include:
- Directory traversal: The code may not implement security controls against directory traversal attacks, which may allow unauthorized access to sensitive files on your system.
- Injection attacks: There may be loopholes in string escaping that leave the application vulnerable to SQL attacks, operating system commands, or other injection attacks.
- Misunderstanding of requirements: The model can sometimes misunderstand or oversimplify user requirements, resulting in incomplete or unreliable solutions.
- Missing validation of inputs: In some cases, the model may neglect input validation or sanitization of user input, opening the scope for attacks such as Cross-Site Scripting (XSS).
- Unsafe errors: The model may recommend or generate code whose default settings are not safe, such as: B. weak password requirements or unencrypted data transfers.
- Error handling errors: Poor error handling can inadvertently reveal sensitive information about the system or the application’s internal workings.
Given these potential pitfalls and others not explicitly mentioned, it is important to thoroughly review, test, and verify generated code before deploying it to an application, especially those that are sensitive for security reasons. When in doubt, always consult security experts or conduct rigorous penetration testing.
Microsoft today released Phi-2 (opens in new tab), a 2.7 billion parameter language model that exhibits exceptional reasoning and language understanding capabilities and has leading performance among Microsoft language models. Base of less than 13 billion parameters. For complex benchmarks, Phi-2 meets or outperforms models up to 25 times larger thanks to new innovations in model scaling and training data curation.
With its compact size, Phi-2 is an ideal playground for researchers, particularly for exploring mechanistic interpretability, improving safety, or fine-tuning experiments in a variety of tasks. We’ve made Phi-2 (opens in a new tab) available in the Azure AI Studio model catalog to encourage language model research and development.
Main ideas behind the creation of Phi-2
The massive scale-up of language models to hundreds of billions of parameters has unlocked a host of new capabilities that have redefined the landscape of natural language processing. It remains to be seen whether these new capabilities can be achieved on a smaller scale through strategic training decisions such as data selection.
Work with Phi models aims to answer this question by training SLMs that achieve performances comparable to those of much larger models (but still far from frontier models). The main conclusions for breaking the traditional laws of scaling language models with Phi-2 are two:
First, the quality of the training data plays an essential role in the performance of the models. This has been known for decades, but this idea is taken to the extreme by focusing on “textbook quality” data, an extension of our previous work “Textbooks are all you need.” Training Data contains synthetic data sets specifically created to teach the model common sense and general knowledge, including, but not limited to, science, everyday activities, and theory of mind.
Second, Microsoft is using innovative techniques to scale, starting with our 1.3 billion-parameter Phi-1.5 model and integrating its knowledge into the 2.7-billion-parameter Phi-2 model. This large-scale knowledge transfer not only accelerates the convergence of training, but also the Phi-2 reference values, which are significantly improved.
Training details
Phi-2 is a transformer-based model with a next word prediction objective trained on 1.4T tokens from multiple passes across a mix of synthetic and web datasets for NLP and coding. Phi-2 training lasted 14 days on 96 A100 GPUs. Phi-2 is a baseline model that has not been subjected to Reinforced Human Feedback Learning (RLHF) or fine-tuned instruction. Nevertheless, better performance in terms of toxicity and bias was observed compared to existing open source models that were aligned. This corresponds to what was observed in Phi-1.5 thanks to our customized data curation technique.
Evaluation of Phi-2
Below we summarize Phi-2’s performance on academic benchmarks compared to popular language models. Our benchmarks cover several categories, namely: Big Bench Hard (BBH) (3 moves with CoT), Common Sense Reasoning (PIQA, WinoGrande, ARC Easy and Challenge, SIQA), Language Comprehension (HellaSwag, OpenBookQA, MMLU (5 moves), SQuADv2 (2 trains), BoolQ), mathematics (GSM8k (8 trains)) and coding (HumanEval, MBPP (3 trains)).
With only 2.7 billion parameters, Phi-2 outperforms the Mistral and Llama-2 models on 7B and 13B parameters on various aggregated benchmarks. In particular, it performs better than the 25 times larger Llama-2-70B model on multi-level thinking tasks, i.e. coding and math. Additionally, Phi-2 matches or even surpasses the recently announced Google Gemini Nano 2, despite being smaller.
Of course, we are aware of the current difficulties in evaluating models and know that many public benchmarks can leak into the training data. For our first model, Phi-1, we conducted a comprehensive decontamination study to eliminate this possibility, which can be found in our first report, The Manuals Are All You Need.
Ultimately, we believe that the best way to evaluate a language model is to test it on real use cases. With this in mind, we also evaluated Phi-2 against several internal Microsoft datasets and tasks and compared the new Mistral and Llama-2. We observed similar trends: on average, Phi-2 outperforms Mistral-7B, and the latter outperforms the Llama-2 models (7B, 13B and 70B).
Microsoft’s release of Phi-2 certainly represents a significant advance in the field of artificial intelligence language models, outperforming its competitors in terms of performance despite its relatively modest size of 2.7 billion parameters. Phi-2’s ability to compete with larger models, including Google’s Gemini Nano 2, is undeniably impressive and shows the effectiveness of its architecture.
Microsoft’s focus on reducing “toxicity” and bias in Phi-2 responses compared to other models such as Llama 2 is a positive development. This attention to response quality highlights the importance of promoting language models that minimize bias and undesirability, thereby improving the reliability and usefulness of AI in various contexts.
However, the disappointment lies in the restriction on commercial use of Phi-2, which limits its use to research purposes only. This decision can be seen as an obstacle for companies that want to use this technology to develop innovative products. The restriction raises questions about Microsoft’s strategy for commercializing its AI advances and could potentially hinder Phi-2’s adoption in practical and lucrative applications.
Ultimately, while Phi-2’s technological advances are laudable, limiting it to research purposes could undermine its real impact on the market and raise questions about Microsoft’s willingness to share its innovations with the commercial world.
Source: Microsoft (1, 2)
And you ?
Although Phi-2 appears to outperform its competitors in terms of performance, the question arises to what extent does this comparison take into account criteria other than the number of parameters, such as contextual accuracy and semantic understanding?
Microsoft criticizes Google’s demonstration with Gemini Nano 2. What are the specific elements of this demonstration and how does Phi-2 approach these aspects differently or similarly?
See also:
According to Rmi Louf, LLMs (Large Language Models) can generate JSON (JavaScript Object Notation), which is valid 100% of the time
Starling-7B: a new open source LLM (Large Language Model) almost as efficient as GPT-4, according to a study from the University of California
LLM by hallucination rate: According to an evaluation by Vectara, GPT-4 is the AI language model that hallucinates the least, suggesting that Google LLMs are the least reliable