Soon after rolling out its AI chatbot Q in late November, Amazon faced a barrage of negative reviews about it giving false answers, or hallucinations in industry parlance.

That’s left Amazon insiders looking for answers, and some are putting part of the blame on a less-capable version of Anthropic’s Claude, one of the base models that underpin the Q chatbot service. The cloud giant is now significantly ramping up an existing team of human staffers who manually review and fix the chatbot’s responses, Business Insider has learned.

Early Q stumbles are the result of a “rushed” launch that gave little time to test the chatbot properly, according to 6 current and former Amazon employees who were directly involved in the project. They asked not to be identified because they are not authorized to speak to the press. They said employees have repeatedly raised these concerns, and the team is now under pressure to improve the quality of Q’s answers, even as the project faces constraints over Amazon’s computing resources.

Amazon’s Q is a ChatGPT-like service that offers business customers quick answers to work-related or project-specific questions. It’s arguably the company’s highest-profile generative AI product so far, and an answer to popular chatbots from rivals such as Microsoft, Google, and OpenAI.

Despite the rushed launch, Q came out more than a year after ChatGPT and many months after Google’s Bard, highlighting how far behind Amazon is in the generative AI race. Q’s early challenges may be a setback for its efforts to catch up.

“Q should be more polished, given how far behind we are,” one of the Amazon employees told BI. “We had very limited time to test it.”

An Amazon spokesperson said Q is not based on a single AI model, and its launch followed standard operating procedure.

“Amazon Q is powered by Amazon Bedrock and takes advantage of many of the latest high-performing foundation models, using logic to route tasks to the model that is the best fit for the job,” the spokesperson added in a statement. “During the preview period we have received a significant amount of positive feedback from customers, and we continue to rapidly improve Amazon Q to make it even more useful for our customers.”

Claude Instant 1.2 versus Claude 2.1

Bedrock, the AWS cloud service that powers Q, provides access to a number of AI models, including Anthropic’s Claude 2.1, Meta’s Llama 2 and Amazon’s own Titan offering. Q can tap into the model that’s best for different use cases. The selling point for Q, one employe told BI, is that any company can take a base model, apply its own fine-tuning using company-specific proprietary data, and launch a bespoke chatbot for its own use.

Though Q is powered by Bedrock, Anthropic’s Claude is one of the major underlying base models, according to people familiar with the project. They said Q primarily used Claude Instant 1.2, a cheaper, lighter, and faster version of the AI model that was released in August. Internally, some employees believe upgrading to Claude 2.1, a more advanced version that came out a week before Q’s launch in November, would improve Q’s performance. The day after unveiling Q, Amazon announced that Claude 2.1 was available on Bedrock.

It’s no surprise Amazon is primarily relying on Anthropic for some of the base models it uses. In September, Amazon agreed to invest up to $4 billion in the AI startup. Anthropic CEO Dario Amodei gave a keynote speech at AWS’s re:Invent annual conference in November. Anthropic didn’t respond to a request for comment.

More approachable, but too simple

Currently, Amazon Q is only offered in preview mode to select customers.

Randall Hunt, VP of Cloud Strategy at Caylent, an AWS partner, told BI that Q now appears to be using the latest Claude model in many cases, based on his tests. Still, he said a lot of Q’s responses are too simple and often lack broader context, which may be unappealing to the more advanced cloud customers.

“For now, Q definitely makes AWS more approachable to new users. But I believe power users will find it more difficult to take advantage of it,” Hunt said.

‘Human in the loop’

The bigger concern for Q is its propensity to hallucinate, people familiar with the project said.

For example, during the pre-launch testing period, Amazon employees found that Q was providing inaccurate pricing details and made-up product information, one of the people said. At one point, if the answer contained a competitor’s name, like Oracle, it would be blocked out for unknown reasons, this person said. Platformer previously reported on similar problems.

In response, Amazon is beefing up Q’s human evaluation, a common AI practice known as “human in the loop,” people involved in the project told BI. The company had this process in place before launch, manually checking the accuracy and quality of Q’s answers, while controlling bias. However, there’s a task force now to ramp up these efforts. Hallucination is one of the key areas this team is focused on addressing, one of the people said.

“When Q came out, people realized how bad it was,” one of the people said. “The task force is to improve it.”

“Lack of leadership”

Hallucinations are a common problem among AI chatbots. Other companies, like Microsoft and Google, also saw their chatbots share inaccurate information during previous public demos.

Still, Corey Quinn of the Duckbill Group, a company that helps customers manage AWS bills, told BI that Q’s shortcomings reflect Amazon’s “lack of leadership” in the AI space. AWS may be the market leader in cloud computing, but that has created a “delusion” and “sense of entitlement” about their market position in AI, he said.

Quinn previously tweeted a series of inaccurate answers he found on Q. He also published his findings in a separate blog post, titled, “AWS’s (de) Generative AI Blunder.” It’s unclear how many of these issues have been fixed.

“Are customers helped or hindered by having a bot giving plausible-yet-wrong information?” Quinn told BI in an email.

Jockeying for resources

Another challenge for the Q team is internal competition for AWS’s computing capacity.

The emergence of generative AI has dramatically increased demand for GPUs from Nvidia and other suppliers. That means AWS often has to prioritize external customers over internal tests, which further slows down Q’s development, one of the people said.

Q is only one part of Amazon’s 3-stack approach to AI. The first part is the user apps, like Q, built on top of AI language models. The second layer includes the large language models themselves, such as Claude, Llama 2 and Amazon’s own Titan offering. The final ingredient is the computing power and chips, including Amazon’s Trainium and Inferentia AI cloud chips along with Nvidia GPUs.

Something “good,” not “something asap”

Amazon’s race to catch up in AI and the intense competition have created what some employees dub “AI fatigue,” as BI previously reported. AWS executives say it’s very early, and it’s unlikely one model or application will “rule” the AI landscape, as AWS CEO Adam Selipsky recently told employees in an internal all-hands meeting.

“It is still early,” Selipsky said. “I don’t even know if it’s Day One. I don’t know if it’s Day 0.1 or something.”

Some AWS employees, however, say it feels like the company is in a mad dash to release new products, even if they are subpar. Amazon’s Q, for example, hurried its launch in part so it could meet a late-November deadline of announcing it at re:Invent, AWS’s big annual conference, they said.

“Q came up very abruptly,” one of the people said. “We need to build something good in generative AI — not something asap.”

Do you work at Amazon? Got a tip?

Contact the reporter Eugene Kim via the encrypted-messaging apps Signal or Telegram (+1-650-942-3061) or email (ekim@businessinsider.com). Reach out using a nonwork device. Check out Business Insider’s source guide for other tips on sharing information securely.



Read the full article here

Share.
Exit mobile version