The AI Supply Chain Is Not Impervious

Joao Correia

January 25, 2024 - Technical Evangelist

AI was the leading story of 2023 – to provide some context, ChatGPT became Wikipedia’s most viewed article of 2023 – and it has been implemented in testing or production stages by numerous organizations worldwide. Whether these organizations have already realized the benefits of AI or are still exploring its potential is irrelevant to its rapid ascension in a relatively short period.

However, with the rise of AI, and large language models in particular, came security concerns. Through direct web interfaces or API calls, several attack vectors were identified that could compromise the language models, training data, or user data. The risks were amplified when AI was integrated into third-party applications.

As we close 2023, security researchers uncovered another potential vulnerability in AI platforms: the exploitation of publicly exposed Hugging Face API tokens. This issue is reminiscent of the unsecured public Amazon buckets of yore (so 2022!). It was discovered that many prominent organizations, including Meta and other AI companies, either hard-coded these tokens in publicly accessible code repositories or published them on various websites. Analysis revealed that these tokens provided read, and crucially, write access to the underlying data, including the datasets used to train AI models. Manipulating this data could result in the insertion of false, misleading, or malicious content into the training set, subsequently affecting user-facing applications and portals. Additionally, it’s critical to recognize the significance of training data for AI companies; abuse of these tokens could lead to its loss or corruption in ways that are challenging to detect.

Hugging Face has acknowledged the issue with the exposed tokens, invalidated them, and is working on implementing more refined control mechanisms for token usage in the future.

This situation bears similarities to other supply chain attacks where public credentials are exposed in code repositories, prompting initiatives like Github’s Secret Scanning feature. It underscores a fundamental principle: mishandling secrets, such as posting credentials publicly, invites scrutiny and misuse. Yet, this oversight persists.

Therefore, it is essential to treat access tokens as any other sensitive credential: keep them private and secure. For CI/CD pipelines, consider using environment variables or other secure methods that don’t involve hardcoding secrets in your project’s code. At the very least, ensure these files are added to Git’s ignore list.

Lastly, to safeguard your Java projects from potentially compromised packages, always rely on trusted and vetted sources like SecureChain for Java to help you secure your applications.

Summary