• One AI Thing
  • Posts
  • ☠️ DarkBERT is ChatGPT for the Dark Web

☠️ DarkBERT is ChatGPT for the Dark Web

Part linguistic exercise, part AI police?

Hello, and thanks for reading One AI Thing. Get smarter about artificial intelligence, one thing at a time.

👀 Today’s Thing: DarkBERT is ChatGPT for the Dark Web

🤖 A team of South Korean researchers just published, “DarkBERT: A Language Model for the Dark Side of the Internet,” a paper detailing their newly trained AI. DarkBERT, like ChatGPT and Google Bard, is a large language model (LLM) trained on vast amounts of data found online. Unlike those “vanilla” LLMs, however, DarkBERT was trained on data from the dark web, an anonymized and intentionally hidden part of the Internet that most folks will never step a virtual foot in.

🎧 How do computers process language, anyway? Way back in 2020, I spoke with Hugging Face’s Sam Shleifer about the natural language processing (NLP) technology powering all kinds of chat applications.

📖 Backstory

☞ Researchers crawled the dark web via the Tor network, gathering information for a dark web database they created after filtering the raw data.

☞ The South Korean team fed the data to RoBERTa, an LLM created in 2019 and recently discovered to be more capable than originally thought. The result — DarkBERT — is a model able to analyze the coded and dialect-heavy language used on the dark web. 

☞ Dark Web training corpus or not, the DarkBERT team claims their model is even better than other LLMs in the wild today: “Our evaluations show that DarkBERT outperforms current language models and may serve as a valuable resource for future research on the Dark Web.“

🔑 Keys to Understanding

🥇 As the researchers put it, “Many of the underground activities prevalent in the Dark Web are immoral/illegal in nature, ranging from content hosting such as data leaks to drug sales.“ As such, they highlighted the potential of DarkBERT as a cybersecurity and law enforcement tool.

🥈 This table from the DarkBERT paper gives a glimpse into the kinds of things anonymous types have been up to on the Dark Web (DUTA and CoDA are the names of two publicly available Dark Web datasets):

🥉 Victor Tangermann, Futurism: “The team suggests DarkBERT could be used for a variety of cybersecurity-related tasks, such as detecting sites that sell ransomware or leak confidential data. It could also be used to crawl through the countless dark web forums that get updated daily and monitor them for any exchange of illicit information.

Overall, we'll believe it when we see it. But even if the system works as intended, do we really want to start letting AI police the internet?”

🕵️ Need More?

Searching for a certain kind of AI thing? Reply to this email and let me know what you'd like to see more of.

Until the next thing,

- Noah

p.s. Want to sign up for the One AI Thing newsletter or share it with a friend? You can find me here.