I am learning about the TF-IDF vectorizer, among other things. Writing my thesis is still my priority, but I pick up any extra minutes I have off to get some insight into ML algorithms.
I have become interested in TF-IDF (term-frequency & inverse-document-frequency) because it has been referenced to me as the gold standard for gaining insight into a short document's overall context. In fact, Wikipedia tells me that, as of 2015, surveys show that 83% of all recommender-systems on digital libraries use TF-IDF.
TF-IDF is pretty simple. It is an encoder, so its goal is to take some document — this could be anything from a word, a tweet or an entire document — and it outputs a fixed-length vector of real numbers. Here is how it works
I also read/heard some news about the following topics:
- How weird it is that French corporations are finding borrowing costs lower than the French government. Fitch has downgraded France to B? (I think).
- Erdogan, Turkey's chief, is trying to get rid of the main member of the opposition. AKP would lose an election today, according to public opinion polls.
- Germanium is very important and China has made it hard to access it, after the US imposed export controls in 2023. Germanium is essential for producing thermal imaging tools key in military equipment like fighter jets. Ufff! US companies i.e. Lockheed Martin are looking for alternative suppliers.
- There is a sub-prime lending scandal brewing in the auto industry in the US. Experts say it's not a repeat of 2008, most likely because it's happening in a less exposed sector. Tricolor sells used cheap cars on a loan, mostly to Hispanics in Texas, California, and Nevada. Tricolor is accused of fraud: they would lend money from the big boys (JP Morgan included) and then they would sub-lend it to desperate customers. After, they would bundle these loans into asset-backed securities, and then sell them on some weird market to pay back the big boys.
- US universities are facing tough questions regarding security and safe debate spaces.