Server/NodeJS & NestJS

Node) google news에서 많이 사용된 단어 추출하기

Juzdalua 2023. 3. 7. 19:10

AI모델로 학습된 자연어 처리 모델인 nlp를 사용했다.

특정 단어들은 명사로 인지 되지 않는 불편함이 있었다.

 

그래서 단순한 단어 추출 라이브러리인 keyword_extractor도 함께 사용했다.

NBA 기사 세가지 중, 5번 이상 사용된 단어를 찾았다.

import nlp from 'compromise';
import keyword_extractor from "keyword-extractor";

(async () => {
  
  const news: { [key: string]: string[] } = {};
  const keywords: { [key: string]: number } = {};
  const mostUsedKeywords: { [key: string]: number } = {};

  const normalrizeOptions = {
    contractions: true,    // turn "isn't" to "is not"
    possessives: false,    // turn "Google's tax return" to "Google tax return"
    plurals: false,    // turn "batmobiles" into "batmobile"
  }

  const extractOptions = {
    // language:"english",
    remove_digits: true,
    return_changed_case:true,
    return_chained_words: false,
    remove_duplicates: false,
    return_max_ngrams: 1
  }
  
  const titles = [
  "'Suns' Devin Booker named NBA Western Conference Player of the ... - Yahoo Sports",
  "Power Rankings, Week 21: Knicks, Nuggets rise as Bucks stay at top - NBA.com",
  "NBA Power Rankings: Knicks soar to brink of contention; over/unders confidence check - The Athletic"
  ]
  
  for (let i = 0; i < titles.length; i++) {
  	titles.forEach((item) => {
        const title = item.split(" - ")[0];
        const doc = nlp(title);
        const normalizedTitle = doc.normalize(normalrizeOptions).out('text');

        const wordsList: Array<string> = keyword_extractor.extract(normalizedTitle, extractOptions);
        wordsList.forEach((word: string) => {
          if(keywords[word] == undefined){
            keywords[word] = 1
          } else{
            keywords[word] += 1
          }
        });
      })
    };
  
  console.log(Object.entries(keywords))
  Object.entries(keywords).forEach((keywordSet) => {
    if(keywordSet[1] > 5){
      mostUsedKeywords[keywordSet[0]] = keywordSet[1]
    }
  });

  // console.log(mostUsedKeywords)

  const sortable = Object.entries(mostUsedKeywords)
    .sort(([, a], [, b]) => a - b)
    .reduce((r, [k, v]) => ({ ...r, [k]: v }), {});

  console.log(sortable);
})();

 

출력 결과는 다음과 같다.

[
  [ 'suns', 3 ],        [ 'devin', 3 ],
  [ 'booker', 3 ],      [ 'named', 3 ],
  [ 'nba', 6 ],         [ 'western', 3 ],
  [ 'conference', 3 ],  [ 'player', 3 ],
  [ 'power', 6 ],       [ 'rankings', 6 ],
  [ 'week', 3 ],        [ 'knicks', 6 ],
  [ 'nuggets', 3 ],     [ 'rise', 3 ],
  [ 'bucks', 3 ],       [ 'stay', 3 ],
  [ 'top', 3 ],         [ 'soar', 3 ],
  [ 'brink', 3 ],       [ 'contention', 3 ],
  [ 'over/unders', 3 ], [ 'confidence', 3 ],
  [ 'check', 3 ]
]
{ nba: 6, power: 6, rankings: 6, knicks: 6 }
✨  Done in 0.70s.

 

https://www.npmjs.com/package/compromise

 

compromise

modest natural language processing. Latest version: 14.8.2, last published: a month ago. Start using compromise in your project by running `npm i compromise`. There are 127 other projects in the npm registry using compromise.

www.npmjs.com

https://www.npmjs.com/package/keyword-extractor

 

keyword-extractor

Module for creating a keyword array from a string and excluding stop words.. Latest version: 0.0.25, last published: 25 days ago. Start using keyword-extractor in your project by running `npm i keyword-extractor`. There are 53 other projects in the npm reg

www.npmjs.com