Node) google news에서 많이 사용된 단어 추출하기

Server/NodeJS & NestJS

Node) google news에서 많이 사용된 단어 추출하기

Juzdalua 2023. 3. 7. 19:10

AI모델로 학습된 자연어 처리 모델인 nlp를 사용했다.

특정 단어들은 명사로 인지 되지 않는 불편함이 있었다.

그래서 단순한 단어 추출 라이브러리인 keyword_extractor도 함께 사용했다.

NBA 기사 세가지 중, 5번 이상 사용된 단어를 찾았다.

import nlp from 'compromise';
import keyword_extractor from "keyword-extractor";

(async () => {
  
  const news: { [key: string]: string[] } = {};
  const keywords: { [key: string]: number } = {};
  const mostUsedKeywords: { [key: string]: number } = {};

  const normalrizeOptions = {
    contractions: true,    // turn "isn't" to "is not"
    possessives: false,    // turn "Google's tax return" to "Google tax return"
    plurals: false,    // turn "batmobiles" into "batmobile"
  }

  const extractOptions = {
    // language:"english",
    remove_digits: true,
    return_changed_case:true,
    return_chained_words: false,
    remove_duplicates: false,
    return_max_ngrams: 1
  }
  
  const titles = [
  "'Suns' Devin Booker named NBA Western Conference Player of the ... - Yahoo Sports",
  "Power Rankings, Week 21: Knicks, Nuggets rise as Bucks stay at top - NBA.com",
  "NBA Power Rankings: Knicks soar to brink of contention; over/unders confidence check - The Athletic"
  ]
  
  for (let i = 0; i < titles.length; i++) {
  	titles.forEach((item) => {
        const title = item.split(" - ")[0];
        const doc = nlp(title);
        const normalizedTitle = doc.normalize(normalrizeOptions).out('text');

        const wordsList: Array<string> = keyword_extractor.extract(normalizedTitle, extractOptions);
        wordsList.forEach((word: string) => {
          if(keywords[word] == undefined){
            keywords[word] = 1
          } else{
            keywords[word] += 1
          }
        });
      })
    };
  
  console.log(Object.entries(keywords))
  Object.entries(keywords).forEach((keywordSet) => {
    if(keywordSet[1] > 5){
      mostUsedKeywords[keywordSet[0]] = keywordSet[1]
    }
  });

  // console.log(mostUsedKeywords)

  const sortable = Object.entries(mostUsedKeywords)
    .sort(([, a], [, b]) => a - b)
    .reduce((r, [k, v]) => ({ ...r, [k]: v }), {});

  console.log(sortable);
})();

출력 결과는 다음과 같다.

[
  [ 'suns', 3 ],        [ 'devin', 3 ],
  [ 'booker', 3 ],      [ 'named', 3 ],
  [ 'nba', 6 ],         [ 'western', 3 ],
  [ 'conference', 3 ],  [ 'player', 3 ],
  [ 'power', 6 ],       [ 'rankings', 6 ],
  [ 'week', 3 ],        [ 'knicks', 6 ],
  [ 'nuggets', 3 ],     [ 'rise', 3 ],
  [ 'bucks', 3 ],       [ 'stay', 3 ],
  [ 'top', 3 ],         [ 'soar', 3 ],
  [ 'brink', 3 ],       [ 'contention', 3 ],
  [ 'over/unders', 3 ], [ 'confidence', 3 ],
  [ 'check', 3 ]
]
{ nba: 6, power: 6, rankings: 6, knicks: 6 }
✨  Done in 0.70s.

https://www.npmjs.com/package/compromise

compromise

modest natural language processing. Latest version: 14.8.2, last published: a month ago. Start using compromise in your project by running `npm i compromise`. There are 127 other projects in the npm registry using compromise.

www.npmjs.com

https://www.npmjs.com/package/keyword-extractor

keyword-extractor

Module for creating a keyword array from a string and excluding stop words.. Latest version: 0.0.25, last published: 25 days ago. Start using keyword-extractor in your project by running `npm i keyword-extractor`. There are 53 other projects in the npm reg

www.npmjs.com

'Server > NodeJS & NestJS' 카테고리의 다른 글

NodeJS) Bulk Job 만들기 - async/await & Promise (0)	2023.06.15
NodeJS) 서버에서 HTML 파일 읽기 (0)	2023.04.03
Node) google news rss로 읽어오기 (0)	2023.03.07
NestJS) ChatGPT API 사용후기 (0)	2023.02.01
NestJS) supertest (0)	2023.01.11

현재글Node) google news에서 많이 사용된 단어 추출하기

Way to be gorgeous developer