Baker Tech Note: 6月 2012

2012年6月17日日曜日

[Twitter Streaming API] Twitterの中心でappleを叫ぶ

Twitterのグローバルタイムラインから情報を収集することを検討してみる。調べた結果では、Twitter Streaming APIを使うのが一番と言うことで、早速、試してみた。

Twitter Streaming APIは、WebアプリケーションAPIで利用するRESTな動きではなく、一度Requestを投げて、Twitterとつながるとだらだらとtweetを返してくる。クライアントは、それを待ち受けて処理し続ける。つまりpushされてくるAPIだ。

twitterの中心で"apple"を叫ぶ人達

Node.JSを使う。

試作では、エンドポイント filterを使い、キーワード_apple_を含むグローバルタイプラインを取得している。出力は、標準出力に流している。終了させるためには、Ctl-Cだ（芸がない）。

// ファイル名は app.js にした
var https = require('https'); // httpsモジュールでつなぎます。
var host  = 'stream.twitter.com'; // Streaming APIを提供するサイト。

var find_one = 'apple';

// getメソッドに渡すオプションをまとめ
var options = {
    host : host,
    port : 443,
    path : '/1/statuses/filter.json?track=' + find_one,　
    // filterエンドポイントを利用。この当たりは一寸怪しい
    auth : 'TWITTERID:PASSWORD' // 自分のtwitterアカウントのユーザID,パスワードをセット
};

// リクエストオブジェクトを取得します
var req = https.get(options, 
  function(response){
    console.log('Response: ' + response.statusCode);
  }
);

// リクエストオブジェクトが、イベント"response"を受け取った場合
req.on('response', function(response){
    // レスポンスがデータだった場合(?)
    response.on('data', function(chunk){
        // 受け取ったデータをJSON形式に変換します
        var tweet = JSON.parse(chunk);
        // tweetがuserプロパティ、nameプロパティが存在する場合
        if('user' in tweet && 'name' in tweet.user){
            // 漢字、日本語などはfilterされず流されてくるため、ここで除いておいた
            // ちょっといい加減（今後の課題）
            if(tweet.text.toLowerCase().indexOf(find_one) > 0){
            console.log('[' + tweet.user['name'] + ']\n' + tweet.text);
        }
    }
  });
});

// リクエストオブジェクトが、イベント"error"を受け取った場合、
req.on('error', function(e){
  console.log(e);
});

// 異常例外でプログラムがストップする場合を避けるため
process.on('uncaughtException', function(err){
  console.log('uncaughtException: ' + err);
});

実行してみます。

$ node app.js 
Response: 200
[Q U E]
so my Screen just went black!! I don't feel like going downtown to the Apple store today
[Yoel Wonsey]
@JmSamuel1415 Just go eat a rocky mountain apple
[jose antonio rivera ]
RT @ComediaChistes: — Oye, me enteré que Apple demandó a tu novia por violación de patentes. —¿Si? ¿Por qué? —Porqué está tan plana como ...
[iPhone iPad Actu]
#iphone #geek Post-Jobs, Apple unleashes new iPhone http://t.co/nYJFdAYA #ipod #ipad #apple
[iPhone iPad Actu]
#iphone #geek Steel Battalion Heavy Armor Gameplay Partie 2HD (fr) http://t.co/kqwcvnBt #ipod #ipad #apple
[danielle Bergman]
RT @AcneSkinSite: An apple a day keeps the acne away! Honey helps restore a youthful look! Try an Apple Honey Mask for endless benefits& ...
[Dol Noppadol]
RT @JAlLBREAK: Apple Now Throws in a MagSafe 2 Converter with Every Thunderbolt Display http://t.co/Jdi486kK
[Alniedawn F E]
RT @AcneSkinSite: An apple a day keeps the acne away! Honey helps restore a youthful look! Try an Apple Honey Mask for endless benefits& ...
[TAYLOR Ann'Marie]
Shorty jus gave me sum fye shxt..apple bacardi sprite tini! Fye

(だらだらと続く)

これからの課題

とりあえず、いろいろな人のblogを見ながら、試作してみたが、いろいろと気になる。

Twitter Streaming APIのドキュメントによれば、filter エンドポイントを利用するのは、POST メソッドだが、GETで動作した。
漢字は、filter対象にならず、送り込まれるようだ。英語だけのTweetだと判断する必要があるかな。
開発者アカウントを使う必要があるな。
とにかく、Twitter streaming APIのドキュメントを良く読む必要がある。
Node.JSのhttp,httpsモジュール、processオブジェクトは再勉強する必要あり。

今後使ってみようと思うのは、このモジュール ntwitter 。

ntwitter:Node.JSで、twitter streaming APIを扱う便利なモジュール

processオブジェクトについて

Node.JSドキュメント process から Event: 'uncaughtException' function (err) { } 発生した例外がイベントループまでたどり着いた場合に生成されます。もしこの例外に対するリスナーが加えられていれば、デフォルトの動作 (それはスタックトレースをプリントして終了します) は起こりません。

Twitter Streaming API ドキュメントの一部

Twitter APIのドキュメントから、Streaming API関連の一部を引用してみた。後で訳すことにしよう

Overview

The set of streaming APIs offered by Twitter give developers low latency access to Twitter's global stream of Tweet data. A proper implementation of a streaming client will be pushed messages indicating Tweets and other events have occurred, without any of the overhead associated with polling a REST endpoint.

Twitter offers several streaming endpoints, each customized to certain use cases.

Public streams: Streams of the public data flowing through Twitter. Suitable for following specific users or topics, and data mining.
User streams: Single-user streams, containing roughly all of the data corresponding with a single user's view of Twitter.
Site streams: The multi-user version of user streams. Site streams are intended for servers which must connect to Twitter on behalf of many users.

An app which connects to the Streaming APIs will not be able to establish a connection in response to a user request, as shown in the above example. Instead, the code for maintaining the Streaming connection is typically run in a process separate from the process which handles HTTP requests:

FAQ から

How do I use the Twitter platform?

Twitter offers a platform with a number of different ways to interact with it.

Web Intents, Tweet Button and Follow Button is the simplest way to bring basic Twitter functionality to your site. It provides features like the ability to tweet, retweet, or follow using basic HTML and Javascript. You can also embed individual tweets.

More complex integrations can utilize our REST, Search, and Streaming APIs. The Streaming API allows you to stream tweets in real time as they happen. The Search API provides relevant results to ad-hoc user queries from a limited corpus of recent tweets. The REST API allows access to the nouns and verbs of Twitter such as reading timelines, tweeting, and following.

To use the REST and Streaming API, you should register an application and get to know the ways of OAuth and explore Twitter Libraries.

What is the version of the REST API?

In the API documentation there is a version place marker in the example request URL. Currently only one version of the API exists, that version is 1. This means any REST API queries will be of the format: https://api.twitter.com/1/statuses/user_timeline.json. The Streaming API is currently on version 1 as well, while the Search API is unversioned.

How are rate limits determined on the Streaming API?

At any one moment of time there are X amount of tweets in the public firehose. You're allowed to be served up to 1% of whatever X is per a "streaming second."

If you're streaming from the sample hose at https://stream.twitter.com/1/statuses/sample.json, you'll receive a steady stream of tweets, never exceeding 1% X tweets in the public firehose per "streaming second."

If you're using the filter feature of the Streaming API, you'll be streamed Y tweets per "streaming second" that match your criteria, where Y tweets can never exceed 1% of X public tweets in the firehose during that same "streaming second." If there are more tweets that would match your criteria, you'll be streamed a rate limit message indicating how many tweets fell outside of 1%.

How do I keep from running into the rate limit?

Caching. We recommend that you cache API responses in your application or on your site if you expect high-volume usage. For example, don't try to call the Twitter API on every page load of your hugely popular website. Instead, call our API once a minute and save the response to your local server, displaying your cached version on your site. Refer to the Terms of Service for specific information about caching limitations. Rate limiting by active user. If your site keeps track of many Twitter users (for example, fetching their current status or statistics about their Twitter usage), please consider only requesting data for users who have recently signed in to your site. Scale your use of the API with the number of users you have. When using OAuth to authenticate requests with the API, the rate limit applied is specific to that user_token. This means, every user who authorizes your application to act on their behalf, has their own bucket of API requests for you to use. Request only what you need, and only when you need it. For example, polling the REST API looking for new data is inefficient for both your application, and the Twitter API. Instead consider using one of the Streaming APIs as a signal of when to make REST API requests. Consider using a combination of the APIs to achieve your goal. You can't do everything with one API, but by combining them you can do most things. For example, instead of using the Search API for all your querying, use the Streaming API to track keywords and follow users Tweets, and save the Search API for the more complex queries. These are just some example strategies. To work out different solutions for you to achieve your goals, search through discussions on the topic or start your own.

How do password resets effect application authorization?

When using OAuth, application connectivity and permissions do not change when a user resets their password on twitter.com. The relationship between Twitter, a user, and a third-party application do not involve a username and password combination. When a Twitter user changes their password, we'll now ask the user whether they would also like to revoke any of their application authorizations, but any revocations are manually executed by the end user.

As of March 12, 2012 it is still possible to connect to the The Streaming APIs via Basic Auth credentials. If the password belonging to a user account that connects to the Streaming API via basic auth is changed, the new password will need to be used to regain that connection.

I don't want to require users to authenticate but 150 requests per hour is not enough for my app, what should I do?

Rethink not wanting to require authentication. It's the primary means to grow your application's capabilities. We recommend requiring authentication to make use of potentially 350 requests per hour per access token/user. Consider also investigating whether the Streaming API's follow filter will work for you.

How do I count favorites?

Favorite counts aren't available as part of tweet objects in the REST, Streaming or Search APIs at this time. User streams and Site streams both stream events when an authenticated user favorites tweets or has their tweets favorited. Using these authenticated streaming APIs, you can count favorites in real-time as they happen. This is currently the only scalable means to count favorite activity.

How do I count retweets?

Tweets in the REST and Streaming APIs contain a field called retweet_count that provides the number of times that tweet has been retweeted. You can obtain the retweet count for any arbitrary tweet by using GET statuses/show/:id.

You can count retweets as they happen by using a The Streaming APIs. In particular, User streams and Site streams allow you to be streamed retweet events about/around an authenticated user in real time.

I keep hitting the rate limit. How do I get more requests per hour?

REST & Search API Whitelisting is not provided. Resourceful use of more efficient REST API requests, authentication, and Streaming APIs will allow you to build your integration successfully without requiring whitelisting. Learn more about rate limits or see the rate limiting FAQ for more information.

2012年6月11日月曜日

Part-of-Speech Tagging (品詞タグ付け) 事始め

英文の中から名詞のみを抽出することを検討している。日本語と違い、英語はスペースで分かち書きしているので、「の、は、が」とかを分析する形態要素解析は簡単なようだ。したがって、自然言語処理(NLP Natural Language Processing)の技術のうち、品詞判断を実現するツールを探してみた。

この手のツールは、POS Tagging(Part-of-Speech Tagging: 品詞タグ付け)と呼ぶそうだ。いろいろと探してみた結果、Eric Bill氏が1993年に開発した "Bill Tagger"が有名で、かなりのツールはこのエンジンを利用している。

http://en.wikipedia.org/wiki/Brill_tagger

Bill氏は、今マイクロソフト社に勤務しているとのこと。彼のオリジナルツールは、"Error-driven"であり、"tranformation-based"なんだそうで、辞書ベースと文脈解析で最初の解析を終えたあと、利用者が解析を修正することで辞書を成長させていくものらしい。

今回の目的は、自然言語処理の勉強でなく、名詞のみが判断されれば良いし、また出来ればjavascriptのツールが欲しかったので、pos-jsを試してみた。

https://github.com/fortnightlabs/pos-js
http://code.google.com/p/jspos/

pos-jsのインストール

作業ディレクトリに、ライブラリをインストールしてみる。

pos-js (js pos：javascript parts-of-speech tagger)も、Eric Bill氏のエンジンを利用している。

    [baker@www tagger-js]$ npm install pos
    npm http GET https://registry.npmjs.org/pos
    npm http 200 https://registry.npmjs.org/pos
    npm http GET https://registry.npmjs.org/pos/-/pos-0.1.1.tgz
    npm http 200 https://registry.npmjs.org/pos/-/pos-0.1.1.tgz
    pos@0.1.1 ./node_modules/pos

サンプルプログラムの実行

まずは、サイトのサンプルプログラムを試してみた。

作業ディレクトリで作成したサンプルコード sample.js

var pos = require('pos');
// 分かち書き解析モジュールで単語分解する
// クラス Lexer()を用意し、lexメソッドで単語分解を実行し、wordsに配列として保管。
var words = new pos.Lexer().lex("This is some sample text. This text can contain multiple sentences.");
// クラス Tagger()を用意し、tagづけ（品詞判断）を行う。アウトプットは[単語、タグ]の配列の配列。
var taggedWords = new pos.Tagger().tag(words);
for(i in taggedWords){
    var taggedWord = taggedWords[i];
    var word = taggedWord[0];        // 単語を抽出
    var tag = taggedWord[1];         // タグを抽出   
    console.log(word + " /" + tag);
}

実行結果です。

$ node sample.js 
This /DT     # 前置詞
is /VBZ      # 動詞, 現在形
some /DT     # 前置詞
sample /NN   # 名詞
text /NN     # 名詞
. /.         # ピリオド
This /DT     # 前置詞
text /NN     # 名詞
can /MD      # 助詞
contain /VB  # 動詞
multiple /JJ # 形容詞
sentences /NNS  # 名詞、複数形
. /.         # ピリオド

タグの一覧を引用しておきます。NNPは（たぶん名詞）てな意味ですね。

    --- ----------------------- -------------
    TAG sense                   sample
    --- ----------------------- -------------
    CC Coord Conjuncn           and,but,or
    CD Cardinal number          one,two
    DT Determiner               the,some
    EX Existential there        there
    FW Foreign Word             mon dieu
    IN Preposition              of,in,by
    JJ Adjective                big
    JJR Adj., comparative       bigger
    JJS Adj., superlative       biggest
    LS List item marker         1,One
    MD Modal                    can,should
    NN Noun, sing. or mass      dog
    NNP Proper noun, sing.      Edinburgh
    NNPS Proper noun, plural    Smiths
    NNS Noun, plural            dogs
    POS Possessive ending       Õs
    PDT Predeterminer           all, both
    PP$ Possessive pronoun      my,oneÕs
    PRP Personal pronoun         I,you,she
    RB Adverb                   quickly
    RBR Adverb, comparative     faster
    RBS Adverb, superlative     fastest
    RP Particle                 up,off
    SYM Symbol                  +,%,&
    TO ÒtoÓ                     to
    UH Interjection             oh, oops
    URL url                     http://www.google.com/
    VB verb, base form          eat
    VBD verb, past tense        ate
    VBG verb, gerund            eating
    VBN verb, past part         eaten
    VBP Verb, present           eat
    VBZ Verb, present           eats
    WDT Wh-determiner           which,that
    WP Wh pronoun               who,what
    WP$ Possessive-Wh           whose
    WRB Wh-adverb               how,where
    , Comma                     ,
    . Sent-final punct          . ! ?
    : Mid-sent punct.           : ; Ñ
    $ Dollar sign               $
    # Pound sign                #
    " quote                     "
    ( Left paren                (
    ) Right paren               )

2012年6月3日日曜日

Webな統計分析環境 RStudio とMongodb

さくらVPS に、NoSQLデータベース Mongodbをデータの取得・保存先に使い、統計分析を行う為の環境を用意した。

統計解析ツール R
データベース Mongodb
Rとmongodbを連携させるためのRライブラリ rmongodb
R利用のためのユーザインターフェース R Studio

最終的に目指しているのは、データを自動で定期的にmongodbにため込み、RStudioで適宜分析する環境だ。

インターネットのデータ -- (Spidering Tool) -- mongodb -- rmongodb -- R -- R Studio

R & R Studioのインストール

既に拡張パッケージ EPEL(Extra Packages for Enterprise Linux)を利用して入れば、Rのインストールはyumを使うだめの簡単作業だ。
http://fedoraproject.org/wiki/EPEL
依存しているパッケージが20ぐらいあったような。
R Studioは、統計解析ソフトＲのためのIDE(統合開発環境)だ。Rの為のEclipseやVisualStudioと言ったところか。サーバ版をインストールした。
http://rstudio.org/

# Rをインストールする
# EPELを使っていることが前提  
  $ sudo yum install R
# RStudioのrpmパッケージを取得し、インストール
  $ wget http://download2.rstudio.org/rstudio-server-0.96.228-x86_64.rpm
  $ sudo rpm -Uvh rstudio-server-0.96.228-x86_64.rpm

インストールすると、サーバが起動し、起動スクリプトもセットされる。RStudioは、初期設定ポートが8787なので、8787ポートは開けてあげること。 http://yoursite.com:8787/ にアクセスするとログイン画面が表示されるはず。ユーザ情報は、Linuxユーザ情報をそのまま利用している。

rmongodb

Mongodbのインストールは、随分前にインストールしていたので、省略。yumでインストール出来たはず。コンパイルとか、随分とインストールに時間が掛かったような記憶がある。
rmongodbは、Rのライブラリで、Rに、mongodbへの接続機能を追加するものだ。R本体はデータの取得先に対する機能は充実しておらず、その当たりはライブラリで頑張ってくださいと言う姿勢らしい。
githubで公開しているrmongodbは、上手くコンパイル出来なかった。CRAN(the Comprehensive R Archive Network)から取得すると上手くらしい。

# mongodbとRを接続するためのライブラリ rmongodbをインストールする
# githubに配布されているものは、なぜかインストールに失敗するので、CRANから取得しインストール
 $ wget http://cran.r-project.org/src/contrib/rmongodb_1.0.3.tar.gz 
 $ sudo R CMD INSTALL rmongodb_1.0.3.tar.gz

R Studioを実行してみる

まず、コマンドラインからrmongodbが動作するか確認してみた。

> R                 # コマンドラインからRを起動
(途中省略)
> library(rmongodb) # rmongodbをロードする
rmongodb package (mongo-r-driver) loaded
Use 'help("mongo")' to get started.

R Studioから、同じ事をやってみる。

http://yoursite.com:8787/ にアクセス
linuxユーザでログイン
右下のペインのPackages タブを選択
rmongodb にチェックを入れる

すると、右のコンソールに上記のコマンドラインと同じメッセージが表示される。(当たり前だけど)
なかなか、使い勝手がよさそう。

Mongodbへのデータ登録と取得

サンプルソースをみて、Rからmongodbを利用する方法など確認してみた。

# insert
mongo <- mongo.create()                            # 接続
if (mongo.is.connected(mongo)) {                   # 接続確認
    buf <- mongo.bson.buffer.create()              # bson用の1レコードバッファをR内に用意
    mongo.bson.buffer.append(buf, "name", "baker") # レコードバッファに属性と値をセット
    mongo.bson.buffer.append(buf, "age", 50L)
    b <- mongo.bson.from.buffer(buf)               # レコードバッファをbson形式に変更
    mongo.insert(mongo, "test.people", b)          # db:test, collection:peopleに追加
}

# select
mongo <- mongo.create()
if (mongo.is.connected(mongo)) {
    buf <- mongo.bson.buffer.create()
    mongo.bson.buffer.append(buf, "age", 18L)
    query <- mongo.bson.from.buffer(buf)

    # Find the first 100 records
    #    in collection people of database test where age == 18
    # queryの内容は、{age:18L}
    # レコードコレクションに対するカーソルが提供される
    cursor <- mongo.find(mongo, "test.people", query, limit=100L)
    # Step though the matching records and display them
    # nextメソッドで、カーソルを順に動かしていく
    while (mongo.cursor.next(cursor))
        print(mongo.cursor.value(cursor))
    mongo.cursor.destroy(cursor)                  # カーソルの開放
    # 現在、100件も入っていないので表示されませんが　:-p
}

実際にMongodbにデータが格納されていることを、mongodbのクライアントソフトmongoで確認してみる。

# mongo　(クライアントソフトから実行結果を確認してみる)
> show dbs     # データベース一覧をみる
admin
error_logger
local
test
> use test     # データベース test に移動
switched to db test
> show collections   # 現在のデータベース内のコレクションをみる
foo
people
system.indexes
users
> db.people.find()   # test.people を全件検索してみる
{ "_id" : ObjectId("4fcb042f3eee4d39039e1b87"), "name" : "baker", "age" : 50 }
>

登録: 投稿 (Atom)

Baker Tech Note

Pages