Backing up your tweets to Camlistore with Python and Camlipy

If like me, you want to backup everything you ever published, it makes sense to backup your tweets. Camlistore, also known as personal storage system for life seems a perfect fit for the job.

I chose to use Python since I recently created a Python client for Camlistore, Camlipy and I already played with Twitter API using requests (see using Twitter API v1.1 with Python).

Requirements

$ sudo pip install request request_oauthlib camlipy

Twitter API

Here is the Twitter API OAuth part, from my previous article, you must set your API keys and run the script to generate your OAuth tokens.

In [1]:
import json
from requests_oauthlib import OAuth1
from urlparse import parse_qs
import sys
import datetime
import locale

import requests

REQUEST_TOKEN_URL = "https://api.twitter.com/oauth/request_token"
AUTHORIZE_URL = "https://api.twitter.com/oauth/authorize?oauth_token="
ACCESS_TOKEN_URL = "https://api.twitter.com/oauth/access_token"

CONSUMER_KEY = "XXXXXXXXX"
CONSUMER_SECRET = "XXXXXXXXX"

OAUTH_TOKEN = ""
OAUTH_TOKEN_SECRET = ""

TWITTER_API_TIMELINE = 'https://api.twitter.com/1.1/statuses/user_timeline.json'


def setup_oauth():
    """Authorize your app via identifier."""
    # Request token
    oauth = OAuth1(CONSUMER_KEY, client_secret=CONSUMER_SECRET)
    r = requests.post(url=REQUEST_TOKEN_URL, auth=oauth)
    credentials = parse_qs(r.content)

    resource_owner_key = credentials.get('oauth_token')[0]
    resource_owner_secret = credentials.get('oauth_token_secret')[0]

    # Authorize
    authorize_url = AUTHORIZE_URL + resource_owner_key
    print 'Please go here and authorize: ' + authorize_url

    verifier = raw_input('Please input the verifier: ')
    oauth = OAuth1(CONSUMER_KEY,
                   client_secret=CONSUMER_SECRET,
                   resource_owner_key=resource_owner_key,
                   resource_owner_secret=resource_owner_secret,
                   verifier=verifier)

    # Finally, Obtain the Access Token
    r = requests.post(url=ACCESS_TOKEN_URL, auth=oauth)
    credentials = parse_qs(r.content)
    token = credentials.get('oauth_token')[0]
    secret = credentials.get('oauth_token_secret')[0]

    return token, secret


def get_oauth():
    oauth = OAuth1(CONSUMER_KEY,
                   client_secret=CONSUMER_SECRET,
                   resource_owner_key=OAUTH_TOKEN,
                   resource_owner_secret=OAUTH_TOKEN_SECRET)
    return oauth

if not OAUTH_TOKEN:
    token, secret = setup_oauth()
    print "OAUTH_TOKEN: " + token
    print "OAUTH_TOKEN_SECRET: " + secret
    print
    sys.exit()

Now we can call the Twitter API, we'll hit the statuses/user_timeline endpoint (check out working with timelines if needed).

In [2]:
def fetch_tweets(since_id=None, max_id=None, count=200):
    """ Fetch tweets. """
    params = {'count': count}
    if since_id:
        params['since_id'] = since_id
    if max_id:
        params['max_id'] = max_id

    oauth = get_oauth()
    r = requests.get(url=TWITTER_API_TIMELINE,
                     params=params, auth=oauth)
    return r.json()

Here is the backup_timeline function that automatically handle the max_id paramaters since we can only retrieve 200 tweets by API call.

In [3]:
def backup_timeline(since_id=None, max_id=None, count=200):
    tweets = []
    while 1:
        batch = fetch_tweets(since_id=since_id, max_id=max_id, count=count)
        tweets.extend(batch)
        if tweets:
            max_id = tweets[-1]['id']
        if len(batch) < count:
            break
    return tweets

Now we can fetch our timeline (it will return only the first 3200 tweets, this is the API limit).

In [4]:
tweets = backup_timeline()

And we need to retrieve the most recent id for the next time we will backup.

In [5]:
since_id = tweets[0]['id']

Check that backup_timeline result is empty with since_id.

In [6]:
backup_timeline(since_id=since_id)
Out[6]:
[]

Getting started with Camlistore and Camlipy

If you have never heard about Camlistore before, you should read the website, and the project overview. You should also check out Camlipy documentation.

I am assuming you have at least a little knowledge about how Camlistore works.

Each blob is identified by its unique blobref. A blobref looks like sha1-25f2b42fae7398bc8857ed17d56d7d1e072c9832.

In [7]:
from camlipy import Camlistore
c = Camlistore('http://localhost:3179')

First, we need to create a permanode, it will hold the static set blobref reference, since we will create a new static set each time we perform a backup.

In [8]:
p = c.permanode_by_title('twitter_backups', create=True)
In [9]:
p
Out[9]:
<Schema Permanode:sha1-25f2b42fae7398bc8857ed17d56d7d1e072c9832>

Next, we create a new static set, which will contain each tweets blobrefs (most recent first), with the add_to_static_set helper, and we add this set to a set.

In [10]:
s = c.add_to_static_set([c.put_blob(json.dumps(tweet)) for tweet in tweets])
p.set_camli_content(c.add_to_static_set([s]))

Finally we can store the newest id, since_id for the next search, in a permanode attribute (permanode attribute must be str), so we create a new claim.

In [11]:
p.set_attr('since_id', str(since_id))
In [12]:
p.get_attr('since_id')
Out[12]:
u'369579788119199746'

The final script

Without Twitter API code.

In [13]:
# Fetch the permanode
p = c.permanode_by_title('twitter_backups')

# Retrieve since_id if existing
since_id = int(p.get_attr('since_id'))

# Try to retrieve the existing static set
static_set_blobref = p.get_camli_content()

# A static set already exists, so we load it
if static_set_blobref:
    static_set = c.static_set(static_set_blobref)
else:
    # We create a new static set
    static_set = c.static_set()

# Update/set the new tweets blobrefs
tweets_blobrefs = [c.put_blob(json.dumps(tweet)) for tweet in backup_timeline(since_id=since_id)]
if tweets_blobrefs:
    new_static_set_blobref = static_set.update([c.add_to_static_set(tweets_blobrefs)])
    p.set_camli_content(new_static_set_blobref)

Accessing tweets

In [14]:
p2 = c.permanode_by_title('twitter_backups', create=True)
In [15]:
s2 = c.static_set(p2.get_camli_content())

To retrieve the tweets starting from the most recent, we must iterate the list upside down.

In [16]:
tweets = []
for static_set_blobref in s2.members[::-1]:
    static_set = c.static_set(static_set_blobref)
    for tweet_data in static_set.members:
        tweet = json.loads(c.get_blob(tweet_data).read())
        tweets.append(tweet)
In [17]:
tweets[0]
Out[17]:
{u'contributors': None,
 u'coordinates': None,
 u'created_at': u'Mon Aug 19 22:01:02 +0000 2013',
 u'entities': {u'hashtags': [{u'indices': [49, 60], u'text': u'Camlistore'}],
  u'symbols': [],
  u'urls': [{u'display_url': u'camlipy.readthedocs.org/en/latest/',
    u'expanded_url': u'https://camlipy.readthedocs.org/en/latest/',
    u'indices': [61, 84],
    u'url': u'https://t.co/9GNCYTvvCu'}],
  u'user_mentions': []},
 u'favorite_count': 0,
 u'favorited': False,
 u'geo': None,
 u'id': 369579788119199746,
 u'id_str': u'369579788119199746',
 u'in_reply_to_screen_name': None,
 u'in_reply_to_status_id': None,
 u'in_reply_to_status_id_str': None,
 u'in_reply_to_user_id': None,
 u'in_reply_to_user_id_str': None,
 u'lang': u'en',
 u'place': None,
 u'possibly_sensitive': False,
 u'retweet_count': 0,
 u'retweeted': False,
 u'source': u'web',
 u'text': u'I just released Camlipy 0.1.1, Python client for #Camlistore https://t.co/9GNCYTvvCu',
 u'truncated': False,
 u'user': {u'contributors_enabled': False,
  u'created_at': u'Tue Nov 09 22:56:48 +0000 2010',
  u'default_profile': False,
  u'default_profile_image': False,
  u'description': u'Python developer, paranoid about backups (http://t.co/HS7Z0kW9Iy and http://t.co/0Zwvyje93x)',
  u'entities': {u'description': {u'urls': [{u'display_url': u'docs.bakthat.io',
      u'expanded_url': u'http://docs.bakthat.io',
      u'indices': [42, 64],
      u'url': u'http://t.co/HS7Z0kW9Iy'},
     {u'display_url': u'bakserver.bakthat.io',
      u'expanded_url': u'http://bakserver.bakthat.io',
      u'indices': [69, 91],
      u'url': u'http://t.co/0Zwvyje93x'}]},
   u'url': {u'urls': [{u'display_url': u'thomassileo.com',
      u'expanded_url': u'http://thomassileo.com',
      u'indices': [0, 22],
      u'url': u'http://t.co/9g1uECbDWb'}]}},
  u'favourites_count': 26,
  u'follow_request_sent': False,
  u'followers_count': 132,
  u'following': False,
  u'friends_count': 348,
  u'geo_enabled': False,
  u'id': 213842895,
  u'id_str': u'213842895',
  u'is_translator': False,
  u'lang': u'fr',
  u'listed_count': 5,
  u'location': u'France',
  u'name': u'Thomas Sileo',
  u'notifications': False,
  u'profile_background_color': u'FFFFFF',
  u'profile_background_image_url': u'http://a0.twimg.com/profile_background_images/309927553/nvblackplaid_twitter.br.jpg',
  u'profile_background_image_url_https': u'https://si0.twimg.com/profile_background_images/309927553/nvblackplaid_twitter.br.jpg',
  u'profile_background_tile': False,
  u'profile_image_url': u'http://a0.twimg.com/profile_images/2383014688/mflvmo7ahhlvfgy945c6_normal.jpeg',
  u'profile_image_url_https': u'https://si0.twimg.com/profile_images/2383014688/mflvmo7ahhlvfgy945c6_normal.jpeg',
  u'profile_link_color': u'262626',
  u'profile_sidebar_border_color': u'000000',
  u'profile_sidebar_fill_color': u'545454',
  u'profile_text_color': u'9C9C9C',
  u'profile_use_background_image': True,
  u'protected': False,
  u'screen_name': u'trucsdedev',
  u'statuses_count': 114,
  u'time_zone': u'Paris',
  u'url': u'http://t.co/9g1uECbDWb',
  u'utc_offset': 7200,
  u'verified': False}}

Conclusion

Don't hesitate to let me know if you have any questions/suggestions !

You should follow me on Twitter

Share this article

Tip with Bitcoin

Tip me with Bitcoin and vote for this post!

1FKdaZ75Ck8Bfc3LgQ8cKA8W7B86fzZBe2

Leave a comment

© Thomas Sileo. Powered by Pelican and hosted by DigitalOcean.