This repository was archived by the owner on May 29, 2018. It is now read-only.

Description
arXiv makes its data available by S3 buckets, see: http://arxiv.org/help/bulk_data_s3
Some highlights from this page:
- Papers are available in both PDF and latex source
- Complete PDF data is about 270GB
- Complete source data is about 190GB
- All data lives in requester-pays buckets, so we'll have to cover the cost of the pull ($0.12/GB, about $50 total for both)
Questions:
- Where are we going to host the local copies? Anyone want to volunteer server space?
- How will analysis work?
- Option 1: grep pdf text for urls and/or known DOIs for software
- Option 2: parse the source directly, maybe with plasTex? This could get expensive and difficult, but may give more reliable results