arXiv data pull

arXiv makes its data available by S3 buckets, see: http://arxiv.org/help/bulk_data_s3

Some highlights from this page:
- Papers are available in both PDF and latex source
- Complete PDF data is about 270GB
- Complete source data is about 190GB
- All data lives in requester-pays buckets, so we'll have to cover the cost of the pull ($0.12/GB, about $50 total for both)

Questions:
- Where are we going to host the local copies?  Anyone want to volunteer server space?
- How will analysis work?
  - Option 1: grep pdf text for urls and/or known DOIs for software
  - Option 2: parse the source directly, maybe with plasTex?  This could get expensive and difficult, but may give more reliable results


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

arXiv data pull #1

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

arXiv data pull #1

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions