Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
4072349
add initial processing script for full text dumps
CarsonDavis Feb 21, 2025
c650645
add processing scripts for csv
CarsonDavis Feb 25, 2025
8f8d64e
Added command
dhanur-sharma Apr 4, 2025
c456efe
Added option for full_text export
dhanur-sharma Apr 4, 2025
7f17724
Added changelog
dhanur-sharma Apr 4, 2025
4b2f32c
Update sde_collections/management/commands/export_urls_to_csv.py
CarsonDavis Apr 7, 2025
4410186
Merge pull request #1300 from NASA-IMPACT/1298-csv-export-command-for…
CarsonDavis Apr 7, 2025
0b3f61b
Merge pull request #1301 from NASA-IMPACT/staging
CarsonDavis Apr 11, 2025
c0e017e
add script to dump curated url list with excludes
CarsonDavis Apr 24, 2025
e518189
Merge pull request #1303 from NASA-IMPACT/url_dump_script
CarsonDavis Apr 25, 2025
2ff7dce
Merge branch 'dev' into 1232-process-the-full-text-dump
CarsonDavis Apr 28, 2025
5279857
update the processing script
CarsonDavis Apr 29, 2025
7ae5fde
Merge branch '1232-process-the-full-text-dump' of github.com:NASA-IMP…
CarsonDavis Apr 29, 2025
14b1fea
add changelog entry for clean_text_dump.py
CarsonDavis Apr 30, 2025
43f324f
Merge pull request #1304 from NASA-IMPACT/1232-process-the-full-text-…
CarsonDavis Apr 30, 2025
77031c7
add a config restricting blank issues
CarsonDavis Jun 11, 2025
c4776d8
Merge pull request #1306 from NASA-IMPACT/add_config
dhanur-sharma Jun 11, 2025
809c11f
Added https://science.data.nasa.gov/ to CORS Allowed origins
dhanur-sharma Nov 21, 2025
24b3417
add pbr as a bandit dependency
CarsonDavis Nov 21, 2025
54e7b08
Merge pull request #1310 from NASA-IMPACT/new-url-cors
CarsonDavis Nov 21, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/ISSUE_TEMPLATE/config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
blank_issues_enabled: false
12 changes: 6 additions & 6 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -61,18 +61,18 @@ repos:
- types-requests

- repo: https://github.com/PyCQA/bandit
rev: '1.7.0'
rev: "1.7.0"
hooks:
- id: bandit
args: ['-r', '--configfile=bandit-config.yml']
args: ["-r", "--configfile=bandit-config.yml"]
additional_dependencies:
- pbr

- repo: https://github.com/zricethezav/gitleaks
rev: 'v8.0.4'
rev: "v8.0.4"
hooks:
- id: gitleaks
args: ['--config=gitleaks-config.toml']


args: ["--config=gitleaks-config.toml"]

ci:
autoupdate_schedule: weekly
Expand Down
17 changes: 17 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,13 @@ For each PR made, an entry should be added to this changelog. It should contain
- etc.

## Changelog
### 3.1.??
- 1232-process-the-full-text-dump
- Description: A script was added `/scripts/sde_dump_processing/clean_text_dump.py` which cleans dumps from sinequa. The sinequa dump does not respect normal csv new line formatting, so that a dump of 1.8 million records becomes a csv of 900 million lines. This script can detect the headers and process the dump with the three possible sources TDAMM, SDE, and scripts, in order to create a final, clean csv. It has a simple CLI which allows setting the input and output, the verbosity of the logs, etc. Because the input files can be very large, the script streams them instead of holding them in memory.
- Changes:
- add file /scripts/sde_dump_processing/clean_text_dump.py`

### 3.1.0
- 1209-bug-fix-document-type-creator-form
- Description: The dropdown on the pattern creation form needs to be set as multi as the default option since this is why the doc type creator form is used for the majority of multi-URL pattern creations. This should be applied to doc types, division types, and titles as well.
- Changes:
Expand Down Expand Up @@ -183,3 +189,14 @@ For each PR made, an entry should be added to this changelog. It should contain
- physics_of_the_cosmos
- stsci_space_telescope_science_institute
- Once the front end has been updated to allow for tag edits, all astrophysics collections will be marked to be run through the pipeline

- 1298-csv-export-command-for-urls
- Description: Added a new Django management command to export URLs (DumpUrl, DeltaUrl, or CuratedUrl) to CSV files for analysis or backup purposes. The command allows filtering by collection and provides configurable export options.
- Changes:
- Created a new management command `export_urls_to_csv.py` to extract URL data to CSV format
- Implemented options to filter exports by model type and specific collections
- Added support for excluding full text content with the `--full_text` flag to reduce file size
- Included proper handling for paired fields (tdamm_tag_manual, tdamm_tag_ml)
- Added automatic creation of a dedicated `csv_exports` directory for storing export files
- Implemented batched processing to efficiently handle large datasets
- Added progress reporting during export operations
1 change: 1 addition & 0 deletions config/settings/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,7 @@
"http://sciencediscoveryengine.nasa.gov",
"https://localhost:4200",
"http://localhost:4200",
"https://science.data.nasa.gov/",
]

# MIGRATIONS
Expand Down
168 changes: 168 additions & 0 deletions scripts/dump_url_list_excludes_includes.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,168 @@
"""
this is meant to be run from within a shell. you can do it in the following way:

establish a coding container

```shell
tmux new -s docker_django
tmux attach -t docker_django
tmux kill-session -t docker_django
```

```bash
dmshell
```

copy paste this code into the shell and run it

getting the info out of the container

```bash
docker cp 593dab064a15:/tmp/curated_urls_status.json ./curated_urls_status.json
```

move it onto local
```bash
scp sde:/home/ec2-user/sde_indexing_helper/curated_urls_status.json .
```

"""

import concurrent.futures
import json
import os
from collections import defaultdict

from django.db import connection

from sde_collections.models.delta_url import CuratedUrl


def process_chunk(chunk_start, chunk_size, total_count):
"""Process a chunk of curated URLs and return data grouped by collection"""
# Close any existing DB connections to avoid sharing connections between processes
connection.close()

# Get the chunk of data with collection information
curated_urls_chunk = (
CuratedUrl.objects.select_related("collection")
.all()
.with_exclusion_status()
.order_by("url")[chunk_start : chunk_start + chunk_size]
)

# Group URLs by collection folder name
collection_data = defaultdict(list)
for url in curated_urls_chunk:
collection_folder = url.collection.config_folder
included = not url.excluded # Convert to boolean inclusion status

collection_data[collection_folder].append({"url": url.url, "included": included})

# Save to a temporary file
temp_path = f"/tmp/chunk{chunk_start}.json"
with open(temp_path, "w") as f:
json.dump(dict(collection_data), f)

processed = min(chunk_start + chunk_size, total_count)
print(f"Processed {processed}/{total_count} URLs")

return temp_path


def export_curated_urls_with_status():
"""Export all curated URLs with their inclusion status, grouped by collection"""
output_path = "/tmp/curated_urls_status.json"

# Get the total count and status statistics
curated_urls = CuratedUrl.objects.all().with_exclusion_status()
total_count = curated_urls.count()
excluded_count = curated_urls.filter(excluded=True).count()
included_count = curated_urls.filter(excluded=False).count()

print(f"Total URLs: {total_count}")
print(f" Excluded: {excluded_count}")
print(f" Included: {included_count}")

# Define chunk size and calculate number of chunks
chunk_size = 10000
chunk_starts = list(range(0, total_count, chunk_size))

# Process chunks in parallel
temp_files = []
with concurrent.futures.ProcessPoolExecutor(max_workers=4) as executor:
# Submit all tasks
future_to_chunk = {
executor.submit(process_chunk, chunk_start, chunk_size, total_count): chunk_start
for chunk_start in chunk_starts
}

# Collect results as they complete
for future in concurrent.futures.as_completed(future_to_chunk):
chunk_start = future_to_chunk[future]
try:
temp_file = future.result()
temp_files.append(temp_file)
except Exception as e:
print(f"Chunk starting at {chunk_start} generated an exception: {e}")

# Combine all temp files into final output
combined_data = {}

# Sort temp files by chunk start position
temp_files.sort(key=lambda x: int(os.path.basename(x).replace("chunk", "").split(".")[0]))

for temp_file in temp_files:
with open(temp_file) as infile:
chunk_data = json.load(infile)
# Merge chunk data into combined data
for collection_folder, urls in chunk_data.items():
if collection_folder not in combined_data:
combined_data[collection_folder] = []
combined_data[collection_folder].extend(urls)

# Clean up temp file
os.unlink(temp_file)

# Write the final combined data
with open(output_path, "w") as outfile:
json.dump(combined_data, outfile, indent=2)

# Verify export completed successfully
if os.path.exists(output_path):
file_size_mb = os.path.getsize(output_path) / (1024 * 1024)
print(f"Export complete. File saved to: {output_path}")
print(f"File size: {file_size_mb:.2f} MB")

# Sanity check: Count the total included and excluded URLs in the final file
final_included = 0
final_excluded = 0

# Read the file back and count
with open(output_path) as infile:
file_data = json.load(infile)
for collection_folder, urls in file_data.items():
for url_data in urls:
if url_data["included"]:
final_included += 1
else:
final_excluded += 1

print("\nSanity check on final file:")
print(f"Total URLs in file: {final_included + final_excluded}")
print(f" Included: {final_included}")
print(f" Excluded: {final_excluded}")

# Check if counts match
if final_included == included_count and final_excluded == excluded_count:
print("✅ Counts match database query results!")
else:
print("⚠️ Warning: Final counts don't match initial database query!")
print(f" Database included: {included_count}, File included: {final_included}")
print(f" Database excluded: {excluded_count}, File excluded: {final_excluded}")
else:
print("ERROR: Output file was not created!")


# Run the export function
export_curated_urls_with_status()
Loading
Loading