This ticket contains a potential solution for NationalSecurityAgency/datawave-accumulo-plugins#2
I did not add these details directly in order to avoid conflating specific implementation requirements with the currently known requirements.
Stage 1. Define possible components
- Top Level SimpleHDFSClassLoaderFactory class
- HDFS file fetcher (pluggable for testing)
- ContextPath structure
- Manifest File structure
- Context cleanup thread
Context path structure
The context path should be similar to the following
hdfs://test:8020/contexts/contextA/manifest.json
This manifest file format should be machine readable. (Json is not required but used for this example)
Contexts are re-loadable due to limitations in client code.
Directory and Manifest file structure
The directory should contain a manifest file and jars.
/tmp/local-contexts/contextA/manifest.json
/tmp/local-contexts/contextA/Iterators.jar
/tmp/local-contexts/contextA/IteratorsV2.jar
The manifest file should consist of jar names and checksum values.
{
"context": "contextA",
"jars": [
{
"name": "Iterators.jar",
"checksum": "f2ca1bb6c7e907d06dafe4687e579fce76b37e4e93b7605022da52e6ccc26fd2"
},
{
"name": "IteratorsV2.jar",
"checksum": "934ee77f70dc82403618c154aa63af6f4ebbe3ac1eaf22df21e7e64f0fb6643d"
}
]
}
Stage 2. Create Factory
Create a SimpleHDFSClassLoaderFactory that implements the ContextClassLoaderFactory interface.
This class should use a cache that quickly returns classloaders for already defined context names.
This cache should store the classloader and the local directory used for the contextPath file cache.
The class should perform a property lookup to get the corresponding contextPath for a given context name. See the ContextManager class
It should resolve contextPaths to local directories and attempt to load classes from there.
Local File Cache Directory Resolution
This class should resolve context paths to a local directory location based off the immediate parent directory of the manifest file.
The local directory location should be a user-defined directory. (Similar to the VFS_CACHE_DIR property)
As an example:
ContextPath: hdfs://test:8020/contexts/contextA/manifest.json
User-defined dir: /tmp/local-contexts
Resolved dir: /tmp/local-contexts/contextA
The class should throw an error if that directory doesn't exist.
Once the directory is confirmed to exist, the class should use the manifest file to validate jars and generate a new list of jar urls. This list will then used to create a new URLClassloader.
This new classloader should be cached, along with the file cache dir, and then returned to the ClassLoaderUtil.
Write a test that can stage directory creation and a jar file, then successfully load a class from that jar using the SimpleHDFSClassLoaderFactory.
Stage 3. Fetch Files from HDFS
Create a class that will perform the following steps when given a manifest file location:
- Create a lock file in the user-defined directory.
/tmp/local-contexts/contextA.lock
- Create a unique temp directory for the context
/tmp/local-contexts/tmp-contextA-<uuid>
- Download the manifest file to the temp dir and use the contents to copy and validate defined jars from the source HDFS location to the tmp dir.
- Perform a rename option on the directory to promote it to the new context name.
/tmp/local-contexts/tmp-contextA-<uuid> -> /tmp/local-contexts/contextA
- Delete the lock file.
Modify the SimpleHDFSClassLoaderFactory to use this class when the local context directory doesn't exist and the lock file also doesn't exist.
Write an IT for testing loading classes from HDFS using the SimpleHDFSClassLoaderFactory in a single Tserver.
Stage 4. Support multiple processes
Modify the SimpleHDFSClassLoaderFactory to do the following:
- Check if the lock file exists and wait to load classes until a user-defined period of time has passed since lock file modification.
- If the wait is achieved, the class should touch the lock file to reset it's modification date and proceed with fetching files from HDFS.
Stage 5. Cleanup old contexts
Start a thread that looks at property definitions every minute and if contexts are not defined, they should be removed from the cache.
This ticket contains a potential solution for NationalSecurityAgency/datawave-accumulo-plugins#2
I did not add these details directly in order to avoid conflating specific implementation requirements with the currently known requirements.
Stage 1. Define possible components
Context path structure
The context path should be similar to the following
hdfs://test:8020/contexts/contextA/manifest.jsonThis manifest file format should be machine readable. (Json is not required but used for this example)
Contexts are re-loadable due to limitations in client code.
Directory and Manifest file structure
The directory should contain a manifest file and jars.
The manifest file should consist of jar names and checksum values.
Stage 2. Create Factory
Create a SimpleHDFSClassLoaderFactory that implements the ContextClassLoaderFactory interface.
This class should use a cache that quickly returns classloaders for already defined context names.
This cache should store the classloader and the local directory used for the contextPath file cache.
The class should perform a property lookup to get the corresponding contextPath for a given context name. See the ContextManager class
It should resolve contextPaths to local directories and attempt to load classes from there.
Local File Cache Directory Resolution
This class should resolve context paths to a local directory location based off the immediate parent directory of the manifest file.
The local directory location should be a user-defined directory. (Similar to the VFS_CACHE_DIR property)
As an example:
ContextPath:
hdfs://test:8020/contexts/contextA/manifest.jsonUser-defined dir:
/tmp/local-contextsResolved dir:
/tmp/local-contexts/contextAThe class should throw an error if that directory doesn't exist.
Once the directory is confirmed to exist, the class should use the manifest file to validate jars and generate a new list of jar urls. This list will then used to create a new URLClassloader.
This new classloader should be cached, along with the file cache dir, and then returned to the ClassLoaderUtil.
Write a test that can stage directory creation and a jar file, then successfully load a class from that jar using the SimpleHDFSClassLoaderFactory.
Stage 3. Fetch Files from HDFS
Create a class that will perform the following steps when given a manifest file location:
/tmp/local-contexts/contextA.lock/tmp/local-contexts/tmp-contextA-<uuid>/tmp/local-contexts/tmp-contextA-<uuid>->/tmp/local-contexts/contextAModify the SimpleHDFSClassLoaderFactory to use this class when the local context directory doesn't exist and the lock file also doesn't exist.
Write an IT for testing loading classes from HDFS using the SimpleHDFSClassLoaderFactory in a single Tserver.
Stage 4. Support multiple processes
Modify the SimpleHDFSClassLoaderFactory to do the following:
Stage 5. Cleanup old contexts
Start a thread that looks at property definitions every minute and if contexts are not defined, they should be removed from the cache.