AI Revolutionizes Quality: Never Miss a Software "Quality Issue" Again!

2025-06-22 2025-06-22

Tak@

Hi everyone, I'm Tak@! I'm a systems integrator, usually involved in developing web services that leverage generative AI. My focus is always on designing services that are backed by solid technology.

In this column, I hope to clearly convey the fascinating and practical aspects of AI technology.

This time, the theme is exploring how AI can assist in software quality assurance (QA).

Specifically, I'll be discussing how the previously manual task of reviewing design documents has transformed with the introduction of AI. We'll look at how it's improved quality and streamlined operations, focusing on examples and concrete figures from a case study at Hitachi Solutions, Ltd..

I'll also explain "AI4QA" (Artificial Intelligence for Quality Assurance), a concept where AI brings new value to quality assurance, and "BERT," its core technology, in a way that's easy to understand, even if you don't have specialized knowledge. So please stick with me until the end!

Solving Software Quality Assurance Headaches with AI

In software development, quality is always a critical concern. Design document reviews, in particular, are a vital process that influences the quality of subsequent development stages. However, for many years, this review process has presented numerous challenges.

Challenges of Manual Reviews

When conducting document reviews manually, there were two main challenges:

Quantitative Aspect: Massive Documents and Limited Resources

Modern software is incredibly large and complex. This often leads to a colossal amount of design documentation. Checking every single page thoroughly within limited personnel and time is an extremely challenging task. This could easily lead to oversights or insufficient reviews.

Qualitative Aspect: Reliance on Reviewer Knowledge and Experience

Another issue was that the quality of reviews heavily depended on the knowledge and experience of the assigned reviewer. While veteran engineers could provide insightful feedback based on deep understanding, less experienced reviewers might miss critical issues. This resulted in inconsistencies in review outcomes, leading to "unevenness" in quality. It was difficult to ensure consistent quality regardless of who performed the review.

Limitations of Traditional Keyword-Based Checking

To address these challenges, "keyword-based checking" methods have been tried in the past. This approach involves searching for specific keywords within documents according to predefined check rules. For example, a rule might state that if an expression like "in the case of XX" appears more than twice, it suggests overly complex conditions that could lead to bugs. Open-source document checking tools like Redpen were also utilized.

However, this method had several significant drawbacks:

Tedious Keyword Definition: Defining and registering each check keyword beforehand was a very time-consuming process.
Difficulty in Understanding Context: The biggest challenge was that it only picked up keywords, failing to accurately understand the "context" or "meaning" of a sentence. For instance, even if a keyword like "in the case of XX" was present, the tool couldn't determine if it was truly a problematic expression or simply appropriate within that specific context.
High False Positive Rate: Due to this difficulty in understanding context, initial trials at Hitachi Solutions reportedly saw an astonishing 90% false positive rate. This placed a heavy burden on human reviewers, who had to verify each flagged item, actually making the process less efficient.

I've personally experienced this when reviewing design documents with complex conditional branching. Automated tools would flag numerous warnings, but the vast majority were false positives, making me wish the tool could "judge a bit smarter." Such a high rate of false positives was a major factor in lowering user motivation for these tools.

AI4QA: How AI Transforms Quality Assurance

To overcome these traditional challenges, AI technology has garnered significant attention as a solution. In particular, the approach of leveraging AI directly within the quality assurance process is called "AI4QA" (Artificial Intelligence for Quality Assurance). This differs from "QA4AI," which focuses on ensuring the quality of AI-powered products. AI4QA aims to streamline and enhance quality assurance activities themselves through AI.

What is AI4QA?

The goal of AI4QA is to simultaneously improve quality, speed, and cost by harnessing the power of AI. Specifically, it aims to maintain quality more accurately, quickly, and with less cost by supporting or automating various quality assurance tasks previously performed by humans.

Mr. Shinsuke Matsuki, General Manager of the Research, Planning & Development Department at Veriserve Corporation, also explains that AI4QA is an "approach" to how AI technology will be utilized in quality technology. By bringing AI's strengths into quality assurance, tools like test automation have evolved, leading to the emergence of "better tools."

The Vision for AI-Driven Reviews

Based on the AI4QA philosophy, the vision for document reviews directly addresses the traditional challenges:

Automated Full-Page Checks Without Human Intervention: First, tools can automatically check all document pages without human intervention. This allows for quick reviews of large volumes of documents, reducing the risk of oversights.
Consistent Results for Everyone, Replicating Expert Insights: With AI providing the judgment criteria, the review results become consistent regardless of the reviewer's knowledge or experience. Furthermore, AI can replicate the "insights" gained by veteran engineers over years of experience, providing higher quality feedback.

This truly means that AI enables "quality standardization" and the "sharing of expert techniques."

BERT: The Power of AI that Understands Context

So, what specific AI technology solved the issues of traditional keyword-based checking and enabled intelligent, context-aware reviews? This is where BERT, a powerful natural language processing model, comes in.

What is BERT?

BERT, which stands for "Bidirectional Encoder Representations from Transformers," is a deep learning model introduced by Google in 2018. It achieved the highest scores among natural language processing (NLP) models at the time, generating significant buzz.

BERT's greatest feature is its ability to pre-train on text from both directions simultaneously – from beginning to end and from end to beginning. Previous models primarily learned unidirectionally, so this bidirectional approach allows for a much deeper understanding of context. It's like a careful reader who grasps the overall meaning of a text by reading it thoroughly from both sides.

BERT is based on a model structure called Transformer. The Transformer uses a mechanism called "Attention" to indicate the relevance and importance of words to each other, allowing it to accurately interpret context.

How BERT Solved the Challenges

Applying BERT to design reviews dramatically improved the problems faced by traditional keyword-based checking:

High-Accuracy Detection through Contextual Understanding: BERT deeply understands the context and meaning of sentences. It considers the entire situation in which words are used, not just the presence of keywords. As a result, in a validation experiment at Hitachi Solutions, the false positive rate was successfully reduced from approximately 90% with keyword-based methods to a mere 5%.
Flexibility Without Manual Word Registration: BERT processes sentences by breaking them down into smaller units called "subwords." This eliminates the need to pre-register new words or technical terms, making it flexible enough to handle any text.
Automatic Feature Learning: AI automatically learns patterns and characteristics of sentences that should be flagged, removing the need for humans to define complex judgment logic beforehand. This also eliminated much of the tedium in the preparation phase.

How to Prepare AI Training Data?

For an AI model like BERT to operate with high accuracy, high-quality "training data" is essential. AI learns from provided data (labeled data) as "examples."

Creating training data presented its own challenges. The task of assigning appropriate "quality concern categories" (e.g., "ambiguous terminology concern" or "complex condition concern," with nine types defined) to design document content required specialized knowledge and was time-consuming, with one person limited to 50-80 items per day.

To address this, a clever approach was devised for efficient training data creation. It involved first analyzing publicly available design documents using traditional keyword checking tools, and then visually refining those results. The keyword checking tool would highlight problematic areas and suggest candidate quality concern categories, allowing humans to quickly judge and classify the information.

This method improved individual work efficiency by approximately 6-9 times, enabling one person to assign quality concern categories to 450-500 data items per day. As a result, 71 files and approximately 32,000 sentence data points were collected.

Of course, initial training data preparation still requires time and effort. In this case, approximately 1.5 person-months were needed for about 8,500 training data items. However, once the learning model is built, subsequent accuracy improvements can be easily achieved simply by adding more text and quality concern categories for training.

Concrete Results and Future Prospects of AI Utilization

Hitachi Solutions' initiative clearly demonstrates the tangible benefits AI brings to software quality assurance.

Quantitative Case Study: Hitachi Solutions' Initiative

A comparison of review man-hours for a basic design document (approx. 1500 pages) yielded striking results:

Item	Manual (Hours)	BERT (Hours)	Reduction
Document check based on specific points	300.0	162.0	-46.0%
Creating issue list, Q&A interactions	30.0	17.0	-43.3%
Total	510.0	360.0	-29.4%

They managed to reduce overall review man-hours by 29.4% (150 hours). This is a significant benefit for teams dealing with large volumes of documents.

Furthermore, quality improvements were observed as follows:

Issue Reproduction Rate: This metric indicates how well the AI can identify issues similar to a veteran engineer's findings. It improved to 85.7% with BERT-based text classification, compared to 29.7% with keyword-based checking.
Issue Miss Rate: This metric indicates how often issues are overlooked. It drastically decreased to 4.4% with BERT, compared to 69.2% with keyword-based checking.
False Positive Rate: This metric indicates how often incorrect issues are flagged. It was roughly similar at 9.3% with BERT, compared to 9.1% with keyword-based checking.

These figures illustrate that AI not only accelerates tasks but also significantly enhances quality itself. In particular, the dramatic reduction in overlooked issues directly leads to a lower risk of bugs in later stages of development.

Various Possibilities of AI Utilization

Beyond design reviews, AI is starting to play a valuable role in various aspects of software quality assurance:

Automated Test Case Generation: AI can analyze past data and specifications to automatically create effective test cases. This allows for coverage of scenarios that humans might not conceive of.
Bug Prediction and Analysis: By learning from past test results and development history, AI can predict potential future bugs and issues, warning developers in advance.
Visual UI Verification: Using AI's image recognition technology, visual defects or design inconsistencies in user interfaces can be automatically detected.
Chatbot Applications: Generative AI-powered chatbots are being used to automatically answer citizen inquiries, as seen in the Kawasaki City Office example, or to assist with research data management at universities. Specifically, "RAG (Retrieval Augmented Generation)" technology allows them to retrieve information from external databases, including up-to-date information not in their training data or company-specific knowledge, to generate more accurate responses.
Defect Localization and Correction: AI is also being applied to "defect localization," identifying the exact location of code defects, and "automated program repair" for simple defects.
Defect Report Management: Natural language processing technologies like BERT are used to identify duplicate bug reports among large volumes of submissions.

As such, AI is supporting human work in many stages of quality assurance, enabling smarter and more efficient activities.

Challenges and Future of AI Adoption

While AI is a powerful tool, its implementation comes with several challenges:

Effort in Training Data Creation: As mentioned earlier, achieving high AI accuracy requires significant effort in preparing high-quality training data beforehand. Moreover, differences in quality perspectives among veteran engineers can necessitate harmonization.
AI Model Reliability and Safety: AI output is not always guaranteed to be reliable. Especially with generative AI, risks such as "hallucinations" (generating information not based on facts), "bias" reflecting prejudices in the training data, and unintended information leaks have been pointed out. Establishing evaluation criteria and testing methods to minimize these risks remains a major challenge.
Changing Human Roles and Continuous Learning: As AI automates many tasks, the role of human testers will likely shift from mere "bug finders" to "quality consultants." New skill development will be essential, including proficiency in using AI tools, data analysis skills to interpret test results and derive insights, and strong communication abilities with developers.

To address these challenges, various initiatives are underway, such as implementing "MLOps (Machine Learning Operations)" for continuous AI learning and automated training data additions, and creating comprehensive test viewpoints for safety evaluation.

By allowing AI and humans to leverage their respective strengths and collaborate, higher quality software development will undoubtedly be realized.

Conclusion

In this column, we explored how AI enhances quality and streamlines operations, focusing on AI's application in software quality assurance, particularly in design document reviews.

Traditional keyword-based checking methods struggled with contextual understanding and high false positive rates. However, advanced natural language processing models like BERT have brought about dramatic improvements with their ability to understand context.

The evolution of quality assurance brought about by AI not only shortens work time but also improves review quality and reduces oversights, significantly contributing to the overall quality of the final product. This is clear evidence that "AI4QA," the concept of AI deeply engaging with and enhancing the quality assurance process itself, is becoming a reality.

Of course, AI is not a panacea. Challenges remain, such as preparing training data and ensuring AI reliability and safety. However, these challenges will also be resolved through the collaboration of AI and humans.

AI will undoubtedly create an environment where we humans can focus on more creative and valuable work, making the future of quality assurance brighter and more robust. I'm personally very excited to see how future AI advancements will continue to transform the software development landscape.

Follow me!